How To Scale Easily with Distributed Cache in Elixir

Elixir is a modern language with first-class support for concurrency and distributed applications. Things that can be challenging in other languages can be done in Elixir with minimal effort. One of those things is a distributed cache without any additional infrastructure. Such an approach can reduce your infrastructure costs, provide excellent performance, and work well with cloud autoscaling.

This article’s primary goal is to show the effort you need to bring a distributed caching solution to the Elixir app, and provide some insights and helpful hints based on our experience.

Motivation

We have a service that holds configurations. Our microservices platform can request multiple configurations during one external request, and configuration requests can involve multiple requests to underlying storages like DynamoDB and others. Configurations, however, do not change often, so they can be an ideal subject for in-memory caching optimization.

In our team, each second week we have an “Innovation Day” when we can freely experiment with our services, propose any improvements, and clean up annoying parts of the code. We can do any technical stuff that can be useful without bureaucracy stoppers. Last year on one of these days, a week before Black Friday, we made a prototype of distributed caching. We expect a significant load spike during Black Friday and Cyber Monday as a retail platform, and we realized that distributed caching could be a remarkable improvement to our system before this period. So, we spent one more day polishing it and deployed it to production. It worked without any issues! As a result, our service started to return sub-millisecond responses!

Fortunately, our solution is pretty simple, so let me show you how you can enable distributed caching in your own Elixir project. In practice, you probably already have an application to speed up, but let’s set up a simple HTTP service in Elixir before we proceed. Shortly, I’ll also highlight a few essential details about Elixir and its ecosystem if you’re not familiar with this technology.

Prerequisite: HTTP Service with No Caching

Elixir is a VM-based language. It uses Erlang‘s VM called BEAM. An instance of BEAM is called a node. One node can efficiently utilize all your CPU cores, so you don’t usually need more than one node on one physical machine. In most cases, when you run your Elixir app locally, you start one VM node.

Elixir has first-class support for parallelism and utilizes the Actor Concurrency Model (actors are called processes in Erlang terminology). On top of that, Elixir follows Erlang/OTP conventions, and each Elixir app is a Supervisor Tree. Oversimplifying: an app consists of independent processes that work like small services. There are supervisors that ensure all processes are running or can be restarted if needed. Yes, Elixir/Erlang runtime reminds a set of microservices organized by Kubernetes.

If you are interested in learning and experimenting with Elixir, consider visiting the official “Getting Started” guide or Elixir School‘s resources.

The diagram below visualizes what our simple HTTP server will look like (with a focus on the supervisor tree).

In Elixir, code is organized in modules. A module is a named set of functions. By following some rules, a particular module can implement some behavior. For example, “supervisor behavior.” Roughly speaking, you may consider that there are two types of modules: sets of functions and process behavior implementations. Usually, each node in the supervisor tree is implemented as a separate module. Also, you can provide initialization arguments to them. On the diagram, we refer to Plug.Cowboy, but we provide our module with routing implementation (CacheDemo).

This repository contains the final form of an application we’ are discussing: https://gitlab.com/newstore/demos/elixir_cache (except the Kubernetes setup part).

To simplify testing, let’s assume we have some external data storage, and it takes 20ms to get data from it. Let’s implement it in a simple CacheDemo.Repo module:

defmodule CacheDemo.Repo do
  def get_data do
    Process.sleep(20)

    "THIS IS DATA!"
  end
end

And let’s expose it via HTTP endpoints by defining a router:

defmodule CacheDemo do
  use Plug.Router

  # plug Plug.Logger
  plug :match
  plug :dispatch

  get "/ping" do
    send_resp(conn, 200, "pong")
  end

  get "/no_cache" do
    send_resp(conn, 200, CacheDemo.Repo.get_data())
  end

  match _ do
    send_resp(conn, 404, "route not found")
  end
end

And make it alive by attaching it to the root supervisor:

defmodule CacheDemo.Application do
  use Application

  @impl true
  def start(_type, _args) do
    children = [
      {Plug.Cowboy, scheme: :http, plug: CacheDemo, port: 9000}
    ]

    opts = [strategy: :one_for_one, name: CacheDemo.Supervisor]
    Supervisor.start_link(children, opts)
  end
end

Now we can call our endpoints. Let’s use hey for benchmarking. As a benchmark, let’s send 5k requests with a concurrency of 50. (Concurrency will affect our result significantly, so it’s important to stick to the same value.) hey will provide a lot of data. For simplicity, we will focus on two values: average time and 99th percentile. All the benchmarks in this article are done on the same machine in similar conditions.

This is our “no operation” endpoint. We cannot be faster than it:

hey -n 5000 -c 50 http://127.0.0.1:9000/ping

> Average:	0.0010 secs
> 99% in 0.0033 secs

And this is our “not cached” endpoint. We should not be slower than it:

hey -n 5000 -c 50 http://127.0.0.1:9000/no_cache

> Average:	0.0214 secs
> 99% in 0.0240 secs

Earlier in this article, I told you about sub-millisecond responses, so you may be curious why even a no-op endpoint provides a 1ms response. The answer is concurrency. In production, we rarely have too many parallel requests. So, if we change the concurrency of hey to a number less than the amount of available CPU cores, the results will be much more impressive (I have a 6-Core Intel Core i7 CPU).

hey -n 5000 -c 4 http://127.0.0.1:9000/ping

> Average:	0.0002 secs
> 99% in 0.0004 secs

Step 1: Node Local Cache

Let’s introduce node-local cache. In some situations, it’d be enough just to use ETS to solve this, which is an in-memory storage that can be used in a “key-value” manner. Since it is part of the Erlang standard library, it would be reasonable to consider it as a default solution. But we will not use it for two reasons:

By using ETS, we need to handle expiration logic manually; and
It cannot be easily upgraded to be a distributed solution.

So, let’s use a library that will handle these problems for us (it’s based on ETS under the hood): Nebulex. It has many features and can be used for a variety of cases. Even multilevel caches are possible with it. That makes it a flexible solution, and we can use the same library for different cases.

Moreover, in addition to ETS, overhead is almost zero; we will see this in the first caching benchmark.

Our goal is the following supervisor tree:

This article is not a Nebulex usage manual, so it will not highlight all details and possibilities. But the real goal of the article is to show the amount of coding effort needed to spin up a cache. It’s time to add cache! In mix.exs:

# ...
  defp deps do
    [
      # ...
      {:nebulex, "~> 2.3"},
      {:decorator, "~> 1.4"} # it's an optional dependency for nebulex; skip it if you don't use decorators
    ]
  end
# ...

Then we have to create a module that represents cache:

defmodule CacheDemo.LocalCache do
  use Nebulex.Cache,
    otp_app: :cache_demo,
    adapter: Nebulex.Adapters.Local
end

The module implements GenServer behavior. To make cache work, we have to add it to the supervisor tree of our app:

defmodule CacheDemo.Application do
  use Application

  @impl true
  def start(_type, _args) do
    children = [
      {CacheDemo.LocalCache, []}, # <---------- order matters. See Supervisor docs.
      {Plug.Cowboy, scheme: :http, plug: CacheDemo, port: 9000}
    ]

    opts = [strategy: :one_for_one, name: CacheDemo.Supervisor]
    Supervisor.start_link(children, opts)
  end
end

We can configure our cache behavior in config/*.exs files:

config :cache_demo, CacheDemo.LocalCache,
  # we don't use it, but let's enable it because in a complex project, it can be useful
  stats: true,
  telemetry: true

A configuration like this allows us to have different cache behavior for development, test, and production purposes. For example, you can set global TTL for development to a low value.

And now we can use cache in our repo (we’re using decorators, but it’s possible to interact with cache directly):

defmodule CacheDemo.Repo do
  use Nebulex.Caching

  @ttl :timer.hours(1)
  
  # ...

  @decorate cacheable(cache: CacheDemo.LocalCache, opts: [ttl: @ttl])
  def get_cached(_id) do
    Process.sleep(20)

    "THIS IS DATA!"
  end
end

And expose to HTTP:

defmodule CacheDemo do
  # ...

  get "/local_cache" do
    send_resp(conn, 200, CacheDemo.Repo.get_cached(123))
  end
  
  # ...
end

And benchmark it:

hey -n 5000 -c 50 http://127.0.0.1:9000/local_cache

> Average:	0.0011 secs (/ping result was 0.0010)
> 99% in 0.0033 secs (/ping result was 0.0033)

As we see, local cache usage hardly differs from the no-op endpoint. That’s an incredible result already! It means that in most cases you don’t even need to think about direct ETS usage for optimization reasons – library overhead is minimal!

Results like this are not possible with Redis and Memcached due to network communication and parsing overheads. Moreover, data in ETS is stored in a raw format, with no (de)serialization involved. So, you don’t need to worry about things like, “Can my data be serialized properly?” and “Will I have the same data after deserialization?” In sum, it makes ETS-based cache usage more performant, reliable, and easy to use than third-party solutions.

On the other hand, third-party solutions can be more convenient if you need a persistent cache.

Step 2: Distributed Cache (Local Setup)

So far, we didn’t do anything really impressive. We can achieve similar results by using some in-memory cache libraries for any popular language. But if we were to do it this way, we would have a disadvantage: manual invalidation behavior. Nowadays, we rarely have apps installed just on one machine. In most cases, we dynamically adjust the number of app instances behind the load balancer. When we manually invalidate some cache, we invalidate it only on one instance. If we had to update the value in the cache, one instance would update it in both its own cache and the underlying storage. The rest of the instances would return the outdated value. To solve this problem, we need to invalidate the cache on all instances at once.

A pretty popular solution is using an external cache like Memcached or Redis, but it will make our performance worse, add complexity to infrastructure, and introduce a new failure point. And let’s not forget about serialization/deserialization complexity. Fortunately, in OTP-powered languages like Erlang and Elixir, we have one more option: make our app distributed. All our instances will form a cluster where instances can communicate with each another. And with Elixir, it’s surprisingly easy: the most tricky parts are already handled by OTP and the libcluster library! Regarding cache, Nebulex already supports a distributed way of working.

As the first step, let’s set up a local cluster by running two app instances locally. In mix.exs add the dependency:

# ...
  defp deps do
    [
      # ...
      {:libcluster, "~> 3.3"}
    ]
  end
  # ...

Then configure libcluster to use local strategy (most probably it will be config/dev.exs file in your project):

config :libcluster,
  topologies: [
    local: [
      strategy: Cluster.Strategy.LocalEpmd
    ]
  ]

Start libcluster‘s worker as a part of our app (application.ex):

# ...
  def start(_type, _args) do
    topologies = Application.get_env(:libcluster, :topologies, [])

    children = [
      {Cluster.Supervisor, [topologies, [name: CacheDemo.ClusterSupervisor]]},
      # ...
    ]
    # ...
  end
  # ...

To avoid port conflicts, do not forget to make the port configurable via env variable:

config :cache_demo, CacheDemo,
  port: System.get_env("PORT", "9000") |> String.to_integer()

Now we can start two instances of our application with REPL attached:

PORT=9000 iex --sname node-A -S mix run
PORT=9001 iex --sname node-B -S mix run

Check that the nodes can see each other using Node.list/0:

# on node-A:
iex(node-A@yourhost)1> Node.list()
[:"node-B@yourhost"]

# on node-B
iex(node-B@yourhost)1> Node.list()
[:"node-A@yourhost"]

That is all that we need to set up a distributed VM environment locally! Let’s make replicated cache:

defmodule CacheDemo.ReplicatedCache do
  use Nebulex.Cache,
    otp_app: :cache_demo,
    adapter: Nebulex.Adapters.Replicated, # Replicated adapter instead of Local
    primary_storage_adapter: Nebulex.Adapters.Local # And we are using Local adapter to store a cache replica
end

# in application.ex
    children = [
      {Cluster.Supervisor, [topologies, [name: CacheDemo.ClusterSupervisor]]},
      {CacheDemo.LocalCache, []},
      {CacheDemo.ReplicatedCache, []}, # <---
      {Plug.Cowboy,
       scheme: :http, plug: CacheDemo, options: Application.get_env(:cache_demo, CacheDemo)}
    ]
    
# in config.ex let's use a similar config:
config :cache_demo, CacheDemo.ReplicatedCache,
  # we don't use it, but let's enable it because in a complex project, it can be useful
  stats: true,
  telemetry: true
  
# and add HTTP endpoint:
  get "/replicated_cache" do
    send_resp(conn, 200, CacheDemo.Repo.get_cached_replicated(123))
  end

Benchmark shows the same results as for local cache (!!!):

hey -n 5000 -c 50 http://127.0.0.1:9000/replicated_cache

> Average:	0.0011 secs (/ping result was 0.0010)
> 99% in 0.0035 secs (/ping result was 0.0033)

You can experiment with cache API and ensure that when you invalidate cache entry on one node, it’s invalidated on another. You can also check the behavior when you delete and add nodes to the cluster:

# on node B:
iex(node-B@yourhost)1> CacheDemo.ReplicatedCache.get("my_key")
nil

# on node A:
iex(node-A@yourhost)1> CacheDemo.ReplicatedCache.get("my_key")
nil
iex(node-A@yourhost)2> CacheDemo.ReplicatedCache.put("my_key", "DATA")
:ok
iex(node-A@yourhost)3> CacheDemo.ReplicatedCache.get("my_key")
"DATA"

# on node B:
iex(node-B@yourhost)2> CacheDemo.ReplicatedCache.get("my_key")
"DATA"
iex(node-B@yourhost)3> CacheDemo.ReplicatedCache.delete("my_key")
:ok
iex(node-B@yourhost)4> CacheDemo.ReplicatedCache.get("my_key")
nil

# on node A:
iex(node-A@yourhost)4> CacheDemo.ReplicatedCache.get("my_key")
nil

At this moment, we were impressed by the amount of effort needed to spin up a distributed cache with Elixir.

Step 3: Make It Work with Kubernetes

“But it works locally!” This is a pretty common joke in the IT community. To actually show success, we have to deploy it to production. All you need is to configure libcluster to work well with your production environment. The library has several adapters for different situations and also some third-party ones. This article aims to provide an idea of how much effort is needed to spin up distributed cache in the BEAM environment. To achieve this, let me illustrate what we needed to do to make it work in production.

We’re using the Kubernetes cluster, and our final approach looks like this:

We decided to use Elixir.Cluster.Strategy.Kubernetes adapter. So, we put our configuration in config/prod.exs:

config :libcluster,
  topologies: [
    kubernetes: [
      strategy: Elixir.Cluster.Strategy.Kubernetes,
      config: [
        mode: :ip,                        # this 2 options mean that "pod list" API call will
        kubernetes_ip_lookup_mode: :pods, # be used to determine pods' IP aderesses
        kubernetes_node_basename: "node-name",
        kubernetes_selector: "app=app-name", # selector for pods
        kubernetes_namespace: "our-k8s-namespace",
        polling_interval: 10_000
      ]
    ]
  ]

Each Kubernetes pod has a token to access Kubernetes API. The library knows where to get this token and how to make a call. But we need to specify details like those shown above.

We need some additional work on the infrastructure side to make this configuration work. It’s essential to have proper node naming to avoid name clashes. So, we have to change some things in the release setup. We’re using Elixir Releases wrapped into a Docker image. To configure node naming in the rel\env.sh.eex, add the following (execute mix release.init if you do not have this folder):

export RELEASE_DISTRIBUTION=name
export RELEASE_NODE="app-name@${POD_IP}"

So, our nodes will be named after their private IPs. We control deployments via Terraform. To make things work, our service’s deployment should be configured like this:

locals {
  namespace = "our-k8s-namespace"
}

resource "kubernetes_deployment" "app-name" {
  metadata {
    name      = "app-name"
    namespace = local.namespace
  }

  spec {
    # ...

    # this is a pod spec template
    template {
      metadata {
        labels = {
          app = "app-name" # <--- this is essential for discovery new nodes (see kubernetes_selector in libcluster config)
        }
      }
      spec {
        # we need this because we have provide access to k8s' API "pod list" action
        service_account_name = kubernetes_service_account.app-name.metadata[0].name

        # ...

        container {

          # ...

          env { # this block is essential for proper node naming
            name = "POD_IP"
            value_from {
              field_ref {
                field_path = "status.podIP"
              }
            }
          }
          
          # ...

        }
      }
    }
  }
}

Then we have to configure service account:

resource "kubernetes_service_account" "app-name" {
  metadata {
    name      = "app-name"
    namespace = local.namespace
    annotations = {
      # we're using k8s in AWS; your case can be different
      # so let me skip AWS role details.
      "eks.amazonaws.com/role-arn" = aws_iam_role.app-name.arn
    }
  }
}

resource "kubernetes_role" "app-name" {
  metadata {
    name      = "app-name"
    namespace = local.namespace
  }

  rule {
    # this is essential block!!!
    # It allowes libcluster to discover other nodes' private IPs via k8s API call.
    api_groups = ["*"]
    resources  = ["pods"]
    verbs      = ["list", "get"]
  }
}

resource "kubernetes_role_binding" "app-name" {
  metadata {
    name      = "app-name"
    namespace = local.namespace
  }
  role_ref {
    api_group = "rbac.authorization.k8s.io"
    kind      = "Role"
    name      = kubernetes_role.app-name.metadata.0.name
  }
  subject {
    namespace = local.namespace
    kind      = "ServiceAccount"
    name      = kubernetes_service_account.app-name.metadata.0.name
  }
}

And that’s enough to make it work even with autoscaling. Nodes will join and leave the cluster automatically. But there is one more implicit but essential thing here:

During the connection setup, after node names have been exchanged, the magic cookies the nodes present to each other are compared. If they do not match, the connection is rejected. The cookies themselves are never transferred, instead they are compared using hashed challenges, although not in a cryptographically secure manner. (From Erlang docs)

In the case of Elixir releases, that “magic cookie” is randomly generated when we call mix release:

:cookie – a string representing the Erlang Distribution cookie. If this option is not set, a random cookie is written to the releases/COOKIE file when the first release is assembled. At runtime, we will first attempt to fetch the cookie from the RELEASE_COOKIE environment variable and then we’ll read the releases/COOKIE file.

In our case, it means that any two different versions of a docker image will have different cookies. So, when you do a rolling update, the cache will not be transferred between old and new pods. And that’s good: we can safely change cache structure without worrying about compatibility with older deployments.

Also, in addition to things already highlighted here, we made cache controllable by an environment variable to make it easy to disable in the case of any problems. But we didn’t have any issues.

Conclusion

It took one day to make it work locally. And just one day after that to adjust infrastructure to make pods discoverable for each other. As a result, our average response time reduced from ~7ms to sub-millisecond responses. I agree that introducing AWS managed Redis or Memcached would take similar time and effort, but our way has some valuable advantages:

Simpler infrastructure – no additional resources, only adjustments to existing ones
No additional points of failure – cache is merely a part of your application
No (de)serialization problems
No network overhead for reaching the cache
Cache will be clear for new deployment by default

In Elixir, you can start REPL inside a working VM. So, we can inspect what’s inside the in-memory cache on a particular pod without any problems and without learning new tools or query syntax. We do not sacrifice debugging convenience here; we made it even simpler!

Of course, if we want a persistent cache, the situation will be different. For example, in Kubernetes, our services should be stateless, or else we have to deal with a much more complex setup. But the described solution is perfect for wrapping external data sources in an LRU (least recently used) manner.

And last but not least: we deployed this solution several months ago, and we have not had a single problem with it since.