Adding nodes to a Memcached cluster without downtime is surprisingly achievable by leveraging its client-side sharding and graceful deactivation.
Here’s a Memcached cluster in action. We’ll start with two nodes and add a third.
# Start the first Memcached instance
memcached -p 11211 -m 64 -l 127.0.0.1
# Start the second Memcached instance
memcached -p 11212 -m 64 -l 127.0.0.1
Now, let’s imagine our application is using a client library that supports consistent hashing, like python-memcached or libmemcached. With two nodes, the keys are distributed. For example, my_key_1 might go to 127.0.0.1:11211 and my_key_2 to 127.0.0.1:11212.
import memcache
# Initializing with two servers
mc = memcache.Client(['127.0.0.1:11211', '127.0.0.1:11212'], debug=0)
mc.set('user:100', '{"name": "Alice"}')
mc.set('product:50', '{"name": "Gadget"}')
print(mc.get('user:100'))
print(mc.get('product:50'))
When we add a third node, the client library’s consistent hashing algorithm will re-evaluate the distribution. Crucially, it doesn’t immediately move data. Instead, it starts directing new requests for keys that should go to the new node to that new node.
# Start the third Memcached instance
memcached -p 11213 -m 64 -l 127.0.0.1
The mental model here is that the cluster isn’t a single, monolithic entity that needs to be restarted. Instead, it’s a collection of independent servers, and the "intelligence" for distribution resides in the clients. When a new server is added, the clients simply update their internal mapping of which server is responsible for which hash range of keys.
Consider the process from the client’s perspective:
- Initial State: Client knows about
ServerAandServerB. Keys are hashed and assigned to one of them. - Adding
ServerC: The client is updated with the address ofServerC. - Re-hashing: The client’s consistent hashing algorithm recalculates the key distribution. Now, some keys previously mapped to
ServerAorServerBwill be mapped toServerC. - Data Migration (Implicit): When a client needs to
GETa key that is now supposed to be onServerC, but it’s still onServerA(because it hasn’t been evicted or moved), the client will likely miss. However, theSEToperation for that key will now go toServerC. Over time, as existing keys expire or are overwritten, the data naturally migrates to the new distribution. For critical data that must be immediately available on the new node without relying on expiration, you might need a separate process to iterate through your data andSETit to the new cluster configuration, forcing it onto the correct nodes.
The levers you control are:
- Client Library: The choice of Memcached client library is paramount. It must support consistent hashing (e.g., Ketama, MurmurHash) for seamless scaling. Libraries that simply use modulo arithmetic (
hash(key) % num_servers) will cause a complete reshuffling of all keys, leading to a cache stampede and effectively a downtime. - Server Addition: The act of starting a new Memcached instance on a new IP/port.
- Client Reconfiguration: Updating your application’s client instances to include the new server address. This is often done by updating a configuration file or environment variable and then restarting/reloading the application instances.
- Graceful Deactivation (for removal): When removing nodes, you’d typically tell the clients to stop sending traffic to a specific node, then wait for its connections to drain before shutting it down.
The real magic is that Memcached itself doesn’t know or care about other Memcached instances. It’s a stateless service. All the complexity of managing a "cluster" is pushed to the clients. This is why adding nodes is just a matter of updating the client’s view of the world and letting the hashing algorithm do its work. The data then follows.
What most people don’t realize is that during this transition, there’s a period where your cache hit rate will temporarily dip. Some keys that were previously cached will now result in cache misses because they haven’t yet been re-distributed or re-populated on the new nodes. The client library’s consistent hashing minimizes the number of keys affected, but it doesn’t eliminate the miss entirely for existing data until it’s naturally evicted or overwritten.
The next challenge you’ll face is handling cache stampedes during a significant load increase or when a node does fail.