The MongoDB balancer isn’t just about moving data; it’s a sophisticated distributed system that orchestrates data distribution across shards by migrating chunks, and understanding its nuances is key to preventing performance bottlenecks and ensuring even load.

Let’s see it in action. Imagine you have a sharded cluster with two shards, shard1 and shard2. A collection mycollection is sharded by user_id. As users add data, mycollection grows, and MongoDB’s balancer kicks in.

Here’s a simplified view of what happens:

  1. Chunk Identification: MongoDB internally tracks data ranges (chunks) within a sharded collection. For mycollection, a chunk might represent user_id values from 1000 to 2000.
  2. Balancing Trigger: The balancer monitors chunk distribution. If shard1 has too many chunks, or chunks with a disproportionately large amount of data compared to shard2, the balancer decides to migrate a chunk.
  3. Migration Plan: The balancer selects a chunk to move, say, from shard1 to shard2.
  4. Data Transfer: The actual migration involves two phases:
    • Phased Migration: The source shard (shard1) begins copying the data for the selected chunk to the destination shard (shard2). During this phase, writes to the chunk are still handled by the source shard, but reads can be served by either shard. This minimizes read impact.
    • Commit Migration: Once the data is fully copied, the balancer coordinates a quick "commit" phase. The metadata is updated to reflect that shard2 is now the primary owner of the chunk. Writes are then directed to shard2. This phase is very brief and designed to minimize write disruption.
  5. Metadata Update: The mongos instances (query routers) are updated with the new chunk ownership. Subsequent queries hitting those user_id ranges will be routed to shard2.

The balancer itself is controlled by a configuration setting, config.settings.balancer, which is a document in the config database.

{
  "_id" : "balancer",
  "stopped" : false,
  "activeWindow" : {
    "start" : "00:00",
    "stop" : "24:00"
  }
}

Here, stopped: false means the balancer is enabled. activeWindow defines the hours during which balancing can occur. By default, it’s set to 00:00 to 24:00, meaning it can run anytime.

The core problem this solves is preventing data hotspots. Without a balancer, one shard could accumulate most of the data, leading to:

  • Uneven Read/Write Load: Queries targeting the overloaded shard would be slow, impacting application performance.
  • Disk Space Exhaustion: The overloaded shard might run out of disk space while others are mostly empty.
  • Resource Contention: CPU, memory, and network I/O would be disproportionately high on the overloaded shard.

The balancer aims to keep the number of chunks and the data distribution relatively even across all shards, ensuring that read and write operations are spread out.

To tune the balancer, you primarily interact with two concepts: its active window and its throttling.

The activeWindow is crucial for production environments. You might want to prevent balancing during peak hours to avoid any potential, albeit minimal, performance impact. For example, to only allow balancing between 2 AM and 6 AM:

use config
db.settings.update(
  { _id: "balancer" },
  {
    $set: {
      "activeWindow.start": "02:00",
      "activeWindow.stop": "06:00"
    }
  },
  { upsert: true }
)

This tells the balancer to only consider migrating chunks within this specific window. If a migration is in progress when the window closes, it will be allowed to complete.

The other critical tuning parameter is chunkMigrationMaxRate. This controls the maximum number of chunks that can be migrated concurrently. The default is usually sufficient, but in scenarios with very large datasets or constrained network bandwidth between shards, you might need to adjust it.

To check the current rate:

use config
db.settings.findOne({ _id: "balancer" })

If you need to decrease it, for instance, to 1 chunk per second:

use config
db.settings.update(
  { _id: "balancer" },
  {
    $set: {
      "chunkMigrationMaxRate": 1
    }
  },
  { upsert: true }
)

Increasing this rate (e.g., to 5) can speed up balancing but requires sufficient network bandwidth and I/O capacity on your shards.

A common misconception is that the balancer is a "fire and forget" feature. It needs monitoring. You can observe its activity by checking the logs of your mongos processes. Look for messages related to chunk migrations. You can also query the config.chunks collection to see how chunks are distributed.

To see the current state of chunk distribution for a specific database (mydatabase) and collection (mycollection):

use config
db.chunks.aggregate([
  { $match: { ns: "mydatabase.mycollection" } },
  { $group: { _id: "$shard", count: { $sum: 1 } } }
])

This will show you how many chunks reside on each shard for that collection. An ideal distribution would have a roughly equal count across all shards. If you see significant disparities, the balancer is working, or you might need to adjust its configuration.

The most surprising thing about MongoDB’s chunk migration is that it doesn’t simply move data files; it employs a phased approach that allows reads to continue from the source and later from the destination, and a very short commit phase to minimize write disruption, making the process remarkably transparent to applications.

The next concept you’ll likely encounter is how to effectively choose a shard key that promotes good data distribution and minimizes the need for aggressive balancing.

Want structured learning?

Take the full Mongodb course →