The MongoDB oplog isn’t just a log; it’s the heartbeat of your replica set, and if it’s too small, your secondaries will fall behind, and your whole system grinds to a halt.
Imagine your replica set as a group of people trying to copy a very long book, page by page. The primary is the one writing the original book. The secondaries are the copiers. The oplog is the stack of pages that the primary has written but the copiers haven’t yet copied. If the copiers are slow, or if the primary writes pages really fast, that stack of pages can grow enormous. If the stack gets too big, the primary might run out of space to write new pages. When that happens, the primary has to stop accepting writes, and your application will start erroring out.
The core problem is that the oplog is a capped collection. This means it has a fixed size and automatically overwrites the oldest data when it fills up. If a secondary server is slow to read and apply operations from the oplog, and the oplog is too small, the primary might overwrite an operation before the secondary has a chance to see it. This leads to data divergence, and MongoDB will prevent the secondary from rejoining the replica set until it’s resynchronized from scratch.
Here’s how to diagnose and fix your oplog size:
1. Check Current Oplog Size and Usage
First, you need to see how big your oplog is and how much of it is being used. Connect to your primary MongoDB instance using the mongosh shell.
mongosh "mongodb://your_primary_host:27017"
Once connected, run this command:
db.getReplicationInfo()
This will output something like:
{
"logSizeMB" : 512,
"usedMB" : 450,
"timeDiff" : 3600,
"timeDiffMicros" : 3600000000
}
logSizeMB is the total size of the oplog, and usedMB is how much of it is currently filled. timeDiff is the duration in seconds that the current oplog entries cover. A small timeDiff indicates that operations are being written and then overwritten very quickly.
2. Identify the Cause of Oplog Lag
There are several reasons why your secondaries might be lagging and causing the oplog to fill up:
- Insufficient Oplog Size: This is the most common. The oplog simply isn’t large enough to accommodate the write volume and the replication lag.
- Slow Secondaries: The secondary servers themselves might be under-resourced (CPU, RAM, disk I/O) or experiencing network issues, preventing them from applying oplog entries fast enough.
- Large Write Operations: A single, very large write operation (e.g., inserting millions of documents in one go, or a complex aggregation with a write concern) can consume a disproportionate amount of oplog space.
- High Write Volume: A sustained high rate of writes to the primary can overwhelm even a reasonably sized oplog if secondaries can’t keep up.
- Network Latency/Bandwidth: If the network between the primary and secondaries is slow or unreliable, replication will be delayed.
- Disk I/O on Secondaries: The secondary servers need to read from the oplog and write to their data files. If their disk I/O is saturated, they’ll fall behind.
3. Determine the Required Oplog Size
The general rule of thumb is that your oplog should be large enough to hold at least 24 hours of operations, or at least 2 times your peak write throughput, whichever is larger.
Let’s say you have a peak write rate of 1000 operations per second, and each operation is roughly 1KB. That’s 1MB per second. Over 24 hours (86,400 seconds), that’s 86.4 GB. If your peak write throughput is higher than that, you need to size for that.
A more practical approach is to look at your db.getReplicationInfo() output. If timeDiff is consistently low (e.g., less than a few hours) and usedMB is approaching logSizeMB, you need a larger oplog.
4. Resizing the Oplog
You cannot resize the oplog on a running replica set without downtime. The oplog is created when the replica set is initialized. To resize it, you must:
-
Stop all application writes to the primary.
-
Shut down all members of the replica set.
-
On each member, delete the
local.oplog.rscollection. -
Restart the members one by one, starting with the primary. When a member starts, if
local.oplog.rsdoesn’t exist, MongoDB will recreate it with the default size. -
On the primary, re-create the oplog with the desired size. The
oplogSizeis specified in MB. For example, to set it to 50 GB:// Connect to the primary mongosh "mongodb://your_primary_host:27017" // Ensure you are on the local database use local // Recreate the oplog. Rsyncs are operations, so ensure the size is large enough. // A common size is 50GB (51200 MB) or more. db.createCollection("oplog.rs", { capped: true, size: 51200 }) // 50GB -
Restart the secondaries. They will then start syncing from the primary.
Why this works: By deleting the capped collection and recreating it with a larger size parameter, you’re telling MongoDB to allocate more disk space for the oplog. When it restarts, secondaries will catch up using the new, larger oplog.
5. Optimizing Secondary Performance
If resizing the oplog doesn’t solve the problem, or if you have a truly massive write load, you need to address secondary performance:
- Resource Allocation: Ensure your secondary servers have adequate CPU, RAM, and fast disk I/O (SSDs are highly recommended).
- Network: Verify network connectivity and bandwidth between your primary and secondaries.
- Monitoring: Monitor your secondary servers for resource contention (CPU, disk I/O, network) using tools like
mongostatand system-level monitoring.
6. The Next Problem You’ll Hit
Once your oplog is sized correctly and your secondaries are keeping up, the next challenge you’ll face is managing read load. You might start seeing increased latency on read operations as your secondaries become busy applying writes, leading you to consider read preference configurations.