The MongoDB replica set primary node is crashing due to excessive memory consumption, specifically from its mongod process. This happens because the WiredTiger storage engine is not configured to limit its cache size, allowing it to consume available system RAM until the OS terminates the process to prevent system instability.
Common Causes and Fixes
1. Default WiredTiger Cache Size (Most Common)
- Diagnosis: Check the
mongodprocess’s memory usage. On Linux, useps aux | grep mongodand observe theRSS(Resident Set Size) column. If it’s consuming a large percentage of your system’s RAM (e.g., > 70-80%), this is likely the culprit. - Fix: Configure the
storage.wiredTiger.engineConfig.cacheSizeGBparameter in yourmongod.conffile. A good starting point is 50% of your system’s total RAM, or a fixed value like4GBif you have ample RAM and want to leave room for the OS and other processes.storage: wiredTiger: engineConfig: cacheSizeGB: 4 # Example: Set cache to 4GB - Why it works: This explicitly tells WiredTiger the maximum amount of RAM it can use for its internal cache, preventing it from over-consuming system memory.
2. Large Working Sets Exceeding Cache
- Diagnosis: Even with a configured cache, if your active dataset (the data frequently accessed by queries) is larger than the allocated cache, MongoDB will constantly page data in and out of RAM, leading to high memory pressure. Use
db.serverStatus().wiredTiger.cacheto inspectbytes currently in the cacheandmaximum bytesfor the cache. Also, look atpages evictedandpages read into cache. High eviction rates suggest the cache is too small for your working set. - Fix: Increase
storage.wiredTiger.engineConfig.cacheSizeGBinmongod.confto accommodate your working set. For example, if your working set is consistently around 10GB, you might set it to12GBor16GB.storage: wiredTiger: engineConfig: cacheSizeGB: 12 # Example: Increased cache size - Why it works: A larger cache allows more of your frequently accessed data to reside in RAM, reducing disk I/O and memory pressure caused by constant data swapping.
3. Excessive Index Usage and Fragmentation
- Diagnosis: Large or numerous indexes can consume significant memory. Use
db.collection.stats()to check the size of your indexes. High memory usage might be due to indexes that are no longer efficient or are heavily fragmented. Rundb.collection.getIndexes()and examine the index definitions for potential bloat. - Fix: Rebuild or drop unused/redundant indexes. For example, to rebuild an index:
If an index is no longer needed, drop it:db.collection.reIndex()db.collection.dropIndex("index_name") - Why it works: Rebuilding can defragment indexes, reducing their on-disk and in-memory footprint. Dropping unneeded indexes directly frees up memory that was used to hold their structures.
4. Large Documents and Embedded Arrays
- Diagnosis: Storing very large documents (approaching the 16MB BSON limit) or documents with deeply nested arrays can increase memory usage when these documents are loaded into memory for processing. Analyze your schema for documents that are unusually large.
- Fix: Normalize your schema by breaking down large documents into smaller, related documents. This reduces the amount of data that needs to be loaded into memory at once for a single operation.
- Why it works: Smaller documents are more efficiently handled by the cache and require less memory to load and manipulate, reducing the overall memory footprint per document.
5. WiredTiger Transaction Log Size
- Diagnosis: While less common for direct memory consumption leading to OOM, a very large WiredTiger transaction log (
.wtfiles in the journal directory) can indirectly impact memory usage if checkpoints are not running efficiently or if there’s a backlog. Check the size of files in yourjournaldirectory within your data directory. - Fix: Ensure
storage.journal.commitIntervalMsis set to a reasonable value (e.g.,1000msor30000msfor production, not0). Also, ensurestorage.journal.enabledistrue. Regular checkpoints are handled by WiredTiger automatically, but excessive write activity without sufficient checkpointing can lead to large log files.storage: journal: enabled: true commitIntervalMs: 30000 # Example: commit every 30 seconds - Why it works: This parameter controls how often MongoDB flushes journaled writes to disk. A reasonable interval ensures data durability without overwhelming the journaling system or causing excessive memory pressure due to uncommitted writes.
6. MongoDB Version and Bugs
- Diagnosis: Older MongoDB versions might have memory leaks or inefficiencies in the WiredTiger engine. Check your MongoDB version.
- Fix: Upgrade to the latest stable patch release of your current major version, or to a newer major version if recommended. Follow the official upgrade guide carefully.
- Why it works: Newer versions often contain bug fixes and performance improvements, including optimizations that reduce memory consumption.
After applying these fixes, restart your mongod instances. The next error you might encounter is related to insufficient disk I/O if your disk subsystem is not keeping up with the increased read/write demands from a larger working set.