RocksDB, the state store backend for Kafka Streams, can be a performance bottleneck if not tuned correctly, often manifesting as high latency in stream processing or even out-of-memory errors due to excessive memory usage.

Common Causes and Fixes for RocksDB Performance Issues

The core of the problem is usually RocksDB’s internal data management, specifically how it handles writes, reads, and memory. Let’s break down the most common culprits and how to address them.

1. Excessive Memory Usage Due to Unflushed Writes (Write Stall)

  • Diagnosis: Monitor the rocksdb.num-running-flushes and rocksdb.num-running-compactions metrics. If these are consistently high, especially num-running-flushes, it indicates writes are piling up faster than RocksDB can flush them to disk. Also, check rocksdb.block-cache-usage and rocksdb.mem-table-flush-pending.
  • Cause: Kafka Streams writes to RocksDB’s in-memory MemTables. When a MemTable reaches a certain size, it’s flushed to disk as an immutable file (SSTable). If the flush rate can’t keep up with the write rate, MemTables accumulate, consuming memory and eventually leading to write stalls. This is exacerbated by small, frequent writes.
  • Fix: Increase max-write-buffer-number and write-buffer-size.
    • Command/Config: In your Kafka Streams application’s StreamsConfig, set:
      props.put(StreamsConfig.ROCKSDB_CONFIG_SETTER_CLASS_CONFIG, MyRocksDBConfigSetter.class.getName());
      
      And in MyRocksDBConfigSetter:
      public void setConfig(final String storeName, final Options options, final Map<String, Object> configs) {
          // Example: Increase write buffer size and number of buffers
          options.setWriteBufferSize(64 * 1024 * 1024); // 64MB
          options.setMaxWriteBufferNumber(6);
          // Consider also:
          // options.setMinWriteBufferNumberToMerge(2);
      }
      
    • Why it works: Larger write buffers allow more data to be buffered in memory before a flush is triggered, smoothing out write bursts. Having more write buffers means RocksDB can have more MemTables ready to be flushed, providing more concurrency for write operations.

2. Slow Reads Due to Inefficient Compaction

  • Diagnosis: Monitor rocksdb.num-running-compactions and rocksdb.background-compactions-scheduled. High values and prolonged periods of high compaction activity indicate that RocksDB is spending too much CPU and I/O on merging SSTables. Also, observe rocksdb.block-cache-hit-ratio. A low hit ratio might suggest data isn’t staying in cache effectively.
  • Cause: RocksDB uses a Log-Structured Merge-tree (LSM-tree) structure. Data is written to MemTables, flushed to SSTables, and then these SSTables are periodically merged (compacted) into larger ones to improve read performance and reduce file count. If compaction is too aggressive or not configured optimally, it can consume significant resources.
  • Fix: Tune compaction strategies and levels.
    • Command/Config: In MyRocksDBConfigSetter:
      public void setConfig(final String storeName, final Options options, final Map<String, Object> configs) {
          // Example: Use universal compaction and adjust levels
          options.setCompactionStyle(CompactionStyle.UNIVERSAL);
          // For universal compaction, tune max_subcompactions if using parallel threads
          // options.setMaxSubcompactions(2);
      
          // If using Level-based compaction, tune levels and tier sizes.
          // For example, set number of levels:
          // options.setLevels(7);
          // And tune size ratios:
          // options.setMaxBytesForLevelBase(1024 * 1024 * 1024); // 1GB for level 1 base
          // options.setTargetFileSizeBase(256 * 1024 * 1024); // 256MB target file size
      }
      
    • Why it works: Universal compaction (CompactionStyle.UNIVERSAL) merges all SSTables into a single one, which can be very efficient for read-heavy workloads or when the total data size is manageable. Level-based compaction allows finer control over how many levels of SSTables exist and their sizes, balancing read performance with write amplification. Tuning max_subcompactions or level sizes helps control the intensity of background merging.

3. Poor Cache Hit Ratio

  • Diagnosis: Monitor rocksdb.block-cache-usage and rocksdb.block-cache-hit-ratio. A consistently low hit ratio (e.g., below 80-90%) indicates data is frequently being fetched from disk rather than cache.
  • Cause: RocksDB uses a block cache to store frequently accessed data pages in memory. If the cache is too small, or if the data access pattern is highly sequential and doesn’t re-access pages often, the hit ratio will be low.
  • Fix: Increase the block-cache-size.
    • Command/Config: In MyRocksDBConfigSetter:
      public void setConfig(final String storeName, final Options options, final Map<String, Object> configs) {
          // Example: Allocate 25% of JVM heap to block cache
          long heapSize = Runtime.getRuntime().maxMemory();
          long cacheSize = heapSize / 4; // 25%
          options.setBlockCacheSize(cacheSize);
          // Also consider enabling block cache for compressed blocks if using compression
          // options.setBlockCacheCompressedSize(cacheSize / 2);
      }
      
    • Why it works: A larger block cache can hold more data pages in memory, increasing the probability that subsequent reads find the data already cached, thus reducing disk I/O.

4. High Write Amplification

  • Diagnosis: This is harder to directly monitor with standard metrics but can be inferred. High compaction activity (rocksdb.num-running-compactions) and increasing disk I/O usage, especially during periods of moderate writes, can be indicators.
  • Cause: Write amplification occurs when the amount of data written to disk is significantly larger than the amount of data written by the application. This is inherent in LSM-trees due to the flushing and merging of SSTables. Aggressive compaction or suboptimal settings can worsen this.
  • Fix: Adjust compaction and MemTable settings.
    • Command/Config: In MyRocksDBConfigSetter:
      public void setConfig(final String storeName, final Options options, final Map<String, Object> configs) {
          // Example: Reduce number of levels to reduce merge complexity
          options.setLevels(4); // Default is 7 for some strategies
          // Increase target file size for levels
          options.setTargetFileSizeBase(512 * 1024 * 1024); // 512MB
          // Consider enabling `disable_auto_compactions` and manually triggering compactions
          // if you have predictable write patterns, but this is complex.
      }
      
    • Why it works: Reducing the number of compaction levels or increasing target file sizes means fewer merge operations are needed overall, reducing the total amount of data rewritten to disk.

5. Incorrect Compression Settings

  • Diagnosis: Monitor CPU usage. If CPU is maxed out consistently, it might be due to compression/decompression overhead. Also, observe disk I/O. If disk I/O is surprisingly low for the amount of data being processed, compression might be the cause.
  • Cause: RocksDB supports various compression algorithms (Snappy, LZ4, Zstd). While compression reduces disk space and I/O, it adds CPU overhead for compression and decompression.
  • Fix: Choose an appropriate compression algorithm.
    • Command/Config: In MyRocksDBConfigSetter:
      public void setConfig(final String storeName, final Options options, final Map<String, Object> configs) {
          // Example: Use Zstandard for better compression ratio (higher CPU)
          // options.setCompressionType(CompressionType.ZSTD);
          // options.setZstdCompressionLevel(3); // Adjust level for trade-off
      
          // Or use LZ4 for faster compression/decompression (lower CPU, less space saving)
          options.setCompressionType(CompressionType.LZ4);
          options.setZstdCompressionLevel(1); // This is ignored for LZ4, but good practice to set
      }
      
    • Why it works: Zstd generally offers the best compression ratio at the cost of higher CPU. LZ4 is much faster, reducing CPU load but saving less space. The optimal choice depends on whether your bottleneck is CPU or I/O.

6. Insufficient I/O Throughput on Storage

  • Diagnosis: Monitor disk I/O wait times (iowait in top or vmstat) and disk throughput (e.g., iostat). If these are consistently high, your underlying storage might be saturated.
  • Cause: RocksDB’s performance is fundamentally limited by the speed of the underlying storage. If your SSDs or network storage cannot keep up with the combined demands of flushes, compactions, and reads, performance will suffer.
  • Fix: Upgrade storage or optimize I/O.
    • Command/Config: This is usually an infrastructure change. Ensure you are using fast NVMe SSDs. If on cloud, provision instances with local SSDs or higher IOPS EBS volumes.
    • Why it works: Faster storage directly increases the rate at which RocksDB can read and write data, alleviating I/O bottlenecks.

After addressing these, the next common issue you might encounter is java.lang.OutOfMemoryError: Java heap space if RocksDB’s internal memory structures (like block cache) are too large relative to the JVM heap, or if Kafka Streams itself is consuming too much heap.

Want structured learning?

Take the full Kafka course →