The core issue is that your monolith’s request handling has become a distributed system problem, but it’s still running as a single process.
Here’s how you find and fix the bottlenecks:
Common Causes and Fixes
-
Database Connection Pool Exhaustion
- Diagnosis: Check your database connection pool metrics. Look for
ActiveConnectionsapproachingMaxConnectionsorThreadsWaitingForConnectionrising. In PostgreSQL,pg_stat_activitycan show many idle connections. - Fix: Increase the
maxPoolSizein your application’s database connection configuration. For example, if you’re using HikariCP, changemaximumPoolSize=10tomaximumPoolSize=25. - Why it works: This allows more concurrent requests to acquire a database connection simultaneously, preventing threads from being blocked waiting for an available connection.
- Diagnosis: Check your database connection pool metrics. Look for
-
Excessive Garbage Collection (GC) Pauses
- Diagnosis: Monitor your JVM GC logs. Look for long
Full GCpauses (e.g., > 500ms) or frequentMinor GCpauses that significantly impact application throughput. Tools likejvisualvmorGCViewercan visualize this. - Fix:
- Tune GC Algorithm: Switch to a more modern GC like G1GC (
-XX:+UseG1GC) or ZGC (-XX:+UseZGCfor newer JDKs). - Increase Heap Size: If objects are being promoted too quickly, increase the heap size:
-Xmx4g -Xms4g. - Reduce Object Allocation: Profile your code to find high-allocation sites and optimize them.
- Tune GC Algorithm: Switch to a more modern GC like G1GC (
- Why it works: Different GC algorithms have different pause time characteristics. G1GC and ZGC are designed for lower pause times. A larger heap gives the GC more room to work before needing to collect, and reducing object churn directly lowers GC pressure.
- Diagnosis: Monitor your JVM GC logs. Look for long
-
Thread Contention and Deadlocks
- Diagnosis: Use thread dumps to identify threads in
BLOCKEDorWAITINGstates. Look for patterns where threads are waiting on each other for locks.jstack <pid>orjcmd <pid> Thread.printare your friends here. Profilers like YourKit or JProfiler can visualize lock contention. - Fix:
- Reduce Synchronization Scope: Make
synchronizedblocks as small as possible. - Use Concurrent Data Structures: Replace
synchronizedMapwithConcurrentHashMap. - Avoid Nested Locks: If you must use multiple locks, acquire them in a consistent order across all threads.
- Reduce Synchronization Scope: Make
- Why it works: Minimizing the time spent holding locks or using lock-free data structures reduces the probability of threads blocking each other. Consistent lock ordering prevents circular dependencies that lead to deadlocks.
- Diagnosis: Use thread dumps to identify threads in
-
Inefficient Application Code (CPU-Bound)
- Diagnosis: Use a CPU profiler (e.g.,
async-profiler,perf,VisualVM’s sampler) to identify methods consuming the most CPU time. Look for hot spots in your application logic, not just external calls. - Fix:
- Algorithmic Improvements: Refactor inefficient algorithms (e.g., O(n^2) to O(n log n)).
- Caching: Implement in-memory caches for frequently accessed, expensive-to-compute data.
- Batching: Group similar operations to reduce overhead.
- Why it works: Directly optimizing the code that consumes the most CPU cycles reduces the overall processing time per request, freeing up threads and resources.
- Diagnosis: Use a CPU profiler (e.g.,
-
Slow External API Calls (Network I/O Bound)
- Diagnosis: Monitor your application’s network I/O. Look for high latency on outbound HTTP requests or other network calls. Tracing tools like Jaeger or Zipkin are invaluable here, showing the duration of each span, including external service calls.
- Fix:
- Asynchronous Calls: Use non-blocking I/O (e.g.,
CompletableFuture, reactive libraries) for external calls. - Timeouts and Retries: Configure aggressive but reasonable timeouts for external calls (
connectTimeout=500ms,readTimeout=1000msfor HTTP clients). Implement backoff-based retries. - Circuit Breakers: Implement circuit breaker patterns (e.g., Resilience4j) to quickly fail requests to an unhealthy external service.
- Asynchronous Calls: Use non-blocking I/O (e.g.,
- Why it works: Non-blocking I/O allows your threads to do other work while waiting for external responses. Proper timeouts and circuit breakers prevent cascading failures and stop your application from spending excessive time waiting on unresponsive services.
-
Insufficient Application Server Threads
- Diagnosis: Monitor your application server’s thread pool (e.g., Tomcat’s
maxThreads, Undertow’sworker-threads). IfActiveThreadsare consistently atmaxThreadsand requests are queued or dropped, this is your bottleneck. - Fix: Increase the
maxThreadssetting in your application server’s configuration. For Tomcat, this is oftenmaxThreads="200"inserver.xml. - Why it works: More threads can handle more concurrent requests, especially if the application is I/O bound (waiting for databases or external services). Be mindful of the trade-off with memory consumption and potential for increased contention.
- Diagnosis: Monitor your application server’s thread pool (e.g., Tomcat’s
-
Memory Leaks
- Diagnosis: Observe your application’s heap usage over time. If it steadily increases and never returns to a baseline, even after GC, you likely have a memory leak. Heap dumps analyzed with tools like Eclipse MAT can pinpoint leaking objects.
- Fix: Identify the leaking objects (e.g., unclosed resources, static collections holding references) and ensure they are properly released or cleared.
- Why it works: Eliminating memory leaks prevents the JVM from constantly running GC on ever-growing memory, which degrades performance and can lead to
OutOfMemoryError.
You’ll likely hit java.lang.OutOfMemoryError: Metaspace if you’ve been dynamically loading and unloading classes without managing the classloader lifecycle.