This isn’t just a checklist for getting microservices into production; it’s a checklist for making sure they stay there, gracefully handling the chaos of real-world usage.
Let’s see what "production-ready" actually looks like when the rubber meets the road. Imagine a new service, user-profile, is deployed. It’s supposed to fetch user data from a user-db and sometimes from an external auth-service.
Here’s user-profile in action, handling a request to /users/123:
{
"request_id": "a1b2c3d4-e5f6-7890-1234-567890abcdef",
"method": "GET",
"path": "/users/123",
"timestamp": "2023-10-27T10:00:00Z",
"duration_ms": 75,
"status_code": 200,
"response_body_size_bytes": 512,
"upstream_requests": [
{
"service": "user-db",
"method": "SELECT",
"query": "SELECT * FROM users WHERE id = 123",
"duration_ms": 40,
"status_code": 200
},
{
"service": "auth-service",
"method": "GET",
"path": "/auth/validate/123",
"duration_ms": 30,
"status_code": 200
}
],
"error": null
}
This log entry, often called an "access log" or "request log" in a microservices context, tells us a lot. It’s not just about the user-profile service itself, but also its dependencies. The request_id is crucial for tracing this single request across multiple services. duration_ms shows the total time and breakdown for each upstream call. status_code indicates success or failure.
The core problem microservices solve is independent deployability and scalability. Instead of a monolith where one bug or one slow feature brings down the whole system, each microservice can be updated, scaled, and even failed without impacting others. This requires a completely different mindset around observability, fault tolerance, and deployment.
Here’s how user-profile might be configured to achieve this:
Configuration for user-profile (simplified application.yaml):
server:
port: 8080
spring:
application:
name: user-profile
datasource:
url: jdbc:postgresql://user-db.internal:5432/users
username: readonly_user
password: ${DB_PASSWORD} # Injected via secrets management
redis:
host: cache.internal
port: 6379
eureka: # Service Discovery
client:
serviceUrl:
defaultZone: http://discovery.internal:8761/eureka/
management: # Health Checks and Metrics
endpoints:
web:
exposure:
include: health,info,metrics
endpoint:
health:
show-details: when_authorized
metrics:
export:
prometheus:
enabled: true
tags:
application: ${spring.application.name}
# Resilience patterns
feign: # For calling auth-service
client:
config:
default:
connectTimeout: 2000 # ms
readTimeout: 5000 # ms
circuitbreaker:
enabled: true
instances:
auth-service: # Specific config for auth-service calls
enabled: true
slidingWindowType: CALLER_ERROR_THRESHOLD # or REQUEST_VOLUME_THRESHOLD
slidingWindowSize: 10 # number of calls in window
minimumNumberOfCalls: 5 # min calls before circuit breaker opens
failureRateThreshold: 50 # % of failures to open circuit
waitDurationInOpenState: 10000 # ms to wait before attempting a call
logging:
pattern:
level: "%5p [${spring.application.name}:%X{traceId:-},%X{spanId:-}]" # Include trace ID in logs
The eureka.client section tells user-profile how to find other services like user-db and auth-service without hardcoding their IPs. The management.endpoints section exposes endpoints like /actuator/health which Kubernetes or other orchestrators can poll to know if the service is alive and ready. feign.circuitbreaker is key: if auth-service starts failing, this configuration will automatically stop sending requests to it for a while, preventing a cascading failure.
The logging.pattern is critical. It ensures that every log line from user-profile includes a traceId. When user-profile calls auth-service, it should pass this traceId along. The auth-service then also logs its operations with the same traceId. This allows you to stitch together the full request flow.
The one thing most people don’t know about production readiness is that it’s less about a perfect, bug-free deployment and more about a controlled failure model. You’re not aiming for "never fails," but for "fails gracefully and predictably." This means building in mechanisms like circuit breakers, retries with exponential backoff, and well-defined fallback strategies so that when a dependency does fail, your service doesn’t just crash or hang indefinitely. It should degrade its functionality or return a sensible error.
Next, you’ll want to think about how you manage the lifecycle of these services, especially when updates are frequent.