Deploy gRPC Services on Kubernetes with Session Affinity (2026)

A gRPC service on Kubernetes with session affinity isn’t sticky in the way you might expect; it’s sticky at the pod level, not the service level.

Let’s see it in action. Imagine we have a simple gRPC service, greeter, that just echoes back whatever you send it. We’ll deploy it to Kubernetes.

First, the gRPC server code (Python):

# server.py
import grpc
import time
from concurrent import futures
import helloworld_pb2
import helloworld_pb2_grpc

class GreeterServicer(helloworld_pb2_grpc.GreeterServicer):
    def SayHello(self, request, context):
        print(f"Received message: {request.name}")
        return helloworld_pb2.HelloReply(message=f"Hello, {request.name}!")

def serve():
    server = grpc.server(futures.ThreadPoolExecutor(max_workers=10))
    helloworld_pb2_grpc.add_GreeterServicer_to_server(GreeterServicer(), server)
    server.add_insecure_port('[::]:50051')
    server.start()
    print("Server started on port 50051")
    try:
        while True:
            time.sleep(86400)
    except KeyboardInterrupt:
        server.stop(0)

if __name__ == '__main__':
    serve()

And the client (Python):

# client.py
import grpc
import helloworld_pb2
import helloworld_pb2_grpc

def run():
    with grpc.insecure_channel('localhost:50051') as channel: # Will be updated to Kubernetes service IP
        stub = helloworld_pb2_grpc.GreeterStub(channel)
        try:
            response = stub.SayHello(helloworld_pb2.HelloRequest(name='KubernetesUser'))
            print("Greeter client received: " + response.message)
        except grpc.RpcError as e:
            print(f"Error: {e.code()} - {e.details()}")

if __name__ == '__main__':
    run()

We’ll need a Dockerfile for the server:

FROM python:3.9-slim
WORKDIR /app
COPY server.py .
COPY helloworld_pb2.py . # Assuming these are generated from .proto
COPY helloworld_pb2_grpc.py . # Assuming these are generated from .proto
RUN pip install grpcio grpcio-tools
CMD ["python", "server.py"]

Now, let’s deploy this to Kubernetes. First, create the gRPC service definition (helloworld.proto):

syntax = "proto3";

package helloworld;

service Greeter {
  rpc SayHello (HelloRequest) returns (HelloReply) {}
}

message HelloRequest {
  string name = 1;
}

message HelloReply {
  string message = 1;
}

Compile it: python -m grpc_tools.protoc -I. --python_out=. --grpc_python_out=. helloworld.proto

Next, the Kubernetes deployment and service:

# greeter-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: greeter-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      app: greeter
  template:
    metadata:
      labels:
        app: greeter
    spec:
      containers:
      - name: greeter
        image: your-dockerhub-username/greeter-service:latest # Replace with your image
        ports:
        - containerPort: 50051

---
# greeter-service.yaml
apiVersion: v1
kind: Service
metadata:
  name: greeter-service
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: "nlb" # For AWS NLB, essential for session affinity
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-port: "50051"
    service.beta.kubernetes.io/aws-load-balancer-scheme: "internet-facing"
    service.beta.kubernetes.io/aws-load-balancer-target-group-attributes: "deregistration_delay.timeout_seconds=30"
    # The key annotation for session affinity
    service.beta.kubernetes.io/aws-load-balancer-connection-draining-enabled: "true"
    service.beta.kubernetes.io/aws-load-balancer-connection-draining-timeout: "60"
spec:
  selector:
    app: greeter
  ports:
    - protocol: TCP
      port: 50051
      targetPort: 50051
  type: LoadBalancer
  sessionAffinity: ClientIP # This is the standard Kubernetes session affinity
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: 10800 # 3 hours

When you apply these manifests (kubectl apply -f greeter-deployment.yaml -f greeter-service.yaml), Kubernetes provisions a LoadBalancer service. If you’re on AWS, this typically creates a Network Load Balancer (NLB). The sessionAffinity: ClientIP setting on the Service tells Kubernetes to instruct the load balancer to send subsequent requests from the same client IP address to the same backend pod.

Here’s where it gets interesting: Kubernetes’s ClientIP session affinity works by configuring the load balancer (in this case, the AWS NLB) to maintain a mapping of client IP addresses to backend pods. When a client sends a request, the NLB checks its table. If it sees the client IP, it forwards the request to the previously assigned pod. If not, it picks a pod (usually based on round-robin or a similar algorithm) and records the mapping.

The timeoutSeconds: 10800 in sessionAffinityConfig means that the NLB will remember this mapping for up to 3 hours. After that, or if the pod it was mapping to becomes unavailable, the next request from that client IP might go to a different pod.

The service.beta.kubernetes.io/aws-load-balancer-connection-draining-enabled: "true" and service.beta.kubernetes.io/aws-load-balancer-connection-draining-timeout: "60" are important for graceful pod termination. When a pod is being scaled down or updated, connection draining allows existing connections to finish within the specified timeout (60 seconds here) before the pod is fully removed from the load balancer’s targets. This prevents abrupt disconnections for clients that are currently using that pod due to session affinity.

Crucially, if your gRPC client is behind a NAT gateway or a proxy, all its requests will appear to originate from a single IP address (the NAT gateway’s or proxy’s public IP). In such scenarios, ClientIP session affinity will effectively make all clients behind that NAT/proxy sticky to the same backend pod, potentially leading to uneven load distribution. This is a common pitfall.

The "session affinity" here is really load balancer affinity. The Kubernetes Service object acts as a configuration interface, but the actual sticky behavior is implemented by the underlying cloud provider’s load balancer (or an ingress controller if you were using one). For gRPC, which is fundamentally stateless on the server-side but might benefit from client-side caching or state management that you want to keep consistent for a user session, this is the primary mechanism.

The next thing you’ll likely encounter is managing gRPC streams across multiple requests from the same client, where you might want to maintain state not just per client IP, but per stream or per user identity.