gRPC keep-alive is surprisingly not about keeping connections alive, but rather about detecting when they’ve died silently.

Let’s watch a gRPC client and server communicate with keep-alive enabled.

Client (Go):

package main

import (
	"context"
	"log"
	"time"

	"google.golang.org/grpc"
	"google.golang.org/grpc/credentials/insecure"
	pb "google.golang.org/protobuf/types/known/emptypb" // Assuming an empty response for simplicity
)

func main() {
	// Configure keep-alive parameters
	// Time between sending keep-alive probes if no data is sent
	keepAliveTime := 30 * time.Second
	// Time after which the client will give up if the server doesn't respond to keep-alive probe
	keepAliveTimeout := 10 * time.Second
	// Whether the client will send keep-alive probes when it's idle
	keepAlivePermitWithoutStream := true

	conn, err := grpc.Dial(
		"localhost:50051", // Target server address
		grpc.WithTransportCredentials(insecure.NewCredentials()),
		grpc.WithDefaultCallOptions(grpc.KeepaliveParams(
			{
				Time:    keepAliveTime,
				Timeout: keepAliveTimeout,
			},
		)),
		grpc.WithDefaultServiceConfig(`{"loadBalancingPolicy":"round_robin"}`), // Example: round robin policy
		grpc.WithKeepaliveParams(
			{
				Time:    keepAliveTime,
				Timeout: keepAliveTimeout,
				PermitWithoutStream: keepAlivePermitWithoutStream,
			},
		),
	)
	if err != nil {
		log.Fatalf("did not connect: %v", err)
	}
	defer conn.Close()

	log.Println("Connected to gRPC server. Sending keep-alive probes...")

	// Simulate sending a request periodically
	ticker := time.NewTicker(60 * time.Second) // Sending requests less often than keep-alive time
	defer ticker.Stop()

	for range ticker.C {
		ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
		_, err := pb.NewHelloClient(conn).SayHello(ctx, &pb.HelloRequest{Name: "gRPC"}) // Assuming SayHello is a unary RPC
		if err != nil {
			log.Printf("RPC failed: %v", err)
			// If the RPC fails due to a broken connection, the keep-alive mechanism
			// has already detected it and potentially closed the connection.
			// The error here might be related to that.
		} else {
			log.Println("RPC successful.")
		}
		cancel()
	}
}

Server (Go):

package main

import (
	"context"
	"log"
	"net"

	"google.golang.org/grpc"
	"google.golang.org/grpc/keepalive"
	pb "google.golang.org/protobuf/types/known/emptypb" // Assuming an empty response for simplicity
)

const (
	port = ":50051"
)

// server is used to implement helloworld.GreeterServer.
type server struct {
	pb.UnimplementedHelloServer // Embed for forward compatibility
}

// SayHello implements helloworld.GreeterServer.
func (s *server) SayHello(ctx context.Context, in *pb.HelloRequest) (*pb.HelloReply, error) {
	log.Printf("Received: %v", in.GetName())
	return &pb.HelloReply{Message: "Hello " + in.GetName()}, nil
}

func main() {
	lis, err := net.Listen("tcp", port)
	if err != nil {
		log.Fatalf("failed to listen: %v", err)
	}

	// Configure server-side keep-alive
	// The server will send a keep-alive ping to the client if the connection is idle for this duration.
	serverKeepAliveTime := 60 * time.Second
	// After sending a keep-alive ping, the server will wait for this duration for a pong.
	serverKeepAliveTimeout := 20 * time.Second
	// Whether the server will send keep-alive pings when there are no active streams.
	serverKeepAlivePermitWithoutStream := true

	s := grpc.NewServer(
		grpc.KeepaliveParams(keepalive.ServerParameters{
			Time:    serverKeepAliveTime,
			Timeout: serverKeepAliveTimeout,
		}),
		grpc.KeepaliveEnforcementPolicy(keepalive.EnforcementPolicy{
			MinTime:             5 * time.Second, // Minimum time a client must wait before sending a keepalive
			PermitWithoutStream: serverKeepAlivePermitWithoutStream, // Allow pings even if no active streams
		}),
	)
	pb.RegisterHelloServer(s, &server{})

	log.Printf("server listening at %v", lis.Addr())
	if err := s.Serve(lis); err != nil {
		log.Fatalf("failed to serve: %v", err)
	}
}

The core problem keep-alive solves is detecting when a network connection has become unusable without an explicit error being sent by either the client or the server. This often happens due to intermediate network devices (like firewalls or load balancers) silently dropping idle TCP connections. Without keep-alive, your application might continue to try sending data to a connection that’s effectively dead, leading to hangs, timeouts, and difficult-to-diagnose issues.

Common Causes and Fixes:

  1. Firewall/NAT Dropping Idle Connections:

    • Diagnosis: Observe your application. It works fine for a while, then requests start hanging or timing out after a period of inactivity. Network packet captures might show no traffic on the gRPC connection for extended periods.
    • Fix: Configure client-side and server-side keep-alive.
      • Client: Set grpc.KeepaliveParams(keepalive.ClientParameters{Time: 30 * time.Second, Timeout: 10 * time.Second, PermitWithoutStream: true}).
      • Server: Set grpc.KeepaliveEnforcementPolicy(keepalive.EnforcementPolicy{MinTime: 5 * time.Second, PermitWithoutStream: true}) and grpc.KeepaliveParams(keepalive.ServerParameters{Time: 60 * time.Second, Timeout: 20 * time.Second}).
    • Why it works: The client or server periodically sends a small "ping" (keep-alive probe) even if there’s no application data to send. If the other side doesn’t respond within the Timeout period, the connection is considered broken and closed. PermitWithoutStream: true ensures these probes are sent even when no RPCs are active, which is crucial for fighting idle timeouts.
  2. Incorrect KeepaliveTime on Client:

    • Diagnosis: The client is sending keep-alive probes, but the server is dropping them. RPCs fail after a period of inactivity.
    • Fix: On the client, ensure KeepaliveTime is less than the idle timeout of any intermediate network devices (firewalls, load balancers). A value like 30 * time.Second is often a good starting point.
    • Why it works: The client must "refresh" the connection’s state in any intervening network hardware before that hardware decides to drop the connection.
  3. Incorrect KeepaliveTimeout on Client:

    • Diagnosis: The client sends probes, but the connection is being marked as dead too aggressively, even when the network is fine. This might manifest as intermittent RPC failures that seem random.
    • Fix: On the client, increase KeepaliveTimeout. This is the duration the client waits for a pong response after sending a ping. A value like 10 * time.Second or 20 * time.Second is usually sufficient.
    • Why it works: Network latency can sometimes cause a slight delay in responses. A longer timeout gives the network more time to deliver the pong before the client prematurely declares the connection dead.
  4. Server Not Enforcing Keep-Alive (or Enforcing Too Laxly):

    • Diagnosis: Clients might be configured with keep-alive, but the server doesn’t respond or actively rejects probes. Connections might still be dropped by the server side due to its own idle timeouts or by network devices between the client and server.
    • Fix: On the server, configure grpc.KeepaliveEnforcementPolicy with MinTime (e.g., 5 * time.Second) and PermitWithoutStream: true. Also, configure grpc.KeepaliveParams for the server’s own probing.
    • Why it works: The EnforcementPolicy tells the server how to treat incoming keep-alive probes from clients. PermitWithoutStream: true is vital for the server to accept probes when no active RPCs are ongoing, preventing it from dropping connections that clients are trying to keep alive. MinTime ensures clients don’t send probes too frequently, which could overload the server.
  5. Client Not Sending Keep-Alive Without Streams:

    • Diagnosis: Applications that have long periods of inactivity between RPCs will experience dropped connections, even if KeepaliveTime and KeepaliveTimeout are set.
    • Fix: Ensure PermitWithoutStream: true is set on both the client (grpc.WithKeepaliveParams) and server (grpc.KeepaliveEnforcementPolicy).
    • Why it works: Without PermitWithoutStream: true, the client or server will only send keep-alive probes during an active RPC stream. If there are no active streams, no probes are sent, and any intermediate network device will eventually time out and drop the idle TCP connection.
  6. Misconfigured MinTime on Server Enforcement Policy:

    • Diagnosis: Clients might be trying to send keep-alive probes, but the server rejects them with an error like "too soon to send keepalive."
    • Fix: On the server, set grpc.KeepaliveEnforcementPolicy(keepalive.EnforcementPolicy{MinTime: 5 * time.Second, ...}). Adjust MinTime as needed, but 5 * time.Second is a common and safe value.
    • Why it works: MinTime is a server-side protection to prevent clients from hammering the server with keep-alive probes too rapidly. It enforces a minimum interval between probes sent by the client, ensuring the server isn’t overwhelmed.

Once keep-alive is correctly configured, the next immediate issue you’ll encounter is understanding how to handle transient network errors that might still occur even with keep-alive.

Want structured learning?

Take the full Grpc course →