NATS metrics endpoints are the key to understanding your message bus’s health and performance, but they’re often overlooked in favor of more "critical" system metrics.
Let’s watch NATS in action. Imagine a simple NATS setup with two publishers and two subscribers, all running within a Docker Compose environment. We’ll configure NATS to expose its HTTP metrics endpoint.
version: '3.8'
services:
nats:
image: nats:latest
ports:
- "4222:4222" # NATS client port
- "8222:8222" # NATS HTTP monitoring port
command: -m # Enable monitoring
networks:
- nats_net
publisher1:
build: ./publisher1
networks:
- nats_net
publisher2:
build: ./publisher2
networks:
- nats_net
subscriber1:
build: ./subscriber1
networks:
- nats_net
subscriber2:
build: ./subscriber2
networks:
- nats_net
networks:
nats_net:
In this docker-compose.yml, the nats service is started with the -m flag, which enables the HTTP monitoring endpoint on port 8222. The other services (publisher1, publisher2, subscriber1, subscriber2) represent your applications that will interact with NATS.
Now, let’s simulate some traffic. Here’s a basic Go publisher:
package main
import (
"fmt"
"log"
"time"
"github.com/nats-io/nats.go"
)
func main() {
nc, err := nats.Connect(nats.DefaultURL)
if err != nil {
log.Fatalf("Error connecting to NATS: %v", err)
}
defer nc.Close()
subject := "updates.messages"
for i := 0; i < 10; i++ {
message := fmt.Sprintf("Hello from publisher1, message %d", i)
if err := nc.Publish(subject, []byte(message)); err != nil {
log.Printf("Error publishing message: %v", err)
} else {
log.Printf("Published: %s on subject %s", message, subject)
}
time.Sleep(500 * time.Millisecond)
}
}
And a simple Go subscriber:
package main
import (
"log"
"time"
"github.com/nats-io/nats.go"
)
func main() {
nc, err := nats.Connect(nats.DefaultURL)
if err != nil {
log.Fatalf("Error connecting to NATS: %v", err)
}
defer nc.Close()
subject := "updates.messages"
_, err = nc.Subscribe(subject, func(msg *nats.Msg) {
log.Printf("Received message: %s on subject %s", string(msg.Data), msg.Subject)
})
if err != nil {
log.Fatalf("Error subscribing to subject %s: %v", subject, err)
}
log.Printf("Subscribed to subject: %s", subject)
// Keep the subscriber running
select {}
}
With these running, you can now access the NATS HTTP metrics endpoint. If NATS is running on localhost:8222, you can curl it:
curl http://localhost:8222/metrics
This will output a stream of Prometheus-formatted metrics. You’ll see counters for published messages, received messages, bytes sent, bytes received, connection counts, and more. For instance, you might see:
# HELP nats_server_connections_total Total number of connections to the server.
# TYPE nats_server_connections_total counter
nats_server_connections_total 5
# HELP nats_server_msgs_total Total number of messages published to the server.
# TYPE nats_server_msgs_total counter
nats_server_msgs_total 120
# HELP nats_server_bytes_total Total number of bytes published to the server.
# TYPE nats_server_bytes_total counter
nats_server_bytes_total 6240
The core problem NATS monitoring solves is providing visibility into the internal state of the message bus itself. Without these metrics, you’re flying blind, unable to diagnose performance bottlenecks, identify unusual traffic patterns, or even confirm that NATS is functioning as expected.
Internally, NATS exposes these metrics via an HTTP server that runs concurrently with the main NATS server. This HTTP server is deliberately lightweight, designed to provide detailed operational data without adding significant overhead. The metrics are exposed in the Prometheus exposition format, making them easily consumable by standard monitoring tools like Prometheus, Grafana, and others.
The key levers you control are primarily through NATS server configuration. The most direct is enabling the monitoring port, typically done via the -m flag or within a configuration file:
# Example nats.conf snippet
server {
# ... other settings
http {
port: 8222
}
# ... other settings
}
You can also configure TLS for the monitoring endpoint if security is a concern, though this adds complexity. The metrics themselves are largely fixed, offering a comprehensive view of NATS’s operation. You don’t "configure" what metrics are exposed, but you do configure how they are exposed and where they are scraped from.
The one thing most people don’t know is that the /subsz endpoint on the monitoring port provides a real-time snapshot of subscriptions across the entire cluster, showing which subjects are being listened to and by how many clients. This isn’t a Prometheus metric but an ad-hoc HTTP endpoint that’s incredibly useful for debugging connectivity and fanout issues. You can curl http://localhost:8222/subsz to get a JSON output like:
{"subs":1,"sz":1,"total":1,"varz":0,"wire":0}
(This example is simplified; a real output would show subject details and client counts).
Once you’ve got Prometheus scraping these metrics, the next logical step is to visualize them in Grafana, creating dashboards that track latency, throughput, and connection health over time.