MQTT brokers are surprisingly fragile state machines, and their perceived performance is often a direct reflection of the network’s ability to keep that state synchronized.
Let’s watch a tiny MQTT broker in action. Imagine a single client connecting to our broker, subscribing to a topic, and then publishing a message.
import paho.mqtt.client as mqtt
import time
broker_address = "localhost"
broker_port = 1883
def on_connect(client, userdata, flags, rc):
if rc == 0:
print("Connected to MQTT Broker!")
client.subscribe("test/topic")
else:
print(f"Failed to connect, return code {rc}\n")
def on_message(client, userdata, msg):
print(f"Received message on topic {msg.topic}: {msg.payload.decode()}")
client = mqtt.Client("MonitorClient")
client.on_connect = on_connect
client.on_message = on_message
client.connect(broker_address, broker_port)
client.loop_start() # Start a new thread to process network traffic
time.sleep(2) # Give it time to connect and subscribe
client.publish("test/topic", "Hello from Python!")
time.sleep(5) # Allow time for message to be received
client.loop_stop() # Stop the loop
client.disconnect()
This script does three things: connects, subscribes, and publishes. The client.loop_start() is crucial; it’s a background thread constantly sending and receiving MQTT packets, handling keepalives, ACKs, and message delivery. Without it, your client would just sit there.
The core problem MQTT monitoring solves is understanding the health and capacity of the broker from the perspective of the clients. This means tracking not just the broker’s CPU and memory, but more importantly, its ability to process incoming connections, subscriptions, and message traffic efficiently.
Here’s what you’re actually controlling and observing:
- Connection Count: The most fundamental metric. How many clients are currently connected? A sudden drop might indicate a network issue or a broker crash. A steady climb beyond expected levels could signal a denial-of-service attack or a runaway client process.
- Subscription Count: For each connected client, how many unique topic subscriptions do they have? High subscription counts can strain broker resources, especially if many clients subscribe to wildcard topics (
#or+). - Message Throughput: This is usually measured in messages per second (publish rate) and messages per second received by subscribers. It’s the lifeblood of most MQTT systems.
- Latency: The time it takes for a message published by one client to be received by another. This is heavily influenced by network conditions but also by broker processing load.
- Retained Messages: The number of messages the broker is holding onto for clients that connect later. This consumes broker memory.
- Will Messages: Messages clients set to be published if they disconnect unexpectedly. Monitoring these can help detect client failures.
To monitor these, you’ll typically use a combination of broker-specific metrics (many brokers expose these via an API or log files) and general network/system tools. For example, Mosquitto, a popular open-source broker, can be configured to log connection/disconnection events and publish internal statistics to specific MQTT topics itself.
Consider this Mosquitto configuration snippet:
# mosquitto.conf
log_dest file /var/log/mosquitto/mosquitto.log
log_type all
# Enable statistics topic
# This will publish stats to $SYS/broker/...
# Set to true to enable
$SYS_Topic true
With $SYS_Topic true, Mosquitto will send metrics like $SYS/broker/clients/total, $SYS/broker/messages/publish/sec, and $SYS/broker/load/disk. You’d then use a separate MQTT client to subscribe to these $SYS topics and log or visualize them.
The most surprising thing about monitoring MQTT is how easily you can be misled by focusing solely on broker-side metrics. A broker might show low CPU and memory usage, yet clients experience high latency or dropped connections. This almost always points to network bottlenecks or, more subtly, the broker’s internal event loop being starved due to excessive I/O or blocking operations, often triggered by a specific client’s behavior or a misconfiguration in how clients are handling QoS levels.
When a client publishes a message with QoS 1 or 2, the broker must send an ACK back to the publisher. If the network is slow or the broker is overloaded, this ACK can be delayed. The publisher then waits, and if the timeout is reached, it might retry, creating a feedback loop that escalates latency and connection issues. Understanding the interplay between network, broker internal state, and client acknowledgments is key.
You’ll eventually want to correlate broker metrics with client-side observed behavior to get the full picture.