Kubernetes nodes are far more than just VMs; they’re the fundamental building blocks of your cluster, and their health directly dictates the stability of your entire deployment.
Let’s see what a healthy node looks like in New Relic. Imagine you’ve got a pod that’s acting up, maybe it’s slow to respond or crashing intermittently. You’d naturally start by looking at the pod’s metrics in New Relic. But if the pod itself looks fine, the next logical step is to zoom out.
Here’s a typical view of a Kubernetes node’s performance in New Relic. You’ll see CPU utilization, memory usage, disk I/O, and network traffic.
{
"entity": {
"guid": "MzUyODg3N3xJTkZSfE5vZGV8NDU2Nzg5MA",
"type": "K8sNode",
"name": "node-worker-01.us-west-2.compute.internal",
"account": {
"id": 3528877
}
},
"metrics": {
"cpu": {
"usagePercent": 35.5,
"cores": 8
},
"memory": {
"usageBytes": 17179869184,
"totalBytes": 34359738368
},
"disk": {
"readBytesPerSecond": 15000,
"writeBytesPerSecond": 10000
},
"network": {
"receiveBytesPerSecond": 50000,
"transmitBytesPerSecond": 25000
}
},
"kubernetes": {
"node": {
"conditions": {
"Ready": "True",
"MemoryPressure": "False",
"DiskPressure": "False",
"PIDPressure": "False",
"NetworkUnavailable": "False"
}
}
}
}
This JSON snippet represents the kind of data New Relic collects for a single Kubernetes node. guid is New Relic’s unique identifier for the entity. name is the actual hostname of the node. cpu.usagePercent shows how much of the node’s CPU is being utilized. memory.usageBytes and memory.totalBytes give you the node’s memory consumption. disk and network metrics provide I/O and traffic stats. Critically, kubernetes.node.conditions shows the status of key Kubernetes node conditions.
The problem this solves is visibility. Before tools like New Relic Infrastructure, understanding node-level performance in a dynamic Kubernetes environment often meant SSHing into individual nodes and running top, htop, or iostat – a tedious and often incomplete picture. New Relic aggregates this data, correlating it with your applications and services running on those nodes, giving you a unified view.
The core mechanism is the New Relic Infrastructure agent. It runs as a DaemonSet on your Kubernetes cluster, meaning it’s deployed to every node. This agent collects system-level metrics, Kubernetes-specific metadata (like node conditions, labels, and annotations), and custom attributes. It then securely sends this data to the New Relic platform for processing, analysis, and visualization.
You can configure the agent to collect specific metrics, adjust the collection interval, and set up integrations with cloud providers or other systems. For example, you might want to exclude certain high-traffic directories from disk metric collection to reduce noise.
The power comes from correlation. If you see a spike in cpu.usagePercent on node-worker-01.us-west-2.compute.internal, you can immediately drill down to see which pods and containers are consuming that CPU. You can also see if this node is experiencing MemoryPressure or DiskPressure, which are critical Kubernetes conditions that can lead to pod evictions.
When you’re troubleshooting performance issues, you’re not just looking at raw numbers; you’re looking at trends and anomalies. New Relic allows you to set up alerts based on these metrics. For instance, you could be alerted if a node’s CPU utilization exceeds 80% for more than 10 minutes, or if MemoryPressure condition becomes True.
The kubernetes.node.conditions are particularly insightful. A Ready: "True" condition is the baseline; if it’s False, the node isn’t even considered by the Kubernetes control plane. MemoryPressure and DiskPressure indicate that the node is running out of resources, and Kubernetes might start evicting pods to reclaim them. PIDPressure means the node is running out of process IDs. These aren’t just generic system warnings; they are Kubernetes-aware indicators of impending cluster instability.
Most people focus on CPU and Memory usage percentages. They miss that kubernetes.node.conditions are the actual signals Kubernetes uses to manage node health. If a node reports MemoryPressure: "True", Kubernetes will actively try to make space by terminating pods, even if the overall CPU usage looks fine. This is a critical distinction – it’s not just about how much resource is used, but whether the node is signaling resource starvation to Kubernetes.
Once you’ve mastered node monitoring, the next logical step is understanding how these nodes contribute to your cluster’s overall networking topology and how to visualize traffic flows between them and external services.