nvidia-smi is your direct line to what your NVIDIA GPUs are actually doing, not what you think they’re doing. It’s the command-line utility that lets you inspect everything from raw processing power utilization to the precise gigabytes of memory being hogged by a specific process.

Let’s see it in action. Open your terminal and type:

watch -n 1 nvidia-smi

This command will refresh the output of nvidia-smi every second, giving you a live, dynamic view. You’ll see a table with your GPUs. For each GPU, you’ll get:

  • GPU Name: The model of your graphics card.
  • Fan Speed: How fast the fans are spinning, in percentage.
  • Temperature: The current core temperature in Celsius.
  • Power Usage: How much wattage the GPU is currently drawing.
  • Power Limit: The maximum wattage the GPU is allowed to draw.
  • Memory Usage: How much VRAM is currently in use and the total available.
  • Compute Usage: The percentage of time the GPU’s cores have been busy processing computations.
  • Processes: A list of processes currently using the GPU, along with their GPU ID, process ID (PID), and the amount of VRAM they’re consuming.

This isn’t just for idle curiosity. Imagine you’re running a deep learning training job and performance is sluggish. The watch -n 1 nvidia-smi output might reveal that your GPU’s compute usage is hovering around 10% while memory usage is near 100%. That tells you your model is likely bottlenecked by VRAM capacity or memory bandwidth, not raw processing power. Alternatively, if compute usage is low and temperature is high, you might be thermal throttling, or your workload isn’t structured to keep the GPU busy.

The system solves the problem of opaque GPU activity. Without nvidia-smi, you’d have to guess what your GPU is doing, or rely on application-specific logs that might not tell the whole story. It provides a unified, hardware-level view that’s indispensable for performance tuning, debugging, and resource management.

Internally, nvidia-smi queries the NVIDIA driver and hardware for real-time statistics. The driver acts as an intermediary, translating requests for performance counters and sensor readings into a format nvidia-smi can display. The utility itself is a thin wrapper, primarily focused on presentation.

You control the granularity of information through command-line arguments. For example, to see only the PIDs and memory usage for processes on GPU 0, you’d use:

nvidia-smi --query-compute-apps=pid,used_gpu_memory --format=csv,noheader,nounits -i 0

This outputs 1234, 2048 (PID 1234 using 2048 MiB of memory). If you wanted to see the total memory usage across all GPUs in a more scriptable format, you’d use:

nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits

This would give you a list of numbers, one for each GPU, representing MiB used. This is crucial for automated monitoring scripts that might alert you if a specific GPU exceeds a VRAM threshold.

The most surprising thing most people miss is how much control nvidia-smi offers over GPU clocks and power limits, which can drastically impact performance and energy consumption. You can dynamically adjust these settings for specific use cases. For instance, to set the power limit of GPU 0 to 150 watts (assuming your card supports it and you have appropriate permissions):

sudo nvidia-smi -i 0 --power-limit=150

This isn’t just about pushing performance; it’s also about power efficiency. For tasks that don’t require peak performance, lowering the power limit can save significant electricity and reduce heat output without a catastrophic drop in throughput. It’s a delicate balance, and nvidia-smi gives you the tools to find it.

Once you’ve mastered monitoring and basic adjustments, the next step is often integrating this data into more sophisticated monitoring and alerting systems.

Want structured learning?

Take the full Gpu course →