NVIDIA Nsight Systems and Nsight Compute are powerful tools for profiling GPU kernels, but they work differently and are used for different purposes. Nsight Systems gives you a high-level, timeline-based view of your application, showing how CPU and GPU activities interleave. Nsight Compute, on the other hand, dives deep into individual GPU kernels, providing granular performance metrics.
Let’s see Nsight Systems in action. Imagine you’re running a simulation. You’d start Nsight Systems on your host machine, launch your application through it, and record a trace.
nsys profile --trace=cuda,nvtx --stats=true --output=my_app.nsys-rep ./my_app
This command tells Nsight Systems to trace CUDA API calls and NVTX (NVIDIA Tools Extension) ranges, collect basic statistics, and save the output to my_app.nsys-rep. When you open my_app.nsys-rep in the Nsight Systems GUI, you’ll see a timeline. On this timeline, you’ll see your application’s CPU threads executing. When a CPU thread calls a CUDA API function like cudaLaunchKernel, you’ll see a corresponding GPU activity appear on the GPU timeline.
Here’s what that might look like in the UI (simplified):
CPU Thread 0: [------ cudaLaunchKernel ------] [------ cudaMemcpy ------]
GPU Worker Thread 0: [====== Kernel Execution ======]
The key insight here is how CPU and GPU work together. If the CPU is busy preparing work or waiting for results, you’ll see gaps on the GPU timeline where it’s idle. Conversely, if the GPU is a bottleneck, you’ll see the CPU threads waiting for GPU operations to complete, indicated by long bars on the CPU timeline that correspond to GPU activity.
Nsight Systems helps you answer questions like:
- Is my GPU being fed work fast enough by the CPU?
- Are my CUDA API calls introducing significant overhead?
- How much time is spent on data transfers versus kernel execution?
Once Nsight Systems points you to a specific kernel that seems to be a bottleneck (e.g., a long-running kernel execution on the GPU timeline), you’d then use Nsight Compute to analyze that kernel in detail. You launch Nsight Compute from the command line, pointing it to your application and specifying which kernel to profile.
ncu --target-processes all -o my_kernel_profile --section SpeedOfLight --section Metrics:GpuThroughput my_app
This command runs my_app and profiles all CUDA kernels within it. The --section flags tell Nsight Compute to collect specific sets of metrics. SpeedOfLight gives a theoretical maximum performance, and Metrics:GpuThroughput focuses on metrics related to how much work the GPU is actually doing.
When you open the my_kernel_profile report in the Nsight Compute GUI, you’ll see an overwhelming amount of data. The most important views are the "Summary" and "Kernel Metrics" pages. The "Summary" page gives you a quick overview of potential performance issues, like low occupancy or memory bandwidth limitations. The "Kernel Metrics" page lists hundreds of counters, categorized by what they measure: memory access, compute utilization, instruction stalls, and so on.
Nsight Compute provides a detailed breakdown of where a kernel is spending its time. For example, you might see metrics like:
DRAM Throughput: How much data is being read from or written to global memory.L2 Cache Hit Rate: How often data is found in the L2 cache.SM Utilization: How busy the Streaming Multiprocessors (SMs) are.Occupancy: The ratio of active warps to the maximum possible warps on an SM.
The goal is to identify the "long pole in the tent" – the primary reason your kernel isn’t running as fast as it could. Is it waiting on memory? Is it not using enough compute resources? Is it stalled waiting for instructions?
A common pattern is to see low SM Utilization and high DRAM Throughput alongside a low L2 Cache Hit Rate. This tells you the kernel is memory-bound, and not effectively using the L1/L2 caches. The fix might involve restructuring your data access patterns to improve cache locality, or perhaps using shared memory more effectively.
For instance, if your kernel is doing a lot of independent reads from global memory, you might see that the Compute metric is low while DRAM Throughput is high. To improve this, you’d look at Cache Hit Rate metrics. If that’s low, it suggests your data access isn’t coalesced or is too random. The fix could be to reorder your data structure or access pattern to ensure threads within a warp access contiguous memory locations. For example, if you were accessing a 2D array in a strided fashion, changing to a row-major or column-major access pattern that aligns with thread IDs within a warp can dramatically improve cache utilization and global memory bandwidth.
The most surprising thing about Nsight Compute is how it can reveal performance bottlenecks that aren’t immediately obvious from the algorithm itself. You might have a perfectly correct algorithm, but the way it maps to the GPU’s hardware — its memory access patterns, instruction mix, and thread scheduling — can be the real performance limiter. The tool exposes these hardware-specific interactions.
For instance, when looking at warp-level statistics, you might notice a high number of Stall:Other events. This counter is a catch-all, but if you drill down into the instruction mix, you might find that a specific type of instruction, like a transcendental function (e.g., sin, cos, exp), is taking a disproportionately long time and causing warps to stall. The fix here isn’t always about rewriting the algorithm, but sometimes about using lookup tables, polynomial approximations, or simpler approximations if the required precision allows.
Once you’ve optimized a kernel based on Nsight Compute’s analysis, you’d go back to Nsight Systems to ensure your overall application performance has improved and that no new bottlenecks have been introduced.
The next step after mastering kernel profiling is understanding how to leverage different GPU architectures effectively, as the metrics and their interpretation can vary significantly between generations.