Log in to h4cker, then connect Hacker News to publish comments.
EIeinpoklumavant-hier
First - nice writeup which goes into a lot of nooks and crannies.
That said, a lot of the user-space "voodoo" is gone if you don't go through CUDA's "runtime API". If you use the driver API, take your kernel source as a string and compile it with NVIDIA's run-time compiler, you'll have better visibility into a lot (not all) of what's going on. For the "raw" version of this, look at:
https://github.com/NVIDIA/cuda-samples/tree/master/cpp/0_Int...
but for a much more readable, and still fully transparent modern-C++ API version of the same, try this:
https://github.com/eyalroz/cuda-api-wrappers/blob/master/exa...
that's a sample program for my CUDA API wrappers (header-only) library.
KIkinowhier
I just finished a master's on HPC where I had to take some classes on CUDA, MPI+CUDA, OpenCL. Reading an article like this before the classes would have been a lot helpful! Especially the part just before and after "What does it mean for a warp to be eligible?".
MSmschuetzhier
That was an interesting read. Also enjoyed reading about the semaphores in the default stream. It's great that cuda implicitly handles syncing of commands for users and makes parallel commands optional and opt-in via streams, unlike Vulkan which completely unloads the full complexity of syncing to users right from the start.
FOfooblasteravant-hier
The hardware has some open documentation. You don't actually need to read the kernel source to find some of the method documentation or qmd formats. See https://github.com/NVIDIA/open-gpu-doc/blob/master/classes/c...
ABaberrahmane_bhier
It's very useful. The doorbell and QMD part were the most useful for me, because it connects the CUDA launch syntax to what actually gets submitted to the GPU. Most explanations stop around kernels, blocks and warps, but this made the CPU to driver to GPU path much easier to follow.
Comments
5 preview comments · loading full threadLog in to h4cker, then connect Hacker News to publish comments.
First - nice writeup which goes into a lot of nooks and crannies. That said, a lot of the user-space "voodoo" is gone if you don't go through CUDA's "runtime API". If you use the driver API, take your kernel source as a string and compile it with NVIDIA's run-time compiler, you'll have better visibility into a lot (not all) of what's going on. For the "raw" version of this, look at: https://github.com/NVIDIA/cuda-samples/tree/master/cpp/0_Int... but for a much more readable, and still fully transparent modern-C++ API version of the same, try this: https://github.com/eyalroz/cuda-api-wrappers/blob/master/exa... that's a sample program for my CUDA API wrappers (header-only) library.
I just finished a master's on HPC where I had to take some classes on CUDA, MPI+CUDA, OpenCL. Reading an article like this before the classes would have been a lot helpful! Especially the part just before and after "What does it mean for a warp to be eligible?".
That was an interesting read. Also enjoyed reading about the semaphores in the default stream. It's great that cuda implicitly handles syncing of commands for users and makes parallel commands optional and opt-in via streams, unlike Vulkan which completely unloads the full complexity of syncing to users right from the start.
The hardware has some open documentation. You don't actually need to read the kernel source to find some of the method documentation or qmd formats. See https://github.com/NVIDIA/open-gpu-doc/blob/master/classes/c...
It's very useful. The doorbell and QMD part were the most useful for me, because it connects the CUDA launch syntax to what actually gets submitted to the GPU. Most explanations stop around kernels, blocks and warps, but this made the CPU to driver to GPU path much easier to follow.