Share your bug fixes

Dec 18, 2012 at 4:28 PM

The GPU is an unruly beast. Please share your most annoying bugs and how you eventually managed to fix them, so that others won’t have to go through that same horror.

Here’s a few of my own, in no particular order:

- Number of parameter on kernel calls: a kernel stack space is limited, and may vary with the hardware. As a rule of thumb, try to keep it below 128 bytes. NVIDIA claims it can go up to 256, but with mem alignment and other issues, it can rarely reach that threshold safely. If you’re unsure on how to calculate your kernel’s stack size, assume that on a worst-case scenario, each normal argument takes 8 bytes each, except for pointers (i.e., array arguments), where you should count 16 bytes. Exceed the stack size, and you’ll have crazy errors.

- Launching kernels within a different context: If you’re getting “Invalid handle” while launching your kernel in a multi-gpu setting, it usually means that the “gpu” instance that you’re using to launch the kernel is on a different context than the current context. Either use CUDAfy’s multi-threading management tools (check cudafy’s unit test for examples), or manually switch the gpu’s context to become the current context, by calling mygpu.SetCurrentContext() before you launch the kernel.

- If you’re getting an ErrorUnknown after either launching your kernel or (if launching it asynchronously) on the next cudafy instruction, that means that your kernel aborted unexpectedly. There could be a plethora of reasons for that, but the most frequent is memory access violations. Use emulation mode to pinpoint your problem. The CUDA Toolkit “cuda-memcheck.exe” can also help you there. Other frequent reasons for ErrorUnknown are calling “return” on only a few of the threads within the block, or dividing by zero, or even launching a kernel with an excessively large blockDim*GridDim.

- If you’re getting inconsistent numerical results, which change on every run, most likely you are accessing uninitialized memory somewhere in your computations. Another source of this can result by not calling SynchThreads() after some particular manipulations of shared memory where more than one thread contribute for a common shared result. Try placing SynchThreads() everywhere, and if it fixes your problem, selectively remove them. This has fixed some vexing problems I had where different hardware & architectures would produce inconsistent results.


Sep 25, 2013 at 10:04 PM
Edited Nov 20, 2013 at 10:54 PM
  • launching a kernel with an incorrect number of parameters will throw ErrorUnknown (immediately, whereas a div0 for example could take time as the GPU runs)
  • ErrorUnknown, possibly with a display driver crash: Kernel is running for too long (usually 2secs)