Cudafy and long-running kernels

Dec 3, 2012 at 11:38 PM


I have a some code that takes a long time to run, and hits the windows graphics card watchdog timer timeout.

I therefore split it up into smaller kernels. What is the best way to do this within cudafy?

Let's say I have split it into 10 sets, I can see a few options to loop through:

1) Call LaunchAsync with a new streamid (0-9) each time.

2) Call LaunchAsync with a set of streams (say 0-2).

3) Call Launch and then Syncronize each time.

as far as I can tell (1) doesn't work - it still hits the watchdog timer. I've tried 2+3, but neither seems consistently better. What is best practice?


Dec 4, 2012 at 6:48 AM

None of these options will make your kernel run quicker.  The only way to do it is to reduce the number of loops within a kernel - say exiting and then picking up again with following kernel.  If this is not easily feasible and you do not mind freezing the app and graphics for some time and you have control over the users' machines then increase the time-out via registry.

Dec 4, 2012 at 7:47 AM

sorry, I wasn't clear: I have made the kernel run quicker, by splitting it into 10 kernels of smaller loops.

My question is when I run those 10 kernels, what is the best method?

Dec 4, 2012 at 9:42 AM

CUDA runs kernel launches sequentially even when they have different stream ids.  Copying can take place in parallel, so async only is of advantage when putting copies and launches in parallel.  Also note that on Geforce cards that there is only one copy engine so the copies to- and from device are sequential.  On high end Quadro and Tesla cards there are two copy engines, so copy to- and from can take place in parallel.

So, in answer to your question there is no way speed up kernel execution using streams.  You can only attempt to overlap your I/O if that also costs a significant piece of time.


Dec 4, 2012 at 2:12 PM

In fact, fermi introduced simultaneous kernel launches per gpu. You can check if your gpu has that feature by querying GetDeviceProperties() and checking the concurrentKernels property (writing from memory here, there could be a typo).'s_Fermi-The_First_Complete_GPU_Architecture.pdf

"Fermi supports simultaneous execution of multiple kernels from the same application, each kernel being distributed to one or more SMs on the device. This capability avoids the situation where a kernel is only able to use part of the device and the rest goes unused".


If your kernels weren't already at full occupancy, you may gain some performance by launching them simultaneously using two parallel streams. If you have many pending kernel launches, you should interleave them onto the streams (ex: k1 -> s1, k2 -> s2, k3 -> s1, k4 -> s2, etc) using LaunchAsync. I'm assuming that none of your kernels depend on the output of another kernel, otherwise you'd need to add some synchronization mechanism.



Dec 5, 2012 at 1:24 PM

How do you guys define a "long running kernel"? What is the timeframe of execution here?

Dec 5, 2012 at 1:57 PM

Long running in my opinion is a kernel that exceeds the time out of the Windows display driver which is around 1-2 seconds.  Others may define it lower such as at the point where is becomes noticeable that the screen is frozen.