I have a some code that takes a long time to run, and hits the windows graphics card watchdog timer timeout.
I therefore split it up into smaller kernels. What is the best way to do this within cudafy?
Let's say I have split it into 10 sets, I can see a few options to loop through:
1) Call LaunchAsync with a new streamid (0-9) each time.
2) Call LaunchAsync with a set of streams (say 0-2).
3) Call Launch and then Syncronize each time.
as far as I can tell (1) doesn't work - it still hits the watchdog timer. I've tried 2+3, but neither seems consistently better. What is best practice?