Multiple GPUs and multi-threading

Nov 26, 2012 at 7:13 PM

I'm trying to get this working with 2 GPUs in a multi-threaded application.

I create the 2 gpu objects, and when calling have 2 different objects which I lock on (so that each GPU can only be called by one thread at a time).

Within the 2 locks I call SetCurrentContext and then perform CopyToDevice, Launch, Synchronize, CopyFromDevice and FreeAll (on the respective GPU objects).

I'm getting the error "ErrorInvalidHandle" - do you have any idea what I could be doing wrong? This works fine with a single GPU in a multi-threaded environment.

Nov 27, 2012 at 12:40 AM

Hi, assuming this happens only while using CUDAfy, could you provide us with a code example showing where the exception is being thrown, exactly?

Nov 27, 2012 at 1:08 AM
Edited Nov 27, 2012 at 9:14 PM

Error is thrown on "Launch".

pseudo-code looks like this:

	public class GPUtester
		CudaGPU[] gpus;
		object[] oLocks;
		const int numGPUS = 2;		//works fine if =1
		const int numTHREADS=2;
		public GPUtester()
			oLocks = new object[numGPUS];
			for(int i=0;i<numGPUS;i++) oLocks[i] = new object();
			gpus = new CudaGPU[numGPUS];
			CudafyModule km = CudafyTranslator.Cudafy(ePlatform.Auto, eArchitecture.sm_30,typeof(GPUtester));
			for(int i=0;i<numGPUS;i++)
				gpus[i] = (CudaGPU)CudafyHost.GetDevice(CudafyModes.Target, i);
			Thread[] thds = new Thread[numTHREADS];
			for(int i=0;i<numTHREADS;i++)
				thds[i] = new Thread(RunIt);
				int iThread = i % numGPUS;
		public void RunIt(object o)
			int iGPU = (int)o;
			int[] anArray = new int[1000];
			int numLaunched=0;
				if(!gpus[iGPU].IsCurrentContext) throw new Exception("Not current context");
				int[] danArray = gpus[iGPU].CopyToDevice(anArray);
				float[] pIn = new float[100];
				float[] pOut = gpus[iGPU].Allocate<float>(pIn);
				if(!gpus[iGPU].IsCurrentContext) throw new Exception("Not current context");
				gpus[iGPU].Launch(100,8,"GPUrunner",danArray, pOut);		//  ERROR HERE: ErrorInvalidHandle
				if(!gpus[iGPU].IsCurrentContext) throw new Exception("Not current context");
				gpus[iGPU].CopyFromDevice(pOut, pIn);
				if(!gpus[iGPU].IsCurrentContext) throw new Exception("Not current context");
			catch(Exception e)
		public static void GPUrunner(int[] theArray, float[] outArray)
			for(int i=0;i<theArray.Length;i++) outArray[i] = ((float)theArray[i])/3f;
Nov 27, 2012 at 7:58 PM

Hi, I'm sorry I can't help you much today, since my dual gpu rig is being repaired. Maybe tomorow it will be fixed and I'll be able to test your code. Until then, or until a dev can look into your problem, I have only a few ideas which might help you:

1 - You don't need to handle the locks yourself, since the CudaGPU already has a Lock/Unlock method which you may call directly (see 2). The Lock method also "Pushes" the correct context into scope, so you also shouldn't have to call gpus[iGPU].SetCurrentContext(), and the Unlock "Pops" said context back.

2 - Have you checked the Cudafy.Host.UnitTests project in CUDAfy's source code? It has a test that does something similar to what you're trying here, but it only launches 2 threads and copies dev mem without calling a kernel. I sugest you try it, since it will help you narrow down the problem. I'll past the code for your convenience:

private GPGPU _gpu0;
private GPGPU _gpu1;
private uint[] _gpuuintBufferIn0;
private uint[] _uintBufferIn0;
private uint[] _uintBufferOut0;
private uint[] _gpuuintBufferIn1;
private uint[] _uintBufferIn1;
private uint[] _uintBufferOut1;

public void Test_TwoThreadTwoGPU()
    _gpu0 = CudafyHost.CreateDevice(CudafyModes.Target, 0);
    _gpu1 = CudafyHost.CreateDevice(CudafyModes.Target, 1);
    // to-do: initialize uintBufferIn0 & uintBufferIn1
    bool j1 = false;
    bool j2 = false;
    for (int i = 0; i < 10; i++)
        Thread t1 = new Thread(Test_TwoThreadTwoGPU_Thread0);
        Thread t2 = new Thread(Test_TwoThreadTwoGPU_Thread1);
        j1 = t1.Join(10000);
        j2 = t2.Join(10000);
        if (!j1 || !j2)

private void Test_TwoThreadTwoGPU_Thread0()
    _gpuuintBufferIn0 = _gpu0.CopyToDevice(_uintBufferIn0);
    _gpu0.CopyFromDevice(_gpuuintBufferIn0, _uintBufferOut0);
    Assert.IsTrue(Compare(_uintBufferIn0, _uintBufferOut0));

private void Test_TwoThreadTwoGPU_Thread1()
    _gpuuintBufferIn1 = _gpu1.CopyToDevice(_uintBufferIn1);
    _gpu1.CopyFromDevice(_gpuuintBufferIn1, _uintBufferOut1);
    Assert.IsTrue(Compare(_uintBufferIn1, _uintBufferOut1));

3 - try calling gpus[iGPU].IsCurrentContext() once you call gpus[iGPU].Lock(), to see if the context was properly set.


I hope this helps...



Nov 27, 2012 at 9:09 PM

thanks, I have made the changes you suggest (amended in my initial post above).

IsCurrentContext always seems to be true (no exception thrown there).

I'm still getting the same error, though not every time I run the test code (I hadn't noticed this previously, but was probably still the case).

The unit test code you posted runs fine - no errors.


Here's the stack trace of my error if it helps:

at GASS.CUDA.CUDA.set_LastError(CUResult value)
   at GASS.CUDA.CUDA.SetFunctionBlockShape(CUfunction func, Int32 x, Int32 y, Int32 z)
   at Cudafy.Host.CudaGPU.DoLaunch(dim3 gridSize, dim3 blockSize, Int32 streamId, KernelMethodInfo gpuMethodInfo, Object[] arguments)
   at Cudafy.Host.GPGPU.LaunchAsync(dim3 gridSize, dim3 blockSize, Int32 streamId, String methodName, Object[] arguments)
   at Cudafy.Host.GPGPU.Launch(Int32 gridSize, Int32 blockSize, String methodName, Object[] arguments)
   at Strats.GPUtester.RunIt(Object o) in s:\Users\Ben\Documents\SharpDevelop Projects\Strats\GPUtester.cs:line 126


thanks very much for your help.

Nov 28, 2012 at 6:49 AM


Currently I do not have access to a multi-GPU system so cannot run the test.  Personally I've never seen the InvalidHandle error - InvalidContext, yes, seen him enough when implementing the context support.  From googling InvalidHandle I believe your issue may be related to the fact you loaded your kernel module in a different context from where you used it.  See this link for more info.  

Let us know how you get on.


Nov 28, 2012 at 8:01 AM

that's solved it! Thanks to both of you.

To be clear, the change I've made it is to move this line:

CudafyModule km = CudafyTranslator.Cudafy(ePlatform.Auto, eArchitecture.sm_30,typeof(GPUtester));

inside the loop, just after GetDevice.


thanks again,


Nov 28, 2012 at 12:57 PM
Edited Nov 28, 2012 at 12:57 PM

It makes sense, in indsight :)

I'm glad you sorted it out, especially since I'll have to use very similar code in the future.

Nick: a nice feature would be an optional parameter to "Cudafy", or any module-creating method, where you'd specify the target gpu.



Nov 28, 2012 at 1:16 PM

Can you elaborate on that?  What should it do?


Nov 28, 2012 at 1:55 PM
Edited Nov 28, 2012 at 2:05 PM

Ok, maybe I got this all wrong, so please bear with me :)

As far as I can tell, a kernel handle is bound to a context. Trying to use a kernel acquired in a different context will throw an error.

Presently, whenever one loads (and acquires a handle to) a kernel (using either CudafyTranslator.Cudafy, or CudafyModule.Deserialize, etc), the current context is implicitly used.

Therefore, it would make sense (for multi-threaded multi-gpu scenarios) to be able to specify a context when you load (and consequently acquire a handle to) a kernel.


(edited for clarity)

Nov 28, 2012 at 3:07 PM

Could easily be me missing something here.  As I see it the problem only arises when a module was loaded in a different context from which it was launched (gpu.LoadModule(...)).  The creating of modules through Cudafy method or deserializing is context agnostic.  

Nov 28, 2012 at 3:10 PM


It may have been more useful to future readers of this thread if you had left your original "wrong" code in place, maybe with an edit with a comment at the place where it went bad, then re-post the good code. Actually this is also for selfish reasons since I was trying to do a before and after to better understand pedritolo1's point ;-)



Nov 28, 2012 at 5:44 PM
Edited Nov 28, 2012 at 5:47 PM

It was my mistake, now I see. I  got confused, I mixed CudafyTranslator.Cudafy with gpu.LoadModule...



Nov 28, 2012 at 5:50 PM
Edited Nov 28, 2012 at 5:50 PM

hi Nick, 

that's what I've done - the original post has the CudafyTranslator.Cudafy outside the loop, and my comment says that I've moved it inside.

To be clear my NEW code (snippet) looks like this:


			gpus = new CudaGPU[numGPUS];
			for(int i=0;i<numGPUS;i++)
				gpus[i] = (CudaGPU)CudafyHost.GetDevice(CudafyModes.Target, i);
			        CudafyModule km = CudafyTranslator.Cudafy(ePlatform.Auto, eArchitecture.sm_30,typeof(GPUtester));


Now I'm a little unsure - is that the change you were suggesting?

Nov 28, 2012 at 7:39 PM
Edited Nov 28, 2012 at 7:40 PM

Ah, hence my confusion. That part in bold wasn't supposed to be there at all, since it should be context-agnostic (makes no sence having to cudafy over and over per each thread). But mcmillab claims it won't work without it. Which led me to believe that CudafyTranslator.Cudafy was inadvertely loading a kernel onto the current context.


Dec 2, 2012 at 7:33 AM

yes that's right. 

I had expected the CudafyTranslator.Cudafy call to be context-agnostic, which is why I had it outside the loop to start with.

I've now been running for a couple of days with it inside the loop and it definitely seems to have solved my problem.

Dec 3, 2012 at 8:05 AM

This makes no sense. The cudafy module simply contains the ptx (among other things) which is loaded during the LoadModule operation.  You should only need to put the LoadModule into the loop.  Remember to use caching of cudafy modules to improve re-run performance.