CUDAfy Sync vs Async Code

Mar 20, 2016 at 9:30 PM
Hey All,

First I have to say I am loving CUDAfy!! I was hoping I could get some advice on the below code. I have my routines working synchronously and so far works very well. The obvious next road was to get everything working asynchronously. I have been able to achieve this and it is working but I am feeling like the code I have could be slimmed down and I am possibly not taking proper advantage of the aync methods. These routines will be ran multiple times. The basics of what I am trying to accomplish is finding specific colors in a byte array from an image and marking the ones that match. I process this twice because each one launch is using different parameters looking for a different color. Hopefully this code does not look to bad.

            _gpu = CudafyHost.GetDevice(CudafyModes.Target, CudafyModes.DeviceId);

            if (_gpu != null)
                var props = _gpu.GetDeviceProperties();
                Console.WriteLine("Using {0}optimized driver.", props.HighPerformanceDriver ? "" : "non-");
                Console.WriteLine("{0}, Platform: {1}, Capability: {2}, Max Block Threads: {3}", props.Name, props.PlatformName, props.Capability, props.MaxThreadsPerBlock);

                _cm = CudafyModule.TryDeserialize();
                if (_cm == null || !_cm.TryVerifyChecksums())
                    _cm = CudafyTranslator.Cudafy();

                if (_cm != null)

                float elapsedTime;
                CloneParameters(HostParam[0], out DevParam[0]);
                // Allocate the memory on the GPU                              
                byte[] dev_Source = _gpu.Allocate<byte>(N); // Source Image Result
                byte[] dev_Tire = _gpu.Allocate<byte>(N);   // Result 1 Image Result
                byte[] dev_Wheel = _gpu.Allocate<byte>(N);  // Result 2 Image Result

                // Allocate host locked memory, used to stream
                IntPtr src_aPtr = _gpu.HostAllocate<byte>(N);
                IntPtr host_aPtr = _gpu.HostAllocate<byte>(N);
                IntPtr host_bPtr = _gpu.HostAllocate<byte>(N);

                // Allocate & Write Settings
                IntPtr dev_Param = _gpu.HostAllocate<MatchParameters>(Marshal.SizeOf(typeof(MatchParameters)));
                dev_Param.Write(DevParam, 0, 0, Marshal.SizeOf(typeof(MatchParameters)));

                // Write Source to Pointer                                
                src_aPtr.Write(_source, 0, 0, N);

                if (_gpu.IsCurrentContext)

                    _gpu.CopyToDeviceAsync(src_aPtr, 0, dev_Source, 0, N, 1);
                    _gpu.CopyToDeviceAsync(host_aPtr, 0, dev_Tire, 0, N, 2);
                    _gpu.CopyToDeviceAsync(host_bPtr, 0, dev_Wheel, 0, N, 3);
                    _gpu.CopyToConstantMemoryAsync(dev_Param, 0, DevParam, 0, 1, 4);

                    _gpu.LaunchAsync(new dim3(_bmp.Width, _bmp.Height), 1, 1, "MatchKernel", dev_Tire, dev_Source, 1);
                    _gpu.LaunchAsync(new dim3(_bmp.Width, _bmp.Height), 1, 2, "MatchKernel", dev_Wheel, dev_Source, 2);

                    _gpu.CopyFromDeviceAsync(dev_Tire, 0, host_aPtr, 0, N, 1);
                    _gpu.CopyFromDeviceAsync(dev_Wheel, 0, host_bPtr, 0, N, 2);


                    elapsedTime = _gpu.StopTimer();

                    byte[] host_a = new byte[N];
                    byte[] host_b = new byte[N];

                    GPGPU.CopyOnHost(host_aPtr, 0, host_a, 0, N);
                    GPGPU.CopyOnHost(host_bPtr, 0, host_b, 0, N);
                    Console.WriteLine("Elapsed: {0} ms", elapsedTime);


Mar 21, 2016 at 12:14 AM
Edited Mar 21, 2016 at 12:17 AM
Hi, some quick remarks:

1) I think you might want to fix this:

_gpu.CopyToDeviceAsync(host_bPtr, 0, dev_Wheel, 0, N, 3);
_gpu.CopyToConstantMemoryAsync(dev_Param, 0, DevParam, 0, 1, 4);

they are running on different streams from the kernel's, which means they may have not yet completed when the kernels start, and I assume the kernels will need that data available.

2) You are creating auxiliary pinned memory buffers on the host and then copying into them from the original host data. You should instead pin the original host memory buffer (use the "fixed" keyword in c# or the GCHandle.Alloc function) and copy it directly onto the device. Don't forget to unpin them when you're done.

3) The point of async kernel executions is usually to overlap the execution of a kernel with a memory copy operation, which doesn't seem to be what you're doing here.
For info on the latter, you should look at the examples within cudafy; you might also want to read

good luck
Mar 21, 2016 at 12:24 AM
Gotcha will look at number 1.

Agree I was feeling that this part could be improved and I was not allocating the memory 100% the correct way. Will see if I can sort this out.

I will read this document over!

Thanks for your time!
Mar 21, 2016 at 1:45 AM
Hey pedritolo1,

I was hoping to lean on your advice once more real quick. I have been looking over that doc you sent and things are starting to make more sense. What I was hoping you could shed some light on is when you need to share data/memory between streams? For example the code below I need to work like this.
// Both Streams Need this data. Default stream? Or Copy to both streams 1 & 2
_gpu.CopyToDeviceAsync(src_aPtr, 0, dev_Source, 0, N, 1);

// Stream 1 
_gpu.CopyToDeviceAsync(host_aPtr, 0, dev_Tire, 0, N, 1);

// Stream 2
_gpu.CopyToDeviceAsync(host_bPtr, 0, dev_Wheel, 0, N, 2);

// Both Streams Need this data. Default stream? Or Copy to both streams 1 & 2
_gpu.CopyToConstantMemoryAsync(dev_Param, 0, DevParam, 0, 1, 4);

_gpu.LaunchAsync(new dim3(_bmp.Width, _bmp.Height), 1, 1, "MatchKernel", dev_Tire, dev_Source, 1);
_gpu.LaunchAsync(new dim3(_bmp.Width, _bmp.Height), 1, 2, "MatchKernel", dev_Wheel, dev_Source, 2);

_gpu.CopyFromDeviceAsync(dev_Tire, 0, host_aPtr, 0, N, 1);
_gpu.CopyFromDeviceAsync(dev_Wheel, 0, host_bPtr, 0, N, 2);
Mar 21, 2016 at 2:30 AM
So using the default stream 0 for my dev_source data and constant memory data does produce the correct results but is this the correct way to do this?
Mar 21, 2016 at 3:32 PM
Edited Mar 21, 2016 at 3:45 PM
If kernel A and kernel B (it can actually be the same code, but with different data) both rely on data buffer C, then you need to move C onto the device prior to executing A & B. You can achieve that by forcing the sync of the stream you used for C's transfer, and only then call A and/or B.

gpu.CopyToDeviceAsync(h_C, d_C, 1); // it could be any stream, as long as you sync it in the end.

or simply copy it syncronously:

gpu.CopyToDevice(C); //will implicitly use stream 0 and force a sync in the end.

only then can you call anything that relies on C being on the device:

gpu.LaunchAsync(A, d_C, 1)
gpu.LaunchAsync(B, d_C, 2)

I notice you're calling lots of stuff async without ever synchronizing anything (except at the very end), and the stream numbers don't seem to follow any apparent logic either.

I don't think you're supposed to use constant memory in that manner. I suggest reading up on constant memory in cuda, or simply using a normal buffer instead of const mem buffer.

I think you could gain from drawing a diagram picturing your desired goal, like those diagrams in the link I posted. It might help you see things more clearly.
Let's see here, you're trying to gain performance by moving mem into the device while a kernel is executing on a different set of data. I don't see that in the code at all.
1st you need to move mem common to both kernels onto the device. Synchronize.
2nd copy buffer needed for kernel A. Synchronize.
3rd concurrently call kernel A while copying data needed for kernel B. Synchronize.
4th concurrently call kernel B while copying A's output back into the host. Synchronize.
5th. copy B's output back into the host. Synchronize.

Notice how there's only 2 steps where you stand to gain from concurrent execution: 3rd and 4th.
Mar 21, 2016 at 4:20 PM
Your right on several parts here. I was working with some samples I have seen and most of them (SynchronizeStream) like I had. Clearly I need to do some more reading on this. You are right about 3 & 4 and this is where I thought I would see an improvement.

Since we have been on the topic this is what I have been using and was looking to see if I can imrpove upon the performance.
byte[] buffer1 = new byte[_source.Length];
byte[] src1_dev_bitmap = _gpu.CopyToDevice(_source);
byte[] dst1_dev_bitmap = _gpu.Allocate<byte>(_source.Length);

_gpu.CopyToConstantMemory(HostParam, DevParam);
_gpu.Launch(new dim3(_bmp.Width, _bmp.Height), 1).MatchKernel(dst1_dev_bitmap, src1_dev_bitmap, 1);
_gpu.CopyFromDevice(dst1_dev_bitmap, buffer1);

byte[] buffer2 = new byte[_source.Length];
byte[] src2_dev_bitmap = _gpu.CopyToDevice(_source);
byte[] dst2_dev_bitmap = _gpu.Allocate<byte>(_source.Length);

_gpu.Launch(new dim3(_bmp.Width, _bmp.Height), 1).MatchKernel(dst2_dev_bitmap, src2_dev_bitmap, 2);
_gpu.CopyFromDevice(dst2_dev_bitmap, buffer2);
Mar 21, 2016 at 4:25 PM
Funny since talking with you on this code I can see I was also doing a double copy with my current code that did not need to be there!
The source array/image was copied to the device twice.