Why is CUDAfy.NET so slow?

Jul 11, 2014 at 1:37 PM
Being a die-hard C# coder I hoped I could use CUDAfy.NET to to harness my GPU's performance without going the C/C++ route.

However, after running the samples from the book, I realized that a release build of C# version of the examples run about 50% slower than the C version.

As an example, on a Tesla K20 it takes <50ms to run the C release build of the Chapter 9 example (histogram) VS 100ms for the C# version.

Why's that?
Jul 11, 2014 at 8:00 PM
Hi
I seriously doubt that the examples fully exploit the gpu's capabilities. They are merely how-to examples, and not performance studies. I know a few things about performance optimization, and I can attest that, on average (it varies a lot depending on gpu and cpu specs, of course), I see performance improvements that may go easily beyond 20x. It depends on the class of problem, of course, so there isn't an easy answer.
As for Cudafy vs cuda in c, there's no difference between both, except when the only thing you do is run billions of kernels that do absolutely nothing, meaning that interop delays become the bottleneck.
Anyhow, welcome to cudafy :)
Jul 14, 2014 at 7:11 AM
Edited Jul 14, 2014 at 7:12 AM
Thanks for your answer!

I think I haven't made my point clear though, or probably you misunderstood me.

The question was, why is it that a release build of the pure CUDA C version is faster by 50% than the C# cudafy version.

I'm talking about the Chapter 9 example from the book.
Jul 15, 2014 at 6:24 PM
If you are seeing those numbers, I think it might be because of:
  1. You're also measuring cudafy's initialization, where the dll is transformed to c and then compiled on-the-fly to cuda, which doesn't happen in native. If that's the case, do the cudafy initialization outside of your performance study.
  2. You're actually measuring the interop cost, or
  3. The c code produced from the dll's disassembly won't be as ameanable to optimizations by cuda's compiler as native.
hope that helps
Coordinator
Jul 15, 2014 at 7:52 PM
Are you definitely targeting a CUDA device and not an OpenCL device? Check the Program.cs line 23. If it is OpenCL then it could even be that you are running it on the CPU.
Jul 16, 2014 at 8:35 AM
For your convenience I've attached a VS 2012 solution containing both the C# and the C++ versions.

The C++ project is called HistogramGPUGmem and the C# version HistogramCPUCSCudafy (sorry for the typo).

https://dl.dropboxusercontent.com/u/29132054/Histrogram.7z

I'm definitely positive that I'm targeting a CUDA device.
Coordinator
Jul 17, 2014 at 2:23 PM
Looks like you are running on the emulator!!
            CudafyModule km = CudafyTranslator.Cudafy();

            GPGPU gpu = CudafyHost.GetDevice(CudafyModes.Target, CudafyModes.DeviceId);
            if (gpu is CudaGPU && gpu.GetDeviceProperties().Capability < new Version(1, 2))
            {
                Console.WriteLine("Compute capability 1.2 or higher required for atomics.");
                return -1;
            }
            gpu.LoadModule(km);
Check the value of CudafyModes.Target.
Jul 17, 2014 at 4:09 PM
Hmm didn't think of that one :)
Coordinator
Jul 17, 2014 at 6:12 PM
You should spend a good proportion of your programming life in the debugger! Visual Studio has incredibly powerful debug support. One should also examine the Output window.
Jul 17, 2014 at 6:30 PM
Guys, no offense, but what have I done to you that you think I'm mentally handicapped? Or is this your general attitude towards people, that they are all stupid?

Of course I'm running the device, except if cudafy returns "Tesla K20X" when calling gpu.GetDeviceProperties().Name if the emulator was running.
Coordinator
Jul 17, 2014 at 7:36 PM
I don't think there is anything in the above replies to suggest anyone thinks you have any deficiencies. "You should spend a good proportion of your programming life in the debugger!" was a quote from the chief developer on Visual Studio at the 2009 Microsoft conference in the Netherlands.

I took the time to download your project, compile it and run it and what I saw in the debugger was that the emulator was used. I could not find where you would set the CudafyModes.Target to any other value.
Jul 19, 2014 at 10:36 AM
Hi lightxx, I'm sorry if you got the wrong impression from my posts. I by no means wanted to imply anything about you from my replies. Just saw a problem and tried to fix it. And knowing Nick, I'm sure he didn't mean to either.
Aug 19, 2014 at 9:31 AM
Edited Aug 19, 2014 at 12:29 PM
Never mind. Sorry for being rude but I had a hard day / night when I wrote that post.

I'm running the code on the device, obviously you don't have to set CudafyModes.Target explicitly to run a kernel on the device instead of the emulator. The emulator is orders of magnitudes slower, that's why I know that the kernel runs on the device and not the host. If I explicitly set the CudafyModes.Target to Emulator the code runs at least a 100 times slower than on the device.

What's even more weird, I have two kernels, one using shared and global memory atomics, and another one using just global memory atomics. In the book (chapter 9), the former version should be a 100 times faster than the latter, yet on my Tesla K20 both versions run almost identically fast (or slow).

Global: http://pastebin.com/UE868vuW
Shared and Global: http://pastebin.com/VJ4f9BhW

Could you take a look and solve that mystery?

Also, I still for the heck of it can't figure out why the C++ version is 50% faster than the C# version.

EDIT: I'm targeting SM_35. The CUDA Handbook says that SM 3.5 features "faster global atomics". Could that be the reason?