CopyToDevice C# vs C, small blocks vs large blocks.

Jun 2, 2014 at 1:56 PM
I have a cuda(fy) project where the total amount of data does not fit on the GPU.
I need to "loop" blockwise through the data many times and do computations.

So if I have 10Gb of data I can theoretically put it in blocks of (for instance) 100Mb.
On a code-technic view it's better to put it in blocks of 1Mb. That is what we did for now.

At this moment the process is bounded by the "copytodevice" portion.
It's already much faster than CPU so we see the potential.
However, optimizing the GPU kernels is of no use since 90% of the time is in the cudafy "CopyToDevice" function.

Now I have 2 questions:

Is the copytodevice function in Cudafy much slower than ie. copying in native C ?
Possibly due to overhead in dataconversion etc.
In that case we need to switch to C. That would be a bummer.

If we would copy in blocks of 10..100Mb instead of 1Mb.. would this be beneficial ?
Bigger blocks would give us some ugly code constructs but if it's worth it we'd do it.
If it's a big benefit we can stay in C#.

Thanks for the answer!
Jun 3, 2014 at 7:58 AM
Q1: Fractionally. I doubt you'd be able to measure it consistently.
Q2: The proof is in the pudding. You can test this very quickly by modifying code sample in the CudafyByExample project. Remember you can pipeline your transfers. Put copies and launches in parallel. If using a Quadro or Tesla you can have copy to, launch and copy from in parallel.
Aug 24, 2014 at 9:16 AM
Just to inform you.
There was some overhead in CopyToDevice.
Batching the copies does improve the throughput.

However.. Somehow my profiler gave wrong information.
It was a kernel that took the most time.