Grid and block size - CPU (OpenCL) 4x faster than GPU?!

May 7, 2013 at 8:43 PM
Edited May 7, 2013 at 9:07 PM
Any suggestions on optimizing grid and block size?

I have three 2D arrays of slightly different sizes:
var dim = 200;

var ez = new float[dim + 1, dim + 1];
var hx = new float[dim, dim + 1];
var hy = new float[dim + 1, dim]
I call three kernels in a loop, each of which updates one of the arrays (and each update for any array depends on the previous value, and current values from "neighboring" cells in the other arrays, so these have to be done sequentially).

At the moment, I'm just passing in a grid size equal to the array dimensions, and a block size of 1x1. This functions correctly, BUT:

For a "dim" value of 128, I'm getting about 280 FPS (where one frame consists of an update of each of the three fields, so three kernel calls per frame). However, when I use eGPUType.OpenCL and use the CPU instead, I see about 4x that.

(Yeah, the CPU in this case is cranking at about four times the speed of the GPU).

In case it matters:
Macbook Pro (Retina), Win7 x64, Geforce 650M (Kepler, 384 core), i7 2.7 GHz (quad core, 8 with hyperthreading).

(And I can verify that the faster version is running on the CPU - all 8 bars crank to ~100% in the performance analyzer when it's running.)

Also, the examples I've seen all use grid and block sizes that are powers of two. In this case, the +1 size offsets preclude that.

Suggestions? I'm very surprised at the result.

(But, on the other hand, the OpenCL support appears to be working VERY well!)
May 8, 2013 at 5:37 PM
Edited May 10, 2013 at 6:20 PM
Getting stuff to run fine on a gpu is only the 1st part. The 2nd is improving performance - a lot can be achieved here, and general considerations regarding gpu programming apply, i.e., it's not a specific problem of CUDAfy. I recommend reading nvidia's cuda manual, it's necessary if you wish to move beyond basic gpu stuff.
Specifically - you need to know what grid size and block size stand for, hardware-wise. In short, a block is a group of threads, and a grid is a batch of blocks. Threads within a block are aware of each other, moreso if within the same warp (look it up); not so between blocks. To get the real performance improvements, you'll need to optimize the number of grids and blocks. There's an excel spreadsheet just for that, somewhere in the CUDA install folders. Presently, you're only using one thread per block, and therefore wasting the vast majority of the gpu's resources.
If the border cells are a problem, make them a special case and write a different algo for it.
Once you've improved your grid/block dims, the next step is optimizing memory access, usually through a staging area in shared memory. Again, reading cuda's manual is vital.
For your specific problem, I see a grid of, for example, blocks of 8x8 threads each, where each thread corresponds ot a cell/pixel;
1 - the relevant block of (8+2)x(8+2) data is loaded into shared memory;
2 - each thread reads whatever it needs from its neighbour cells in shared memory;
3 - computes and
4 - stores directly to the destination buffer's correct location.
5 - write slightly different code for the border cases.