Arrangement of threads in a block important?

Mar 21, 2014 at 10:06 PM
Edited Mar 21, 2014 at 10:06 PM
I am just curious, if the arrangement of the threads in a block has an impact on performance. Lets say I have a task that requires 256 threads. I could pass just 256 as argument to .Launch() or, likely, 16x16x1 or even 8x8x4. Does it make a difference to the GPU or is it just a matter of what's most convenient for me?
Likely, what about the dimensions of blocks? Also, when specifying the number of blocks, is it recommended to keep the number low and increase the calls to .Launch() or just putting in all at once? I mean, does that make difference in terms of efficiency? Is it preferable to keep the number of launches low or to put in more blocks?

Thanks :)
Coordinator
Mar 23, 2014 at 7:42 AM
The proof is in the pudding. Part of the challenge (or fun) of GPU programming is the tweaking of these kind of variables and observing the performance.
Mar 24, 2014 at 12:03 AM
Thanks, so I suppose I am going to try out different configurations (1/2/3D). Is there a guideline, when to use which (and how many) dimensions when specifying the number of threads in a block? Is there a predestined recommendation for specific tasks? I mean, I could imagine using a 2D-arrangement of threads when editing images could be quite efficient/handy for example.
The work I am doing is nothing like that, I just require "many threads" that don't share anything and do not depend in any way on each other and found it most easy to just pass in a 1D-arrangement. Also, when using 1D-arrangements, does it matter which dimension I am using (x/y/z) (although this would be an easy test ;) )?
Coordinator
Mar 24, 2014 at 8:19 AM
Typically stick with 1D unless you are dealing with 2D data. You can use x, y or z but of course best to just use x to save confusion!
Set number of threads per block to something like 256 (can be less if you need more memory per thread, can be more if it's more convenient, but 256 is portable and efficient).
Then adjust the number of blocks to suit. Beyond a certain size #blocks * #threads may be less than number of items, in which case make the kernel skip through the data.

.Launch(2048, 256, ...)
Apr 2, 2014 at 7:58 PM
Okay, so I suppose I am going to stick with 1D. In the Cuda by example book I also read that the 2D/3D dimensional options are there for convenience when you're dealing with 2D or 3D data (what I am not).