Max Values for GPU Launch (revisited)

Mar 13, 2015 at 5:27 AM
Hi,

I came across this discussion going back to July 2013 "Max Values for GPU Launch" (https://cudafy.codeplex.com/discussions/449624 ).

The person's question wasn't answered then and now I have the same question; when I query the device properties (gpu.GetDeviceProperties ...) I get

Max threads per block: 1024
Max thread dimensions: (1024, 1024, 1)
Max grid dimensions: (2147483647, 65535, 1)

and would therefore expect to be able to do gpu.Launch(2147483647, 1024) (ie, threads runnable in parallel on GPU device = 2147483647 x 1024).

But CUDAfy doesn't allow me to more than gpu.Launch(65535, 1024).

1) What am I misunderstanding?
2) Using CUDAfy, how can I maximize the number of threads I can run on my GPU device in parallel?

Thanks for the great library,
Conrad
Mar 16, 2015 at 6:55 PM
You need to maximise the warp occupancy, not the number of threads http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls hth jd
Mar 17, 2015 at 11:37 AM
Hi,
I've taken a look at the Occupancy calculator but having trouble determining the following two fields for my kernel:
Registers Per Thread    
Shared Memory Per Block (bytes) 
The help section adds:
To determine the number of registers used per thread in your kernel, simply compile the kernel code using the option --ptxas-options=-v to nvcc.  
This will output information about register, local memory, shared memory
Given that I'm using Cudafy.net, and (afaik) I don't have the option of using the --ptxas-options=-v compile option, I run the CudyByExample enum_gpu.cs code and get:
Registers per mp:  65536
Shared mem per mp: 49152
But when I run deviceQuery, I get:
GeForce GT 740M
CUDA Capability Major/Minor version number:    3.5
( 2) Multiprocessors, (192) CUDA Cores/MP:     384 CUDA Cores
Total number of registers available per block: 65536
Total amount of shared memory per block:       49152 bytes
Which seems to disagree with the output from enum_gpu.cs.

Assuming deviceQuery to be correct and 512 threads per block, I enter:
Registers Per Thread = 128  
Shared Memory Per Block (bytes) = 49152 
And see MP warp occupancy = 16. Is this correct? (and if so why does enum_gpu.cs give different output from deviceQuery)?
Mar 17, 2015 at 5:54 PM
Registers per thread relates to variables declared in your kernel and shared memory per block relates to shared (fast) memory allocated by your kernel, they do not relate to properties of the card itself.
I have never used it myself, but there is an overload of CudafyTranslator.Cudafy that has a param CompileOptions which you could try adding your extra args to.

hth jd
Mar 18, 2015 at 12:01 PM
Thanks for the suggestion. I've added the NVCC compile options like this:
            CudafyModule km = CudafyTranslator.Cudafy();
            var options = NvccCompilerOptions.Createx64(eArchitecture.sm_35);
            options.AddOption("--ptxas-options=-v");
            km.CompilerOptionsList.Add(options);
            gpu = CudafyHost.GetDevice(CudafyModes.Target, CudafyModes.DeviceId);
            gpu.LoadModule(km);
On the last line when I debug the code I see that km.CompilerOptionsList has the following:
    "-m64"
    "-arch=sm_35"
    "--ptxas-options=-v"
But km.CompilerOutput only contains:
    "\r\n\r\nnvcc warning : The 'compute_11', 'compute_12', 'compute_13', 'sm_11', 'sm_12', and 'sm_13' architectures are deprecated, and may be removed in a future release.\r\nCUDAFYSOURCETEMP.cu\r\n"
i.e., no mention of output like that mentioned here http://stackoverflow.com/questions/12388207/interpreting-output-of-ptxas-options-v

I've also tried using "-ptxas-options=-v" and some nonsense values but always get the same for km.CompilerOutput. So I don't think it's even using the compiler options I pass in as the km.CompilerArguments are also always the same:

" -I\"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.5\include\" -m64 -arch=sm_13 -ptx \"C:\Users\conrad\Documents\Visual Studio 2013\Projects\SimdTest\SimdTest\bin\Debug\CUDAFYSOURCETEMP.cu\" -o \"C:\Users\conrad\Documents\Visual Studio 2013\Projects\SimdTest\SimdTest\bin\Debug\CUDAFYSOURCETEMP.ptx\" "
Mar 18, 2015 at 12:07 PM
I think you need to pass the options to the Cudafy method jd
Mar 18, 2015 at 12:42 PM
Actually, no, it turns out I was missing:
            km.Compile(eGPUCompiler.CudaNvcc);
right after:
            km.CompilerOptionsList.Add(options);
See eg http://cudafy.codeplex.com/discussions/403375

I now see the additional options I specified in the km.CompilerOutput :
            "\r\n-m64, -arch=sm_35, --ptxas-options=-v,  Platform: x64,\r\n -I\"C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v6.5\\include\" -m64  -arch=sm_35  --ptxas-options=-v  \"C:\\Users\\conrad\\Documents\\Visual Studio 2013\\Projects\\SimdTest\\SimdTest\\bin\\Debug\\CUDAFYSOURCETEMP.cu\"  -o \"C:\\Users\\conrad\\Documents\\Visual Studio 2013\\Projects\\SimdTest\\SimdTest\\bin\\Debug\\CUDAFYSOURCETEMP.ptx\"  -ptx\r\n\r\nCUDAFYSOURCETEMP.cu\r\n"
But still none of the information on registers per thread, etc, I'm looking for.
Mar 18, 2015 at 2:21 PM
How complex is your kernel? the first 32 registers are "for free" and unless you are deliberately assigning shared memory that will be 0. After that select a block size that is optimum in terms of occupancy and create a grid that covers the task wrt the block size. My fairly naive scheme is for 2D matrices is https://github.com/jaundice/SimpleRBM/blob/master/SimpleRBM.Cuda/ThreadOptimiser.cs

hth jd