I've taken a look at the Occupancy calculator but having trouble determining the following two fields for my kernel:
Registers Per Thread
Shared Memory Per Block (bytes)
The help section adds:
To determine the number of registers used per thread in your kernel, simply compile the kernel code using the option --ptxas-options=-v to nvcc.
This will output information about register, local memory, shared memory
Given that I'm using Cudafy.net, and (afaik) I don't have the option of using the --ptxas-options=-v compile option, I run the CudyByExample enum_gpu.cs code and get:
Registers per mp: 65536
Shared mem per mp: 49152
But when I run deviceQuery, I get:
GeForce GT 740M
CUDA Capability Major/Minor version number: 3.5
( 2) Multiprocessors, (192) CUDA Cores/MP: 384 CUDA Cores
Total number of registers available per block: 65536
Total amount of shared memory per block: 49152 bytes
Which seems to disagree with the output from enum_gpu.cs.
Assuming deviceQuery to be correct and 512 threads per block, I enter:
Registers Per Thread = 128
Shared Memory Per Block (bytes) = 49152
And see MP warp occupancy = 16. Is this correct? (and if so why does enum_gpu.cs give different output from deviceQuery)?