How to declare a fixed size array in a Cudafy GPU kernel

Mar 20, 2014 at 4:53 AM
I want to do this in a kernel:
int count[8];

I'm almost positive you can declare fixed size arrays within CUDA GPU kernels. So how do I go about doing this while using Cudafy? This does not work:

public static void kernelFunction(int[] input, int[] output)
int count[8];
// ....other stuff
The above code results in a C# error: "Array size cannot be specified in a variable declaration (try initializing with a 'new' expression)."
Mar 20, 2014 at 8:01 AM
Take a look at the CudafyByExample project. You need to use shared memory.
Mar 20, 2014 at 10:20 AM
Currently, I don't think cudafy has the ability to allocate arrays for each separate thread. I would recommend including this feature in the next release.
In many situations this would help readability (and perhaps performance provided the arrays are not too big).
The current alternative is to simply use global memory, ensuring that each thread has exclusive access to it.
Mar 20, 2014 at 10:57 AM
Why would you use global memory which is the slowest? Register space is limited per kernel which is why shared is recommended being a little slower but more plentiful. Give each thread its own part of the shared memory.
Mar 20, 2014 at 11:16 AM
Given a a shared mem size of 49152 bytes, given 1024 threads per thread blocks, at most each thread can have 12 unique integers in shared memory.
Further, given that we may want to use shared memory to do something else, shared memory may not be available.

One alternative is to use local memory e.g. int count[8];
Here the compiler can do one of two things:
(1) give each element in count a separate register e.g. countr1, countr2, ... countr8. This is the fastest option available.
(2) if register pressure is to high the compiler will keep count in the L1 cache. From Kepler Turning guide: "L1 caching in Kepler GPUs is reserved only for local memory accesses, such as register spills and stack data. Global loads are cached in L2 only (or in the Read-Only Data Cache)."

The above local memory feature is not available currently in Cudafy. Hence as I described we can keep count in global memory and hope that the L2 cache is working well.