BlockSize 7 results different from BlockSize 8

Jul 1, 2016 at 8:02 PM
Edited Jul 2, 2016 at 10:04 AM
Hello CUDAfy.NET Community,

I wanted to add GPU power into my application which has some Parallel.For loops.
And I have some interesting different results when the BlockSize is 8 and 7.

I have 10 items to loop. The GridSize is 1.
CASE 1: CudafyModes.Target = eGPUType.OpenCL and the BlockSize is 1,2,3,4,5,6 and 7. The results are correct.
CASE 2: CudafyModes.Target = eGPUType.OpenCL and the BlockSize is 8,9,10,11, .... and more. The results are incorrect.
CASE 3: CudafyModes.Target = eGPUType.Emulator and the BlockSize is 1,2,3,4,5,6,7,8,9,10,11, .... and more. The results are correct.

Initializing the variables...
        double[,] data;
        double[] nmin, nmax, gmin, gmax;

        void initializeVars()
        {
            data = new double[10, 10];
            for (int i = 0; i < 10; i++)
                {
                    data[i, 0] = 100 + i;
                    data[i, 1] = 32 + i;
                    data[i, 2] = 22 + i;
                    data[i, 3] = -20 - i;
                    data[i, 4] = 5522 + 10 * i;
                    data[i, 5] = 40 + i;
                    data[i, 6] = 14 - i;
                    data[i, 7] = 12 + i;
                    data[i, 8] = -10 + i;
                    data[i, 9] = 10 + 10 * i;
                }
            nmin = new double[10];
            nmax= new double[10];
            gmin = new double[10];
            gmax = new double[10];
            for (int i = 0; i < 10; i++)
            {
                nmin[i] = -1;
                nmax[i] = 1;
                gmin[i] = i;
                gmax[i] = 11 * i*i+1;
            }
        }
gpu.Launch Code
        private void button1_Click(object sender, EventArgs e)
        {
            CudafyModes.Target = eGPUType.OpenCL;
            CudafyModes.DeviceId = 0;
            CudafyTranslator.Language = eLanguage.OpenCL;
            CudafyModule km = CudafyTranslator.Cudafy();
            Cudafy.Host.GPGPU gpu = Cudafy.Host.CudafyHost.GetDevice(CudafyModes.Target, CudafyModes.DeviceId);
            gpu.LoadModule(km);
            initializeVars();
            double[,] devdata = gpu.Allocate<double>(data); gpu.CopyToDevice(data, devdata);
            double[] devnmin = gpu.Allocate<double>(nmin); gpu.CopyToDevice(nmin, devnmin);
            double[] devnmax = gpu.Allocate<double>(nmax); gpu.CopyToDevice(nmax, devnmax);
            double[] devgmin = gpu.Allocate<double>(gmin); gpu.CopyToDevice(gmin, devgmin);
            double[] devgmax = gpu.Allocate<double>(gmax); gpu.CopyToDevice(gmax, devgmax);
            double[] test = new double[10];
            double[] devtest = gpu.Allocate<double>(test);
            gpu.Launch(1, 8, "kernelfx_alldata", 10, devdata, devnmin,
                   devnmax, devgmin, devgmax,  devtest);
            gpu.CopyFromDevice(devtest, test);
            gpu.FreeAll();
        }
the Cudafy kernel
        [Cudafy]
        public static void kernelfx_alldata(GThread thread, int N, double[,] data, double[] nmin, double[] nmax, double[] gmin, double[] gmax,  double[] test)
        {
            int tid = thread.threadIdx.x + thread.blockIdx.x * thread.blockDim.x;
            while (tid < N)
            {
                double[] tmp = thread.AllocateShared<double>("tmp", 10);
                tmp[0] = 1; 
                for (int i = 1; i < 10; i++)
                {
                    tmp[i] = data[tid, i - 1];
                }
                for (int i = 1; i < 10; i++)
                {
                    tmp[i] = (nmax[i - 1] - nmin[i - 1]) / (gmax[i - 1] - gmin[i - 1]) * (tmp[i] - gmin[i - 1]) + nmin[i - 1];
                }
                test[tid] = tmp[1];

                tid = tid + thread.blockDim.x * thread.gridDim.x;
            }
        }
The Correct (CASE 1 and CASE 3) Results are:
test[0]=199.0
test[1]=201.0
test[2]=203.0
test[3]=205.0
test[4]=207.0
test[5]=209.0
test[6]=211.0
test[7]=213.0
test[8]=215.0
test[9]=217.0


Incorrect (CASE 2) results are:
test[0]=213.0
test[1]=213.0
test[2]=213.0
test[3]=213.0
test[4]=213.0
test[5]=213.0
test[6]=213.0

test[7]=213.0
test[8]=217.0
test[9]=217.0

When the BlockSize is lower then 8, the results are correct. But when the BlockSize is greater then 8 the results are incorrect. In order to use the gpu efficiently the blockSize must be greater then 8.

What is the problem on this code?

I appreciate any help,

Best Regards...
Jul 2, 2016 at 12:27 PM
Hi
You seem to have an incorrect understanding of how shared memory works.
Shared mem is shared across all threads in a block. So if you write tmp[i] = thread.threadIdx.x (for example), you risk having several threads concurrently writing different values on the same mem location, with an undefined result.
Also, you should declare shared mem once at the start of your func, and not within a loop.
Finally, from your sample, and from what I can tell, you only ever use tmp[1]. The other array cells are not part of the final computation. So you can simply use a register (a local var).
Hope this helps.
Jul 2, 2016 at 12:50 PM
Edited Jul 2, 2016 at 12:51 PM
pedritolo1 wrote:
Hi
You seem to have an incorrect understanding of how shared memory works.
Shared mem is shared across all threads in a block. So if you write tmp[i] = thread.threadIdx.x (for example), you risk having several threads concurrently writing different values on the same mem location, with an undefined result.
Also, you should declare shared mem once at the start of your func, and not within a loop.
Finally, from your sample, and from what I can tell, you only ever use tmp[1]. The other array cells are not part of the final computation. So you can simply use a register (a local var).
Hope this helps.
Thank you for the reply.
Firstly, It is true that I only used tmp[1] here in this code, but in fact I'm using all members of tmp[] array. The code here is shortened in order to explain the problem.

Secondly, what I want to do is, tmp[] array should be a local array in one thread. But declaring "double tmp[10];" is not allowed in Cudafy.Net. Is there any way to declare local array.

Thirdly, is possible to explain why it works as desired when the blocksize is lower then 8?

Best regards...
Jul 2, 2016 at 11:21 PM
Well you can declare your shared mem array as a 2d array, where the 1st dimension is the thread id and the second is the size of the "private" array.
I can't remember if you can declare 2d shared mem arrays in cudafy, but if you can't, then an alternative is to just emulate it with a 1d array with a len = dim1*dim2, I'm sure you know what I mean. This way each thread would have its own private working space within that big shared mem array. Make sure you try both row-first and col-first implementations, since one will be significantly faster than the other. Also be careful to not exceed the max size allowed for shared mem within a block (use the cuda occupancy calculator xls tool to double-check).

Be careful with the type "double" since some devices don't support it and invisibly replace it with floats or just stop working altogether.

I have no idea why it worked with less than 8 threads, sorry.
Jul 3, 2016 at 11:44 AM
pedritolo1 wrote:
Well you can declare your shared mem array as a 2d array, where the 1st dimension is the thread id and the second is the size of the "private" array.
I can't remember if you can declare 2d shared mem arrays in cudafy, but if you can't, then an alternative is to just emulate it with a 1d array with a len = dim1*dim2, I'm sure you know what I mean. This way each thread would have its own private working space within that big shared mem array. Make sure you try both row-first and col-first implementations, since one will be significantly faster than the other. Also be careful to not exceed the max size allowed for shared mem within a block (use the cuda occupancy calculator xls tool to double-check).

Be careful with the type "double" since some devices don't support it and invisibly replace it with floats or just stop working altogether.

I have no idea why it worked with less than 8 threads, sorry.
Thank you very much pedritolo1.
That is exactly what I needed. I declared 2d array as your suggestion.
By the way Cudafy is allowing to declare 2d shared array as below.
        [Cudafy]
        public static void kernelfx_alldata(GThread thread, int N, double[,] data, double[] nmin,
                                        double[] nmax, double[] gmin, double[] gmax, double[] test)
        {
            int tid = thread.threadIdx.x + thread.blockIdx.x * thread.blockDim.x;
            double[,] tmp = thread.AllocateShared<double>("tmp", 10, 10);
            while (tid < N)
            {
                tmp[tid, 0] = 1;
                for (int i = 1; i < 10; i++)
                {
                    tmp[tid, i] = data[tid, i - 1];
                }
                for (int i = 1; i < 10; i++)
                {
                    tmp[tid, i] = (nmax[i - 1] - nmin[i - 1]) / (gmax[i - 1] - gmin[i - 1]) * (tmp[tid, i] - gmin[i - 1]) + nmin[i - 1];
                }
                test[tid] = tmp[tid, 1];

                tid = tid + thread.blockDim.x * thread.gridDim.x;
            }
        }
Thank you again,
Best regards...