This project is read-only.

Issues with Parallel Prefix Sum (scan) for large arrays

Dec 29, 2013 at 4:13 AM
Edited Sep 20, 2014 at 7:23 AM
Hello,

I am attempting to write a scan algorithm in parallel. As a guide, I am using the code provided in Chapter 9 of the book "Programming Massively Parallel Processors 2nd Edition". The CUDA C code I have attached below.

The authors mention that this kernel is incomplete and that it needs "to have each thread to load two X elements at the beginning and store two Y elements at the end" for completion.

I have attempted to write this code in CUDAfy (attached below the C code) but I have had no success. I am not 100% sure how I should go about adding the two elements at the beginning and end as the authors suggest. Currently, when running my code, it crashes and my screen flashes. I think it is an issue with how I am launching my kernel.

My code also differs in that I am using a 2D array of doubles instead of a 1D array of floats. This is because I plan to use this code in another application that involves using a large 2D array of doubles.

I am new to CUDA and parallel computing so any advice as to how I can complete this code and my implementation will be greatly appreciated.

EDIT: Also, if anyone knows of a library in CUDA or CUDAfy that already does this, that information would alos be greatly appreciated.


Here is the CUDA C code from "Programming Massively Parallel Processors 2nd Edition" They mention that they launch this kernel with 1024 blocks and that the variable SECTION_SIZE has a value of 2048. (I am assuming that BLOCK_SIZE is 1024).
__global__ void work_efficient_scan_kernel (float *X, float *Y, int InputSize) {

__shared__ float XY[SECTION_SIZE];

int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < InputSize) {
   XY[threadIdx.x] = X[i];
}

for (unsigned int stride = 1; stride < blockDim.x; stride *=2) {
   __syncthreads();
   int index = (threadIdx.x + 1) * 2 * stride - 1;
   if (index < blockDim.x) {
        XY[index] += XY[index - stride];
    }
}

for (int stride = SECTION_SIZE/4; stride > 0; stride /= 2){
   __syncthreads;
   int index = (threadIdx.x + 1) * 2 * stride - 1;
   if(index + stride < BLOCK_SIZE) {
      XY[index + stride] += XY[index];
   }
}

__syncthreads();

Y[i]  = XY[threadIdx.x];

}
Here is my attempt at writing this code using .NET CUDAfy
namespace Parallel_Prefix_sum_CUDAfy
{
    class Program
    {
       //some constants for the size of my arrays, shared memory, etc.
        private const int BLOCK_SIZE = 2048;
        private const int INPUT = 1;
        private const int OUTPUT = 0;
        private const int BIGR = 4000;
        private const int BIGC = 2;

       //the kernel, here I need to add to this so that it stores two input elements at the beginning
       //and two output elements at the end
        [Cudafy]
        public static void inclusiveBigRunningSum(GThread thread, double[,] bigNums)
        {
            //initializing the shared memory
            double[] temp = thread.AllocateShared<double>("inclusiveTempMem", BLOCK_SIZE);
 
           //getting the index
            int i = thread.blockIdx.x * thread.blockDim.x + thread.threadIdx.x;

            int InputSize = bigNums.GetLength(0);

            if (i < InputSize)
            {
                temp[i] = bigNums[i, INPUT];
            }

            for (int stride = 1; stride < thread.blockDim.x; stride *= 2)
            {
                thread.SyncThreads();
                int index = (thread.threadIdx.x + 1) * 2 * stride - 1;
                if (index < thread.blockDim.x)
                {
                    temp[index] += temp[index - stride];
                }
            }

            for (int stride = BLOCK_SIZE / 4; stride > 0; stride /= 2)
            {
                thread.SyncThreads();
                int index = (thread.threadIdx.x + 1) * 2 * stride - 1;
                if (index + stride < BLOCK_SIZE / 2)
                {
                    temp[index + stride] += temp[index];
                }
            }

            thread.SyncThreads();

            bigNums[i, OUTPUT] = temp[thread.threadIdx.x];
        }

        static void Main(string[] args)
        {
            try
            {
                //loading the CUDA module
                CudafyModule km = CudafyTranslator.Cudafy(ePlatform.x64, eArchitecture.sm_13);
                _gpu = CudafyHost.GetDevice(eGPUType.Cuda);
                _gpu.LoadModule(km);

               //declaring my array of doubles
                double[,] bigNums = new double[BIGR, BIGC];
                
               //initializing my array
                for (int i = 0; i < BIGR; i++)
                {
                    bigNums[i, OUTPUT] = 0;
                    bigNums[i, INPUT] = i;
                }
                double [,] dev_bigNums = _gpu.CopyToDevice(bigNums);
                   
                  //launching the kernel. Here I think I am doing this incorrectly
                _gpu.Launch(((BLOCK_SIZE / 2) + BIGR) / (BLOCK_SIZE / 2), BLOCK_SIZE / 2).inclusiveBigRunningSum(dev_bigNums);

                _gpu.CopyFromDevice(dev_bigNums, bigNums);

                _gpu.FreeAll();

            }
            catch (CudafyLanguageException cle)
            {
                HandleException(cle);
            }
            catch (CudafyCompileException cce)
            {
                HandleException(cce);
            }
            catch (CudafyHostException che)
            {
                HandleException(che);
            }            
        }

        private static GPGPU _gpu;

        private static void HandleException(Exception ex)
        {
            Console.WriteLine(ex.Message);
        }

    
    }

}
Aug 25, 2014 at 12:32 PM
Have you ever managed to solve your issues? If so, could you please provide us your CUDAfy c# code?
Sep 11, 2014 at 3:41 PM
Edited Sep 20, 2014 at 7:12 AM
Hello lightxx,

I was able to solve the issue. I will post the source code here shortly. Just wanted to let you know for the time being that I did receive your message and that I was able solve the issue. Thanks for asking.

Regards,

KE0M
Sep 20, 2014 at 7:17 AM
Edited Sep 20, 2014 at 7:25 AM
Here is the code:
namespace Parallel_Prefix_sum_CUDAfy
{
    class Program
    {
        private const int BLOCK_SIZE = 256;
        private const int R = 8;
        private const int C = 2;
        private const int INPUT = 1;
        private const int OUTPUT = 0;
        private const int BIGR = 10000;
        private const int BIGC = 2;



        static void Main(string[] args)
        {
            try
            {

                CudafyModule km = CudafyTranslator.Cudafy(ePlatform.x64, eArchitecture.sm_20);
                _gpu = CudafyHost.GetDevice(eGPUType.Cuda);
                _gpu.LoadModule(km);

                double[,] bigNums = new double[BIGR, BIGC];

                for (int i = 0; i < BIGR; i++) //creating the big arrays
                {
                    bigNums[i, OUTPUT] = 0;
                    // Note : I change to = 1 for testing, you may replace this by i
                    bigNums[i, INPUT] = 1;
                }

               double [,] dev_bigNums = _gpu.CopyToDevice(bigNums);
               int grid_size = (BIGR - 1) / BLOCK_SIZE + 1;
               double[,] dev_aux = _gpu.Allocate<double>(grid_size, BIGC);


                // Note : Calling kernel to Inclusive Prefix Scan (1)
               _gpu.Launch(grid_size, BLOCK_SIZE).inclusiveBigRunningSum(dev_bigNums, dev_aux);


                // Note : this section for level 2 Prefix Scan - Don't change these code in both (1) and (2) cases.
                int grid_size_l2 = (grid_size - 1) / BLOCK_SIZE + 1;
                double[,] dev_aux2 = _gpu.Allocate<double>(grid_size_l2, BIGC);
                _gpu.Launch(grid_size_l2, BLOCK_SIZE).inclusiveBigRunningSum(dev_aux, dev_aux2);
                _gpu.Launch(grid_size, BLOCK_SIZE).finalUpdate(dev_bigNums, dev_aux);
                /////////////////

               _gpu.CopyFromDevice(dev_bigNums, bigNums);

               for (int i = 0; i < BIGR; i++)
                    Console.WriteLine(bigNums[i, INPUT]+ "\t" + bigNums[i, OUTPUT]);
                
                Console.ReadLine();

            }
            catch (CudafyLanguageException cle)
            {
                HandleException(cle);
            }
            catch (CudafyCompileException cce)
            {
                HandleException(cce);
            }
            catch (CudafyHostException che)
            {
                HandleException(che);
            }            
        }

        [Cudafy]
        public static void finalUpdate(GThread thread, double[,] bigNums, double[,] aux)
        {
            
            int i = thread.blockIdx.x * thread.blockDim.x + thread.threadIdx.x;

            double addValue = (thread.blockIdx.x > 0) ? aux[thread.blockIdx.x - 1, OUTPUT] : 0;
            int InputSize = bigNums.GetLength(0);
            if (i < InputSize)
            {
                bigNums[i,OUTPUT] = bigNums[i, OUTPUT]+ addValue;
            }
            
            
        }

        [Cudafy]
        public static void inclusiveBigRunningSum(GThread thread, double[,] bigNums, double [,] aux)
        {
            double[] temp = thread.AllocateShared<double>("inclusiveTempMem", BLOCK_SIZE);

            int i = thread.blockIdx.x * thread.blockDim.x  + thread.threadIdx.x;

            int InputSize = bigNums.GetLength(0);
            

            if (i < InputSize)
            {
                temp[thread.threadIdx.x] = bigNums[i, INPUT];
            }

            for (int stride = 1; stride < thread.blockDim.x; stride *= 2)
            {
                thread.SyncThreads();
                int index = (thread.threadIdx.x + 1) * 2 * stride - 1;
                if (index < thread.blockDim.x)
                {
                    temp[index] += temp[index - stride];
                }
            }

         

            for (int stride = BLOCK_SIZE / 4; stride > 0; stride /= 2)
            {
                thread.SyncThreads();
                int index = (thread.threadIdx.x + 1) * 2 * stride - 1;
                if (index + stride < BLOCK_SIZE)
                {
                    temp[index + stride] += temp[index];
                }
            }

            thread.SyncThreads();
            aux[thread.blockIdx.x, INPUT] = temp[BLOCK_SIZE - 1];
            bigNums[i, OUTPUT] = temp[thread.threadIdx.x];
        }
      private static GPGPU _gpu;

        private static void HandleException(Exception ex)
        {
            Console.WriteLine(ex.Message);
        }

    
    }

}
If you see a symbol like '+' it is an addition symbol. Not sure why I see it that way in my browser.

I hope that helps and let me know if you have any questions.

Regards,

KE0M