Sorting an array on GPU using cudafy

May 17, 2013 at 11:31 PM
Hi,

Does any one know how to do array sorting using cudafy. I have been looking for that for a while.
I have tried to use quick sort to run on GPU, then I found out that recursive is not supported in Quadro 4000.
I have tried shell sort and bubble sort to run on GPU, but they are slower than quick sort, and it did not give the right answer comparing to Array.sort() in C#.

Any suggestion, if there is a sort function can be used on cudafy to run on GPU?

Thanks
Coordinator
May 18, 2013 at 7:15 AM
Can you post some code? There are also many CUDA examples that can be ported to CUDAfy.
May 18, 2013 at 1:08 PM
Edited May 18, 2013 at 1:09 PM
Radix sort is said to work well in paralel. I think Thrust has an implementation of it. If you must have O(n Log(n)), then you should use mergeSort (also implemented in thrust, if I'm not mistaken). If you want it in cuda(fy), you'll have to reverse-engineer the thrust source files.
May 19, 2013 at 5:21 PM
Edited May 19, 2013 at 5:50 PM
Thanks all for your replies.

I have the following non recursive quick sort function, since recursive is not supported in Quadro 4000.

{

...................... some code
....................
..................
        float[,] dev_a = gpu.Allocate<float>(F);
        float[] dev_aa = gpu.Allocate<float>(w);
        float[] dev_dataUnsorted = gpu.Allocate<float>(w);
        float[] dev_w = gpu.Allocate<float>(w);
        int[] dev_b = gpu.Allocate<int>(Y);
        int[] dev_h = gpu.Allocate<int>(Y);
        int[] dev_y = gpu.Allocate<int>(Y);
        float[] dev_c = gpu.Allocate<float>position.Count);
        int[] dev_indx = gpu.Allocate<int>(position.Count);
        float[] dev_al = gpu.Allocate<float>(position.Count);
        float[] dev_be = gpu.Allocate<float>(position.Count);
        float[] dev_t = gpu.Allocate<float>(position.Count);
        int[] beg_and_end = gpu.Allocate<int>(F.Length / 4);


        gpu.CopyToDevice(F, dev_a);

        gpu.CopyToDevice(Y, dev_y);
        gpu.CopyToDevice(w, dev_w);

        gpu.Launch(position.Count, 1).function_Cudafy(dev_b, dev_a, dev_aa, dev_dataUnsorted, dev_w, dev_h, dev_c, dev_y, dev_indx, beg_and_end , dev_al, dev_be, dev_t);
..........................................
..........................................
.........................................some code


}

[Cudafy]
    public static void function_Cudafy(GThread thread, int[] indices, float[,] data1, float[] data, float[] data_unsorted,
                                                     float[] w, int[] h, float[] c, int[] y, int[] indx,int[] beg_and_end,float[] a,float[] b,float[] t)
    {
            arr_all_quickSort arr = new arr_all_quickSort();
            arr.arr = data;
            arr.indx = indices;
            arr.beg = beg_and_end;
            arr.end = beg_and_end;


            arr = quickSort_non_recursive_Cuda(arr, data.Length);

            data = arr.arr;
            indices = arr.indx;
   }
[Cudafy]
    public static arr_all_quickSort quickSort_non_recursive_Cuda(arr_all_quickSort arr, int elements)
    {


        float piv;
        int i = 0, L, R, piv_indx;

        arr.beg[0] = 0; arr.end[0] = elements;
        while (i >= 0)
        {
            L = arr.beg[i]; R = arr.end[i] - 1;
            if (L < R)
            {
                piv = arr.arr[L];
                piv_indx = arr.indx[L];
                //if (i == MAX_LEVELS - 1) ;//return arr;
                while (L < R)
                {
                    while (arr.arr[R] >= piv && L < R)
                        R--;
                    if (L < R)
                    {
                        arr.indx[L] = arr.indx[R];
                        arr.arr[L++] = arr.arr[R];
                    }
                    while (arr.arr[L] <= piv && L < R)
                        L++;
                    if (L < R)
                    {
                        arr.indx[R] = arr.indx[L];
                        arr.arr[R--] = arr.arr[L];
                    }
                }
                arr.arr[L] = piv; arr.indx[L] = piv_indx; arr.beg[i + 1] = L + 1; arr.end[i + 1] = arr.end[i]; arr.end[i++] = L;
            }
            else
            {
                i--;
            }
        }
        return arr;
    }

and the following struct is the data type to be returned from the above sorting function:


[Cudafy]
    public struct arr_all_quickSort
    {
        public float[] arr;
        public int[] indx;
        public int[] beg;
        public int[] end;
    }


what I am trying to do is something similar to the build in c# function: Array.sort()

so for example if I want to sort indices based on specific data. I will say:
Array.sort(data, indices);

then indices array will be sorted based on data array. but this works on CPU.

Now I need to do the same thing on GPU using cudafy. but I did not find an example how to sort an array using cudafy.

So I have tried to do the above code, but I don't know the output I got is different than the output of Array.sort(data, indices).

Then I found the following article:
http://adnanboz.wordpress.com/2011/07/27/faster-sorting-in-c-by-utilizing-gpu-with-nvidia-cuda/

which is based on creating dll file of sorting function in thrust library, then make the dll as a reference in another c# project, but I think this way will not work for me, because the output of sorting function will be back to system memory, then I need to copy it to GPU memory. In this case it will not be worthy for me.

What I am looking for exactly is a GPU function call the GPU sort function. so every thing happen in GPU not in CPU.


By the way the reference for the non recursive quick sort function mentioned above is the following link:

http://alienryderflex.com/quicksort/


Thank you.
Coordinator
May 21, 2013 at 9:02 AM
The Quadro 4000 is compute capability 2.0 so I would expect it to support recursive device functions.
May 22, 2013 at 1:08 PM
I have tried the following code, which is a recursive quicksort, but it gave me the following error:

[Cudafy]
    public struct arr_all
    {
        public float[] arr;
        public int[] indx;
    }
[Cudafy]
    public static arr_all quickSort(arr_all arr, int array_size)
    {
        return q_sort(arr, 0, array_size - 1);
    }


    [Cudafy]
    public static arr_all q_sort(arr_all numbers, int left, int right)
    {
        float pivot;
        int indx_p, l_hold, r_hold;

        l_hold = left;
        r_hold = right;
        pivot = numbers.arr[left];
        indx_p = numbers.indx[left];
        while (left < right)
        {
            while ((numbers.arr[right] >= pivot) && (left < right))
                right--;
            if (left != right)
            {
                numbers.arr[left] = numbers.arr[right];
                numbers.indx[left] = numbers.indx[right];

                left++;
            }
            while ((numbers.arr[left] <= pivot) && (left < right))
                left++;
            if (left != right)
            {
                numbers.arr[right] = numbers.arr[left];
                numbers.indx[right] = numbers.indx[left];
                right--;
            }
        }
        numbers.arr[left] = pivot;
        numbers.indx[left] = indx_p;
        indx_p = left;
        pivot = left;
        left = l_hold;
        right = r_hold;
        if (left < pivot)
            q_sort(numbers, left, (int)pivot - 1);
        if (right > pivot)
            q_sort(numbers, (int)pivot + 1, right);


        return numbers;
    }
The error:

{"Compilation error: CUDAFYSOURCETEMP.cu\r\ntmpxft_000017f0_00000000-5_CUDAFYSOURCETEMP.cudafe1.gpu\r\ntmpxft_000017f0_00000000-10_CUDAFYSOURCETEMP.cudafe2.gpu\r\nC:/.................... /bin/Debug/CUDAFYSOURCETEMP.cu(256): Error: Recursive function call is not supported yet: q_sort(Form1arr_all, int, int)\r\nC:/................./bin/Debug/CUDAFYSOURCETEMP.cu(252): Error: Recursive function call is not supported yet: q_sort(Form1arr_all, int, int)\r\n"}
Coordinator
May 22, 2013 at 1:21 PM
What architecture are you targeting? The CUDAfy default is 1.3 which would fail. You need to explicitly give the architecture (2.0 in your case) when calling the Cudafy(...) method.
May 22, 2013 at 3:06 PM
And, you're launching multiple blocks of 1 thread each (gpu.Launch(position.Count, 1)), but nowhere in your cudafy code do I see you referencing thread.<something>. You're running a single-threaded code on a multi-threaded environment, while sharing the same memory buffers, which will be randomly overriten, resulting in invalid results.
May 22, 2013 at 3:41 PM
I ran CudafyByExample just to see the computation capability, since the following line will show me that:

Console.WriteLine("Compute capability: {0}.{1}", prop.Capability.Major, prop.Capability.Minor);

But I am wondering, it mentioned the name of the device Quadro 4000, and the compute capabilty is 1.1.

So I am confused, based on the following Nvidia web site and based on what you are saying the compute capability is 2.0:
https://developer.nvidia.com/cuda-gpus

The way I understande it is that architecture 2.0 means compute capability is 2.0.
So please correct me if I understand it wrong.

Thanks
May 22, 2013 at 3:50 PM
pedritolo1 wrote:
And, you're launching multiple blocks of 1 thread each (gpu.Launch(position.Count, 1)), but nowhere in your cudafy code do I see you referencing thread.<something>. You're running a single-threaded code on a multi-threaded environment, while sharing the same memory buffers, which will be randomly overriten, resulting in invalid results.
I have a two dimensional array, so the way I was thinking to run that is by making each block will be in charge of one row of the two dimensional array.
That is whay when I lunch "gpu.Launch(position.Count, 1))" I have created blocks matching the number of rows in the 2D array.
and since each row has around 13000 element, I have created one thread to deal with them.

This the way how I understand it, and please correct me if I am wrong.

Thanks
May 22, 2013 at 4:05 PM
If that's what you're trying to do, then within the cuda code you need to fetch the proper buffer to work with based on the current block's index.
May 22, 2013 at 4:29 PM
The following is how I am calling cuda function. But I am still trying to find why it does not give me the right result when I compare it with CPU results.
Any way, I am trying to invistigate more and more to see If there something wrong.

gpu.Launch(position.Count, 1).function_Cudafy(dev_b, dev_a, dev_aa, dev_dataUnsorted, dev_w, dev_h, dev_c, dev_y, dev_indx , dev_al, dev_be, dev_th,beg,end);


[Cudafy]
public static void function_Cudafy(GThread thread, int[] indices, float[,] data1, float[] data, float[] data_unsorted, float[] w, int[] h, float[] c, int[] y, int[] indx, float[] al, float[] be, float[] th, int[] beg, int[] end)
{

        int tid = thread.blockIdx.x;
        while (tid < position.Count)
        {
                 for (int i = 0; i < data.Length; i++)
            {
                data[i] = data1[tid, i];
                data_unsorted[i] = data1[tid, i];
                indices[i] = i;
            }

            arr_all arr = new arr_all();
            arr.arr = data;
            arr.indx = indices;


            arr = quickSort_non_recursive_Cuda2(arr, data.Length, beg, end);

            data = arr.arr;
            indices = arr.indx;    


           //.....some code

             c[tid] = value1;
            indx[tid] = value2;
            a[tid] = value3;
            b[tid] = value4;
            t[tid] = value5;


            tid++;

        }
}


[Cudafy]
    public static arr_all quickSort_non_recursive_Cuda2(arr_all arr, int elements, int[] beg, int[] end)
    {


        //int MAX_LEVELS = 500000;
        //int[] beg = new int[MAX_LEVELS];
        //int[] end = new int[MAX_LEVELS];
        int i = 0, L, R, swap, piv_indx;
        float piv;
        beg[0] = 0; end[0] = elements;
        while (i >= 0)
        {
            L = beg[i];
            R = end[i] - 1;
            if (L < R)
            {
                piv = arr.arr[L];
                piv_indx = arr.indx[L];
                while (L < R)
                {
                    while (arr.arr[R] >= piv && L < R)
                        R--;
                    if (L < R)
                    {
                        arr.indx[L] = arr.indx[R];
                        arr.arr[L++] = arr.arr[R];
                    }
                    while (arr.arr[L] <= piv && L < R)
                        L++;
                    if (L < R)
                    {
                        arr.indx[R] = arr.indx[L];
                        arr.arr[R--] = arr.arr[L];
                    }
                }
                arr.arr[L] = piv; arr.indx[L] = piv_indx; beg[i + 1] = L + 1; end[i + 1] = end[i]; end[i++] = L;
                if (end[i] - beg[i] > end[i - 1] - beg[i - 1])
                {
                    swap = beg[i]; beg[i] = beg[i - 1]; beg[i - 1] = swap;
                    swap = end[i]; end[i] = end[i - 1]; end[i - 1] = swap;
                }
            }
            else
            {
                i--;
            }
        }
        return arr;
    }

[Cudafy]
    public struct arr_all
    {
        public float[] arr;
        public int[] indx;
    }
Coordinator
May 22, 2013 at 5:09 PM
monther wrote:
I ran CudafyByExample just to see the computation capability, since the following line will show me that:

Console.WriteLine("Compute capability: {0}.{1}", prop.Capability.Major, prop.Capability.Minor);

But I am wondering, it mentioned the name of the device Quadro 4000, and the compute capabilty is 1.1.

So I am confused, based on the following Nvidia web site and based on what you are saying the compute capability is 2.0:
https://developer.nvidia.com/cuda-gpus

The way I understande it is that architecture 2.0 means compute capability is 2.0.
So please correct me if I understand it wrong.

Thanks
Looks like you are running in OpenCL mode! On a CUDA device that will return 1.1 since that is what NVIDIA support for OpenCL. In V1.22 this will not be an issue if you specify architecture 2.0. If using an earlier version .... just upgrade.
May 22, 2013 at 6:07 PM
You are right. I just went through the code in CudafyBuExample under program.cs

I found the following lines:
        CudafyModes.Target = eGPUType.OpenCL;
        CudafyModes.DeviceId = 0;
        CudafyTranslator.Language = eLanguage.OpenCL;
I downloaded V1.22 couple days ago, and I did not change any thing in the code when I ran first time, because I did not expect that it will run in OpenCL mode.
But after I select CUDA from eGPUType. It shows the right compute capability which is 2.0 .

Thanks a lot.

One more thing, since I am trying to debug my code to find out what is wrong. Is there any way that I can see the data inside array allocated in GPU?

Best regards.
May 23, 2013 at 6:02 PM
        puplic void run ()
        {
        float[,] F = Fill_F_array();     

        CudafyModule km = CudafyTranslator.Cudafy(eArchitecture.sm_20); 
        GPGPU gpu = CudafyHost.GetDevice(CudafyModes.Target);
        gpu.LoadModule(km);

        int[] Y = new int[Total];
        float[] c = new float[position.Count];
        int[] indx = new int[position.Count];
        float[] al = new float[position.Count];
        float[] be = new float[position.Count];
        float[] th = new float[position.Count];

        float[,] dev_a = gpu.Allocate<float>(F);
        float[] dev_aa = gpu.Allocate<float>(w);
        float[] dev_dataUnsorted = gpu.Allocate<float>(w);
        float[] dev_w = gpu.Allocate<float>(w);
        int[] dev_b = gpu.Allocate<int>(Y);
        int[] dev_h = gpu.Allocate<int>(Y);
        int[] dev_y = gpu.Allocate<int>(Y);
        float[] dev_c = gpu.Allocate<float>(position.Count);
        int[] dev_indx = gpu.Allocate<int>(position.Count);
        float[] dev_al = gpu.Allocate<float>(position.Count);
        float[] dev_be = gpu.Allocate<float>(position.Count);
        float[] dev_th = gpu.Allocate<float>(position.Count);
        int MAX_LEVELS = F.Length / 4;
        int[] beg = gpu.Allocate<int>(MAX_LEVELS);
        int[] end = gpu.Allocate<int>(MAX_LEVELS);

        gpu.CopyToDevice(F, dev_a);           
        gpu.CopyToDevice(Y, dev_y);
        gpu.CopyToDevice(w, dev_w);   

        gpu.Launch(position.Count, 1).function_Cudafy(dev_b, dev_a, dev_aa, dev_dataUnsorted, dev_w, dev_h, dev_c, dev_y, dev_indx , dev_al, dev_be,             dev_th,beg,end);

        gpu.CopyFromDevice(dev_c, c);
        gpu.CopyFromDevice(dev_indx, indx);
        gpu.CopyFromDevice(dev_al, al);
        gpu.CopyFromDevice(dev_be, be);
        gpu.CopyFromDevice(dev_th, th);

        gpu.FreeAll();
        }
[Cudafy]
    public static void function_Cudafy(GThread thread, int[] indices, float[,] dev_a , float[] dev_aa , float[] data_unsorted, float[] w, int[] h, float[] c, int[] y, int[] indx, float[] al, float[] be, float[] th, int[] beg, int[] end)
    {
        int tid = thread.blockIdx.x;          

        while (tid < position.count)
        {       
                for (int i = 0; i < dev_aa .Length; i++)
                {
                    dev_aa [i] = dev_a [tid, i];
                    data_unsorted[i] = dev_a [tid, i];
                    indices[i] = i;
                }

                arr_all arr = new arr_all();
                arr.arr = dev_aa ;
                arr.indx = indices;
                arr = quickSort_non_recursive_Cuda2(arr, dev_aa .Length, beg, end);
                dev_aa = arr.arr;
                indices = arr.indx;

                c[tid] = v1;
                indx[tid] = tid;
                alpha[tid] = v2;
                beta[tid] = v3;
                thresh[tid] = v4;
            }
            tid++;
        }
    }

[Cudafy]
public static arr_all quickSort_non_recursive_Cuda2(arr_all arr, int elements, int[] beg, int[] end)
{
    //int MAX_LEVELS = 500000;
    //int[] beg = new int[MAX_LEVELS];
    //int[] end = new int[MAX_LEVELS];
    int i = 0, L, R, swap, piv_indx;
    float piv;
    beg[0] = 0; end[0] = elements;
    while (i >= 0)
    {
        L = beg[i];
        R = end[i] - 1;
        if (L < R)
        {
            piv = arr.arr[L];
            piv_indx = arr.indx[L];
            while (L < R)
            {
                while (arr.arr[R] >= piv && L < R)
                    R--;
                if (L < R)
                {
                    arr.indx[L] = arr.indx[R];
                    arr.arr[L++] = arr.arr[R];
                }
                while (arr.arr[L] <= piv && L < R)
                    L++;
                if (L < R)
                {
                    arr.indx[R] = arr.indx[L];
                    arr.arr[R--] = arr.arr[L];
                }
            }
            arr.arr[L] = piv; arr.indx[L] = piv_indx; beg[i + 1] = L + 1; end[i + 1] = end[i]; end[i++] = L;
            if (end[i] - beg[i] > end[i - 1] - beg[i - 1])
            {
                swap = beg[i]; beg[i] = beg[i - 1]; beg[i - 1] = swap;
                swap = end[i]; end[i] = end[i - 1]; end[i - 1] = swap;
            }
        }
        else
        {
            i--;
        }
    }
    return arr;
}
[Cudafy]
public struct arr_all
{
    public float[] arr;
    public int[] indx;
}
I have a question regarding the above code. Based on what I understand about cudafy that I can not create an array inside cudafy function. For that reason, in the above code I have allocated an array outside the cudafy function.
for example, in the above code,
float[] dev_aa = gpu.Allocate<float>(w);
dev_aa was allocated and passed as a parameter in the function named: "function_Cudafy".
Now inside "function_Cudafy", what I am trying to do in this allocated array, is to take one row from the 2D array "dev_a"and fill it in dev_aa. Then I will sort the dev_aa array.
I think but I am not sure yet, that by following this way, it will not work fine. Since dev_aa is allocated as general memory. Then when all blocks will run in parallel, as a result dev_aa will be overwritten from different blocks. So in this case, I will not have the correct result.

This is my guessing, but I am not sure if it is correct or not.

Please correct me, and in this case what should I do to solve the problem.

Thanks.
May 23, 2013 at 8:24 PM
Edited May 23, 2013 at 8:30 PM
yes, if you had all threads using your dev_aa simultaneously for write operations, of course you'd be in trouble. You should, for example, use a full copy of the original array dev_a and have each thread read/write only within the row assigned to it.
By the way, there's no need to return anything in the fuction quickSort_non_recursive_Cuda2. Just pass dev_aa and indices as func arguments.

I recomend that you read CUDA's manual (it's a good starting point, really), since it's impossible to write good parallel code without knowing the fundamentals. I say this because I see you're trying to forcefully fit a linear algorithm into a parallel engine and also because you're not aware of the basics such as coalesced memory access, shared memory, thread concurrency, warp usage, thread/block ratios, etc. Otherwise, even if you get your code above to work on a gpu, you won't get any significant performance improvements over a cpu.
May 30, 2013 at 5:52 PM
Thanks for replying.

I was wondering, in case if I want to create a local array, I have to use something like that:

float[] cache = thread.AllocateShared<float>("cache", threadsPerBlock);

Based on what I understand, the problem here is the size of the array will be limited to the number of threads per block.

So in my case, one block and only one thread in that block will be in charged of each row in 2D array. So if I want to copy one row from 2D array to a shared memory, but the length size of that row is 12000 and only I have one thread used in each block. What should I do?

Thanks.
May 30, 2013 at 11:15 PM
Hi

You're not limited to the thread count within a block when allocating shared memory. Read more about shared memory in cuda's manual.
You could do this in your sample case:

thread.AllocateShared<float>("cache", n); // n is some arbitrary constant

Note that
1 - shared memory is a limited resource, you mustn't allocate too much.
2 - one thread per block is an extroardinarily bad use of gpu resources. Read about gpu occupancy to learn more.

Here's a good example of gpu sorting done properly. If you're to adapt that code to your specific case, you'd only be converting about 10% of it to cudafy.
http://www.naic.edu/~phil/hardware/nvidia/doc/src/particles/radixsort.cu