Recursion and required compute capability?

Jul 11, 2013 at 10:55 AM
Edited Jul 11, 2013 at 11:14 AM
I have just added a new kernel which is recursive and when i try and compile the code, i get the following error:
An error has occured and the operation has been terminated.
Compilation error: CUDAFYSOURCETEMP.cu
C:/Users/FrazMann/Desktop/CUDA - MK3/FrazerMann.Profiler.UserInterface/bin/x64/Debug/CUDAFYSOURCETEMP.cu(129): error: calling a global function("QuickSortOfValues") from a global function("QuickSortOfValues") is only allowed on the compute_35 architecture or above

C:/Users/FrazMann/Desktop/CUDA - MK3/FrazerMann.Profiler.UserInterface/bin/x64/Debug/CUDAFYSOURCETEMP.cu(129): error: a global function call must be configured

C:/Users/FrazMann/Desktop/CUDA - MK3/FrazerMann.Profiler.UserInterface/bin/x64/Debug/CUDAFYSOURCETEMP.cu(133): error: calling a global function("QuickSortOfValues") from a global function("QuickSortOfValues") is only allowed on the compute_35 architecture or above

C:/Users/FrazMann/Desktop/CUDA - MK3/FrazerMann.Profiler.UserInterface/bin/x64/Debug/CUDAFYSOURCETEMP.cu(133): error: a global function call must be configured

4 errors detected in the compilation of "C:/Users/..../Temp/tmpxft_00001bb4_00000000-8_CUDAFYSOURCETEMP.cpp1.ii".

Stack Trace for above error:
at Cudafy.CudafyModule.Compile(eGPUCompiler mode, Boolean deleteGeneratedCode)
at Cudafy.Translator.CudafyTranslator.Cudafy(ePlatform platform, eArchitecture arch, Version cudaVersion, Boolean compile, Type[] types)
at Cudafy.Translator.CudafyTranslator.Cudafy(ePlatform platform, eArchitecture arch, Type[] types)
at Cudafy.Translator.CudafyTranslator.Cudafy(eArchitecture arch)
at FrazerMann.Profiler.UserInterface.TaskManager.DataRetrieval() in C:\Users\FrazMann\Desktop\CUDA - MK3\FrazerMann.Profiler.UserInterface\System\TaskManager.cs:line 213
at FrazerMann.Profiler.UserInterface.MainView.btnEvaluation_Click(Object sender, EventArgs e) in C:\Users\FrazMann\Desktop\CUDA - MK3\FrazerMann.Profiler.UserInterface\View\MainView.cs:line 338
Inner exception:
Below is the stripped down method:
        [Cudafy]
        public static void QuickSortOfValues(GThread thread, long[,] tc, int[] lc, int left, int right)
        {
            int threadIndex = thread.threadIdx.x;
            int blockIndex = thread.blockIdx.x;
            int threadsPerBlock = thread.blockDim.x;
            int tickPosition = (threadIndex + (blockIndex * threadsPerBlock));

            int i = 0;
            int j = 0;

            if (left == -1)
            {
                i = 0;
                j = lc[tickPosition];
                left = i;
                right = j;
            }
            else
            {
                i = left;
                j = right;
            }

            long pivot = tc[((i + j) / 2),0];
 
            while (i <= j)
            {
                while (tc[i,0] < pivot)
                {
                    i++;
                }

                while (tc[j,0] > pivot)
                {
                    j--;
                }
 
                if (i <= j)
                {
                    // Swap
                    long tmp0 = tc[i,0];
                    long tmp1 = tc[i, 1];

                    tc[i, 0] = tc[j, 0];
                    tc[i, 1] = tc[j, 1];

                    tc[j, 0] = tmp0;
                    tc[j, 1] = tmp1;

                    i++;
                    j--;
                }
            }
 
            // Recursive calls
            if (left < j)
            {
                QuickSortOfValues(thread, tc, lc, left, j);
            }
 
            if (i < right)
            {
                QuickSortOfValues(thread, tc, lc, i, right);
            }
        }
Am i suppose to pass "thread" as a parameter during the recursive call?

I also have the following which means im using cuda and targeting 2.0 compute capability. I have a gtx 660 so thats as high as the card will allow if im not mistaken.
            CudafyModes.Target = eGPUType.Cuda;
            CudafyModes.DeviceId = 0;
            CudafyTranslator.Language = eLanguage.Cuda;

            CudafyModule km = CudafyTranslator.Cudafy(eArchitecture.sm_20);            
            _gpu = CudafyHost.GetDevice(eGPUType.Cuda);
            _gpu.LoadModule(km);
Any ideas why its asking for 3.5?
Jul 11, 2013 at 11:37 AM
Edited Jul 11, 2013 at 11:39 AM
You should avoid recursion in cuda, since the stack size is fairly small and/or located in global memory
Nevertheless, try using the following pattern (to differenciate between device and global functions):
// call this one from the host
[Cudafy]
public static void QuickSortOfValues(GThread thread, long[,] tc, int[] lc, int left, int right)
{
// some initialization here
// ...

QuickSortOfValues_Inner(thread, tc, lc, left, right);

// some finalization here
}
// returning something other than "void" is part of how cudafy differenciates between device and global modifiers
[Cudafy]
public static int QuickSortOfValues_Inner(GThread thread, long[,] tc, int[] lc, int left, int right)
{
//...

QuickSortOfValues_Inner(thread, tc, lc, SomeValue1, SomeValue2)

// etc


// finally, return some dummy value
return 0;
}
"Any ideas why its asking for 3.5?"
The compiler thinks you're trying to do dynamic parallelism. Don't worry about it. Compute 2.0 should be enough for standard recursion.
Jul 11, 2013 at 12:00 PM
Edited Jul 11, 2013 at 12:02 PM
Hey pedritolo1, thanks for replying. I was under the impression i wasnt able to call another method from a kernel.

So if im sorting an array[2000, 30000], where each thread is assigned a row to sort, i could run into problems with the stack size?

I'll give the your pattern a shot and see if that solves it.

Thanks again for your help :)
Jul 11, 2013 at 12:22 PM
The amount of used stack depends on how unsorted your lists are.
Generally speaking, quicksort is a terrible choice for a sorting algo. Try some non-recursive implementation of merge sort, for example. better still, look up on the web for a proper parallel implementation of a sort algo, since the paradigm changes abruptely once parallelism is used.
Jul 11, 2013 at 5:43 PM
Edited Jul 11, 2013 at 5:44 PM
@Pedritolo1, ive tried modifying my code, i created a CPU version, using the same parameters, so i could debug it and then i launched my GPU version several times. It keeps hanging for several minutes. The GPU has 15000 rows to do in parallel but the cpu managed to do 1 in 0.8 seconds so im assuming ~30sec and it should have done all 15000 rows.

1 thing i dont understand regarding the pattern above is the returning of the int. Shouldnt the 1st code block be
int dummyValue = QuickSortOfValues_Inner(thread, tc, lc, left, right);
If so, dont i have several threads trying to write back to a single non-thread safe variable?

Maybe i wasnt clear when i originally stated the question, if so i apologize, but each thread is sorting 1 row of the data. The GTX 660 has 960 cores (making the simplification that that equates to 960 threads) then i could sort 960 rows all at once, which i thought would be a reasonable way of sorting the data.

Regarding merge-sort, i did look into this, but i couldnt find an implementation (it took me a while to get my head around the basics of both) + as far as im aware, quick-sort is an in place sort whereas merge-sort requires more memory, which is something im running short on.

Thanks again for your help so far, i appreciate it.
Jul 11, 2013 at 6:39 PM
There are in-place parallel mergesort and radix sort C source codes. Somewhere out there. I posted a URL to one of them in one of my earlier posts.

The return value is just a way to force cudafy to mark the recursive function with the "device" modifier and it will later be removed by the optimizer, so don't pay it any mind.

To fix your problem you could perhaps use visual nSight to debug it while reducing drastically the size of your problem so it becames maneagable.
Sorry that I can't help further.

cheers
Jul 11, 2013 at 6:54 PM
So, is the modified code i just posted correct, or should i not assign the returned value to an int?

i had a look into radix sort but unfortunately the numbers im sorted are extremely large, hence the long data type, so converting them into binary and storing them in a datatype isnt possible without loosing precision, as far as im aware.

If ive understood correctly, i need to have a second GPU in order to debug and currently im still using a single GPU. Until i can confirm this is worth it im not willing to pay for a new one, being a student and all.

Regarding making the problem more managable, yeah thats a logical next step.

1 final question if i may, is it possible to add the "device" modifier to the code manually rather than modifying the return type of the 2nd method?

Thanks again :)
Jul 11, 2013 at 9:25 PM
"So, is the modified code i just posted correct"
It's correct, but pointless. Leave it as per my previous example. Disclaimer: Personally, I never tried recursivity in cuda. In theory it should work.

"If ive understood correctly, i need to have a second GPU in order to debug and currently im still "
Not anymore. With the latest nSight, you can use a single pc. I know, since I do it with on laptop.

"is it possible to add the "device" modifier to the code manually "
Not that I'm aware of.

"i had a look into radix sort but unfortunately ..."
Yeah, I see. You can try Bitonic sort, lovely algo, but it's a bit complex to imlement. There are many source codes around on the interwebs.
Jul 11, 2013 at 10:21 PM
Edited Jul 12, 2013 at 9:31 AM
"If ive understood correctly, i need to have a second GPU in order to debug and currently im still "
Not anymore. With the latest nSight, you can use a single pc. I know, since I do it with on laptop.

Nice, i just gave it a shot and managed to get it loaded up. Unfortunately i cant seem to evaluate the variables like the guy does on the nvidia website. I tried adding a variable to the watch list but its not working. I also dont get anything when i hover the cursor over the variable. I didnt name my variable "num2" so im assuming thats because cudafy made this code?

I have simplified the code so only one thread is executing and that works. Im going to try changing it back to ur original code pattern n see if that helps.

Ive only just started using sorting algo's so id prefer to stick with quicksort, if possible, atm.

Image


EDIT I decided to post the issue about the debugger in a separate question.
Coordinator
Jul 12, 2013 at 11:16 AM
By default cudafy functions with a void return will be assumed to be global. You cannot call a global function from the GPU unless you use a 3.5 device and dynamic parallelism which will be supported in next version of CUDAfy (a fair amount of jumping through hoops was required but that is another story).
You can override this behaviour by something like this:
        [Cudafy(eCudafyType.Device)]
        public static void childFunction(GThread thread, int[] a, int[] c, short coeff)
        {
            int tid = thread.blockIdx.x;
            if (tid < a.Length)
                c[tid] = a[tid] * coeff;
        }
Okay I will hunt down your debugger issue on other thread.
Jul 12, 2013 at 11:25 AM
Hi NikcKopp, thanks for replying. Thanks for the heads up regarding the modified attribute. Ill give it a shot and see if it works.

Odd thing is, the 1st row is sorted correctly, but the next row isnt. I hadnt checked the 3rd+ row but im assuming they are also not sorted correctly. Needless to say this is where a debugger would be pretty handy.

Thanks again for your time.
Jul 12, 2013 at 2:02 PM
[Cudafy(eCudafyType.Device)]
Oh nice! Didn't know that.
Jul 15, 2013 at 10:16 AM
@pedritolo1 / @NickKopp Are there any sorting algorithm frameworks or when using cudafy.net do you always have to implement your own?
Coordinator
Jul 15, 2013 at 10:22 AM
There is no oven ready sorting algorithm in CUDAfy. There are plenty of examples in CUDA C that could be ported. I would really like to encourage you if you do implement to share the code so we can also integrate into future CUDAfy releases. This is how some of the Cudafy.Math libraries came about. Unfortunately not many people have ever done this. Not directing this to you personally, more a general statement to all those stumbling on this thread: I believe that implementing more higher level libraries will be key to increasing the profile of CUDAfy.

Take home is: If you implement something with CUDAfy and it could be useful to more people, please consider submitting the code!
Jul 15, 2013 at 10:47 AM
Np, ive written a quick sort algorithm but after running it on the CPU it took 24 seconds to sort 30,000 records of values between 100,000,000 -> 900,000,000 which im assuming isnt very good so im not sure im the best coder for adding to cudafy.net but i'll happily share what i come up with and let u make a decission as to if its %^&* or not :)
Jul 15, 2013 at 11:43 AM
I'm considering writing an article on best practices on gpu programming, and I may focus on sorting algos. We'll see.
Sep 2, 2013 at 9:59 PM
FYI: an article on performance has been recently added to cudafy's documentation which contains the cudafy source for a gpu sorter