This project is read-only.

Inconsistent Output

Jun 25, 2013 at 6:44 PM
Edited Jun 25, 2013 at 6:55 PM
Hello all,

So everything in my code is working perfectly through the first iteration and I have confirmed it's output with an existing CPU solution. When it goes through the second iteration, however, the GPU code often gets incorrect answers, but not always. I suspect that the first method call is stepping on the end of the last method call, but do not know how to fix this?

I've inserted gpu.Synchronize(); methods, but they don't seem to help. (what exactly do these do? Because I must be using them wrong)
for (int ni = 0; ni < 2; ni++) //if this is ni<1 it gets the correct answer, but will need to be >1000 for final run
{
gpu.Synchronize();
gpu.Launch(Array1.GetLength(0) / N + 1, N).Method1(… Parameters …);
gpu.Synchronize();

if (Array2OnDevice == null)
{//probably a more efficient way to do this as well
gpu.CopyFromDevice<float>( Array2OnDevice, Array2);
Array3OnDevice = gpu.CopyToDevice<float>(Array2); 
}
gpu.Synchronize();

gpu.Launch(Array1.GetLength(0) / N + 1, N).Method2(… Parameters …);
gpu.Synchronize();

//Once I get the correct output, will try to combine these calls to avoid the DLL overhead
gpu.Launch(Array3.GetLength(0) / N + 1, N).Method3(… Parameters …);
gpu.Synchronize();

gpu.Launch(Array3.GetLength(0) / N + 1, N).Method4(… Parameters …);
gpu.Synchronize();
    
gpu.Launch(Array3.GetLength(0) / N + 1, N).Method5(… Parameters …);
gpu.Synchronize();

gpu.Launch(Array3.GetLength(0) / N + 1, N).Method6(… Parameters …);
gpu.Synchronize(); 
}
Any ideas on how to fix this?

Thanks for all the help!

Zach
Jun 25, 2013 at 10:28 PM
Hi
You don't need all those Synch instructions. They are used when you are doing asynchronous kernel launches and memory copy operations, which isn't the case here (check cuda toolkit manual and cudafy examples for more info about asynch execution and streams).
Your problem with inconsistent results is most likely due to multiple threads concurrently changing the same data. I wrote a post about this a while back, but I can't seem to find it.
Anyhow, the best thing to do is to debug your kernel using nvidia's nSight.
good luck
Jun 25, 2013 at 10:40 PM
Ok, that's what I figured about the sync. And the cuda toolkit manual is a good source of documentation? I've been looking for more documentation to help understand all of this but discounted it as CUDA != Cudafy.

And yes, I figured it was the different threads concurrently changing data, and I know it's between the last Launch (Method6) and the next invocation of the Method1... but what can I use to stop that? Is there a sync instruction?

I will look through the toolkit manual and report back what I find, unless someone answers it here before I find it.

Thank you,

Zach
Jun 25, 2013 at 11:26 PM
It looks like I should use this: "cudart_builtin cudaError_t cudaDeviceSynchronize (void):"

But how would I call something like that from Cudafy?

Also, is "cudaDeviceBlockingSync" supported in cudafy? If so I might test for performance once this is fixed.
Jun 26, 2013 at 12:44 AM
It seems to me you are mixing two different things. One is what happens on the cpu (the host) outside a kernel launch, the other is what goes on within a gpu's kernel (the device). These are two different worlds. The one you should be focusing on is the kernel. Instead you're trying to use api calls used by the host to control the device (cudafy already does all that for you, so you shouldn't worry about it).

And yes, the cuda toolkit "CUDA_C_Programming_Guide.pdf" is a mandatory read for anyone wishing to go anywhere with cuda. It's, IMO, a very interesting read. The manual is well thought out and quite thorough. I suspect you might be trying to read "CUDA_Toolkit_Reference_Manual.pdf" instead, which is only a reference manual and likely confusing unless you know what you're looking for.
Once you've read it, go through cudafy's (short) manual, and it will all make a lot of sense.
Cudafy basically allows you to write the kernel code directly in c# instead of C, which makes things neater and your code becames cleaner and easier to mantain. But knowledge of the underlying cuda is, IMO, essential.

Have fun
Jun 26, 2013 at 4:38 PM
Edited Jun 26, 2013 at 4:38 PM
Ok, well I skimmed the "CUDA_C_Programming_Guide.pdf" about a year ago, but haven't had to use it since, so I'll brush up on it asap.

And I understand the difference between the host code calling a Cudafy kernel method and the Cudafy kernel method, I'm just not familiar enough with the Cuda/Cudafy commands to know where I can use what yet. I also simply don't know what commands exist in cudafy. (This is likely my problem, but I don't see a whole lot of documentation on this - the cudafy manual is very short and is mostly setting up cudafy and examples, unless I'm looking at the wrong thing?)

Since last post, I've simplified it to the code below and still getting the same inconsistent output. To my understanding, if I use any command to "sync" threads on the GPU itself, it will only sync threads of the same warp, correct? This is why I was looking for something to sync the threads in the host code.

Is there any command to sync all the threads? And would it be used in C# or Kernel?
for (int ni = 0; ni < 2; ni++)
{
    gpu.Launch(Array1.GetLength(0) / N + 1, N).Method1(… Parameters …);

    gpu.Launch(Array2.GetLength(0) / N + 1, N).Method2(… Parameters …);

    //If I can sync all warps on the device, I can combine these 3 calls into one and avoid the overhead. 
    //Any suggestion on how to do this?

    gpu.Launch(Array2.GetLength(0) / N + 1, N).Method3(… Parameters …);

    gpu.Launch(Array2.GetLength(0) / N + 1, N).Method4(… Parameters …); 
    //Need to sync all threads after this, or at the end of this, method before it loops.
}
As always, thank you for your help,

Zach
Jun 26, 2013 at 5:18 PM
Since your host side is a single threaded application and you aren't using cuda streams, then you won't need to synchronize anything on the cpu.
You want to work only on the kernel side, that's where the function marked with the [Cudafy] attribute will run.
Within your MethodX function, you have the "thread" argument, which you may use to call the method "thread.SyncThreads()" to synch all threads within the same block.

"if I use any command to "sync" threads on the GPU itself, it will only sync threads of the same warp, correct"
No. Within a warp, all threads are implicitly synchronized, so you'll only need to call thread.SyncThreads if you need synchronization across the whole block.
cheers
Jun 26, 2013 at 5:40 PM
pedritolo1 wrote:
Since your host side is a single threaded application and you aren't using cuda streams, then you won't need to synchronize anything on the cpu.
So any time it goes back to the host, every thread on the card will be complete? Aka, between Method3 and Method4 there is no chance a single thread in Method4 starts before the last thread in Method3 finishes? Because that is what I believe is currently causing the inconsistencies (except between Method4 and Method1)
You want to work only on the kernel side, that's where the function marked with the [Cudafy] attribute will run.
Within your MethodX function, you have the "thread" argument, which you may use to call the method "thread.SyncThreads()" to synch all threads within the same block.

"if I use any command to "sync" threads on the GPU itself, it will only sync threads of the same warp, correct"
No. Within a warp, all threads are implicitly synchronized, so you'll only need to call thread.SyncThreads if you need synchronization across the whole block.
cheers
Ok, so that syncs it across the whole block, but does it sync between the different blocks on the grid? Is there any way to insure that?
Jun 26, 2013 at 8:14 PM
Edited Jun 26, 2013 at 9:18 PM
There must be something to this - Cudafy must not force all of the blocks to finish before making the next call as adding a delay after Method4 had decreased the occurrence of the error - fairly significantly too (about 4x less likely now). This still isn't accurate enough, nor is the performance acceptable, but it helps key in on the problem.

Edit - nevermind, it might still be an issue. During my testing apparently I was just getting "lucky" and saw an improvement where there might not be any. (did over 1000 trials)
Jun 26, 2013 at 9:06 PM
gpu.Launch and gpu.CopyFromDevice are both synchronous. So there's really no need to enforce synchronization. But if it puts your mind at ease, you can always call gpu.Synchronize after each call, for good measure.

"Ok, so that syncs it across the whole block, but does it sync between the different blocks on the grid?"

No, only within the block. But atomics work across the whole device, so you should be ok. If you post us a simplified, trimmed down version of your kernel code that still exhibits that error, we may be able to help you further. remember to use the memtest tool, it has an option to test for race conditions, and also nsight is a useful tool in these scenarios.

"...fairly significantly too (about 4x less likely now..."
I think you're looking in the wrong place. The error likely lies within your kernel code.
Jun 28, 2013 at 10:57 PM
Unfortunately I haven't had time to tinker with this lately. But I've narrowed it down to this call:
public static void findMaxDistance(GThread thread, float[]maxDistance, float[] tempDistances, float[,] coordinates)
{ //coordinates contains X, Y and Z coordinates in index [tid, 0], [tid, 1], and [tid, 2] respectively
  //maxDistance is an array of length 1 simply to allow a modified value to be passed back
    int tid = thread.blockIdx.x * thread.blockDim.x + thread.threadIdx.x;
    if (tid < tempDistances.GetLength(0))
    {
        tempDistances[tid] = ((coordinates[tid, 0] * coordinates[tid, 0]) +
        (coordinates[tid, 1] * coordinates[tid, 1]) + (coordinates[tid, 2] * coordinates[tid, 2]));

        thread.SyncThreads();

        for (int i = tempDistances.GetLength(0) / 2; i > 0; i /= 2)
        {
            if (tid < i)
            {
                if (tempDistances[tid] < tempDistances[tid + i])
                {
                    tempDistances[tid] = tempDistances[tid + i];
                }    
            }
            thread.SyncThreads();
        }

        if (tid == 0)
        {
            maxDistance[0] = (float)Math.Sqrt(tempDistances[0]);
        }
    }   
}
Now, I'm guessing the error is coming from the fact that the sync is only syncing threads on the same block, but there are possibly different blocks that execute out of order and hence, cause an issue by reading out of order.

Any ideas of how to fix this? Or is there a better method to find the maxDistance?
Jun 29, 2013 at 12:41 PM
Hi
This is a tipical reduction algo, and you're doing it wrong. You shouldn't even need the tempDistances buffer (which didn't seem to have the needed length or having been properly initialized throughout) . Instead use only a buffer with the same size as the number of blocks, containing the partial reduction result for each block, which you'll aggregate on the cpu for the final answer.
Used shared data for the reductions, and read on the nvidia docs for many cool examples of reductions.
cheers
Jul 1, 2013 at 4:58 PM
The temp buffer is being initialized once on the CPU before any GPU code is called and then passed in to be used repeatedly through each of the iterations.

If possible (and practical) I would like to keep everything on the GPU as I currently do this on the CPU and the memory transfer in this inner loop is more time consuming that most of the calculations. (This whole findMaxDistance routine is in the inner loop).

I can easily calculate the distance in the previous method and then just have a distance array to search though - would that likely perform better? I guess that would enable me to use a pre-built reduction function, but is there one that doesn't require powers of 2?

Looking through the Cuda example codes, I don't understand how my application is different than this one:
for (unsigned int s=blockDim.x/2; s>0; s>>=1) {
if (tid < s) {
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
Which is found on page 15 here: http://docs.nvidia.com/cuda/samples/6_Advanced/reduction/doc/reduction.pdf

Sure, I understand that there are many optimizations to make after this point, but they start getting more technical that I think I have time to work on this.
Jul 1, 2013 at 6:13 PM
Edited Jul 1, 2013 at 6:16 PM
"Now, I'm guessing the error is coming from the fact that the sync is only syncing threads on the same block, but there are possibly different blocks that execute out of order and hence, cause an issue by reading out of order."

You're correct, that's where the error lies. But:
  • your temporary buffer for the 1st stage of reduction should be in shared memory, and not global.
  • the idea behind reduction is that each block operates independently from other blocks, reading data from a small interval of a large global buffer of coordinates. So, alloc a shared mem buffer within a block, fill it with distances, then preform a reduction on that shared mem buffer (yes, your algo was fine as long as it operates on shared mem, and not on global mem). The final result of the block's reduction should be stored on a global mem buffer, with as many elements as there are blocks. Here your kernel finishes.
  • now, you have a small buffer with partial reductions, one per each block. You can either choose to scan them sequentially on the cpu, or feed them again into a different kernel which wil do the final reduction. Since the buffer should be fairly small, using the cpu for the final step seems feasible, but not essencial.
cheers
Jul 1, 2013 at 6:27 PM
Sounds good, I'll give it a try both ways (searching results on GPU and CPU) and see what has the best performance.

I'm a bit surprised that there isn't a built in "find Max" and "find Min" function though... Considering there's a Array.Max() function standard in C# I would have though something would exist.

Also, is there a built in sort method that I could use and just take the first/last element depending on the sort order? How much difference would this have on performance?

Thanks for your help.
Jul 1, 2013 at 8:01 PM
Edited Jul 1, 2013 at 8:02 PM
Would this work? Or does it not do what it appears?

public abstract int IAMAX( float[] vectorx, int n, int rowx, int incx);

It appears to search through vectorx and return the index of the max value. I believe n is the number of elements to search, but I'm not sure what rowx or incx does...

Couldn't find much documentation on this. Searching for IAMAX yeilds nothing of relevance...


There's also:
NppStatus nppsMax_32f (const Npp32f * pSrc, int nLength, Npp32f * pMax, Npp8u * pDeviceBuffer);

Any chance that's implemented in Cudafy?
Jul 2, 2013 at 12:06 AM
So this is what I've got. The goal was to have each block find it's max value and place it at the beginning of its section of the array in the first method. The second method is supposed to launch again with only 1 block of many threads (the number of blocks last time) and compare the first memory location of each of these to find the max between the first method's blocks. For some reason it doesn't always work (although it does sometimes). Haven't tracked that down yet, but will do soon:
[Cudafy]
        public static void findMaxStepOne(GThread thread, float[] tempDistance)
        {
            int tid = thread.threadIdx.x;
            int bindex = thread.blockIdx.x * thread.blockDim.x;

            if (tid + bindex < tempDistance.GetLength(0))
            {
                thread.SyncThreads();

                for (int i = thread.blockDim.x / 2; i > 0 && tid + bindex + i < tempDistance.GetLength(0); i /= 2)
                {
                    if (tid < i && tempDistance[tid + bindex] < tempDistance[tid + bindex + i] )
                    {
                        tempDistance[tid + bindex] = tempDistance[tid + bindex + i];
                    }
                    thread.SyncThreads();
                }
            }
        }

[Cudafy]
        public static void findMaxStepTwo(GThread thread, float[] tempDistance)
        {
            int tid = thread.threadIdx.x;
            int OldBlockSize = thread.blockDim.x;
            
            if (tid + OldBlockSize * thread.blockIdx.x < tempDistance.GetLength(0))
            {
                thread.SyncThreads();

                for (int i = thread.blockDim.x / 2; i > 0; i /= 2)
                {
                    if (tid < i)
                    {
                        if (tempDistance[tid * N] < tempDistance[(tid + i) * N])
                        {
                            tempDistance[tid * N] = tempDistance[(tid + i) * N];
                        }
                    }
                    thread.SyncThreads();
                }
            }
        }
Is there really no clean, optimized way to do this in Cudafy already built in???

Thanks for the help.
Jul 2, 2013 at 2:01 AM
You're getting closer.
But you're still not using shared memory for the reduction. And those offsets inside the arrays are a bit confusing.
Try to always think of global memory access as something prohibitively expensive, to be avoided at all costs. Read the data that your block will be working, in 1 step, onto a shared mem buffer, and work exclusively with it shared mem from then on. Do try to make that 1st memory read as a coalesced memory access. At the very end, save in 1 step the results onto global mem.

Anything inside a [cudafy] function isn't really c# (hah!). It only appears to be so. In fact, ultimately you're writing C code. So any library that you're used to work with in c# simply won't be available.
CUDA has some math libraries (BLAS, FFT, etc) that CUDAfy gives access to, but I am not aware of there being a sort or a max/min function on those libs.

"Also, is there a built in sort method that I could use and just take the first/last element depending on the sort order? How much difference would this have on performance?"
It's a huge difference. A scan searching for a Max/Min is only O(n), while a sort is O(nLog(n)), not to mention all those memory writes.

I have no idea whatsoever what "IAMAX" and "nppsMax" are, sorry.
Jul 2, 2013 at 8:42 AM
Edited Jul 2, 2013 at 8:45 AM
Take a look at BLAS - it has functions for max and min. Consult the NVIDIA documentation on CUBLAS for details. rowx and incx allow more advanced stepping through the data. You can also look at BLAS_1D.cs unit test in Cudafy.Math.UnitTests project.
Jul 2, 2013 at 5:30 PM
Yes, I have found a function in CUBLAS methods and had tried using them, but as I mentioned above I couldn't find sufficient documentation on any of it. I have followed the example in BLAS_1D.cs exactly, but yet these two codes yield different results:
maxDistance[0] = tempDistance.Max();

 int index = blas.IAMAX(tempDistanceOnDevice);
 maxDistance[1] = tempDistance[index];
I have checked that the array is correct and everything is copied so that the array on device and the array on host are the same, but to no avail.
I have also tried:
 int index = blas.IAMAX(tempDistanceOnDevice, 0, 0, 1);
 maxDistance[1] = tempDistance[index];

 int index = blas.IAMAX(tempDistanceOnDevice, tempDistance.GetLength(), 0, 1);
 maxDistance[1] = tempDistance[index];
to see if the other parameters were necessary (even though your example code doesn't require them), and to see if N was the number of elements to search, but to no avail.

Any ideas?




And pedritolo1: Thanks for your help - yes I realize there's a lot I can do to make it perform better, and if necessary I will, but first I was just trying to get the proper output. I can work on performance once it works. But I'm really hoping there's a built in function that works as those tend to be heavily optimized and work much better than anything I would write in my limited time for this project. And yes, I understand that this is a wrapper class going directly to C - I actually prefer to write in C/CUDA directly, but thought I'd give this a try. Also, IAMAX is a method given in Cudafy's implementation of BLAS. Try:
GPGPUBLAS blas = GPGPUBLAS.Create(gpu);
blas.IAMAX(Array);
Haven't figured out the parameters yet that yield the correct answer, but it does return an index into the array and should supposedly return the max...
Jul 2, 2013 at 6:20 PM
You can best consult the CUBLAS docs or any BLAS docs for that matter. It does indeed return the index of max value, not the value.
Jul 2, 2013 at 6:27 PM
Yes, and I've been looking through the CUBLAS documentation, but I cannot figure out why these are not equivalent:
(Sorry, the reference to trouble finding documentation was in regards to the IAMAX() method)
maxDistance[0] = tempDistance.Max();

 int index = blas.IAMAX(tempDistanceOnDevice);
 maxDistance[1] = tempDistance[index];
Jul 2, 2013 at 6:41 PM
Take a look again at the unit tests. You will see that BLAS is 1 indexed and not 0.
        [Test]
        public void TestMAXInVectorFirstHalf()
        {
            CreateRamp(_hostInput1);
            _gpu.CopyToDevice(_hostInput1, _devPtr1);
            int index = _blas.IAMAX(_devPtr1, ciN / 2);
            float max = _hostInput1.Take(ciN / 2).Max();
            Assert.AreEqual(max, _hostInput1[index - 1]); // 1-indexed
        }
Jul 2, 2013 at 6:47 PM
Wow... I should not have missed that.

Thank you for your help!