
Hi
Looking for a solution:
I’m running my gpu function, with a fixed data set, no changing data, honest!
I run the gpu function in a loop several times:
Sometimes I get the full data set returned 1367 elements in my array, great ! with the result I expect:
Other times I get different results in the first 1024 elements , From dev_c[0] to dev_c[1024],
But the remainder are correct and what I expect(10251366).
I’m calling gpu function with:
_gpu.Launch(1024, 1024, "gpufuncnew", dev_w2, dev_c, dev_ds, dev_ia, dev_stk);
And data back with:
_gpu.CopyFromDevice(dev_c, gpu_results);
The 1024 can’t be coincidence? What am I overwriting?
Between each iteration I clear mem with:
_gpu.Free(dev_w2);
_gpu.Free(dev_ia);
_gpu.Free(dev_stk);
_gpu.Free(dev_c);
public const int N = 1367;
GPU func:
public static void gpufuncnew(GThread thread, Single[] weights, Single[] c, Single[,] ds, int[] ia, Single[] stk )
{
int tid = thread.threadIdx.x + thread.blockIdx.x * thread.blockDim.x;
while (tid < N )
{
… to big to copy whole lot
Any simple answers please?
Thanks in advance
Carl



Most likely its a race condition.
Take the time to build a trimeddown example that still has the problem, and then post it for us to look at.
And maybe this will help:
If you’re getting inconsistent numerical results, which change on every run, most likely you are accessing uninitialized memory somewhere in your computations. Another source of this can result by not calling SynchThreads() after some particular manipulations
of shared memory where more than one thread contribute for a common shared result. Try placing SynchThreads() everywhere, and if it fixes your problem, selectively remove them. This has fixed some vexing problems I had where different hardware & architectures
would produce inconsistent results.



Many Thanks pedritolo :)
Here is the full code, tried SyncThreads and GC.Collect() calls, still inconsistent results:
hope its understandable, please excuse my simple coding...
public const int N = 1248;
CudafyModule km = CudafyTranslator.Cudafy();
GPGPU _gpu = CudafyHost.GetDevice(CudafyModes.Target);
_gpu.LoadModule(km);
// this data looks real simple, it is!
// I start with a population of Chromosomes that are all functions in RPN like "$12 sqrt ln"
// I translate Chromo, into an int array swapping sin, cos exp etc for int values 1001, 1002, 1003 etc, so I can get them to gpu and they can be evaluated
// as gpu does not seem to like strings!
// I want to pass a data set of 1000 maybe 10000 to the gpu eventually and have it evaluate the RPN expression for each row of data in the set
// returning data set  result, which I can then average
//
// what im trying to test here is the gpu function "gpufuncnew" running this same calc 1248 times to simulate a data set.
// so im expecting the result array to be filled with the same answer 1.24245334, for each row
// sometimes I get this, other times I get 0.3, 0.7 and NaN returned!
//
int[] data_array = new int[3] {12, 1009, 1008 };
Single[] stk = new Single[12];
Single[] dev_stk = _gpu.Allocate<Single>(stk);
dev_stk = _gpu.CopyToDevice(stk);
int[] dev_ia = _gpu.Allocate<int>(data_array);
dev_ia = _gpu.CopyToDevice(data_array);
Single[] result = new Single[1248];
Single[] dev_c = _gpu.Allocate<Single>(result);
dev_c = _gpu.CopyToDevice(result);
Single[] gpu_results = new Single[1248];
_gpu.Launch(1024, 1024, "gpufuncnew", dev_c, dev_ia, dev_stk);
_gpu.Synchronize();
_gpu.CopyFromDevice(dev_c, gpu_results);
//_gpu.FreeAll();
GC.Collect();
var AVG1 = gpu_results.Average();
[Cudafy]
public static void gpufuncnew(GThread thread, Single[] c, int[] ia, Single[] stk )
{
int tid = thread.threadIdx.x + thread.blockIdx.x * thread.blockDim.x;
//
// this is my attempt at a simple stack emulator, as a "stack" does not seem to work in Cudafy or cuda
// even had to replace indexof function with a loop as this was also not supported
// works on cpu fine consistent results every time
while (tid < N)
{
Single AccResult = 0f;
int stkPos = 0;
for (int x = 0; x < ia.Length; x++) // for each element on array rpn expression
{
if (ia[x] < 1000) // value add to stack
{
int has = 0; // Array.IndexOf(stk, 0);
for (int i = 0; i < stk.Length; i++) { if (stk[i] == 0) { has = i; break; } }
stk[has] = ia[x]; // replace this with lookup from real data set later after proof that this GPU func actually works consistently
AccResult = stk[has];
stkPos++;
}
else
{
//var has = Array.IndexOf(stk, 0);
if (ia[x] == 1001) // its an +
{
int has = 0; for (int i = 0; i < stk.Length; i++) { if (stk[i] == 0) { has = i; break; } } // get the first zero element of stack
AccResult = (Single)stk[has  1] + stk[has  2];
stk[has  1] = 0;
stk[has  2] = AccResult;
}
else
{
if (ia[x] == 1002) // its an 
{
int has = 0; for (int i = 0; i < stk.Length; i++) { if (stk[i] == 0) { has = i; break; } }
AccResult = (Single)stk[has  2]  stk[has  1];
stk[has  1] = 0;
stk[has  2] = AccResult;
}
else
{
if (ia[x] == 1003) // its an *
{
int has = 0; for (int i = 0; i < stk.Length; i++) { if (stk[i] == 0) { has = i; break; } }
AccResult = (Single)stk[has  1] * stk[has  2];
stk[has  1] = 0;
stk[has  2] = AccResult;
}
else
{
if (ia[x] == 1004) // its an /
{
int has = 0; for (int i = 0; i < stk.Length; i++) { if (stk[i] == 0) { has = i; break; } }
AccResult = (Single)stk[has  2] / stk[has  1];
stk[has  1] = 0;
stk[has  2] = AccResult;
}
else
{
if (ia[x] == 1005) // its an sin
{
int has = 0; for (int i = 0; i < stk.Length; i++) { if (stk[i] == 0) { has = i; break; } }
AccResult = (Single)Math.Sin(stk[has  1]); //// convert from radians to degrees?
stk[has  1] = AccResult;
}
else
{
if (ia[x] == 1006)
{
int has = 0; for (int i = 0; i < stk.Length; i++) { if (stk[i] == 0) { has = i; break; } }
AccResult = (Single)Math.Cos(stk[has  1]); //// convert from radians to degrees?
double codd = Math.Cos(stk[has  1]);
stk[has  1] = AccResult;
}
else
{
if (ia[x] == 1007)
{
int has = 0; for (int i = 0; i < stk.Length; i++) { if (stk[i] == 0) { has = i; break; } }
AccResult = (Single)Math.Exp(stk[has  1]);
stk[has  1] = AccResult;
}
else
{
if (ia[x] == 1008)
{
int has = 0; for (int i = 0; i < stk.Length; i++) { if (stk[i] == 0) { has = i; break; } }
AccResult = (Single)Math.Log(stk[has  1]);
stk[has  1] = AccResult;
}
else
{
if (ia[x] == 1009)
{
int has = 0; for (int i = 0; i < stk.Length; i++)
{
if (stk[i] == 0)
{
has = i; break;
}
}
AccResult = (Single)Math.Sqrt(stk[has  1]);
stk[has  1] = AccResult;
}
}
}
}
}
}
}
}
}
}
}
c[tid] = AccResult;
thread.SyncThreads();
tid += thread.blockDim.x * thread.gridDim.x;
}
}
Regards Carl



Morning hope everyone had a great weekend!
Ok, woke early today and back at my prob with a clearer head:
by process of elmination, I started to comment out sections of code
when I replaced this line ..........
AccResult = (Single)Math.Sqrt(stk[has  1]);
with:
AccResult = 2.0f; // (Single)Math.Sqrt(stk[has  1]);
I got consistent results......
Given the data is a constant, during my testing, why does GPU/CUDA/CUdafy not like Math.Sqrt function?





Mar 19, 2013 at 8:24 PM
Edited Mar 19, 2013 at 8:32 PM

Have you tried running it on the cpu under emulation mode (I mean eGPUType.Emulator), where things should work (mostly) as expected? I know 1024 threads is huge and the cpu would collapse under such strain, so you'll need to split up your problem into
items with a much lesser number of threads, for emulation and error detection purposes.
Besides, only the very latest nvidia cards run properly under such huge load of blocks/threads, and even then not always seamlessly. Personally, I've run into inconsistent results when using (th>=512,bl=1024) sizes under debug mode (CudafyTranslator.GenerateDebug
= true). This on a gtx680.
Nevertheless, your example is still huge and really hard to read. God, all the nested if's! Are you very sure you can't keep trimming it down? There will come a time when you remove something and suddenly get corrent results. That will be your bug right there.
Also, if the sqrt is a source of rounding problems then you should temporarily remove it and compare the result with some CPU calculation which also had its sqrt removed.
Hope this helps



Yes you could try: AccResult = stk[has  1]; // (Single)Math.Sqrt(stk[has  1]);
I would also advise using GMath.Sqrt instead of Math.Sqrt. This can improve performance by using the native CUDA single floating point overload of sqrt. You also will not need to cast. I do not think that this is the cause of your issue though.

