Can't get simple algorithm to work? Been at it 24 hours no sleep! :-\

Sep 17, 2014 at 5:32 PM
This is the code
    [Cudafy]
    public unsafe static void Add(GThread thread, byte[] sourceBuffer, byte[] destBuffer)
    {
        int threadId = thread.blockIdx.x;

        int loopSize = (TotalByteDataToBeMixed / NumCudaThreads);
        int startIndex = threadId * loopSize;
        int endIndex = startIndex + loopSize;
        int length = endIndex - startIndex;

        fixed (byte* pDestBuffer = &destBuffer[startIndex])
        {
            fixed (byte* pSourceBuffer = &sourceBuffer[startIndex])
            {
                float* pfDestBuffer = (float*)pDestBuffer;
                float* pfReadBuffer = (float*)pSourceBuffer;
                int samplesRead = length / 4;
                for (int n = 0; n < samplesRead; n++)
                {
                    pfDestBuffer[n] += pSourceBuffer[n];
                }
            }
        }
    }
I have test code which validates that this function should be correct, and it never is.

It is only correct if i change the line which reads

pfDestBuffer[n] += pSourceBuffer[n];

to

pfDestBuffer[n] = pSourceBuffer[n];

The reason that is correct is because the first run the destination is filled with zeros in the destBuffer and this i validated was correct by using pfDestBuffer[n] = pDestBuffer[n]; and noting it always came back full of zeros. I also know that i get back values with pfDestBuffer[n] += pSourceBuffer[n]; but they are not the correct ones.

So here is what it compiles to

// GpuBlaster.Program
extern "C" global void Add( unsigned char* sourceBuffer, int sourceBufferLen0, unsigned char* destBuffer, int destBufferLen0)
{
int x = blockIdx.x;
int num = 37500;
int num2 = x * num;
int num3 = num2 + num;
int num4 = num3 - num2;
unsigned char* ptr = &destBuffer[(num2)];

{
    unsigned char* ptr2 = &sourceBuffer[(num2)];

    {
        float* ptr3 = (float*)ptr;
        float* ptr4 = (float*)ptr2;
        int num5 = num4 / 4;
        for (int i = 0; i < num5; i++)
        {
            ptr3[(i)] += *(ptr4 + i);
        }
    }
}
}
To me that seems correct, and i took the same code, and altered the unsigned char to byte, ran it through c# and it was correct. But it ain't correct on the Gpu????

I'm at a loss here since if i can't get a simple algorithm like that to work as intended then how could i proceed to more advanced ideas.

Any help much appreciated...

Thanks
Sep 18, 2014 at 6:41 AM
Well i changed it to use a float array instead, this cured it. Now the question is why the exact same code executed on my cpu is twice as fast on the cpu than on the gpu. its a GTX760.

i read somewhere about not updating an array insitu but then if i do create a different array to perform the calcs on then how do i get the result back into the insitu array? i have two memory blocks, 1 is source, and 1 is destination. the destination is always on the gpu and each call will pass in a new source to add to the destination using above formula. when i finished computing for multiple sources i will fetch destinatation from the device back to the host.,