This project is read-only.

Different results on emulator and GPU

Feb 4, 2012 at 3:52 PM


This is probably a truncation issue or something simple I'm just not seeing, but here goes.  I'm calculating a Simple Moving Average.  Non-CUDA comparison code is this:

        public static float SimpleMovingAverageSerial(PriceInfo[] priceInfo, int[] period)
            float average = 0;
            for (int i = priceInfo.Length - 1; i > priceInfo.Length - period[0] - 1; i--)
                average += priceInfo[i].Close;
                return average / period[0];

I have the following CUDA implementation thanks to this discussion group.

        public static void SimpleMovingAverage(GThread thread, PriceInfo[] priceInfo, int[] period, float[] result)
            int i = thread.threadIdx.x * blockSize + thread.blockIdx.x + period[0];
            if (i < (priceInfo.Length - period[0]) || i >= priceInfo.Length)
            AtomicFunctions.atomicAdd(thread, ref result[0], priceInfo[i].Close / period[0]);

Which I pull off the GPU with gpu.CopyFromDevice(d_result, h_result);

When I use GPGPU gpu = CudafyHost.GetDevice(eGPUType.Emulator); I get a result within .000000xx of my baseline (which for my case is certainly OK).

When I use GPGPU gpu = CudafyHost.GetDevice(eGPUType.Cuda, 0); I get something on the order of -1.9683E+38

For reference I'm working on currencies so the order is x.xxxx for my float values.  I am also using a Compute 2.1 card and sm_20 architecture. 

Any idea where I'm going astray?

Feb 6, 2012 at 3:58 AM

OK, a little more information.  I'm trying to calculate a 50 day SMA (Simple Moving Average) on an array of 200 values.  I'm using the last 50 values in the array.  When I run in emulator mode everything is fine (although understandably slow).  When I drop the atomicAdd and return just the division part of the calculation in an array of 50 floats I get the first 27 values what I expect and the rest all something different.  Some are huge, some are small, some are NaN.  Anything like this sound familiar?

Feb 6, 2012 at 8:43 AM

Numbers with values like that typically correspond to uninitialized memory.  Can you supply the code for the PriceInfo struct and where you make the copies and the launch? Also when using atomic functions the preferred syntax is to use the extensions on GThread: thread.atomicAdd(ref ....).

Feb 6, 2012 at 4:04 PM
Edited Feb 7, 2012 at 2:42 AM

 I had been using thread.atomicAdd, but at this point I've actually taken out the add all together just for testing.

public struct PriceInfo
    public PriceInfo(float bid, float ask, float last, float close, int volume, long time)
        Bid = bid;
        Ask = ask;
        Last = last;
        Close = close;
        Volume = volume;
        Time = time;

    public float Bid;
    public float Ask;
    public float Last;
    public float Close;
    public int Volume;
    public long Time;

 The rest of the code resembles. 


PriceInfo[] priceInfo = LoadPriceInfo(200);

CudafyModule km = CudafyTranslator.Cudafy(ePlatform.Auto, eArchitecture.sm_20, typeof(PriceInfo), typeof(TechnicalIndicators));
GPGPU gpu = CudafyHost.GetDevice(CudafyModes.Target, 0);
//GPGPU gpu = CudafyHost.GetDevice(eGPUType.Emulator);

float[] hostResult = new float[1];
PriceInfo[] pi = gpu.Allocate<PriceInfo>(priceInfo);
float[] dev_result = gpu.Allocate<float>(hostResult);
int[] period = new int[] { 50 };
int[] dev_peroid = gpu.Allocate<int>(period);

gpu.CopyToDevice(priceInfo, pi);
gpu.CopyToDevice(period, dev_peroid);

gpu.Launch(blockSize, Math.Min(maxThreads, period[0]), "SimpleMovingAverage", pi, dev_peroid, dev_result);

gpu.CopyFromDevice(dev_result, hostResult);


Feb 11, 2012 at 7:31 PM

It's in indexing problem for the memory.  This turns out to be the issue:

int i = thread.threadIdx.x * blockSize + thread.blockIdx.x + period[0];

I'm not sure why the emulator behaves differeinly, but I know this is the issue.  I can fix that issue and can access memory as I expect.  Now I only seem to be getting an issue where every value in my PriceInfo array of structures seems to be the same.  I'm going to try this with a simple float array and see if I can get past it.  AoS is less preferable to Structure of Arrays from what I've read anyway so perhaps I will change my libraries around to deal with this. 

I've been looking at the .cu output from Cudafy and I have to say it's really quite good!  It even cleans up some of my bad code!  Report back soon. 

Feb 11, 2012 at 7:45 PM

Resolved.  User error!  Too much time in managed code not realizing that atomic add will require initialization to zero!  Wow, I've been away from C++ for too long.  I was just about to jump back, but I think I'll stick with C# a little more now that CUDAFY is doing perfect translation for me!  Thanks!

Feb 12, 2012 at 8:34 AM

Pleased to hear it!  I'm no stranger to such things myself.  You can initialize the array to zero on the device using the Set method of GPGPU instance.