Example of parallel time series computation

Jan 23, 2012 at 7:22 PM

First off I'd like to say CUDAfy.NET is really cool!  I'm trying to build a Simple Moving Average calculation and I just can't seem to understand how to get CUDA (and thus CUDAfy.NET) to parallelize it for me.  The algorithm is very easy.  Given an array of float (a) and an int (n) sum the last n elements in a and divide by n.  I know I can just do the division distributed and then sum the result, but will this require an array as temp space to hold the intermediate results (which then feeds the sum). 

I suppose what confuses me the most is that I only want a subset of the array processed.  Does this mean I should "pull up" some logic and only pass in the part of the array I want processed?  My main area of concern here is that I'll have a few thousand floats in an ideally shared array that all run differnent permutations of calculations against it.  Respecting memory bandwidth I'm trying to keep this lean. 

 

Jan 23, 2012 at 8:27 PM
Edited Jan 23, 2012 at 8:28 PM

Dan,

I'm not sure exactly what your doing, but maybe this launch code will help...

int maxThreads = 512; // Dependant on GPU, if too many reduce the number 
int offset = 0; // Where you want to start in your item array
int loopCount = 5000; // How many items you want to process
int blockSize = Math.Max(maxThreads, (int)Math.Ceiling(loopCount / (float)maxThreads));

gpu.Launch(blockSize, Math.Min(maxThreads, loopCount)).Process(blockSize, offset, loopCount, dev_itemArray);

...

[Cudafy]
public static void Process(GThread thread, int blockSize, int offset, int count, float[] items)
{
    int i = thread.threadIdx.x * blockSize + thread.blockIdx.x + offset;
    if (i - offset > count)
        return; // exit, index is out of processing bounds
    // TODO: do your calculation here for i index in your array
    items[i] = items[i] + 7;
}

Basically this code does a for loop and i comes out to be the index you want to process. If this doesn't help maybe someone else can assist, I hope it helps though.

Jan 23, 2012 at 8:56 PM

I think this helps, I can divide each item in the array by the parameter value which is part of the algorithm.  At the end I need to sum the values of everything in the array.  The code above gets me almost there, I just need to sum now.

Thanks!

Jan 23, 2012 at 9:03 PM

Glad that helps, I maybe able to help you a little more. If you want another parameter for result...

1. Make an array of size 1
2. Copy that array to the device so its initalized to zero (using gpu.CopyToDevice())
3. Pass it to the process function
4. In the process function utilize the function...

Cudafy.Atomics.AtomicFunctions.atomicAdd()



Like:

AtomicFunctions.atomicAdd(ref sum[0], amountToAdd);

5. Copy the sum out after launch, and you should have your result

Jan 23, 2012 at 9:55 PM

So far so good, but any idea about: error: no instance of overloaded function "atomicAdd" matches the argument list
            argument types are: (float *, float)

Jan 23, 2012 at 10:44 PM

I'm not sure off the top of my head, but are you doing "ref sum[0]" or "ref sum" should be with the [0]

Jan 23, 2012 at 10:47 PM

Also make sure you are passing two floats, and if that doesnt work try

 float dummyResult = AtomicFunctions.atomicAdd(ref sum[0], amountToAdd);

Jan 24, 2012 at 1:38 AM

Yes, I'm passing two floats ref to the element in the array in param1 and the float I want to add in param 2.  I can't just call AtomicFunctions.atomicAdd because no overloads take two parameters (they take the GThread, which is really part of the extension method).  Could it be a compiler option or maybe this GPU doesn't support AtomicAdd?  I can try on a strong GPU in a little bit. 

Jan 24, 2012 at 1:41 AM

I'm not sure what version cuda will support that function, but the default in cudafy is 1.2, if you have a Fermi card, try compiling with 2.0 support by doing...

CudafyTranslator.Cudafy(ePlatform.Auto, eArchitecture.sm_20...


        
    
Jan 24, 2012 at 1:45 AM

I did find this thread, which I think is saying exactly what you just typed!  http://forums.nvidia.com/index.php?showtopic=194079

You are fast!

Jan 24, 2012 at 6:40 AM

Everything work out for you? I just noticed a small bug... :)

int blockSize = Math.Max(maxThreads, (int)Math.Ceiling(loopCount / (float)maxThreads));

should be

int blockSize = Math.Min(maxThreads, (int)Math.Ceiling(loopCount / (float)maxThreads));

Jan 24, 2012 at 2:42 PM

It did, it was the architecture.  My laptop is capable only of Compute Capability 1.2, but my desktop of 2.1 (I believe atomicAdd is a 2.0 capability.  Using that flag I can compile on the laptop, I just need to run on the desktop.  Again this demonstrates how AWESOME CUDAfy.NET is, total control over so much without falling back to C++. 

Thanks for all your help and your quick responses!

Jan 24, 2012 at 10:15 PM

Your welcome, I would also keep in mind that there should be a way to eliminate the atomicAdd() call, or minimize it to even make the calculation faster. However I am still quite new to this but it has something to do with the threads I think you dont have to call it if they are on the same thread. But I'm not really sure

Coordinator
Jan 25, 2012 at 6:40 AM

Do a google for CUDA sum reduction or reduce.  There is also an example in the CUDA SDK.  By splitting the sum up into blocks and then summing the totals of these blocks and so on you make the operation as parallel as possible.