This project is read-only.

running kernels in sequence, global data and passing arrays!

Jan 10, 2012 at 4:33 PM
Edited Jan 11, 2012 at 9:05 AM

Hello there,

I'm using Cudafy to increase the speed of a flow model (for flood mapping etc..) that I've been using/developing for the last 10 years or so, and have managed to paralell the code nicely to run on multi core machines so the next step is for GPU!

I've managed to get cudafy working, and it seems to be giving corrrect results with a c.4x speed up. But I know how I am doing things below is really clunky.

Basically, I have 4 or 5  2d arrays (e.g. 100 by 100 cells - but can be up to 1000 by 1000) that I need to manupulate every iteration according to values in neighbouring cells and values in the other arrays. There are 3 sets calculations that have to be done after each other - so I need to go through all the cells for calc1 (calcflow() ) then all the cells for calc2 and then for calc 3 etc... I effectivly have a kernel for calc1 calc2 and calc3 and call them after each other 

I know whats slowing my code is shuffling these arrays to and fro from the GPU.. so what I really want to do (and am trying to do below) is to trasnfer all the array values to the GPU... repeat calc 1, calc2 calc3 (in sequence) say 100 times, then pull the results down (for display etc..) then do the same again etc...

How should I do this? (as I'm pretty sure I'm doing it wrong here!) should I set these arrays as global variables on the GPU and do all the calcs that way? How I seem to have it working below involves moving all the values accross at the start then pulling down and transferring the values back and forth from the Kernels each time...

Sorry for the dumb question - I'm really pleased with how far I've got so far! and suspect I can get some pretty impressive speed ups from what I have so far.. sorry for my horrible coding too! its a mixture of trial and error and some informed guesswork!

All the best,

Tom

 

(main code loop below and kernels below that)

                              ymax=100;

                              xmax=100;

		int[,] dev_down_scan = _gpu.CopyToDevice(down_scan);
                double[,] dev_qx = _gpu.CopyToDevice(qx);
                double[,] dev_qy = _gpu.CopyToDevice(qy);
                double[,] dev_elev = _gpu.CopyToDevice(elev);
                double[,] dev_water_depth = _gpu.CopyToDevice(water_depth);

		// here is where I'm trying to loop the code within the GPU...

                for (int Q = 1; Q <= 50; Q++)
                {

                    local_time_factor = courant_number * (DX / Math.Sqrt(9.8 * (maxdepth)));

                    //qroute(local_time_factor);
                    //depth_update(local_time_factor);
                    //reach_water_and_sediment_input(local_time_factor);

                    _gpu.Launch(ymax / 4, 4).calcflow(DX, local_time_factor, dev_down_scan, ymax, xmax, dev_qx, dev_qy,
                         dev_elev, dev_water_depth);
                    _gpu.Synchronize();
                        //copy the array back from the GPU to the CPU
                    _gpu.CopyFromDevice(dev_qx, qx);
                    _gpu.CopyFromDevice(dev_qy, qy);

                    // depth update
                    ///
                    _gpu.Launch(ymax / 4, 4).calcdepths(DX, local_time_factor, dev_down_scan, ymax, dev_qx, dev_qy,
                         dev_water_depth);
                    _gpu.Synchronize();
                    // copy the array back from the GPU to the CPU
                    _gpu.CopyFromDevice(dev_water_depth, water_depth);

                    // scan area
                    _gpu.Launch(ymax / 4, 4).scanarea(xmax, ymax, dev_down_scan, dev_water_depth);
                    _gpu.Synchronize();
                    _gpu.CopyFromDevice(dev_down_scan, down_scan);

                    counter++;
                    cycle += local_time_factor / 60;

                    scan_area();

                    temptot = 0;
                    for (y = 1; y <= ymax; y++)
                    {
                        if (water_depth[xmax, y] > 0)
                        {
                            temptot += water_depth[xmax, y] * DX * DX / local_time_factor;
                            // and zero water depths at RH end
                            water_depth[xmax, y] = 0;
                        }
                    }
                    waterOut = temptot;
                }

                _gpu.FreeAll();






        [Cudafy]
        public static void calcflow(GThread thread,  double DX, double local_time_factor, int[,] down_scan, int ymax, int xmax, double[,] qx, double[,] qy,
             double[,] elev, double[,] water_depth)
        {
            // Get the id of the thread. addVector is called N times in parallel, so we need 
            // to know which one we are dealing with.
            double edgeslope = 0.001;

            int y = thread.blockDim.x * thread.blockIdx.x + thread.threadIdx.x;

            // To prevent reading beyond the end of the array we check that the id is less than Length
            if(y>1&&y<ymax)
            {
                int inc = 1;
                while (down_scan[y, inc] > 0)
                //for (int x = 1; x <= xmax; x++)
                {
                    int x = down_scan[y, inc];
                    inc++;

                    if (elev[x, y] > -9999) // to stop moving water in to -9999's on elev
                    {
                        // routing in x direction
                        if ((water_depth[x, y] > 0 || water_depth[x - 1, y] > 0) && elev[x - 1, y] > -9999)  // need to check water and not -9999 on elev
                        {
                            double hflow = Math.Max(elev[x, y] + water_depth[x, y], elev[x - 1, y] + water_depth[x - 1, y]) -
                                            Math.Max(elev[x - 1, y], elev[x, y]);

                            if (hflow > 0.001)
                            {
                                double tempslope = (((elev[x - 1, y] + water_depth[x - 1, y])) -
                                        (elev[x, y] + water_depth[x, y])) / DX;
                                if (tempslope > 0.1) tempslope = 0.1;
                                if (tempslope < -0.1) tempslope = -0.1;
                                if (x == xmax) tempslope = edgeslope;

                                qx[x, y] = ((qx[x, y] - (9.8 * hflow * local_time_factor * tempslope)) /
                                        (1 + 9.8 * hflow * local_time_factor * (0.03 * 0.03) * Math.Abs(qx[x, y]) /
                                        Math.Pow(hflow, (10 / 3.0))));

                            }
                            else
                            {
                                qx[x, y] = 0;

                            }
                        }

                        //routing in the y direction
                        if ((water_depth[x, y] > 0 || water_depth[x, y - 1] > 0) && elev[x, y - 1] > -9999)
                        {
                            double hflow = Math.Max(elev[x, y] + water_depth[x, y], elev[x, y - 1] + water_depth[x, y - 1]) -
                                            Math.Max(elev[x, y], elev[x, y - 1]);

                            if (hflow > 0.001)
                            {
                                double tempslope = (((elev[x, y - 1] + water_depth[x, y - 1])) -
                                    (elev[x, y] + water_depth[x, y])) / DX;
                                if (tempslope > 0.1) tempslope = 0.1;
                                if (tempslope < -0.1) tempslope = -0.1;
                                if (x == xmax) tempslope = edgeslope;

                                qy[x, y] = ((qy[x, y] - (9.8 * hflow * local_time_factor * tempslope)) /
                                        (1 + 9.8 * hflow * local_time_factor * (0.03 * 0.03) * Math.Abs(qy[x, y]) /
                                        Math.Pow(hflow, (10 / 3.0))));


                            }
                            else
                            {
                                qy[x, y] = 0;
                            }
                        }

                    }
                }
            }
        }

        [Cudafy]
        public static void scanarea(GThread thread, int xmax, int ymax, int[,] down_scan, double[,] water_depth)
        {
            int y = thread.blockDim.x * thread.blockIdx.x + thread.threadIdx.x;
            // To prevent reading beyond the end of the array we check that the id is less than Length
            if (y > 1 && y < ymax)
            {

                int inc = 1;

                for (int x = 1; x <= xmax; x++)
                {

                    // zero scan bit..
                    down_scan[y, x] = 0;

                    // and work out scanned area.
                    if (water_depth[x, y] > 0
                        || water_depth[x - 1, y] > 0
                        || water_depth[x - 1, y - 1] > 0
                        || water_depth[x - 1, y + 1] > 0
                        || water_depth[x + 1, y - 1] > 0
                        || water_depth[x + 1, y + 1] > 0
                        || water_depth[x, y - 1] > 0
                        || water_depth[x + 1, y] > 0
                        || water_depth[x, y + 1] > 0)
                    {
                        down_scan[y, inc] = x;
                        inc++;
                    }
                }
            }
        }

        [Cudafy]
        public static void calcdepths(GThread thread, double DX, double local_time_factor, int[,] down_scan, int ymax, double[,] qx, double[,] qy,
             double[,] water_depth)
        {
            // Get the id of the thread. addVector is called N times in parallel, so we need 
            // to know which one we are dealing with.

            int y = thread.blockDim.x * thread.blockIdx.x + thread.threadIdx.x;
            // To prevent reading beyond the end of the array we check that the id is less than Length
            if (y > 1 && y < ymax)
            {
                int inc = 1;
                while (down_scan[y, inc] > 0)
                //for (int x = 1; x <= xmax; x++)
                {
                    int x = down_scan[y, inc];
                    inc++;

                    // update water depths
                    water_depth[x, y] += local_time_factor * ((qx[x + 1, y] - qx[x, y] + qy[x, y + 1] - qy[x, y]) / DX);


                }
            }
        }
Jan 11, 2012 at 9:23 AM

Hi Tom,

Have you run your app through NVIDIA Compute Profiler?  CUDA optimization is very much try and see and profile, repeat.

It is difficult to figure out what values you need to keep in every loop - this is not obvious to me.  As a simple rule minimize the transfer of data to and from the GPU.  Keep the data on the GPU.  There are on top of this further optimizations you can try:

  • Use pinned memory (host allocate) http://www.hybriddsp.com/Support/CudafyTutorials/FasterDataCopyingtoGPU.aspx
  • Overlap kernel and copy actions (this goes hand in hand with pinned memory)
  • Use floats instead of doubles.  This can make a big difference (like 4x) on Geforce cards.  By the way if you have an early CUDA card then all double operations are treated as float anyway, so you do not see the difference.

Hope this helps. Let us know how it works out.

Nick

Jan 11, 2012 at 10:02 AM
Edited Jan 11, 2012 at 10:03 AM

Hi Nick,

Thanks for getting back to me: I think my issue is probaly quite basic and I'm just not explaining myself. Sorry, I'll try again by distilling the problem to its simplest..

Lets say I have a 2d array: depth[100,100]

Every iteration of the model,

Task 1. I have to alter every cell in depth[,] according to some parameters (it doesnt matter what really)..

Tast 2. I then have to alter every cell in depth[,] according to some other parameters

Then repeat task 1 and 2 etc... 1 has to be carried out to all cells before 2...

 

So, what I want to do, is

1. copy depth[,] over to the GPU

2. Carry out task1 then task2

3. repeat 2 say 100 times, then pull values for depth[,] back to the main program to look at them,

4. then repeat... 2 100 times then pull values etc..... and so on...

 

But, At the moment, I'm....

1. copying depth[,] accross to the GPU

2. run a Kernel for task1 - but then pull depth[,] back again

3. copy depth accross to the GPU (again..)

4. run a kernel for tast2 - then pull depth[,] back again...

 

Is there a way I can copy depth[,] accross to the GPU - and keep it (and its values) on the GPU, and run multiple kernels using depth[,] without having to pass it to and from the GPU...? a global variable perhaps? (sounds messy but mught work)....

Whats slowing me down at the moment is moving depth[,] back and forth every time I run a kernel... if I comment out the copying back, the speed shoots up....

Thanks, and I hope that makes more sense! I proably should have written it as pseudocode before... :)

All the best,

Tom


        int[,] dev_down_scan = _gpu.CopyToDevice(down_scan);
                double[,] dev_qx = _gpu.CopyToDevice(qx);
                double[,] dev_qy = _gpu.CopyToDevice(qy);
                double[,] dev_elev = _gpu.CopyToDevice(elev);
                double[,] dev_water_depth = _gpu.CopyToDevice(water_depth);

        // here is where I'm trying to loop the code within the GPU...

                for (int Q = 1; Q <= 50; Q++)
                {

                    local_time_factor = courant_number * (DX / Math.Sqrt(9.8 * (maxdepth)));

                    //qroute(local_time_factor);
                    //depth_update(local_time_factor);
                    //reach_water_and_sediment_input(local_time_factor);

                    _gpu.Launch(ymax / 4, 4).calcflow(DX, local_time_factor, dev_down_scan, ymax, xmax, dev_qx, dev_qy,
                         dev_elev, dev_water_depth);
                    _gpu.Synchronize();
                        //copy the array back from the GPU to the CPU
                    _gpu.CopyFromDevice(dev_qx, qx);
                    _gpu.CopyFromDevice(dev_qy, qy);

                    // depth update
                    ///
                    _gpu.Launch(ymax / 4, 4).calcdepths(DX, local_time_factor, dev_down_scan, ymax, dev_qx, dev_qy,
                         dev_water_depth);
                    _gpu.Synchronize();
                    // copy the array back from the GPU to the CPU
                    _gpu.CopyFromDevice(dev_water_depth, water_depth);

                    // scan area
                    _gpu.Launch(ymax / 4, 4).scanarea(xmax, ymax, dev_down_scan, dev_water_depth);
                    _gpu.Synchronize();
                    _gpu.CopyFromDevice(dev_down_scan, down_scan);

                    counter++;
                    cycle += local_time_factor / 60;

                    scan_area();

                    temptot = 0;
                    for (y = 1; y <= ymax; y++)
                    {
                        if (water_depth[xmax, y] > 0)
                        {
                            temptot += water_depth[xmax, y] * DX * DX / local_time_factor;
                            // and zero water depths at RH end
                            water_depth[xmax, y] = 0;
                        }
                    }
                    waterOut = temptot;
                }

                _gpu.FreeAll();






        [Cudafy]
        public static void calcflow(GThread thread,  double DX, double local_time_factor, int[,] down_scan, int ymax, int xmax, double[,] qx, double[,] qy,
             double[,] elev, double[,] water_depth)
        {
            // Get the id of the thread. addVector is called N times in parallel, so we need
            // to know which one we are dealing with.
            double edgeslope = 0.001;

            int y = thread.blockDim.x * thread.blockIdx.x + thread.threadIdx.x;

            // To prevent reading beyond the end of the array we check that the id is less than Length
            if(y>1&&y<ymax)
            {
                int inc = 1;
                while (down_scan[y, inc] > 0)
                //for (int x = 1; x <= xmax; x++)
                {
                    int x = down_scan[y, inc];
                    inc++;

                    if (elev[x, y] > -9999) // to stop moving water in to -9999's on elev
                    {
                        // routing in x direction
                        if ((water_depth[x, y] > 0 || water_depth[x - 1, y] > 0) && elev[x - 1, y] > -9999)  // need to check water and not -9999 on elev
                        {
                            double hflow = Math.Max(elev[x, y] + water_depth[x, y], elev[x - 1, y] + water_depth[x - 1, y]) -
                                            Math.Max(elev[x - 1, y], elev[x, y]);

                            if (hflow > 0.001)
                            {
                                double tempslope = (((elev[x - 1, y] + water_depth[x - 1, y])) -
                                        (elev[x, y] + water_depth[x, y])) / DX;
                                if (tempslope > 0.1) tempslope = 0.1;
                                if (tempslope < -0.1) tempslope = -0.1;
                                if (x == xmax) tempslope = edgeslope;

                                qx[x, y] = ((qx[x, y] - (9.8 * hflow * local_time_factor * tempslope)) /
                                        (1 + 9.8 * hflow * local_time_factor * (0.03 * 0.03) * Math.Abs(qx[x, y]) /
                                        Math.Pow(hflow, (10 / 3.0))));

                            }
                            else
                            {
                                qx[x, y] = 0;

                            }
                        }

                        //routing in the y direction
                        if ((water_depth[x, y] > 0 || water_depth[x, y - 1] > 0) && elev[x, y - 1] > -9999)
                        {
                            double hflow = Math.Max(elev[x, y] + water_depth[x, y], elev[x, y - 1] + water_depth[x, y - 1]) -
                                            Math.Max(elev[x, y], elev[x, y - 1]);

                            if (hflow > 0.001)
                            {
                                double tempslope = (((elev[x, y - 1] + water_depth[x, y - 1])) -
                                    (elev[x, y] + water_depth[x, y])) / DX;
                                if (tempslope > 0.1) tempslope = 0.1;
                                if (tempslope < -0.1) tempslope = -0.1;
                                if (x == xmax) tempslope = edgeslope;

                                qy[x, y] = ((qy[x, y] - (9.8 * hflow * local_time_factor * tempslope)) /
                                        (1 + 9.8 * hflow * local_time_factor * (0.03 * 0.03) * Math.Abs(qy[x, y]) /
                                        Math.Pow(hflow, (10 / 3.0))));


                            }
                            else
                            {
                                qy[x, y] = 0;
                            }
                        }

                    }
                }
            }
        }

        [Cudafy]
        public static void scanarea(GThread thread, int xmax, int ymax, int[,] down_scan, double[,] water_depth)
        {
            int y = thread.blockDim.x * thread.blockIdx.x + thread.threadIdx.x;
            // To prevent reading beyond the end of the array we check that the id is less than Length
            if (y > 1 && y < ymax)
            {

                int inc = 1;

                for (int x = 1; x <= xmax; x++)
                {

                    // zero scan bit..
                    down_scan[y, x] = 0;

                    // and work out scanned area.
                    if (water_depth[x, y] > 0
                        || water_depth[x - 1, y] > 0
                        || water_depth[x - 1, y - 1] > 0
                        || water_depth[x - 1, y + 1] > 0
                        || water_depth[x + 1, y - 1] > 0
                        || water_depth[x + 1, y + 1] > 0
                        || water_depth[x, y - 1] > 0
                        || water_depth[x + 1, y] > 0
                        || water_depth[x, y + 1] > 0)
                    {
                        down_scan[y, inc] = x;
                        inc++;
                    }
                }
            }
        }

        [Cudafy]
        public static void calcdepths(GThread thread, double DX, double local_time_factor, int[,] down_scan, int ymax, double[,] qx, double[,] qy,
             double[,] water_depth)
        {
            // Get the id of the thread. addVector is called N times in parallel, so we need
            // to know which one we are dealing with.

            int y = thread.blockDim.x * thread.blockIdx.x + thread.threadIdx.x;
            // To prevent reading beyond the end of the array we check that the id is less than Length
            if (y > 1 && y < ymax)
            {
                int inc = 1;
                while (down_scan[y, inc] > 0)
                //for (int x = 1; x <= xmax; x++)
                {
                    int x = down_scan[y, inc];
                    inc++;

                    // update water depths
                    water_depth[x, y] += local_time_factor * ((qx[x + 1, y] - qx[x, y] + qy[x, y + 1] - qy[x, y]) / DX);


                }
            }
        }
Jan 12, 2012 at 12:13 PM

I don't understand why you copy it back to the host within the loop.  Why don't you do all the calculations on the GPU (so also the for loop you do on the host)? Put the following in a device function - optimizing this for a GPU is not trivial but not too complex either; you'll need to divide dev_water_depth into subblocks and sum temptot per block, then sum all the temptots to give waterOut which you read back from the GPU (use an array of length 1 for this value).  There is nothing difficult or messy about this.

What does the host scan_area() do?  Can this also be done on GPU?

                   cycle += local_time_factor / 60;

                    scan_area();

                    temptot = 0;
                    for (y = 1; y <= ymax; y++)
                    {
                        if (water_depth[xmax, y] > 0)
                        {
                            temptot += water_depth[xmax, y] * DX * DX / local_time_factor;
                            // and zero water depths at RH end
                            water_depth[xmax, y] = 0;
                        }
                    }
                    waterOut = temptot;
Jan 12, 2012 at 8:25 PM
Hello Nick - thanks for the reply -

NickKopp wrote:

I don't understand why you copy it back to the host within the loop.  Why don't you do all the calculations on the GPU (so also the for loop you do on the host)?

Thats what I want to do - but dont seem to be able to get it to work. Can I copy an array over to the GPU - then call (I assume) separate Kernels to access the data and change it without having to move the data back to the CPU then over to the GPU again? The algorithms I use need to perform one calculation on all cells in an array then perform another calc on all the cells in the array... you have to look at all cells first - then all cells second etc..

I;ve been on a cuda training course - and I know you can do this in Cuda- but re-writing everything in C would be a real pain, so I'm really impresesd with Cudafy.. I suspect what I'm trying to do is actualyl quite straighforward, but I'm just explaining it badly!


Put the following in a device function - optimizing this for a GPU is not trivial but not too complex either; you'll need to divide dev_water_depth into subblocks and sum temptot per block, then sum all the temptots to give waterOut which you read back from the GPU (use an array of length 1 for this value).  There is nothing difficult or messy about this.

What does the host scan_area() do?  Can this also be done on GPU?

                   cycle += local_time_factor / 60;

                    scan_area();

                    temptot = 0;
                    for (y = 1; y <= ymax; y++)
                    {
                        if (water_depth[xmax, y] > 0)
                        {
                            temptot += water_depth[xmax, y] * DX * DX / local_time_factor;
                            // and zero water depths at RH end
                            water_depth[xmax, y] = 0;
                        }
                    }
                    waterOut = temptot;

Thanks for looking at this code too :) this is really fast to exectute and I can GPU this readily - its just in the code for now
as I'm trying to get the rest of the code workign well on the GPU and I needed this routine for it to work...

 

Jan 13, 2012 at 7:11 AM

So long as the data is on GPU you can launch a kernel function that uses it.  You can load your data, do your GPU kernel launch on it, then do another GPU kernel launch, etc.  You only need think of where the data is now - GPU or host.  If it is on GPU then just do your stuff.

Jan 13, 2012 at 1:38 PM

OK - I think I've actually been making this far more complicated than I need to! I made this example (its pointless but does kind of what i need to do - adding up adjacent values in arrays etc..) and it all seems to do what I want... can you see any issues with this?

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Diagnostics;
using Cudafy;
using Cudafy.Host;
using Cudafy.Translator;
namespace CudafyIntroduction
{
    class Program
    {
        private static int N = 100;
        
        static void Main(string[] args)
        {
            try
            {
                CudafyModule km = CudafyTranslator.Cudafy();
                _gpu = CudafyHost.GetDevice(eGPUType.Cuda);
                _gpu.LoadModule(km);


                double[] a = new double[N];

                // fill the arrays 'a' from 0 to 99
                for (int i = 0; i < N; i++)
                {
                    a[i] = i;
                }
                
                double[] dev_a = _gpu.CopyToDevice(a);

                for (int n = 1; n <= 50; n++)
                {
                    _gpu.Launch(N, 1).addVector(dev_a);
                }

                _gpu.CopyFromDevice(dev_a, a);

                Console.WriteLine("total of a[1] is {0} ", a[1]);

                // This used a bit more precious GPU memory than the earlier examples, so let's free it
                _gpu.FreeAll();


                Console.WriteLine("Done!");
            }
            catch (CudafyLanguageException cle)
            {
                HandleException(cle);
            }
            catch (CudafyCompileException cce)
            {
                HandleException(cce);
            }
            catch (CudafyHostException che)
            {
                HandleException(che);
            }
            
            Console.ReadLine();
        }


        [Cudafy]
        public static void addVector(GThread thread, double[] a)
        {
            int tid = thread.blockIdx.x;
            // To prevent reading beyond the end of the array we check that the id is less than Length
            if (tid < a.Length-1)
            {
                a[tid] = a[tid+1];
            }
        }


        private static GPGPU _gpu;

        private static void HandleException(Exception ex)
        {
            Console.WriteLine(ex.Message);
        }
    }
}

I was wondering what are the benefits of allocating memory on the GPU as opposed to just copying values accross as I am above?

Also, if I have kernels called sequentially - then will it wait for the first kernel to finish before starting the second or do I need to use a (sync?) command to make sure its finished?

I'll try and implement my code using the above as a template over the weekend and see how I get on! Its probably easiest to copy all the variables over to the GPU and do the work there - even some serial code..

I'm really impressed with Cudafy Nick - its been a bit tricky for me to get set up and for me to get my head around some of the ideas - but it seems to work really well and integrates beautifully with existing c# code (which in my case deals with all the fancy drawing and file handling arrangements etc..)

Tom

Jan 13, 2012 at 5:26 PM
Edited Jan 13, 2012 at 6:14 PM

Wow.

Its working! I'm getting the same results as my linear code.... GPU'ometer is showing 99%.. cpu 25% :)

So some benchmarks. I'm doing all this on a laptop - a Dell Alienware m14x, i5 (quad core) and GT555m (3gb mem, but ddr3..)

I'm getting a pretty consistent 4x speed up with the GPU - but this is in comparison to the quad core, if I set affinity to 1 core (for the non GPU) then I'm getting an 11x speed up...  

I've a GTX560 on my desktop at work - which should go faster....

Thanks for the help - anything I can do to help with the project?

Tom

Edit: with a little 'gentle' over clocking of the GPU - I've got that speed up to 5x... amazing how a 20% tweak in GPU = 100% speed up for the original!

Jan 14, 2012 at 7:10 AM

Good news!  Pleased to hear this.  I believe you could get even better performance but it depends on how much time you want to invest.  If you have not done so then run it through the NVIDIA Compute Visual Profiler.  

Help?  Yes that would certainly be appreciated.  Some possibles:

  • Publicity: CUDA is still a niche believe it or not, and CUDAfy is a tiny niche of that niche.  Articles or papers can help spread the word.  More users, more testing, less bugs, more features and higher level libraries. 
  • Testing and feedback: report bugs and idiosyncrasies, or things that could be better or clearer.  The maths libraries are being extended by contributors and also need testing due to their complexity.
  • Suggest and add to the higher level libraries such as the maths libraries.  Higher level libraries lower the hurdle to use.  Take a look at some 3rd party libs such as .NET Numerics.
  • Investigate graphics interop such as CUDAfy to XNA or SlimDX.  This is a bit of a hole in CUDAfy right now.  A means of letting people display CUDA data directly without copying it back to host first would be very useful!
May 1, 2012 at 8:47 PM
NickKopp wrote:

Good news!  Pleased to hear this.  I believe you could get even better performance but it depends on how much time you want to invest.  If you have not done so then run it through the NVIDIA Compute Visual Profiler.  

Help?  Yes that would certainly be appreciated.  Some possibles:

  • Publicity: CUDA is still a niche believe it or not, and CUDAfy is a tiny niche of that niche.  Articles or papers can help spread the word.  More users, more testing, less bugs, more features and higher level libraries. 

Hi Nick,

I'm picking up my Cudafy after a few months of having to do the rest of my day job! Anyway - as a recap I'm using cudafy to speed up a river flooding model, and I'm thinking of preparing a shortish paper for the journal 'Computers and Geosciences' describing what I did.. its a pretty simple paper - heres an existing model (that I've modified/adapted for c#) and these are the steps I did to get it running in gpgpu... etc.. But - I was wondering if you wanted to be a co-author and contribute a couple of paragraphs background on what Cudafy is and how it works etc..? Probably 5-700 words or so...? Interested? Might be some good publicity for the academic world about Cudafy? Its just an idea at the moment but would probably be pretty easy to put together. I'm happy to head things up, write the bulk and point out where sections are needed etc..?

Tom

May 2, 2012 at 10:17 AM

Hi Tom,

Happy to help out however I can.  If you contact me via email then we can work out the details.

Cheers,

Nick

Dec 25, 2013 at 9:57 PM
Hi Tom and Nick,

Has anything happen with the proposed paper? I did some searching online and I could not find anything. I would be really interested in learning about your implementation in greater detail as I am working on something very similar.

If possible, please let me know.

Thanks in advance,

Joe
Dec 26, 2013 at 11:00 AM
Nothing on my side, Joe. Publicity, tutorials and articles are very useful in promoting the use of GPGPU from .NET. Tom - did you put together anything that could be used as basis?
Jan 8, 2014 at 8:43 AM

Hi Joe & Nick,

Sorry for picking this up late..

I didn’t write anything up at the time.. I got to the point with Cudafy – where I ran into a point whereby I was no longer getting any benefits over running the code on multi core (ie 6-8 core etc..) intel architecture. Its quite likely this is due to my own (inept!) programming capabilitites – but to describe things as briefly as possible...

I make numerical models of rivers/estuaries operating over grids (matrices) up to 1 million cells. Moving water from cell to cell (the hydraulics) works well with Cudafy and was fairly easy to code up - and this involves c.2-3 variables per cell and c. 2-3 calculations per cell.. When you start to route sediment from cell to cell, (the bit I’m interested in) you start to need c.20 variables per cell and 20-30 calcs per cell. This means (on a large grid) that I start to move larger chunks of data – and thus memory - around from iteration to iteration and things slow down to the point where it no longer seemed worthwhile.

As I said above – I’m probably doing something wrong (or could be doing something more right!) but theres a gap in my knowledge as a programmer and especially with the nomenclature/language that holds me back here...

Tom