This project is read-only.

Performance

Feb 19, 2012 at 6:16 AM
Edited Feb 19, 2012 at 6:26 AM

I write simple performance test, and results isn't very good.

 

[Cudafy]
        public static void calc_e(GThread thread, int n, int[] dx, int[] dy, int[] e)
        {
            for (int i = 0; i < n; i++)
            {
                e[i] = 2 * dy[i] - dx[i];
            }
        }

        static void Main(string[] args)
        {
            int n = 2000000;
            
            Random r = new Random();

            int[] dx = new int[n];
            int[] dy = new int[n];
            int[] e = new int[n];

            // fills massives by random
            for (int i = 0; i < n; i++)
            {
                dx[i] = r.Next();
                dy[i] = r.Next();
            }

            double t2 = MeasureTime(() =>
            {
                for (int i = 0; i < n; i++)
                {
                    e[i] = 2 * dy[i] - dx[i];
                }
            });

            CudafyModule km = CudafyTranslator.Cudafy(eArchitecture.sm_11);

            GPGPU gpu = CudafyHost.GetDevice(CudafyModes.Target, 0);
            gpu.LoadModule(km);            

            int[] dev_dx = gpu.Allocate<int>(dx);
            int[] dev_dy = gpu.Allocate<int>(dy);
            int[] dev_e = gpu.Allocate<int>(e);

            gpu.CopyToDevice(dx, dev_dx);
            gpu.CopyToDevice(dy, dev_dy);

            double t3 = MeasureTime(() =>
            {
                gpu.Launch(128, 1, "calc_e", n, dev_dx, dev_dy, dev_e);
            });

            double t4 = MeasureTime(() =>
            {
                gpu.CopyFromDevice(dev_e, e);
            });

            Console.WriteLine(string.Format("n = {0}", n));
            Console.WriteLine(string.Format("CPU ::: e = 2 * dy - dx ::: Excecution time: {0} ms", t2 * 1000));
            Console.WriteLine(string.Format("CUDA ::: e = 2 * dy - dx ::: Excecution time: {0} ms", t3 * 1000));
            Console.WriteLine(string.Format("CUDA copy to host {0} ms", t4 * 1000));
            Console.ReadKey();
        }

        static double MeasureTime(Action action)
        {
            Stopwatch watch = new Stopwatch();
            
            watch.Start();
            action.Invoke();
            watch.Stop();

            return watch.ElapsedTicks / (double)Stopwatch.Frequency;
        }

 

And I got the following results:

 n = 2000000

CPU ::: e = 2 * dy - dx ::: Excecution time: 16,4077245748105 ms

CUDA ::: e = 2 * dy - dx ::: Excecution time: 17,3898500616068 ms

CUDA copy to host 1917,33549611416 ms

Copy to host operation is extremely sloooow!

But, I think what this performance test code and can be cardinally improved.

Any ideas?

 

Thank in advance.

Feb 19, 2012 at 1:01 PM

There are a few issues here.  First of all you are being very, very unfair to the GPU!!  You are forcing him to do the 2000000 calculations 128 times in parallel! The CPU only does one cycle through the 2000000.  You can also split up your 2000000 and launch far more threads in parallel.  

Secondly, be very careful when timing device code.  Preferably make use of the built in timing and synchronization functions of the GPGPU class.  Invoke also has unpredictable and significant overhead.   If you change the Launch args to (1, 1, ....) and add

gpu.Synchronize();

  after the Launch you'll see that the Launch is basically asynchronous.

        [Cudafy]
        public static void calc_e_v2(GThread thread, int n, int[] dx, int[] dy, int[] e)
        {
            int i = thread.blockDim.x * thread.blockIdx.x + thread.threadIdx.x;
            while(i < n)
            {
                e[i] = 2 * dy[i] - dx[i];
                i += (thread.blockDim.x * thread.gridDim.x);
            }
        }
...
            double t3 = MeasureTime(() =>
            {
                gpu.Launch(n / 512, 512, "calc_e_v2", n, dev_dx, dev_dy, dev_e);
            });
...

To let CUDA "warm up" you'll see even better performance:
            for (int x = 0; x < 2; x++)
            {
                t3 = MeasureTime(() =>
                {
                    //gpu.Launch(1, 1, "calc_e", n, dev_dx, dev_dy, dev_e);
                    gpu.CopyToDevice(dx, dev_dx);
                    gpu.CopyToDevice(dy, dev_dy);
                    gpu.Launch(n / 512, 512, "calc_e_v2", n, dev_dx, dev_dy, dev_e);
                    //gpu.Synchronize();
                    gpu.CopyFromDevice(dev_e, e);
                });
            }