Matrix multiplication

Apr 26, 2012 at 3:25 PM

I want to implement matrix multiplication using shared memory with Cudafy.NET to use it in my csharp application.

I converted the kernel method from CUDA C Programming Guide (pg. 26) and ended up with this (the code snippet tool didn't work?!):

 

[Cudafy]

public static void MatMulKernel(GThread thread, double[,] A, double[,] B, double[,] C)
{
    // Block row and column
    int blockRow = thread. blockIdx.y;
    int blockCol = thread. blockIdx.x;
           
    // Each thread block computes one sub-matrix Csub of C
    double[,] Csub = thread.AllocateShared<double>("Csub", BLOCK_SIZE, BLOCK_SIZE);
    int subRowStart1 = blockRow * BLOCK_SIZE;
    int subColStart1 = blockCol * BLOCK_SIZE;

    for (int i = 0; i < BLOCK_SIZE; i++)
        for (int j = 0; j < BLOCK_SIZE; j++)
            Csub[i, j] = C[subRowStart1 + i, subColStart1 + j];
           
    // Each thread computes one element of Csub by accumulating results into Cvalue
    double Cvalue = 0;
           
    // Thread row and column within Csub
    int row = thread. threadIdx.y;
    int col = thread. threadIdx.x;

    // Loop over all the sub-matrices of A and B that are required to compute Csub Multiply each pair of sub-matrices together and accumulate the results
    for (int m = 0; m < (matrix1Cols / BLOCK_SIZE); m++)
    {
        // Get sub-matrix Asub of A
        double[,] Asub = thread.AllocateShared<double>("Asub", BLOCK_SIZE, BLOCK_SIZE);
        int subRowStart2 = blockRow * BLOCK_SIZE;
        int subColStart2 = m * BLOCK_SIZE;

        for (int i = 0; i < BLOCK_SIZE; i++)
            for (int j = 0; j < BLOCK_SIZE; j++)
                Asub[i, j] = A[subRowStart2 + i, subColStart2 + j];
               
        // Get sub-matrix Bsub of B
        double[,] Bsub = thread.AllocateShared<double>("Bsub", BLOCK_SIZE, BLOCK_SIZE);
        int subRowStart3 = m * BLOCK_SIZE;
        int subColStart3 = blockCol * BLOCK_SIZE;

        for (int i = 0; i < BLOCK_SIZE; i++)
            for (int j = 0; j < BLOCK_SIZE; j++)
                Bsub[i, j] = B[subRowStart3 + i, subColStart3 + j];
               
        // Synchronize to make sure the sub-matrices are loaded before starting the computation
        thread.SyncThreads();
               
        // Multiply Asub and Bsub together
        for (int e = 0; e < BLOCK_SIZE; e++)
            Cvalue += Asub[row, e] * Bsub[e, col];
               
        // Synchronize to make sure that the preceding computation is done before loading two new sub-matrices of A and B in the next iteration
        thread.SyncThreads();
    }
           
    // Write Csub to device memory Each thread writes one element
    Csub[row, col] = Cvalue;
}

 

I call this method within:

 

public const int BLOCK_SIZE = 16;
public const int matrix1Rows = BLOCK_SIZE*10;
public const int matrix1Cols = BLOCK_SIZE*8;
public const int matrix2Cols = BLOCK_SIZE*10;

public static void Execute()
{
    // Translates this class to CUDA C and then compliles
    CudafyModule km = CudafyTranslator.Cudafy(eArchitecture.sm_11);

    // Get the first GPU and load the module
    GPGPU gpu = CudafyHost.GetDevice(CudafyModes.Target);
    gpu.LoadModule(km);

    double[,] matrix1 = new double[matrix1Rows, matrix1Cols];
    double[,] matrix2 = new double[matrix1Cols, matrix2Cols];

    int maxRows = matrix1Rows;
    int maxCols = matrix2Cols;

    double counter = 1;

    for (int i = 0; i < maxRows; i++)
        for (int j = 0; j < matrix1Cols; j++)
            matrix1[i, j] = counter++;

    for (int i = 0; i < matrix1Cols; i++)
        for (int j = 0; j < maxCols; j++)
            matrix2[i, j] = counter++;

    double[,] multiplyResult = new double[maxRows, maxCols];
    double[,] gpuMatrix1 = gpu.CopyToDevice(matrix1);
    double[,] gpuMatrix2 = gpu.CopyToDevice(matrix2);
    double[,] gpuMultiplyResult = gpu.Allocate(multiplyResult);

    gpu.Launch(16, 16).MatMulKernel(gpuMatrix1, gpuMatrix2, gpuMultiplyResult);
    gpu.CopyFromDevice(gpuMultiplyResult, multiplyResult);
    gpu.FreeAll();
}

 

The computation performs without any error, but the multiplyResult array is filled with zeros only. Probably it's a typical beginners problem, but I don't found the reason yet.

Coordinator
May 3, 2012 at 10:31 AM

Hi there,

A bit short of time so I am unable to test this right now.  Typically what I do in these situations is to start removing parts of the code, usually the Launch command and then replace this with a CopyOnDevice command.  Check that you get back what you wrote to the device.  If good then start with an empty kernel method that just does a copy with same launch parameters - it should give the same result as CopyOnDevice.  If good introduce your corner turning operations and so on.

Let me know how it goes.

Cheers,

Nicik

May 7, 2012 at 11:25 AM

Problem is solved.

I followed your suggestions. The main problem was that my graphic card doesn't support double precission. After fixing some small issues in the code above, I changed the data types from double to float and it works.

Thank you Nick!

Aug 3, 2012 at 8:13 AM

matt500:

Would you be so kind and release your current version of matrix multiplication implementation with CUDAfy? I'm struggling to get it work regardless if I use double precision or not.

 

Best regards,

flip

Aug 6, 2012 at 9:15 AM

This is the kernel method, I ended up with.

 

        [Cudafy]
        public static void MultiplyMatricesKernel(GThread thread, float[,] A, int widthA, float[,] B, float[,] C)
        {
            // Block row and column
            int blockRow = thread.blockIdx.y;
            int blockCol = thread.blockIdx.x;

            int subRowStart = blockRow * BLOCK_SIZE;
            int subColStart = blockCol * BLOCK_SIZE;

            // Thread row and column within Csub
            int row = thread.threadIdx.y;
            int col = thread.threadIdx.x;

            // Each thread computes one element of Csub by accumulating results into Cvalue
            float Cvalue = 0;

            // Loop over all the sub-matrices of A and B that are required to compute Csub Multiply each pair of sub-matrices together and accumulate the results
            for (int m = 0; m < (widthA / BLOCK_SIZE); m++)
            {
                // Get sub-matrix Asub of A
                float[,] Asub = thread.AllocateShared<float>("Asub", BLOCK_SIZE, BLOCK_SIZE);

                Asub[row, col] = A[subRowStart + row, m * BLOCK_SIZE + col];

                // Get sub-matrix Bsub of B
                float[,] Bsub = thread.AllocateShared<float>("Bsub", BLOCK_SIZE, BLOCK_SIZE);

                Bsub[row, col] = B[m * BLOCK_SIZE + row, subColStart + col];

                // Synchronize to make sure the sub-matrices are loaded before starting the computation
                thread.SyncThreads();

                // Multiply Asub and Bsub together
                for (int e = 0; e < BLOCK_SIZE; e++)
                    Cvalue += Asub[row, e] * Bsub[e, col];

                // Synchronize to make sure that the preceding computation is done before loading two new sub-matrices of A and B in the next iteration
                thread.SyncThreads();
            }

            // Write Csub to device memory. Each thread writes one element
            C[subRowStart + row, subColStart + col] = Cvalue;
        }