
Did any one succeed in using cublas gemm for matrix multiplication?
I only get NotImplementedException.
Would really make me happy if any one can give me a clue.
/Javerberg
************ Code ***********
public static void TestMatrixMultiplication()
{
GPGPU gpu = CudafyHost.GetDevice(CudafyModes.Target);
GPGPUBLAS blas = GPGPUBLAS.Create(gpu);
int matrixAHeight = 3;
int matrixAWidth = 3;
float[] matrixA = new[] { 0.9f, 0.8f, 0.7f, 0.6f, 0.5f, 0.4f, 0.3f, 0.2f, 0.1f };
int matrixBHeight = 3;
int matrixBWidth = 3;
float[] matrixB = new[] { 0.9f, 0.8f, 0.7f, 0.6f, 0.5f, 0.4f, 0.3f, 0.2f, 0.1f };
int matrixCHeight = 3;
int matrixCWidth = 3;
float[] matrixC = new float[matrixCWidth * matrixCHeight];
// allocate the memory on the GPU
float[] devA = gpu.Allocate<float>(matrixAHeight * matrixAWidth);
float[] devB = gpu.Allocate<float>(matrixBHeight * matrixBWidth);
float[] devC = gpu.Allocate<float>(matrixCHeight * matrixCWidth);
// copy the arrays to the GPU
gpu.CopyToDevice(matrixA, devA);
gpu.CopyToDevice(matrixB, devB);
gpu.CopyToDevice(matrixC, devC);
float alpha = 1.0f;
float beta = 0.0f;
int m = matrixAHeight; //number of rows of matrix op(A) and C.
int n = matrixBWidth; //number of columns of matrix op(B) and C.
int k = matrixAWidth; //number of columns of op(A) and rows of op(B).
int lda = matrixAWidth; //leading dimension of twodimensional array used to store the matrix A.
int ldb = matrixBWidth; //leading dimension of twodimensional array used to store matrix B.
int ldc = matrixCWidth; //leading dimension of a twodimensional array used to store the matrix C.
blas.GEMM(m, n, k, alpha, devA, devB, beta, devC,
cublasOperation.N, cublasOperation.N, lda, ldb, ldc);
gpu.CopyFromDevice(devC, matrixC);
for (int i = 0; i < 9; i++)
{
Console.WriteLine(matrixC[i]);
}
}




Sorry for late reply.
What build target of your cudafy? 32bit or 64bit? BLAS LEVEL 2, 3 and SPARSE routines in cudafy support only 64bit now. 32bit version is not supported yet. Try 64bit. 32bit is in working progress :)




Thanks alot!
Changed to 64 bits, and it works!




Hi guys, Im facing the same problem with the example above. Wen I call the method I get a NotImplementedException. The method GEMM is not implemented. What do you guys mean by the build target.




You need to use 64bit Windows OS. In BUILD menu in VS 2012, click Configuration Manager. Change Platform x86 to x64.




Thanks for the speedy reply. I have changed it but I am still having the same problem.




Check your Active Solution configuration in Configuration manager (DEBUG or RELEASE). Platform setting is different with both solution configuration. You need to change both platform settings.




Both have been set to x64 and its not woriking.




What type of your CUDAfy ? If your cudafy was prebuilt binary (in Download section), try to build cudafy from source code in x64 platform. Source code can be downloaded from
here.




I have tried that and there i still get the same error. Is it possible that cudafy is not seeing the cublas library




If cublas library is missing, program throws other exception, not NotImplementedException.
NotImplementedException error in BLAS was thrown if program is running in 32bit mode. both CUDAfy and your project is set to x64 platform?




So I downloaded cudafy again and compliled it in x64 debug mode. I then copied ur example and pasted it in the program file inside CudafyByExample and then I just called that method inside the try of the main. All the other examples run perfectly and I
must also add that I dnt have a GPU card yet and so im using the emulator.



Feb 7, 2013 at 1:22 PM
Edited Feb 7, 2013 at 1:32 PM

Oh, I got it. BLAS and SPARSE need to actual device for CUDA libraries. CPU Emulator is not implemented yet :)




Thanks for that. Do u know how much perfomance u actualy shed by going thru cudafy compared to coding in cuda from c++.




CUDAfy's BLAS and SPARSE routines call native C++ functions using DLLImport. So I think there is very slight difference between C# and C++ (pure calculating speed. no entire speed).
But .Net programs are compiled to binary code when program is running (Just In Time Compiler), So generally, First running speed of C# program is slower than C++ native program.
If you need more actual performance test results, I recommend you that create new thread for other user's opinion.




Hi,
In this same examples when I change the size of the matices more precisely when I change the lines above to
int matrixAHeight = 2;
int matrixAWidth = 2;
//float[] matrixA = new[] { 0.9f, 0.8f, 0.7f, 0.6f, 0.5f, 0.4f, 0.3f, 0.2f, 0.1f };
float[] matrixA = new[] {0.9f, 0.8f, 0.7f, 0.6f};
int matrixBHeight = 2;
int matrixBWidth = 6;
//float[] matrixB = new[] { 0.9f, 0.8f, 0.7f, 0.6f, 0.5f, 0.4f, 0.3f, 0.2f, 0.1f, 0.5f, 0.5f, 0.5f };
float[] matrixB = new[] { 0.9f, 0.8f, 0.7f, 0.6f, 0.5f, 0.4f, 0.3f, 0.2f, 0.1f, 0.5f, 0.5f, 0.5f };
int matrixCHeight = 2;
int matrixCWidth = 6;
float[] matrixC = new float[matrixCWidth * matrixCHeight];
I get zeros and NAN's as output. Is there anything I did wrong.




The sample code's GEMM parameters is wrong.
A prototype of CUDAfy.NET cublas wrapper's GEMM is here.
public abstract void GEMM(int m, int k, int n, float alpha, float[] A, float[] B, float beta, float[] C, cublasOperation transa = cublasOperation.N, cublasOperation transb = cublasOperation.N, int lda = 0, int ldb = 0, int ldc = 0);
order of dimension of matrix is M, K, N. You missed this like M N K. (this is different of other normal blas function. sorry :) )
You have to fix source code at
blas.GEMM(m, n, k, alpha, devA, devB, beta, devC,
cublasOperation.N, cublasOperation.N, lda, ldb, ldc);
to
blas.GEMM(m, k, n, alpha, devA, devB, beta, devC,
cublasOperation.N, cublasOperation.N);
(lda, ldb, ldc is automatically calculated generally. So you can omit these parameters.)



Apr 12, 2013 at 2:13 PM
Edited Apr 12, 2013 at 2:14 PM

Hi, Thanks for the response, I have tried this now, double[] check = new double[]{1 ,2 , 3, 4}; double[] checkDev = _gpu.Allocate(check); _gpu.CopyToDevice(check, checkDev); double[] numbers = new double[]{1, 2,3, 4, 1, 2,3,4}; double[] numbersDev = _gpu.Allocate(numbers);
double[] resultCheckDev = _gpu.Allocate(numbers); double[] res = new double[8]; _gpu.CopyToDevice(numbers, numbersDev); blas.GEMM(2, 2, 4, 1, checkDev, numbersDev, 0, resultCheckDev, cublasOperation.N, cublasOperation.N); _gpu.CopyFromDevice(resultCheckDev,
res); I am trying to compute [1 2]x [1 2 3 4 3 4 ; 1 2 3 4] Which should be 3 6 9 12 ;7 14 21 28 But I get something totally off from the code above




Hi,
Thanks for the response, I have tried this now,
double[] check = new double[]{1 ,2 , 3, 4};
double[] checkDev = _gpu.Allocate(check);
_gpu.CopyToDevice(check, checkDev);
double[] numbers = new double[]{1, 2,3, 4, 1, 2,3,4};
double[] numbersDev = _gpu.Allocate(numbers);
double[] resultCheckDev = _gpu.Allocate(numbers);
double[] res = new double[8];
_gpu.CopyToDevice(numbers, numbersDev);
blas.GEMM(2, 2, 4, 1, checkDev, numbersDev, 0, resultCheckDev, cublasOperation.N, cublasOperation.N);
_gpu.CopyFromDevice(resultCheckDev, res);
I am trying to compute
[1 2 [1 2 3 4
3 4] x 1 2 3 4]
Which should be
3 6 9 12
7 14 21 28
But I get something totally off from the code above




cublas's matrix format is columnmajor format. This means that element of matrix are saved as "vertical" vector, not "horizontal" vector. So your double[]{1, 2, 3, 4) means that matrix
[1, 3]
[2, 4]
and {1,2,3,4,1,2,3,4} means that
[1, 3, 1, 3]
[2, 4, 2, 4]
so result
[7, 15, 7, 15]
[10,22,10,22]
is totally correct. Not
[3, 6, 9, 12]
[7, 14, 21, 28]
You can check this article for difference of rowmajor format and columnmajor format :
http://en.wikipedia.org/wiki/Rowmajor_order#Columnmajor_order

