Sparse matrix, Compressed column, ~1 billion non zeros

Feb 15, 2012 at 7:45 PM

I've been working with some rather large matrices and so optimizing matrix/vector calculations has become more important :)

For the past week, I've spent some time looking at MKL and similar BLAS libraries. The problem is that my code is in C# and that my matrices are so big that they overwhelm the .Net 2 GB limit on object size. I've used the BigArray approach with good results as a workaround. The problem with MKL and similar is that I need to send byte aligned arrays to MKL with P/invoke; doing this with Marshal.Copy means you create 2 copies of the same data (which I can't do b/c of size). There might be another way, with GCHandle.Alloc that requires unsafe code, but I haven't quite figured out yet if that solves my problem: moving data from BigArray to unmanaged byte aligned arrays efficiently (fast and with 1 copy of the matrix in memory at any time)

My question is if Cudafy can help with this. Basically I want to calculate A * A^t * v several thousand times, where A is a sparse matrix in compressed column format. I currently do this in 2 loops (A^t * v = x, A * x = result vector) which I can parallelize using .NET 4 very nicely.

The matrices are about 8-14GB in managed memory, as 1D arrays of doubles and ints (BigArray is a wrapper class that provides logical 1D array access via internal 1D jagged arrays)

So the relevant questions are

-Can I avoid memory overhead when moving data to the GPU (i.e. there is only 1 copy of the data in main memory at any time)

-If I have to break the matrix into blocks, is the overhead I would incur justified by the overall increase in performance

-Does Cudafy/CUDA offer any support for this; the relevant CUSPARSE function looks like cusparse<t>csrmv

Feb 20, 2012 at 3:18 AM
Edited Feb 20, 2012 at 3:20 AM

Hi aolney.

CUDAfy.NET has CUBLAS, CUSPARSE Wrapper functions. So you can use cusparse functions (like csrmv) with CUDAfy context. All wrapper functions has  similar usage with original cublas/cusparse functions. Also, with cudafy, you can avoid memory overhead which related with data transferring between cpu and gpu. If your GPU has sufficient memory, you can perform all vector/matrix operation in GPU and need only 2 data trasferring (inputting data CPU->GPU and getting final result GPU->CPU). 

But I don't know whether CUDAfy support large array which bigger than 2GB.

Feb 20, 2012 at 2:20 PM

I'm pretty sure I need to break this down into multiple operations.

If my GPU has 1GB, then I will need at least 8 operations, swapping in and out, to complete the matrix multiply.

Can you point me in the right direction for this kind of work? It seems most people have problems that fit on their card.

Feb 27, 2012 at 3:41 AM

Sorry for late reply. I don't have any experience for very large matrix calculation :). But I think you can use block matrix multiplication (like this Split your matrix to small block matrix and vectors, upload data to GPU memory, calculating matrix operation, and download final result to CPU memory and merge small vectors to BigArray.