I've been working with some rather large matrices and so optimizing matrix/vector calculations has become more important :)
For the past week, I've spent some time looking at MKL and similar BLAS libraries. The problem is that my code is in C# and that my matrices are so big that they overwhelm the .Net 2 GB limit on object size. I've used the
BigArray approach with good results as a workaround. The problem with MKL and similar is that I need to send byte
aligned arrays
to MKL with P/invoke; doing this with Marshal.Copy means you create 2 copies of the same data (which I can't do b/c of size). There might be another way, with
GCHandle.Alloc that requires unsafe code, but I haven't quite figured out yet if that solves my problem: moving data from BigArray to unmanaged byte aligned arrays efficiently (fast and with 1 copy of the matrix in
memory at any time)
My question is if Cudafy can help with this. Basically I want to calculate A * A^t * v several thousand times, where A is a sparse matrix in compressed column format. I currently do this in 2 loops (A^t * v = x, A * x = result vector) which I can parallelize
using .NET 4 very nicely.
The matrices are about 814GB in managed memory, as 1D arrays of doubles and ints (BigArray is a wrapper class that provides logical 1D array access via internal 1D jagged arrays)
So the relevant questions are
Can I avoid memory overhead when moving data to the GPU (i.e. there is only 1 copy of the data in main memory at any time)
If I have to break the matrix into blocks, is the overhead I would incur justified by the overall increase in performance
Does Cudafy/CUDA offer any support for this; the relevant CUSPARSE function looks like cusparse<t>csrmv
