This project is read-only.

Interoperability Cusparse / Pinned memory

Apr 18, 2012 at 4:14 PM
Edited Apr 18, 2012 at 4:18 PM

Hi,

 

In the context of an image segmentation project, I implemented my own element-wise kernels and I took advantage of sparse matrix-vector multiplication (using the cusparse library). While using standard non-pinned memory, everything worked fine. However, my application suffered from the Matrix Data Transfer (~4'000'000 elements in double precision). Thus, I wanted to take advantage of pinned memory, especially for this big transfer.

But, I got several errors, trying to use non-pinned memory.

 

First, I performed the following steps:

1) Allocate the memory on the GPU: 

 

double[] gpuGraph = this.gpu.Allocate<double>(graphLength);

2) Allocate host locked memory, used to stream

 

IntPtr gpuIntPtrGraph = this.gpu.HostAllocate<double>(graphLength);

3) Enable the Smart Copy Mode()

 

this.gpu.EnableCopy() 

4) Asynchronously copy matrix data to GPU

 

this.gpu.CopyToDeviceAsync<double>(cpuGraph,0,gpuGraph,0,graphLength,streamId, gpuIntPtrGraph)

5) Then, I need to allocate ressources for the sparse representation on the GPU side in order to perform the matrix format conversion:

this.gpuNNZRows = gpu.Allocate<int>(iLength);

this.nnz = sparse.NNZ(iLength, iLength, gpuGraph, gpuNNZRows);

this.gpuVals = gpu.Allocate<double>(nnz);

this.gpuRows = gpu.Allocate<int>(iLength + 1);

this.gpuCols = gpu.Allocate<int>(nnz);

sparse.Dense2CSR(iLength, iLength, gpuGraph, gpuNNZRows, gpuVals, gpuRows, gpuCols);

However, it seems that I could not allocate any new ressources while smart copy mode is enabled; I got the error "CUDA.NET exception: ErrorInvalidContext" directly at the allocation of gpuNNZCols.

 

Second, I tried to disable the smart copy mode just before the new allocations and enable it again after:

this.gpu.DisableSmartCopy();

this.gpuNNZRows = gpu.Allocate<int>(iLength);

this.nnz = sparse.NNZ(iLength, iLength, gpuGraph, gpuNNZRows);

this.gpuVals = gpu.Allocate<double>(nnz);

this.gpuRows = gpu.Allocate<int>(iLength + 1);

this.gpuCols = gpu.Allocate<int>(nnz);

this.gpu.EnableSmartCopy();

sparse.Dense2CSR(iLength, iLength, gpuGraph, gpuNNZRows, gpuVals, gpuRows, gpuCols);

However, it seems that I could not call cusparse while the smart copy mode is enabled; I got the error "Sparse error: AllocFailed" at Cudafy.Maths.SPARSE.CudaSPARSE.set_LastStatus(CUSPARSEStatus value)

 

Third, I then tried to enable the smart copy mode right after the sparse.Dense2CSR method call. Now, the method is well performed but I got the error "SPARSE ERROR: MappingError" at Cudafy.Maths.SPARSE.CudaSPARSE.set_LastStatus(CUSPARSEStatus value) while calling the method sparse.csrmv (performing the multiplication).

Finally, I tried to enable the smart copy mode only during the calls of CopyToDeviceAsync and then disable it. At this step, I got the error "CUDA.NET exception: ErrorInvalidContext" at Cudafy.Host.CudaGPU.OnCopyOnHostCompleted[T](IAsyncResult result). It seems that the smart copy mode should be enabled.

 

Therefore, I wonder about the interoperability between the use of cusparse and pinned memory in cudafy. So , the question is: can we use pinned memory with cusparse? If yes, please how could I do it?

 

Thank you for consideration,

 

Best Regards,

 

Sébastien Tourbier

Apr 20, 2012 at 12:08 AM

Hi Sébastien,

Smart copy will be deprecated with the new release.  Its behavior is too erratic and its use too confusing.  Can you try your code without it?  It does mean needing to copy the data manually into the host allocated staging posts and then you do need to make sure this happens in parallel with the transfers to/from device. 

http://www.hybriddsp.com/Support/CudafyTutorials/FasterDataCopyingtoGPU.aspx

Nick

Apr 20, 2012 at 7:33 AM

Hi Nick,

As suggested, I tried my code without enabling the smart copy mode and by copying the data manually into the host allocated staging posts and it worked,. In the end, my code looks like:

int streamId = 1;

double[] gpuGraph = this.gpu.Allocate<double>(graphLength);

IntPtr gpuIntPtrGraph = this.gpu.HostAllocate<double>(graphLength);

gpuIntPtrGraph.Write<double>(cpuGraph);

this.gpu.CopyToDeviceAsyn<double>(gpuIntPtrGraph,0,gpuGraph,0,graphLength,streamId);

Thank you!

Seb

Apr 20, 2012 at 8:00 AM

And now to the business of putting this into a loop so you actually get an overall increase in performance!  Remember to increment your streamId.  Parallel NSight is a good tool to use to check what is going on here.  Also unless you have a Tesla or Quadro 4000 or higher you will not get overlapping of copy to and from operations - only these GPUs have dual copy engines.

Nick