I am trying to use streams to run a few kernels concurrently. The problem I see with the current version of CUDAfy.NET is that when the host code adds requests to GPU streams they are not submitted to GPU until the SynchronizeStream() method is called. I confirmed
by running Nsight performance analysis that this results in inefficient use of GPU. I did some research online and learnt about the existence of the cuStreamQuery method, which supposedly flushes the queued requests.
This method is currently not exposed to the CudaGPU class. Is this something that could be added in the next version?