Recording byte locations with GPU

Oct 7, 2013 at 11:33 AM
I'm looking for a bit of advice on how to accurately record byte locations within a file with a multi-threaded kernel. Unfortunately, as a newbie to CUDAfy, I can't think of a method where I can record locations if the target is found that abides to the laws of parallel operation.

Currently, my program loads the data to search through and the search target onto the GPU and allocates memory for the result count and result locations. The result count works fine, however I need to expand to include the location where it found the result.

I have thought about recording results into a temporary shared thread array (like my result count) and then collating the recorded locations at the end of each thread cycle, however (as far as I gather), I would need to create a global array the size of the data I was searching through to store all of the results. Further, as I don't believe threaded applications can keep an accurate count of the array index, the result locations would be filled with gaps of 0's where it didn't find the search target. This would further require a single threaded operation (CPU most likely) to sort these results into a tangible array (which I would love to avoid for the sake of optimising execution).

Are there any methods, or tutorials, out there regarding recording specific data (like byte locations) produced by each thread into an array? Also, I'd love to raise discussion on what would be the best method to approach this trouble :-)

Thanks!
Oct 7, 2013 at 5:49 PM
I see two ways to go about it:

1 - Divide your source buffer into many smaller buffers Bi; process each buffer Bi with a kernel launch (which will further subdivide it to accomodate for the multiple threads/blocks); for each thread block, a fixed-sized array as large as the maximum number of possible results for a single block had been previously allocated and will now be filled up by each kernel run; merge the results for buffer Bi with all previously processed buffers - possibly doing it on the cpu while the next kernel run is already underway.

or

2 - Run the algo once, returning only the number of matches for each kernel; alloc a single global buffer in device mem large enough to service the total number of matches returned by all blocks; launch the kernels again, this time passing as an aditional parameter the index where the block may start storing its own results within the global buffer.

cheers