Running multiple OpenCL processors with multiple streams

Aug 9, 2016 at 6:03 PM
Edited Aug 9, 2016 at 6:24 PM
Hello,

I am trying to write a simulator. My first order of business is to be able to use all capable processors (I have two on the machine I am testing,

Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz
Intel(R) HD Graphics 4600

What I have done, and what is not working, is that I've created N threads for each processor. Each thread gets a reference to the cudafymodule, the gpu, and a stream ID such that I get:

Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz #1
Intel(R) HD Graphics 4600 #1
Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz #2
Intel(R) HD Graphics 4600 #2

Then I do:
CreateStream(N)
Foo s = new Foo[1];
IntPtr h = gpu.HostAllocate<Foo>(1);
Foo d = gpu.Allocate<Foo>(1);
while (true) {
GPGPU.CopyOnHost(h, 0 , s, 0, 1);
gpu.CopyToDeviceAsync(h, 0, d, 0 , 1, N);
gpu.Launch(grid,block).sim(d);
gpu.SynchronizeStream(N);
}
(Release resources)

I get some really wonky behavior.

Sometimes I get an error the module hasn't loaded, sometimes it works, sometimes the app just disappears without going into debugging, sometimes it breaks into debug, sometimes I get a vshost32.exe has stopped working dialog box.... usually it does a few of these all together.

Am I making a logic mistake here?
Aug 10, 2016 at 3:47 PM
Edited Aug 10, 2016 at 3:59 PM
I've simplified my code a fair bit, and have played around with some variables to ascertain what is happening.

First of all, everything seems to (kind of) okay when I run a single stream on each GPU. Multiple GPUs using the same stream index is fine (as assumed).

The 'kind of' is because I pass a structure, which I set the value of first, then copy it to pinned host memory, then to the device, then set it on the device, then copy it back to pinned, then to the host. Usually it is set to this new value, but not always.

NOTE: I do see the resource manage toss up a bunch of Hard Faults / sec when the app is started, but then it goes away completely if the app is left running. This likely has something to do with the pinned memory on the host which is causing havoc with memory paging or something. I'm not sure, not worried about it, but something to mention.

The code in question:
        gpu.CreateStream(streamID);
        Foo[] sF = new Foo[count];
        IntPtr hF = gpu.HostAllocate<Foo>(count);
        Foo[] dF = gpu.Allocate<Foo>(sF);

            sF[0].Bar = 8;

            int begin = sF[0].Bar;

            //host -> pinned
            GPGPU.CopyOnHost<Foo>(sF, 0, hF, 0, count);
            //pinned -> device
            gpu.CopyToDeviceAsync<Foo>(hF, 0, dF, 0, count, streamID);
            //run
            gpu.Launch().simulation(dF);
            //device -> pinned
            gpu.CopyFromDeviceAsync<Foo>(dF, 0, hF, 0,count, streamID);

            //WAIT
            gpu.SynchronizeStream(streamID);

            //pinned -> host
            GPGPU.CopyOnHost<Foo>(hF, 0, sF, 0, count);

            int end = sF[0].Bar;

        gpu.Free(dF);
        gpu.HostFree(hF);
        gpu.DestroyStream(streamID);
Sometimes end = begin which should not be true. This happens about 5-6% of the time.

All of this is occurring on a host thread, which I have called once on each GPU with no issue.

However, if all of a sudden I start spawning 2 threads, where each is passed the same reference to the GPGPU, the application will stop.

Most of the time it just vanishes, no debugging, just app is terminated. Sometimes the vshost32.exe has stopped responding, but rarely.

If I open multiple copies of the app running a single thread, it works, as expected. So it's not a limitation of the GPU.

Any ideas how to resolve this?

Thanks,

-Phil

EDIT:
I have played with the code some more. It seems the launch of the kernel is the cause, you can copy the memory all you want back and forth, but launching the kernel messes things up. Even if the kernel has no code.

I would also note that by getting the GPGPU by Device ID from CudafyHost results in the same object... so whether I pass the GPU to the thread, or I pass the device ID and get it from the CudafyHost, it's already loaded with the module, it's the same object, it still has issues.
Aug 10, 2016 at 4:41 PM
Edited Aug 10, 2016 at 4:58 PM
I think I found the solution. I think writing about it helped me get through it.

The only difference between doing this on multiple threads or a single thread is that the async calls are being queued at the same time.

I modified my code to be:
            ...
            int begin = sF[0].Bar;

            //host -> pinned
            GPGPU.CopyOnHost<Foo>(sF, 0, hF, 0, count);
            lock (gpu)
            {
                //pinned -> device
                gpu.CopyToDeviceAsync<Foo>(hF, 0, dF, 0, count, streamID);
                //run
                gpu.Launch().simulation(dF);
                //device -> pinned
                gpu.CopyFromDeviceAsync<Foo>(dF, 0, hF, 0, count, streamID);
            }
            //WAIT
            gpu.SynchronizeStream(streamID);

            //pinned -> host
            GPGPU.CopyOnHost<Foo>(hF, 0, sF, 0, count);

            int end = sF[0].Bar;
            ...

I could have put a lock around just the launch, as that's where I am having the issue. But all 3 calls; copy to, launch, and copy from are async calls.

So I can't in parallel queue the calls, but I don't think the queue calls are long so I don't imagine there's a lot of overlap (once I beef up the simulation to take more than 0ms to execute).

I haven't found the reason for the data being incorrect. In playing with some variables, I have some confusing results.

I run 10,000 runs and print the number of errors that occurred.

When I run a single GPU with a single thread, I get few or no errors. (less than 10 per 10,000)

When I run two GPUs with a single thread, I get about 25, sometimes up to a hundred. Once every 6-10 'sets' of 10,000 it will error for all 10,000
This seems to be occurring at a regular interval, and in fact when I time it, it's fairly consistent. I don't understand why. I'm surprised the other 'sets' adjacent to the runs don't have a large spike, so it's as if as soon as I start counting over it knows to error all the time, until the next time I start counting again.

Here is some output (Note: These are two different GPUs, I haven't distinguished between the two when printing)

Stream #1 did 10000 runs with 37 errors.
Stream #1 did 10000 runs with 59 errors.
Stream #1 did 10000 runs with 31 errors.
Stream #1 did 10000 runs with 54 errors.
Stream #1 did 10000 runs with 21 errors.
Stream #1 did 10000 runs with 18 errors.
Stream #1 did 10000 runs with 10000 errors.
Stream #1 did 10000 runs with 25 errors.
Stream #1 did 10000 runs with 14 errors.
Stream #1 did 10000 runs with 12 errors.
Stream #1 did 10000 runs with 13 errors.
Stream #1 did 10000 runs with 16 errors.
Stream #1 did 10000 runs with 10000 errors.
Stream #1 did 10000 runs with 21 errors.
Stream #1 did 10000 runs with 9 errors.
Stream #1 did 10000 runs with 22 errors.
Stream #1 did 10000 runs with 16 errors.
Stream #1 did 10000 runs with 16 errors.
Stream #1 did 10000 runs with 18 errors.
Stream #1 did 10000 runs with 10000 errors.
Stream #1 did 10000 runs with 11 errors.
Stream #1 did 10000 runs with 9 errors.
Stream #1 did 10000 runs with 16 errors.
Stream #1 did 10000 runs with 17 errors.
Stream #1 did 10000 runs with 29 errors.
Stream #1 did 10000 runs with 21 errors.
Stream #1 did 10000 runs with 10000 errors.

After the structure is copied to the pinned memory, I tried setting sF[0].Bar = 2. This way if sF is not set, it would be clear.

It turns out I still get the value of 8. So sF[0].Bar is being set. So hF is being copied to sF, but is hF being copied from dF? is dF correct after SynchornizeStream? THis is hard to know. I can't set hF after copytodevice because it's an async call. I don't know when the simulation will run, or when the device is copying memory to the pinned memory. It's very confusing.
Aug 12, 2016 at 3:26 PM
I have found the answer to the issue, it seems obvious now.

I changed the launch to include the stream:
                gpu.Launch(gridsize,blocksize,streamID).simulation(dF);
Stupid little mistake, but it's fixed now.

The lock is still required (you can't queue the GPU in parallel, but you can wait for the queue in parallel). I now have 10 streams running on both GPGPUs with 0 errors.