OpenCL Crash occurring when processing on multiple streams

Jul 12, 2013 at 4:46 PM
Hi Gurus of Cudafy,

I come forth with a somewhat annoying issue. I'm working with multiple streams to scan through a file for an array of multiple keywords. When Cuda is used as the target platform, it completes the 19 buffer segments of my test file without an issue and reports back the correct result. When OpenCL is used, it gets 10 buffer segments through and hangs the GPU (violently).

I'm new to Cudafy and parallel computing, so it may just be a "schoolboy error" in one of the functions or methods I use, but I would be incredibly grateful if anyone knows why OpenCL is crashing when executing this code.

(To note, I realise the storage of the arrays on the GPU may raise some eyebrows. It's because the target and lookup arrays are jagged and it was the only method I could think of utilising them.)
                String path = txtFile.Text; // Target File
                uint[] results = new uint[1]; // Number of Matches Found
                results[0] = 0; // Reset results to 0

                searchTarget.Clear();  // Clear Search Targets
                GetFileType(cboFileType.SelectedItem.ToString()); // Get Search Targets
                Byte[][] target = new Byte[searchTarget.Count][]; // Translate Search Targets into Bytes
                for (int i = 0; i < target.Length; i++)
                    target[i] = GetBytes(searchTarget[i]);
                int[][] lookup = lookupCreate(target); // Create Lookup for Target

                int chunkSize = 314572800; // 300MB segment size
                byte[] buffer = new byte[chunkSize]; // Buffer creation using segment size
                double count = 0; // Number of bytes read
                int chunkCount = 0; // Number of segments processed

                int[] lastByte = new int[target.Length];
                for (int i = 0; i < lastByte.Length; i++)
                    lastByte[i] = target[i].Last(); // Last byte of target to be passed to GPU

                // Set GPU device as one selected in interface
                //CudafyModes.Target = eGPUType.Emulator; // Set Target to OpenCL
                GPGPU gpu = CudafyHost.GetDevice(CudafyModes.Target, CudafyModes.DeviceId);

                // Ensure if using Cuda, use 2.0 architecture for Atomics compatibility
                if (TestType == "CUDA")
                    CudafyModule km = CudafyTranslator.Cudafy(eArchitecture.sm_20);
                    CudafyModule km = CudafyTranslator.Cudafy();

                // Find out maximum GPU Blocks and Supported Threads for each Block
                GPGPUProperties prop = gpu.GetDeviceProperties();
                int gpuBlocks = prop.WarpSize;
                int blockThreads = prop.MaxThreadsPerBlock;

                // Allocate the memory on the GPU for results
                uint[] dev_results = gpu.Allocate<uint>(results);
                gpu.Set(dev_results);   // Initialise and set results to 0

                // Copy target and lookup arrays to GPU for analysis
                byte[][] dev_target = new byte[target.Length][];
                int[][] dev_lookup = new int[target.Length][];
                for (int i = 0; i < target.Length; i++)
                    dev_target[i] = gpu.CopyToDevice<byte>(target[i]);
                    dev_lookup[i] = gpu.CopyToDevice<int>(lookup[i]);

                // Start stopwatch, open file defined by user
                double time = MeasureTime(() =>
                    using (FileStream DDStream = new FileStream(path, FileMode.Open, FileAccess.Read))
                        int bytesRead;      // Location in file read
                        while ((bytesRead = DDStream.Read(buffer, 0, buffer.Length)) > 0)   // Read into the buffer until end of file
                            chunkCount++;                                       // For each buffer used, increment count
                            byte[] dev_buffer = gpu.CopyToDevice<byte>(buffer); // Copy buffer contents to GPU for processing
                            count += buffer.Length;                             // Record the length of the bytes in the buffer
                            int size = buffer.Length;                           // Find length of buffer to pass to GPU Kernel
                            int blockSize = Math.Min(blockThreads, (int)Math.Ceiling(size / (float)blockThreads));  //Find the optimum size of the threads to handle the buffer
                            //gpu.Synchronize();                                  // GPU Synchronize

                            for (int i = 0; i < target.Length; i++)
                                for (int x = 1; x < (target[i].Length + 1); x++)   // Check to ensure the first target byte is not located at the end of the segment
                                    if (buffer[buffer.Length - x] == target[i][0]) // If it is:
                                        txtOutput.Text = txtOutput.Text + " [!] Start of Target found at End of File Segment " + chunkCount + " - Rewinding by " + (target[i].Length - 1) + " bytes\r\n";
                                        DDStream.Position -= (target[i].Length - 1);   // Rewind the position the length of the target size
                                gpu.Launch(gpuBlocks, blockSize, i + 1).Analyse(dev_buffer, size, dev_target[i], lastByte[i], dev_lookup[i], dev_results);  // Start the analysis of the buffer
                            for (int i = 0; i < target.Length; i++)
                                gpu.SynchronizeStream(i + 1);
                            gpu.Free(dev_buffer);                               // Clear the buffer from GPU memory
                            for (int i = 0; i < target.Length; i++)
                                gpu.DestroyStream(i + 1);
                            Array.Clear(buffer, 0, buffer.Length);              // Clear the buffer to be filled again
                        DDStream.Dispose();                                     // Close the file


                gpu.CopyFromDevice(dev_results, results);                       // Copy results back from GPU

                gpu.FreeAll();                                                  // Free all GPU resources
Jul 12, 2013 at 4:51 PM
What do you mean, "hangs the GPU (violently). " ? You mean windows forces a reset on the display driver? That's usually because of a kernel timeout. OpenCL is slightly slower than CUDA, so that may be the cause.
Jul 12, 2013 at 5:02 PM
It either crashes the display driver to an unrecoverable state, or recovers and displays and throws an exception.

Is there any way to stop the kernel timing out when throwing multiple threads at it? When using a singluar target Keyword, OpenCL seems to work fine. Issues only come about when I throw 5 keywords at it.

I do find it odd still that when debugging and stopping the program every time it clears the buffer array (Array.Clear(buffer, 0, buffer.Length);), it always runs fine until the 11th buffer is attempted. That hints to me that it may not be an unresponsive kernel.
Jul 12, 2013 at 5:40 PM
"Is there any way to stop the kernel timing out when throwing multiple threads at it? "

Yeah, look around this forum, you'll find plenty of instructions on how to proceed.
In short, you can
  • modify the registry
  • reduce the time your kernels run
  • get another gpu and set it as the primary display adapter in windows (and also do step 1).
Jul 15, 2013 at 1:03 PM
Edited Jul 15, 2013 at 1:04 PM
Hi pedritolo, thanks again for your help and advice so far,

The mystery continues unfortunately. I don't believe the issue is because of the kernel timing out. After extensive testing with a external graphics card using your advice, the function operating under OpenCL only crashes with arrays sized 4, 5 or 6. Any size of array out of this is fine and it completes successfully (arrays sized < 4, or > 6). Rather than take the short route and throw it arrays outside of these lengths, I would like to try and find out why it's acting differently for certain sized arrays under OpenCL.

This is an example error encountered when throwing it an array of size 5:

Cloo.MemoryObjectAllocationFailureComputeException was unhandled
Message=OpenCL error code detected: MemoryObjectAllocationFailure.
   at Cloo.ComputeException.ThrowOnError(ComputeErrorCode errorCode)
   at Cloo.ComputeCommandQueue.Execute(ComputeKernel kernel, Int64[] globalWorkOffset, Int64[] globalWorkSize, Int64[] localWorkSize, ICollection`1 events)
   at Cudafy.Host.OpenCLDevice.DoLaunch(dim3 gridSize, dim3 blockSize, Int32 streamId, KernelMethodInfo gpuMI, Object[] arguments)
   at Cudafy.Host.GPGPU.LaunchAsync(dim3 gridSize, dim3 blockSize, Int32 streamId, String methodName, Object[] arguments)
   at Cudafy.Host.DynamicLauncher.TryInvokeMember(InvokeMemberBinder binder, Object[] args, Object& result)
   at CallSite.Target(Closure , CallSite , Object , Byte[] , Int32 , Byte[] , Int32 , Int32[] , UInt32[] )
   at OpenDD.TestInterface.<>c__DisplayClass19.<BetaDDCommand>b__16() in c:\Users\Ethan\Documents\Visual Studio 2012\Projects\OpenDD\OpenDD\TestInterface.cs:line 1130
   at OpenDD.TestInterface.MeasureTime(Action action) in c:\Users\Ethan\Documents\Visual Studio 2012\Projects\OpenDD\OpenDD\TestInterface.cs:line 442
   at OpenDD.TestInterface.BetaDDCommand() in c:\Users\Ethan\Documents\Visual Studio 2012\Projects\OpenDD\OpenDD\TestInterface.cs:line 1103
   at OpenDD.TestInterface.btnDDTest_Click(Object sender, EventArgs e) in c:\Users\Ethan\Documents\Visual Studio 2012\Projects\OpenDD\OpenDD\TestInterface.cs:line 224
   at System.Windows.Forms.Button.OnMouseUp(MouseEventArgs mevent)
   at System.Windows.Forms.Control.WmMouseUp(Message& m, MouseButtons button, Int32 clicks)
   at System.Windows.Forms.Control.WndProc(Message& m)
   at System.Windows.Forms.ButtonBase.WndProc(Message& m)
   at System.Windows.Forms.Button.WndProc(Message& m)
   at System.Windows.Forms.NativeWindow.DebuggableCallback(IntPtr hWnd, Int32 msg, IntPtr wparam, IntPtr lparam)
   at System.Windows.Forms.UnsafeNativeMethods.DispatchMessageW(MSG& msg)
   at System.Windows.Forms.Application.ComponentManager.System.Windows.Forms.UnsafeNativeMethods.IMsoComponentManager.FPushMessageLoop(IntPtr dwComponentID, Int32 reason, Int32 pvLoopData)
   at System.Windows.Forms.Application.ThreadContext.RunMessageLoopInner(Int32 reason, ApplicationContext context)
   at System.Windows.Forms.Application.ThreadContext.RunMessageLoop(Int32 reason, ApplicationContext context)
   at OpenDD.Program.Main() in c:\Users\Ethan\Documents\Visual Studio 2012\Projects\OpenDD\OpenDD\Program.cs:line 19
   at System.AppDomain._nExecuteAssembly(RuntimeAssembly assembly, String[] args)
   at Microsoft.VisualStudio.HostingProcess.HostProc.RunUsersAssembly()
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Threading.ThreadHelper.ThreadStart()
Jul 15, 2013 at 1:16 PM
Any chance you could share your code? Preferably a simplified version that still exhibits the error? A potencial cuda/opencl mismatch needs to be looked into, imo.
Jul 15, 2013 at 2:32 PM
Edited Jul 15, 2013 at 2:40 PM
I tried to create a simplified version, however it's handling it fine and completing with an array of 4, 5 and 6.

Looking back at the code (original post), if I change the chunkSize to 400MB (419430400), it works fine. If I reduce it to anything less than 300MB, it produces the same error as shown above, but in a different chunk (respective to the chunkSize)... it's strange - I can't deny something isn't right, but without sending you full source and a 5.5 GB test file, I'm afraid you wouldn't be able to reproduce it. However, is there anything that could be forcing OpenCL to "cache" data under a certain size?

Edit: It might not be relevant, however, changing the chunkSize to 400MB also significantly alters the time required to complete the analysis.
Jul 15, 2013 at 3:02 PM
And once you have your simplified version that works fine, can you start adding stuff (working towards rebuilding your original code) until you find that one thing that results in an exception?

You're saying that it works well with large buffers, but crashes with small buffers? Then couldn't it be a mem overflow error?

"Edit: It might not be relevant, however, changing the chunkSize to 400MB also significantly alters the time required to complete the analysis."

Changes in what way?
Jul 15, 2013 at 3:23 PM
It takes almost twice as long to complete the analysis 400 MB at a time rather than 300 MB at a time.

Memory overflow error would be unusual for the circumstance - the only difference of the data sizes being loaded is target (list of keywords) and lookup (needed for boyer-moore byte analysis). I can't understand why an overflow could appear when it can handle targets and lookups larger than the problematic range (4-6).

I've managed to recreate the error to the simplest scale, source code is below. Just change the Path variable in the code to a large file, any should do as long as it's larger than 4 GB.
Jul 15, 2013 at 3:23 PM
Edited Jul 15, 2013 at 5:24 PM
Edit: No longer needed
Jul 15, 2013 at 4:30 PM
So that I may reproduce the problem, would you mind terribly replacing the part where one loads a byte[] from a stream, and instead populate said buffer through a call to methods in the Random class? Make sure to specify the seed in the constructor, so I may replicate exactly your data sequence.
Jul 15, 2013 at 5:23 PM
Ah ha, thanks again pedritolo, issue solved. When I generated the random byte test, it completed without issue, which made me suspect something with the input.

Looking back on the original code, when I was reading in from the file, I had a check to ensure that the target keyword was not detected at the end, if it was - the program would rewind the read position back a number of bytes for the next buffer to scan, which then scans the end of the last buffer to ensure the target isn't there.

It seems that by rewinding the position back with an odd number of bytes caused the issue, probably because it caused the next buffer to have an abnormal load. Which in turn caused the GPU in OpenCL to trigger a memory violation/overflow. CUDA must have protection against this for not producing the same error.

Thanks once again Pedritolo :-)
Jul 15, 2013 at 5:38 PM
Ah so it was indeed a buffer overflow :)
I'm glad I could help
Jul 18, 2013 at 12:08 PM
Edited Jul 18, 2013 at 4:03 PM
It seems to me like I'm getting another issue with Nvidia hardware (ATI and Intel work well with the same kernel).

I've located the issue to be with the byte comparison when a lastByte of 0 is set. Is there any (safer) way of doing a byte comparison on GPUs if the byte searched for is equal to 0?

Edit: Clueless on why, but reversing the boyer-moore algorithm made it behave
Jun 5, 2014 at 5:15 AM
I'm also getting the MemoryObjectAllocationFailure, but only on AMD not Nvidia, and my AMD card has more memory than my Nvidia one.
When you say a buffer overflow - are you saying that the kernel is referring to an array element which is out of bounds?
Jun 6, 2014 at 6:57 PM
Yes, pretty much. Some devices/OSs don't/won't/can't check for it.
Jun 6, 2014 at 11:22 PM
If I rerun the same program again I don't (necessarily) get the error again, and if I run the emulator it doesn't ever seem to be out of bounds.
Are there any other possible reasons for the error?
Jun 7, 2014 at 9:14 AM
Lack of contiguous available memory on the device, perhaps? One thing is the global memory capacity, the other is the maximum contiguous buffer size the device can allocate at one time, due to mem fragmentation.
My suggestion? Try a simple mem alloc example project, and then complexify it until it starts showing your unwanted behavior.
Jun 8, 2014 at 11:54 PM
Ok I've had a look at the memory sizes being used.
My code loops through this pseudo loop:
CopyToDevice(float[3000000]);    // ie a float array of length 3mio
I get the error occasionally on the 3rd call (6mio length). I will then get it for the next 1-2 loops after which it will work again.

Doesn't this mean the size of memory being allocated is not the problem? Or am I missing something?
Jun 10, 2014 at 8:20 PM
What happens if you don't launch a kernel (i.e., remove the Launch() pseudocode)?
Jun 11, 2014 at 5:29 AM
it seems to run without an error
Jun 14, 2014 at 9:34 AM
Is your kernel doing anything "bad"?
Jun 14, 2014 at 9:18 PM
I'm pretty sure not.
It runs without error on NVidia.
It runs without error on emulator (for as long as I have let it run for).
Errors are inconsistent on AMD - will fail, then run again later.
Seems that the error occurs less frequently if I Sleep for 50ms or so after each FreeAll.
Jun 15, 2014 at 10:01 AM
ok. What then if you replace your current kernel with a new trivial (perhaps even empty) one?