This project is read-only.

Max Values for GPU Launch?

Jul 10, 2013 at 9:40 PM
Edited Jul 10, 2013 at 9:43 PM
What are the max values that can be used for "gpu.Launch(X, Y)" aka - what are the maximum values for X and Y?

I'm seeing X = 2^31-1 and Y = 1024 if I understand everything correctly?

Are there any other ways to launch an incredibly large number of threads? I essentially have 61 million threads that I want to launch. What would be the best way to do this?

For reference I'm currently using Cuda 3.0 Capable card, but can change to a 3.5 if absolutely necessary.
(Probably will later for performance reasons, but was trying to get it all working before having to invest as they're not cheap.)
Jul 10, 2013 at 10:31 PM
So, in an attempt to get by this restriction, I only launch 1 thread per thousand "threads" I want to run, and then have a for loop inside the kernel than iterates through the 1000 threads. Unfortunately this results in the error "ErrorLaunchTimeOut"
Jul 10, 2013 at 10:49 PM
Hi

The theoretical limits for grid/block size are only a general guideline.

Choosing the # of threads in a block isn't a trivial matter. Your choice's resulting performance is heavily dependent on your particular gpu+algo.
Standard practice is to:
  • Use the occupancy calculator tool (an excel spreadsheet inside the cuda toolkit "tools" folder) to find the best # of threads for your specific problem. Make sure it's always a multiple of 32, for warp optimization.
  • Divide your problem in batches, where each kernel call handles one batch only, and is fast enough to finish before the OS's timeout. Call the kernel many times, thereby processing all batches. Chose a grid size large enough to keep your gpu busy but small enough to avoid an OS timeout.
cheers
Jul 10, 2013 at 11:45 PM
I had played with the occupancy calculator, but hadn't thought of dividing it into batches. I'm sure this isn't a very efficient solution, but I guess it will do.

That fixed both errors (for now anyway - got another error: "UnknownError" that I get to debug now. Hopefully those two don't re-appear.)

Thanks!
Jul 11, 2013 at 11:10 AM
"I'm sure this isn't a very efficient solution, but I guess it will do. "

It's fine if you make sure each kernel runs for more than, say, 100ms. It's actually how advanced algos manage to copy mem to the device while a kernel runs.
Jul 12, 2013 at 4:07 PM
So maybe I'm overlook something simple, but this approach isn't working for me either. I've commented out everything but this section:
for (int i = 0; i < 2/*LOOPS*/; i++)
{
    Console.Write(i);
    gpu.Launch(X, Y).Method(...Parameters...);
    Console.WriteLine(" - Complete with {0} threads", X * Y);
}
If I run this with "i < 1" or "i < 2" everything works fine and the answers are correct for the first few thousand threads (the ones covered by these two iterations), but if I bump it to "i < 5" it gives me an "ErrorLaunchTimeOut". Unfortunately LOOPS needs to be around 1000 to complete them all...

Any suggestions?
Jul 13, 2013 at 2:55 PM
Does every kernel launch takes the same time? How long do they take?
Jul 14, 2013 at 8:37 AM
What are the max values that can be used for "gpu.Launch(X, Y)" aka - what are the maximum values for X and Y?

"Great Question"

One which is not answered in the docs for CUDAfy or indeed in the excellent guide "CUDA by Example"

its fundamental and any errors result in totally wrong results, usually due to a bit of memory being unallocated or over allocated, seems to me though that Pedritolo1 is right,
If your data set is very large ( depends on the memory on your GPU ) then batching is a good solution, but there are downsides to this too as its restricts some functions cant remember what they are now, think it was unable to use 2D arrays. so beware.

for batching though try this snippit of code: http://searchcode.com/codesearch/view/10117501

Unfortunately, with batch sizes and X,Y dims and size of N................ it is far from trivial to understand what the hell is going on :-)
Jul 15, 2013 at 11:28 AM
You can obtain these values by querying the device properties: gpu.GetDeviceProperties(...). See file enum_gpu.cs in CudafyByExample project.
                Console.WriteLine("Max threads per block:  {0}", prop.MaxThreadsPerBlock);
                Console.WriteLine("Max thread dimensions:  ({0}, {1}, {2})", prop.MaxThreadsSize.x, prop.MaxThreadsSize.y, prop.MaxThreadsSize.z);
                Console.WriteLine("Max grid dimensions:  ({0}, {1}, {2})", prop.MaxGridSize.x, prop.MaxGridSize.y, prop.MaxGridSize.z);
Jul 15, 2013 at 4:45 PM
pedritolo1 - they're very short. ~1-2 ms each. I know ideally I'd do 100 ms each Launch, but for code legibility I'd currently like to avoid mixing and matching solutions (aka doing 100 threads per launch to get it up to 100 ms each and then breaking it up into multiple launches). Regardless, this combined approach still does not get enough threads through to be a viable solution.

carlrees1234 - So you think this is a memory problem? I ruled that out fairly early (but I could very well be wrong) as, even though it's millions of threads, I only use a handful of variables that persist, and since it occurs whether I launch 100 threads or 1000 threads at a time, the local variables aren't likely to impact this to my understanding.

Is there any way that it would run fine for 1 loop but break on another if I was passing it too much memory? If I start at 0 it makes it to launch 5 before it errors, yet I can start at 3 and make it to 8 or even 25 and make it to 30...

NickKopp-
Max threads per block: 1024
Max thread dimensions: (1024, 1024, 1)
Max grid dimensions: (2147483647, 65535, 1)

I'm probably reading this wrong, but to my understandings it would say I could do gpu.Launch(2,147,483,647, 1,024).Method() before I ran into problems.
As this isn't the case, I'm probably misunderstanding something.

Thanks for all of the help!
Jul 15, 2013 at 5:33 PM
Could it be the case that the kernels that give the error are incorrectly running a near-endless loop, causing a timeout?
Jul 15, 2013 at 6:02 PM
Well, it's not any kernel in particular - I can start it at any kernel I want and it will run just fine for 4-5 kernels before it fails.

Internally, the method only has 1 loop and each thread does the exact same number of loops, so nothing changes between calls.

Maybe I'm misunderstanding your question?
Jul 15, 2013 at 6:41 PM
Well at this point, I'd suggest you'd share with us a simplified version of your code, perhaps someone will figure out the bug

cheers
Jul 15, 2013 at 7:22 PM
Well I checked to see how much memory I'm using - Just less than a Gigabyte on a 2GB card... so that shouldn't be the problem.

Here's the loop calling it:
for (int i = 0; i < 2/*LOOPS*/; i++)
{
    Console.Write(i);
    gpu.Launch(Length / LOOPS, N).Method(VertOnDevice,TOnDevice, VoxOnDevice, CTOnDevice, i,  Length / LOOPS * N);
    Console.WriteLine(" - Complete with {0} threads", Length / LOOPS * N);
}
And the code: (cut down SEVERELY, yet still displays the problem)
[Cudafy]
public static void Method(GThread thread, float[,] Vert, int[,] T, float[,] Vox, int[] CT, int iteration, int threadsPerIteration)
{
    int v = thread.blockIdx.x * thread.blockDim.x + thread.threadIdx.x;
    int vox = v + iteration * threadsPerIteration;
    if (vox < Vox.GetLength(0))
    {
        float minDistance = float.PositiveInfinity;
        for (int j = 0; j < T.GetLength(0); j++)
        {
            float Px = Vox[vox, 0];
            float Py = Vox[vox, 1];
            float Pz = Vox[vox, 2];

            float X = 0, Y = 0, Z = 0;//Cut out how there were set - Most of computation

            //Calculate distance squared - set to distance;
            float distance = (Px - X) * (Px - X) + (Py - Y) * (Py - Y) + (Pz - Z) * (Pz - Z);

            if (distance < minDistance)
            {
                minDistance = distance;
                CT[vox] = j; //Line of Interest discussed below
            }
        }
    }
}
Narrowing it down, if I comment out the "Line of Interest" it works, yet I know vox is within CT.GetLength(0). I even changed the section to:
if (distance < minDistance)
{
    minDistance = distance;
    if (vox < CT.GetLength(0))
    {
        CT[vox] = j;
    }
    else
    {
        Console.WriteLine("ERROR - Out of Range on Card");
    }
}
With no impact on result (Error message was not output before it error-ed out of the card)
Jul 15, 2013 at 8:14 PM
If you found the offending line, then you're close.
I propose that, instead of storing values in CT, you save the individual indexes (vox) onto a temporary output buffer to parse later at the host, or, if it's a small # of iterations, output to the console. You can then scan that list for offending values.


Just an aside - you know that Console.Writeline only works inside a kernel if you ran nvcc with the debug option, right?
Jul 15, 2013 at 8:55 PM
Well, I am using the line:
CudafyTranslator.GenerateDebug = true;
But just attempted to print something to check it was working and nothing appeared... any ideas?
And that's what you are referring to correct?

I already know every vox being used (when it runs successfully) - I can see which values of CT have been modified. All of these values are the expected indices (read: the indexes that were launched) exactly and the results are within the expected range (although I have no way to verify the exact end value until I get results from all indices). With that in mind, however, I have added an output line to print every index as it works on it:
if (distance < minDistance)
{
    minDistance = distance;
    Console.Write("Index - {0} set ", vox);
    CT[vox] = j;
    Console.WriteLine("COMPLETE");
}
But, same problem as mentioned above, nothing appears.
Jul 15, 2013 at 9:01 PM
You need compute >= 2.0 and "CudafyTranslator.GenerateDebug = true" needs to be placed before you Cudafy your code.
Jul 15, 2013 at 9:02 PM
And now that you have a very tight error test framework, you could debug it using visual nSight.
Jul 15, 2013 at 9:09 PM
Ok, so I forced eArchitecture.sm_30 as shown here:
CudafyModule km = CudafyTranslator.Cudafy(eArchitecture.sm_30);
GPGPU gpu = CudafyHost.GetDevice(CudafyModes.Target, CudafyModes.DeviceId);
gpu.LoadModule(km);

CudafyTranslator.GenerateDebug = true;
And this time it wrote the "COMPLETE" line, but not the "Index..." line - Guessing it has to be "WriteLine"?

Other interesting note - I a similar "Index" - "complete" set up with the Launch kernel. This shows ever launch occuring before the first CT write out is complete... ?

Output shown here:
1 - Complete with 59392 threads
2 - Complete with 59392 threads
3 - Complete with 59392 threads
4 - Complete with 59392 threads
5 - Complete with 59392 threads
6 - Complete with 59392 threads
7 - Complete with 59392 threads
8 - Complete with 59392 threads
9 - Complete with 59392 threads
10 - Complete with 59392 threads
11 - Complete with 59392 threads
12 - Complete with 59392 threads
13 - Complete with 59392 threads
14 - Complete with 59392 threads
15 - Complete with 59392 threads
16 - Complete with 59392 threads
17 - Complete with 59392 threads
18 - Complete with 59392 threads
19 - Complete with 59392 threads
Starting Second Loop
Starting Copy
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
Exception Cudafy.Host.CudafyHostException: CUDA.NET exception: ErrorUnknown.
Jul 15, 2013 at 9:11 PM
pedritolo1 wrote:
And now that you have a very tight error test framework, you could debug it using visual nSight.
And I actually haven't been able to get nSight to work with Cudafy - Don't know what I did wrong initially, but it worked with my CUDA code right off the bat. Spent about an hour trying to get it to work with Cudafy before I retreated to simply using print lines to figure things out.
Jul 15, 2013 at 9:45 PM
Playing with it some more I got:
...cut out lots of the same...
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
Index - {0} set Index - {0} set Index - {0} set Index - {0} set Index - {0} set Index - {0} set Index - {0} set Index - {0} set Index - {0} set Index
- {0} set Index - {0} set Index - {0} set Index - {0} set Index - {0} set Index - {0} set Index - {0} set Index - {0} set Index - {0} set Index - {0}
set Index - {0} set Index - {0} set Index - {0} set Index - {0} set Index - {0} set Index - {0} set Index - {0} set Index - {0} set Index - {0} set In
dex - {0} set Index - {0} set Index - {0} set Index - {0} set COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
COMPLETE
Exception Cudafy.Host.CudafyHostException: CUDA.NET exception: ErrorLaunchTimeOut.
I understand why it's printing all the sets and then all of the completes, but not sure why it's not replacing the {0} with the value.

Also, not sure what changed that allows it to work now with "write" as opposed to "writeLine"...?
Jul 15, 2013 at 9:58 PM
Edited Jul 15, 2013 at 10:00 PM
I think kernel writeline has a different str formatting syntax, check out the cudafy examples to learn more.
You should look further into using nsight, it's a very useful tool. There have been several recent post here discussing how to set it up with cudafy, including troubleshooting and configs.
Also, don't assume that the order of the strings shown on the console match execution order. You might also want to output more info, such as thread and block ids
Jul 15, 2013 at 10:28 PM
I couldn't find an example of the print line in a cudafy kernel in the CudafyByExample project. I searched for "CudafyTranslator.GenerateDebug" and no results were returned... I also skimmed through it by hand but didn't see any. Where could I find one?
Jul 15, 2013 at 10:42 PM
The available example only uses a string literal. Anyway, the formatting is more or less like the c syntax. Here's from the cuda manual:

B.16 FORMATTED OUTPUT
Formatted output is only supported by devices of compute capability 2.x and higher.
int printf(const char format[, arg, ...]);
prints formatted output from a kernel to a host-side output stream.
The in-kernel printf() function behaves in a similar way to the standard C-library
printf() function, and the user is referred to the host system’s manual pages for a
complete description of printf() behavior. In essence, the string passed in as format
is output to a stream on the host, with substitutions made from the argument list
wherever a format specifier is encountered. Supported format specifiers are listed below.
The printf() command is executed as any other device-side function: per-thread,
and in the context of the calling thread. From a multi-threaded kernel, this means that a
straightforward call to printf() will be executed by every thread, using that thread’s
data as specified. Multiple versions of the output string will then appear at the host
stream, once for each thread which encountered the printf().
It is up to the programmer to limit the output to a single thread if only a single output
string is desired (see Examples for an illustrative example).
Unlike the C-standard printf(), which returns the number of characters printed,
CUDA’s printf() returns the number of arguments parsed. If no arguments follow
the format string, 0 is returned. If the format string is NULL, -1 is returned. If an internal
error occurs, -2 is returned.

B.16.1 Format Specifiers
As for standard printf(), format specifiers take the form: %[flags][width]
[.precision][size]type
The following fields are supported (see widely-available documentation for a complete
description of all behaviors):
‣ Flags: ‘#’ ‘ ‘ ‘0’ ‘+’ ‘-‘
‣ Width: ‘
’ ‘0-9’
‣ Precision: ‘0-9’
‣ Size: ‘h’ ‘l’ ‘ll’
‣ Type: ‘%cdiouxXpeEfgGaAs’
Note that CUDA’s printf()will accept any combination of flag, width, precision,
size and type, whether or not overall they form a valid format specifier. In other
words, “%hd” will be accepted and printf will expect a double-precision variable in the
corresponding location in the argument list.
Jul 15, 2013 at 10:57 PM
Edited Jul 15, 2013 at 11:45 PM
"Interesting...

Other than the marked section, everything appears normal until the error..."

Edit: Removed posted error - found the mistake in my print line (both weren't printing out the same thing)

But, with that said - when I run it through more than 5 iterations (where it errors) it actually never passes the first iteration, but if I run it with 3, it finishes iteration 3...
Jul 15, 2013 at 10:59 PM
Note: It's also alternating between "ErrorUnknown" and "ErrorLaunchTimeOut" when I am not changing anything outside of these print lines...
Jul 16, 2013 at 8:15 AM
SET: Iteration - 1 Index - 61870, Block - 2, Thread - 503
SET: Iteration - 1 Index - 61871, Block - 2, Thread - 504
SET: Iteration - 1 Index - 61872, Block - 2, Thread – 505 //(end of 39 correct entries)
SET: Iteration - 61946 Index - 2, Block - 506, Thread -ead - 506 //What is with this section?
SET: Iteration - 61947 Index - 2, Block - 507, Thread -ead - 507

What results do you get when running

gpu.Launch(1, 1024) ?
Jul 16, 2013 at 3:29 PM
carlrees1234 - So for some reason it only prints iteration 5 (set to run 3-5), but other than that it's exactly what you would expect:
COMPLETE: Iteration - 5 Index - 297413, Block - 0, Thread - 453
COMPLETE: Iteration - 5 Index - 297414, Block - 0, Thread - 454
COMPLETE: Iteration - 5 Index - 297376, Block - 0, Thread - 416
COMPLETE: Iteration - 5 Index - 297377, Block - 0, Thread - 417
COMPLETE: Iteration - 5 Index - 297378, Block - 0, Thread - 418
COMPLETE: Iteration - 5 Index - 297379, Block - 0, Thread - 419
COMPLETE: Iteration - 5 Index - 297380, Block - 0, Thread - 420
COMPLETE: Iteration - 5 Index - 297381, Block - 0, Thread - 421
COMPLETE: Iteration - 5 Index - 297382, Block - 0, Thread - 422
COMPLETE: Iteration - 5 Index - 297383, Block - 0, Thread - 423
COMPLETE: Iteration - 5 Index - 297384, Block - 0, Thread - 424
COMPLETE: Iteration - 5 Index - 297385, Block - 0, Thread - 425
COMPLETE: Iteration - 5 Index - 297386, Block - 0, Thread - 426
COMPLETE: Iteration - 5 Index - 297387, Block - 0, Thread - 427
COMPLETE: Iteration - 5 Index - 297388, Block - 0, Thread - 428
COMPLETE: Iteration - 5 Index - 297389, Block - 0, Thread - 429
SET: Iteration - 5 Index - 297920, Block - 0, Thread - 960
SET: Iteration - 5 Index - 297921, Block - 0, Thread - 961
SET: Iteration - 5 Index - 297922, Block - 0, Thread - 962
SET: Iteration - 5 Index - 297923, Block - 0, Thread - 963
SET: Iteration - 5 Index - 297924, Block - 0, Thread - 964
SET: Iteration - 5 Index - 297925, Block - 0, Thread - 965
SET: Iteration - 5 Index - 297926, Block - 0, Thread - 966
SET: Iteration - 5 Index - 297927, Block - 0, Thread - 967
SET: Iteration - 5 Index - 297928, Block - 0, Thread - 968
SET: Iteration - 5 Index - 297929, Block - 0, Thread - 969
SET: Iteration - 5 Index - 297930, Block - 0, Thread - 970
SET: Iteration - 5 Index - 297931, Block - 0, Thread - 971
SET: Iteration - 5 Index - 297932, Block - 0, Thread - 972
SET: Iteration - 5 Index - 297933, Block - 0, Thread - 973
SET: Iteration - 5 Index - 297440, Block - 0, Thread - 480
SET: Iteration - 5 Index - 297441, Block - 0, Thread - 481
SET: Iteration - 5 Index - 297442, Block - 0, Thread - 482
SET: Iteration - 5 Index - 297443, Block - 0, Thread - 483
SET: Iteration - 5 Index - 297444, Block - 0, Thread - 484
SET: Iteration - 5 Index - 297445, Block - 0, Thread - 485
SET: Iteration - 5 Index - 297446, Block - 0, Thread - 486
SET: Iteration - 5 Index - 297447, Block - 0, Thread - 487
SET: Iteration - 5 Index - 297448, Block - 0, Thread - 488
SET: Iteration - 5 Index - 297408, Block - 0, Thread - 448
SET: Iteration - 5 Index - 297409, Block - 0, Thread - 449
SET: Iteration - 5 Index - 297410, Block - 0, Thread - 450
SET: Iteration - 5 Index - 297411, Block - 0, Thread - 451
All block 0, threads go 0-1024, Index are all within the range of expected values... Iteration is just always 5...? Should be some 3, then 4, then 5, but it only actually gives 5...
Jul 16, 2013 at 5:15 PM
IDK what the original problem is, but I think I've devised a way around it so that I can avoid this calculation (or rather separate it into other portions of the process) with minimal change in efficiency.

Aka, I'm going to scrap this for now and try a new approach. Hopefully I don't need to come back to it.

Thanks for all of the help though!
Jul 16, 2013 at 7:42 PM
Can you show your translated code? Always possibility you've hit a translation bug.
And yes you need to use C notation for WriteLine with arguments.