This project is read-only.

A couple of sugestions

Nov 16, 2012 at 4:16 PM

Hi, I'm glad to report that my original deeply-optimized C source could be ported near-perfectly into cudafy. Bloody amazing! Except for a few small details, as follows:

1 - The inability to set #pragma unroll directives on the c# kernel code. Unrolls are great means of improving performance. To fix it, I propose the use of a Cudafy function as a placeholder, which would be replaced by the "#pragma unroll <x>" directive upon C code generation. Example:

Cudafy.PragmaUnroll(10); // will be replaced by "#pragma unroll 10" in C code. Does nothing under emulation mode.
for (int i = 0; i < 30; i++)
    // do stuff here


2 - C-style templates were really useful to expand a kernel into multiple parameter-dependent versions. How about an attribute of a kernel function argument marking it as a template argument and specifying its value range? Example:

[Cudafy]
public static void MyTemplatedKernel(GThread thread, [CudafyTemplateParameter(1, 2, 3)] int MyTemplatedParameter, uint[] d_someBuffer)
{
   if (MyTemplatedParameter == 2)
      d_someBuffer[1] = 4;
   else 
      d_someBuffer[2] = MyTemplatedParameter;
   ...
}

 the class CudafyTemplateParameterAttribute would have diferent constructors for the typical value types used for template, such as

public CudafyTemplateParameterAttribute(params int[] arg) { }
public CudafyTemplateParameterAttribute(params double[] arg) { }

The example above would result in the generation of 3 kernels in C code, one for each possible value of the templated argument. The LaunchKermel method would find in runtime which kernel version to launch, and would throw exceptions when reading a value outside the accepted range.

extern "C" __global__ void MyTemplatedKernel_1(uint* d_someBuffer)
{

if (1 == 2)
      d_someBuffer[1] = 4;
   else 
      d_someBuffer[2] = 1;
   ...
}

extern "C" __global__ void MyTemplatedKernel_2(uint* d_someBuffer)
{

if (2 == 2)
      d_someBuffer[1] = 4;
   else 
      d_someBuffer[2] = 2;
   ...
}

extern "C" __global__ void MyTemplatedKernel_3(uint* d_someBuffer)
{

if (3 == 3)
      d_someBuffer[1] = 4;
   else 
      d_someBuffer[2] = 3;
   ...
}

they would later be optimized by ptx the compiler, removing spurious value comparisons, into something like this 

 

extern "C" __global__ void MyTemplatedKernel_1(uint* d_someBuffer)
{

    d_someBuffer[2] = 1;
   ...
}

extern "C" __global__ void MyTemplatedKernel_2(uint* d_someBuffer)
{

    d_someBuffer[1] = 4;
   ...
}

extern "C" __global__ void MyTemplatedKernel_3(uint* d_someBuffer)
{

    d_someBuffer[2] = 3;
   ...
}

3 - I noticed that this c# instruction

myInt32Var << myOtherInt32Var

would be translated into C by something like this:

myInt32Var << (myOtherInt32Var & 31)

 Here, the  & 31 makes sense, but it forces the use of an extra register to hold the value 31, and an extra AND instruction. It shouldn't happen.

 

4 - This is only a nice thing to have, there's no pressing need for it: variable names in translated C code should keep the original C# names. Otherwise it becomes really hard to tell what is what and who is who while debugging with nsight.

That's it, for now!

Thanks again for making this amazing tool.

 

 

 

 

 

Nov 16, 2012 at 4:38 PM

A couple fo thoughts on your suggestions:

#2) If I am understanding your intent correctly with #2, you could achieve this yourself by deriving you own "generic" subclass.

#3) I believe you are incorrect, and the "& 31" exists only in the C code and will be optimized out by nvcc. It is rare for a source code constant to not be optimized out y the compler these days.

Nov 16, 2012 at 5:32 PM
pgeerkens wrote:

A couple fo thoughts on your suggestions:

#2) If I am understanding your intent correctly with #2, you could achieve this yourself by deriving you own "generic" subclass.

#3) I believe you are incorrect, and the "& 31" exists only in the C code and will be optimized out by nvcc. It is rare for a source code constant to not be optimized out y the compler these days.


Sorry, I don't understand how I'd go about it regarding your remark #2. Could you perhaps elaborate? I'm probably missing something obvious here.

As for #3, it would be cool if you're correct. I can't verify it atm, since my nsight debugger died on me, preventing me from looking at the disassembly.

 

Nov 16, 2012 at 6:58 PM

Thanks for the suggestions.

1) nvcc automatically unrolls small loops so typically you would use unroll to prevent unrolling, if that makes sense.  

2) Don't really get this one, I will look again when more alert.

3) You can use the CudafyModuleViewer to examine the PTX. This is like assembly but with a bit of effort you can work out what's going on.

4) Local variable names (i.e. within methods) are not preserved by .NET in IL, so there is no way to get them back when cudafy disassembles your assembly.

Cheers, Nick

Nov 16, 2012 at 8:05 PM

Hi Nick

1) Yes, it makes sense. But, you see, nvcc won't automatically unroll if the loop count isn't known beforehand, yet even in these cases you'll see a significant increase in performance when unrolling.

2) I was merely refering to function templates . The CUDA manual does a better job of explaining it than I do.



3) True, but ptx is only intermediate, can't really be relied upon as the final word.

4) I see your point. In light of that, I have no ideas on how that could be fixed, except waiting for Roslyn and rewriting much of your code generator.

cheers

Nov 16, 2012 at 8:50 PM
Edited Nov 16, 2012 at 8:51 PM

Hi Pedritolo1.

re #2: When I see a class or method being invoked in in different ways controlled by a condition or switch, I often sub-class it so that the condition is executed only once on instantiation of the object. Then each sibling object specializies on a single code path. Often the speciation is done using generics, which is the C# way of doing templates. An example I wrote up the other day is at the bottom of this thread:

http://social.msdn.microsoft.com/Forums/en-US/csharpgeneral/thread/6b7dbf4d-ac33-4761-bb73-6278eae4847d

Does that explanation help?

 

Nov 16, 2012 at 10:03 PM

Thanks, pgeerkens, for your explanation. Now I see what you mean.

The thing is, I wasn't talking about handling branching conditions on the host-side, but on the device - using templates in order to branch a single code-time kernel implementation into several compile-time argument-specific kernels. You may find an example of it on the 6th optimization step in NVIDIA's document Optimizing Parallel Reductiion in CUDA

I'll also try to write down a simple example.

Take the following kernel:

__global__ void MyTemplatedKernel(uint* d_someBuffer, int someValue)
{
   if (someValue == 2)
      d_someBuffer[1] = 4;
   else 
      d_someBuffer[2] = someValue;
}

If you knew beforehand that "someValue" will only have 1, 2 or 3 as possible values, then you gain by either writing 3 separate kernels, each with its own similar implementation (and suffer with code maintenance), or leave all that work to the nvcc compiler by using templates:

template <int someValue>
__global__ void MyTemplatedKernel(uint* d_someBuffer)
{
   if (someValue == 2)
      d_someBuffer[1] = 4;
   else 
      d_someBuffer[2] = someValue;
}


this templated kernel will be called host-side in this way:

// host side:
switch (someValue)
{
	case 1: MyTemplatedKernel<1> <<< grid, block >>> (d_someBuffer);
	break;
	case 2: MyTemplatedKernel<2> <<< grid, block >>> (d_someBuffer);
	break;
	case 3: MyTemplatedKernel<3> <<< grid, block >>> (d_someBuffer);
	break;
}

 ultimately, the (hidden) compiled kernels will be

__global__ void MyTemplatedKernel_1(uint* d_someBuffer)
{
    d_someBuffer[2] = 1;
}

__global__ void MyTemplatedKernel_2(uint* d_someBuffer)
{
    d_someBuffer[1] = 4;
}

__global__ void MyTemplatedKernel_3(uint* d_someBuffer)
{
    d_someBuffer[2] = 3;
}

It's this functionality I was proposing to be added to CUDAfy.

cheers

Nov 16, 2012 at 10:20 PM

Very thought provoking. but not all C++ tools are accessible from C# code. I need to absorb that big example before I cooment more.

Nov 17, 2012 at 12:18 AM

A very impressive performance improvement in the powerpoint show, but remember that an extreme corner case is being examined, where the reducetion is the ENTIRE algorithm. I manually unrolled the entire reduction for case 4C of my CUDA Tunig with CUDAFY tutorial as shown here:

#if !after
   for (uint i = (uint)thread.blockDim.x / 2u; i > 0; i >>=1) {
      if (thread.threadIdx.x < i) {
         if (answerLocal[thread.threadIdx.x].distance > answerLocal[(uint)thread.threadIdx.x+i].distance) {
            answerLocal[thread.threadIdx.x]   = answerLocal[(uint)thread.threadIdx.x+i];
         }
      }
      thread.SyncThreads();
   }
#else
   if (thread.blockDim.x>= 512) if (thread.threadIdx.x < 256) 
      { answerLocal[thread.threadIdx.x] = Reduce(thread, answerLocal, 256); thread.SyncThreads(); }
   if (thread.blockDim.x>= 256) if (thread.threadIdx.x < 128) 
      { answerLocal[thread.threadIdx.x] = Reduce(thread, answerLocal, 128); thread.SyncThreads(); }
   if (thread.blockDim.x>= 128) if (thread.threadIdx.x < 64) 
      { answerLocal[thread.threadIdx.x] = Reduce(thread, answerLocal, 64); thread.SyncThreads(); }
   if (thread.threadIdx.x < 32) {
      answerLocal[thread.threadIdx.x] = Reduce(thread, answerLocal, 32);
      answerLocal[thread.threadIdx.x] = Reduce(thread, answerLocal, 16);
      answerLocal[thread.threadIdx.x] = Reduce(thread, answerLocal,  8);
      answerLocal[thread.threadIdx.x] = Reduce(thread, answerLocal,  4);
      answerLocal[thread.threadIdx.x] = Reduce(thread, answerLocal,  2);
      answerLocal[thread.threadIdx.x] = Reduce(thread, answerLocal,  1);
   }
#endif
         if (thread.threadIdx.x == 0) {
            answer[thread.blockIdx.x]   = answerLocal[0];
         }
      }
      [Cudafy]
      public static AnswerStruct Reduce(GThread thread, AnswerStruct[] answerLocal, uint i) {
         if (answerLocal[thread.threadIdx.x].distance > answerLocal[(uint)thread.threadIdx.x+i].distance)
            return answerLocal[(uint)thread.threadIdx.x+i];
         else
            return answerLocal[(uint)thread.threadIdx.x];
      }

and the performance improvement on the GPU was a measly 0.25% as shown here:

x64 Release ( 256 threads_per * 96 blocks = 24576 threads)
Cities  12;   Permutations:  479001600:
---------------------------------------
:
: some elision here
:
... and now with disk cache populated.
           Total     Load      Run
              ms       ms       ms            distance
GpuTsp4c    2643 =     85 +   2558; solution: 111.3318: (8,0,7,4,9,5,2,1,10,11,6,3,) - 4_Before
GpuTsp4d    2643 =     91 +   2552; solution: 111.3318: (8,0,7,4,9,5,2,1,10,11,6,3,) - 4_After

Also, note that though the run time improved by 6ms the load time degraded by 6ms for a wash. This is completely coonsistent across numerous iterations of the test, as I unrolled the code and built this post. I think if you are building a production application where this degree of sohistication is necessary, one should probably build a specialty reduction kernel directly in C++, as you have wandered out of CUDAfy's sweet spot.

Nov 17, 2012 at 12:20 AM
Edited Nov 17, 2012 at 12:24 AM

P.S. In case you are wondering, one of the intermediate steps verified that the time cost of the new function is exactly zero, so it is clearly being unrolled by nvcc. Also, due to a good head start from John Huack, I started from Step #3.

Nov 17, 2012 at 7:00 PM

Hi pgeerkens, I'm glad you've been taking a look into what I wrote. It seems to me that you're going about this template thing the wrong way, but perhaps I'm mistaken. In your TSP (example 4c) implementation, from what I can tell, the reduction part wouldn't bebefit a lot from any performance improvement, since the inner loop is all focused around your FindPathDistance call.

Allow me to present a diferent perspective - let's assume we'd like to take this inner loop and allow for a variable number of cities (presently it's a constant).

// known counter at compile-time. Loop will be unrolled.
for (int i = 1; i < _cities; i++)
{
   city = path[i, thread.threadIdx.x];
   var latLong = gpuLatLong[city];
   distance += latLong.Distance(prevLatLong);
   prevLatLong = latLong;
}

Normally, we'd pass that new variable as a parameter into the kernel calls. The result would be that you'd be using another extra register / stack space to hold this variable. Besides, since the loops centered around it aren't of a known counter any more, your code won't benefit from loop unroll (see my #1 remark, 1st post). You'd see a degradation in performance. Not to mention, of course, that you'd not be allowed to alloc a variable amount of shared memory into threadsPerGrid.

If, instead, you used a template argument for _cities (for which we don't yet have a CUDAfy implementation), the generated C file would contain a function more or less like this:

template <int _cities> __global__ float GpuFindPathDistance(long long permutation, void* gpuLatLong, etc) 
{
// etc, same as before
}

which, when nvcc-compiled, would result in as many different kernels as necessary to acomodate for all the possible range of values of _cities.

The compiler would derive that range of values from each kernel call within the compiled host code: An example of what the CUDAfy'ed C code for the host side would look like is: 

// host C code, automatically generated by CUDAfy
switch (cities)
{
   case 1 : dist = GpuFindPathDistance <1> <<< grid, block >>> (permutation, etc); break;
   case 2 : dist = GpuFindPathDistance <2> <<< grid, block >>> (permutation, etc); break;
   case 3 : dist = GpuFindPathDistance <3> <<< grid, block >>> (permutation, etc); break;
// etc
}

cheers



Nov 17, 2012 at 7:05 PM

A remark - I am aware that CUDAfy doesn't generate host C code. But if it did, it would look like what I just wrote.

Nov 17, 2012 at 7:35 PM
Edited Nov 17, 2012 at 7:38 PM

Think about permutations for a moment - at 10 cities, about the break-eve point for GPU or CPU implementation, there are 3.6M permutations and so there is no trade-off on the kernel side. Each thread is overloaded with hundreds or thousandds of distance calculations, and I simply create enough thread "ranks" to minimize read/write latency. I am compute bound at that point, and always will be, as that is the usual nature of NP-hard problems. ANy attempt to "tweak" the kernel for number of cities will degrade performance by increasing compile time, to result in exactly the same run-time kernel.

"Premature optimization is the root of all evil." - Donald Knuth

Choosing sensible algorithms is wise, but tweaking performance before you have measured an actual bottleneck is a mug's game, as your effort will rarely be rewarded. Unless you are tackling a bottleneck of at least 30% of run-time, you are doomed to at BEST a 30% reduction in run-time. To get a performance gain of 10-times one must be tackling a well-localized bottleneck of at least 91% of run-time, and those are infrequent at best.

Nov 17, 2012 at 7:44 PM

The only possible optimizations I see remaining in the TSP problem are:

  1. Switching to another algorithm, such as Branch-and-Bound, which defeats the purpose of a tutorial on leveraging the CUDA architecture; and
  2. Tackling the repeated integer divide and modulo still remaining in PathFromRoutePermutation by:
    • Switching to a Lexical permutation ordering; and
    • Using the STL NextPermutation algorithm to replace the expensive integer on every path-iteration by each thread except the first.
Nov 17, 2012 at 7:53 PM
Edited Nov 17, 2012 at 8:09 PM

Yes, I agree 100%, you have barely any read/write latency at your inner loop; it's basically register arithmetics. But I din't propose to decrease the read/write latency.

What I was putting on the table was a possible solution for a hypothetical scenario: what if you had to rewrite your code to allow for a user-specified number of cities (as it's only natural). How would you go about it, and how would your decisions affect your code? I see the following scenarios:

1 - You'd pass the number of cities into the kernel as an aditional argument. As a consequence, your inner loop would not be unrolled, resulting in aditional computations and branching decisions. Performance would likely suffer, since it's already a pretty tight inner loop (I could be wrong, of course).

2 - You'd write as many different kernel implementations for as many different possible values you think are likely for the "cities" argument ([Edit] no sane person would use more than, say, 15 maximum cities), and at the host do a case/switch and call whichever kernel code is appropriate. No performance degradation here, but you'd have to compile several kernels (is that really an issue?). Alas, this method is terrible for code maintenance.

3 - Same as 2, but CUDAfy would do it all behind the scenes. No performance degradation, no code maintenance.

 

Nov 17, 2012 at 9:33 PM

If you don't believe me that your scenario is not hypothetical , but pointless and doomed to degrade the performance, prove me wrong by implementing it. CITIES is simply unsuitable to be a tuning parameter in this problem.

Nov 17, 2012 at 10:27 PM

I'm sorry if I somehow gave the impression that I was criticizing the performance of your algorithm. I wasn't, and I'm sure it's pretty close of its theorethical maximum.

That was not at all what my whole discussion has been about. I'm sorry if so far I've failed to convey my message across. I tried, but perhaps not hard enough. Presently I don't have a lot of time to devote to this discussion, so I'm afraid I'll have to leave it at that. Thanks for your time and your contribution to this post.

Nov 18, 2012 at 2:52 PM

Gents,

Avoiding premature optimization and keeping in a "sweet spot" has always been a concern and it was never the idea to provide the complete CUDA interface via CUDAfy.  I believe that many CUDA programmers get carried away trying to extract every last bit of performance in the same way as overclockers keep pushing in some bizarre attempt to run superpi faster.  Fair enough if it is on your own time, but enough to frighten employers.

One relatively simple thing that could be added to CUDAfy is a method to allow insertion of literal CUDA C code into methods marked for Cudafying: e.g. thread.InsertText("#pragma unroll 5");

Running kernels with InsertText calls through the emulator would need to throw an exception.

What do you think?

Nick

Nov 18, 2012 at 3:45 PM

I think this "thread.InsertText" would be an awesome idea.

I advise against throwing an exception on the emulator. Best to just igonre it. Or maybe make that an optional argument: "thread.InsertText(string sometext, bool ThrowOnEmulation = false)".

cheers

Nov 18, 2012 at 11:44 PM

pedritolo1: No apology needed; I was a bit prickly yesterday.

I think throwing an exception on the emulator is fine, just not on an actul GPU. Unless there is some way to embed C++ code in C# that I am unaware of, such embedding would likely just break any emulation .

Nick & pedritolo1:

Having done some research and discovered that STL and Thrust are C++ "template" libraries, I think Nick's idea is a great one. I can already picture some test cases that could be run through any such mechanism as extensions of the already posted TSP solutions.

If I can help with the coding, just point me to a place to start.

Nov 19, 2012 at 12:18 PM

Hi

I'm probably missing something here, but Thrust is purely host code, right? I mean, whichever templated expresions you put together using thrust, they ultimately are c++ pieces of host code that internally call some of Cuda's device sdk. Given that CUDAfy never generates host code (only device code as an intermediate step to ptx), I don't see how one would go about it... Unless you're sugesting using CUDA's language parser to generate static dll's resulting from the cl compilation of automatically-generated c++ code based on a corresponding c# "template"?  Or are you instead proposing of doing this directly on .NET's managed c++, somehow? Ok, now I'm confused.

Like I said, I probably lack the proper perspective on this; it has been more than 10 years since I've used a template library, and even then only grudgingly.

 

Nov 19, 2012 at 12:52 PM

Thrust is a device implementation of the STL - but as all template libraries, it is pure source code. No dll's, just H files. Macros essentially. I am not 100% sure how the imlementation works, but I was stymied on how to invoke it from CUDAfy (without rewriting it in C#) until Nick suggested enabling a C++ pass-through mechanism.

Nov 19, 2012 at 1:57 PM

Okay, not sure how Thrust would work either, but code insertion has been added.  Was not as trivial as first thought, but solution is possibly more powerful allowing for hybrid kernels.  Examples:

        [Cudafy]
        private static void AHybridMethod(GThread thread, int[] data, int[] results)
        {
            thread.InsertCode("#pragma unroll 5");
            for (int h = 0; h < data.Length; h++)
                thread.InsertCode("{0}[{2}] = {1}[{2}];", results, data, h);            
        }
This results in the following CUDA C code:

extern "C" __global__ void AHybridMethod(int* data, int dataLen0, int* results, int resultsLen0)
{
	#pragma unroll 5
	for (int i = 0; i < dataLen0; i++)
	{
		results[i] = data[i];
	}
}
Note that it was important to set variable h as an parameter for formatting in the InsertCode method.  Hard coding it in the string would fail since disassembly loses the names of local variables.  There are some limitations - currently putting something like data.Length as a parameter would fail, as does anything more complex than the above examples. Should be good for a start though!  

Nov 19, 2012 at 2:09 PM

It seems you went above and beyond the call of duty, sir. Well done.

Nov 19, 2012 at 2:17 PM

I'm nitpicking here, so please ignore me. But. How about you move the InsertCode method out of the GThread class, and instead make it a static method in some other helper class? For clarity, I mean. After all, the "thread" instance of GThread is supposed to reflect directly the state of the device upon run-time, and not to be used for compilation purposes. You see what I mean? When I use the "thread" object, I expect to be dealing with the GPU, not the compiler. Just a thought.

 

Nov 19, 2012 at 2:50 PM

You know, your approach made me think about a solution to another problem. If you could do what you just did, then you should have the means to do the following.

Again, I speak of mapping the names of C# variables into C. It can be a big problem while debugging complex kernels; very soon you're lost trying to figure out what the hell array13 is, or what does num27 corresponds to.

How about the following: A GThread optional method (or any other suitable class) where you get to map variable names. SOmething like this

[Cudafy]
public static void test(GThread thread)
{
    int AuxVar = (thread.blockIdx.x * thread.blockDim.y + thread.threadIdx.y);
    GThread.MapVariables("AuxVar", AuxVar);


    // this variable isn't very important, so I won't bother mapping it.
    int SomeOtherAuxVar = thread.blockIdx.x * thread.blockDim.x;

    int[,] sd_p_pType = thread.AllocateShared<int>("sd_p_pType", 10, 20);

    for (int MyInnerLoopIterator = 0; MyInnerLoopIterator < 10; MyInnerLoopIterator++)
    {
        GThread.MapVariables("MyInnerLoopIterator", MyInnerLoopIterator);
        sd_p_pType[thread.threadIdx.y, MyInnerLoopIterator] = thread.threadIdx.x;
    }
}

Which would translate into  

extern "C" __global__ void test()
{
	int AuxVar = blockIdx.x * blockDim.y + threadIdx.y;
	int num1 = blockIdx.x * blockDim.x;
	__shared__ int sd_p_pType[10*20];

	int sd_p_pTypeLen0 = 10;
	int sd_p_pTypeLen1 = 20;
	for (int MyInnerLoopIterator = 0; MyInnerLoopIterator < 10; MyInnerLoopIterator++)
	{
		sd_p_pType[(threadIdx.y) * sd_p_pTypeLen1 + ( MyInnerLoopIterator)] = threadIdx.x;
	}
}

 



Notice how I didn't bother to map sd_p_pType, since thread.AllocateShared(...) was supposed to already do this (this isn't working now, I suppose it's a bug?). GThread.MapVariables should accept many pairs of arguments, to make things easier.

cheers

Nov 19, 2012 at 3:02 PM
Edited Nov 22, 2012 at 12:28 PM

interestingly, if you were to implement my variable-mapping proposal, your previous example of insertText could be simplified like this:

[Cudafy]
private static void AHybridMethod(GThread thread, int[] data, int[] results)
{
   thread.InsertCode("#pragma unroll 5");
   for (int h = 0; h < data.Length; h++)
   {
      GThread.MapVariables("h", h); // no need to map "data" or "results", since, being function arguments, their names can be infered from the IL disassembly
      thread.InsertCode("results[h] = data[h];");            
   }
}

Nov 19, 2012 at 8:31 PM

Struggling to know where to put the methods, GThread may not have been logical but it was discoverable.  Anyway have kept them in GThread, but made them static.  Alternative suggestions welcome.