Roslyn + NVRTC

Oct 13, 2015 at 5:56 PM
What would be involved in getting CUDAfy to use Roslyn to compile C# code and NVRTC to compile Cuda kernels? I'm asking because I need runtime compilation for my project, and I'm considering modifying CUDAfy to support doing just this.
Oct 14, 2015 at 10:48 PM
Hi
Just curious: AFAIK CUDAfy doesn't compile c#. It's the other way around, really. Does Roslyn offer decompilation functionality?
Oct 15, 2015 at 5:35 PM
Edited Oct 15, 2015 at 5:37 PM
Well, runtime compilation would require a C# compiler as the first step. The pipeline would be:

Compile C# (Roslyn) -> Decompile -> Emit Cuda code -> Compile kernel (NVRTC)

Of course, the last portion can also emit OpenCL code, which has a runtime compiler as well.

Roslyn also has an interesting feature than you can use it just to extract the Abstract Syntax Tree, without full compilation, from your C#, which suggests a more optimal path (but which would take a much more significant reworking of CUDAfy):

Get AST from C# code (Roslyn) -> Emit LLVM -> Compile kernel (NVVM)

As I understand, SPIR is also based on LLVM, so you could probably get a cross-vendor OpenCL compilation this way without too much trouble.

This should be a bit faster, though the C# compiler is already pretty darn fast. I also suspect this method would be more robust.
Oct 18, 2015 at 11:29 AM
Edited Oct 18, 2015 at 11:30 AM
IMHO I believe the Roslyn->LLVM step to be extremely hard to write and to make it work properly. LLVM's internals and configurations are a perfect nightmare. On the oother hand, NVRTC seems like a nice tool to integrate into cudafy once it comes out of beta/preview and once cudafy supports cuda toolkit v8.0 or whichever version ends up with stable NVRTC. I assume cudafy could retain the old cmd-line cuda compilation alongside a new implementation based on NVRTC.
Anyhow, I'm just some guy talking. You'll have to bring this up to the people behind cudafy.
Oct 21, 2015 at 1:46 AM
Edited Oct 21, 2015 at 1:49 AM
NVVM, thankfully, skips over you having to mess with LLVM config at all - just emit NVVM IR (basically LLVM IR, but with various platform specific things added and removed), and call nvvmCompileProgram. The NVVM library is basically a standalone NVVM to PTX optimizing compiler.

As for OpenCL, any device with the cl_khr_spir extension (read: everyone but Nvidia) should just be able to take a SPIR binary (another variant of LLVM IR) and call clCreateProgramWithBinary with the "-x spir" option to get a runnable kernel. No messing with LLVM internals here either.

OpenCL SPIR is probably a bit more involved since unlike NVVM, which uses plaintext "assembly language" IR, it takes the binary format IR AFAIK.

The IR itself shouldn't be especially hard to generate from an AST - it has an arcane syntax with static single assignment and very explicit data types and reference types (a lot of the reference type specifics will be emitted and consumed by various optimization passes), and the DWARF debug annotations are outright cursed, but it's straightforward enough, avoiding nonsense like register assignment needed to output a machine binary.

The holy grail would, of course, be a full MSIL to NVVM or SPIR cross compiler. Of course, this would require implementing, in software as necessary, various interesting things such as a GPU resident garbage collector. Partial MSIL coverage would almost certainly lead to various language features not working in difficult to predict ways. Wishful thinking!



Anyhow, do you know how to contact the people behind Cudafy?
Oct 22, 2015 at 12:15 AM
Edited Oct 22, 2015 at 12:18 AM
Very interesting reply, ty
Although there are certainly merits on the LLVM IR pathway (it would be nice and tidy!), which as you said would be adaptable for both opencl and cuda (the devil is in the details, mind you), you'd lose a few things I consider very important:
  • The ability to look at the generated c, pre cuda/opencl compilation. This is great for troubleshooting and optimizing your algos
  • Being able to debug gpu-side the above c code using nvidia's nsight, directly from visual studio, placing breakpoints, inspecting threads and memory, etc. This is a fundamental tool for writing more complex kernels.
Of course, if the original cudafy compilation pathway was to be preserved and maintained, then I'd have no further objections.

You mention writing a GC engine on the gpu side. I fail to see the point. When writing kernels, you care only about performance, which in most cases ends up being about memory access, therefore one always needs complete control over memory management.
Cudafy only supports a very small subset of the .net framework - only the basic c-like syntax of c# is preserved, and only the .net fw functions that map directly into gpu intrinsic functions are supported. Anything else should run on the cpu-side, since it won't be a part of an inner loop.
Anyhow, do you know how to contact the people behind Cudafy?
Message Nick, he usually posts on this forum.
https://www.codeplex.com/site/users/view/NickKopp
Oct 22, 2015 at 8:42 AM
Edited Oct 22, 2015 at 8:46 AM
My understanding of NSight with NVVM is that the debugger just works, provided you give it the proper DWARF annotations in the generated NVVM IR, and figure out how to point it to the right file for the source code. This means it should be possible to step through the C# code directly, without any C intermediate. QuantAlea can debug F# running on the GPU this way, for instance.

As far as optimizations, unless something goes very wrong, C# should optimize very similarly to C since the languages are so similar. If anything, the decompilation step as it's currently implemented would be the thing that could break this if it runs into a case where it can't regenerate some bit of code cleanly. NSight looks like it has awesome profiling tools, by the way, at least as of Cuda 7.5.





Memory management on the GPU is a very interesting topic - OpenCL doesn't support it, but Cuda has for some time, since Fermi I believe. Basically, there are a number of problems which are parallel and have large problem sizes, but need dynamic memory allocation to work well. An example of this is constructing a BVH for ray tracing. It's a hefty amount of work (maybe you have 100 million triangles in the scene...), and may have to be entirely redone every frame, depending on how many things are moving in the scene. Also, if you want it anything close to real time, you'd better not have to transfer 3 GB of BVH to the GPU each frame when it's rebuilt! Thing is, a building a BVH is essentially building a tree, so you either need malloc or you need to do it yourself on some flat buffer that you guessed the size of before you started.

As for a full GC, well, I'll agree that at this point, it's not needed or even wanted. However, I suspect that more and more of a program will eventually end up on the GPU as libraries begin to show up. There are some good reasons to move things over - latency between kernel launches is a big one, if you need small kernels with synchronization between them. The other big one is avoiding transferring data between host and device and more than absolutely necessary. It's often a win to run serial code on the GPU if the alternative is a data transfer, or an execution bubble. GPUs aren't that much slower than CPUs on serial code - I'd suspect within an order of magnitude. Who knows if a full managed runtime will ever show up in GPU land in the next decade or two?