Out of Core

May 12, 2013 at 2:04 PM
Edited May 12, 2013 at 2:08 PM

How can we do Out of Core paging from system memory? It would be nice to have it for both OpenCL and CUDA.

Thats how OptiX from Nvidia works..
May 13, 2013 at 7:50 AM
"Out of core" is a buzz phrase that simply means a data set that is too big to completely fit into the GPU memory. It does not tell us how Optix works.
How you implement this is algorithm dependent. Optix is targeted at ray tracing for example. How this works will be different from a large matrix operation. CUDA and therefore CUDAfy is extremely general and lower level so you would need to handle the paging yourself. The new Maxwell GPUs will be able to make use of unified addressing which sounds like it will allow writing of a kernel that accesses system memory as well as GPU memory. This is likely what you need. For now you will need to split up your processing through multiple copies to/from GPU and multiple launches.
May 13, 2013 at 10:19 AM
Edited May 13, 2013 at 10:19 AM
Well to be perfectly honest. We don't know how to implement this. Thats why we asked in the first place.

This should be implemented in CUDAfy. For OpenCL and CUDA. So developers can concentrate how to
write efficient kernels not deal with data transfers! :)
May 13, 2013 at 11:40 AM
Edited May 13, 2013 at 11:44 AM
"This should be implemented in CUDAfy. For OpenCL and CUDA"

Like Nick said, it really is algo-dependent. There's no size-fits-all approach here. Memory needs to flow from mem-device, and the shape and timing by which it flows is completely dependent on your algorithmic needs.

"We don't know how to implement this. Thats why we asked in the first place."

If you describe your particular problem in detail, perhaps someone here can help you figuring it out.

Mind you, there are now some cards on the market with "many" gigs of device memory. Look at the Quadro 6000, for example, with 6GB ddr5, or amd's equivalent line for servers (can't remember the brand name).

May 13, 2013 at 12:06 PM
Edited May 13, 2013 at 12:16 PM
We need to raytrace a mesh which is 10GB big. How can you raytrace that on 3GB gpu without out of core algorithm?

I think its impossble. Tesla gpu's have unified address space for example.
May 13, 2013 at 2:45 PM
Edited May 13, 2013 at 9:46 PM
I see, 10gb is large.
Anyhow - unified addr space is merely an abstraction over the mechanisms of device/host memory. It's basically a way to pack in a 64bit pointer extra info regarding the memory location (local/shared, device, host, and gpu id), all packed up in some of the 64 ptr bits. It allows the use of undiferentiated calls for functions that deal with ptr operations. Yes, some mem copy operations will be done in the background, but they won't be efficient, nor will you be allowed to alloc more than the device's max capacity, anyway.
Your buffers will still be residing on the host, initially. And since they don't fit on the device all at once, they will need to be moved by you in blocks onto the device.

If you want, I can give you some ideas on how to solve your problem, but bear in mind that I'm not specialized in graphical computing. Are you using an out-of-the-box raytracer (i.e., from a library/dll), or are you writing your own from scratch? Those 10gb are with the meshe's textures uncompressed, or with some kind of compression already in place?


Edit: I'm no longer so sure about this: "nor will you be allowed to alloc more than the device's max capacity, anyway"
May 13, 2013 at 2:54 PM
Edited May 13, 2013 at 3:14 PM
We were thinking of using Optix.NET(https://optixdotnet.codeplex.com) but i see thats in beta. :)

Regarding mesh it can be any mesh untextured. Maybe simple OBJ.

It would be fantastic if it was joined with CUDAfy
May 13, 2013 at 9:48 PM
Sorry, I guess misunderstood your post. I thought you were writing your own ray tracer in cudafy and needed help with a large mesh.
May 14, 2013 at 3:49 AM
Sure but can you post a general way to do it? May be usefull to others ..
May 14, 2013 at 11:07 AM
Edited May 14, 2013 at 11:07 AM
I'd probably try to use some simple scheme for mesh/texture compression. Usually, texture meshes for tracing are uncompressed before tracing begins, since tipically images are compressed with very complex algos, unfeasible for queries of the type "give me the rgb at coords x,y". So I'd use some simple compression for each raster line in each texture, probably a repeat counter byte followed by the actual rgb, with a lookup table at the start of the scan. This alone can reduce a texture (depends on the texture, of course) 2 or 3-fold, while increasing computation times slightly. In your case it would be enough for the whole thing to fit in the gpu.
If you're not using textures, then a lot can still be accomplished. For instance, one can pre-process the mesh according to the current camera orientation, reducing the LoD in areas far from the camera's focal. Also, heuristics should be used in a pre-proc stage where procedural/computed surfaces and volumes replace "smooth" triangulated surfaces. Oce the pre-procs are done, you should build your hierarchy (bsp? kd? spheres?) tree making sure that the higher level of the tree are in main memory, and exploit space coherence within rays the same group, which means that for a group of rays only part of the lower-level tree needs to be loaded for that group. So, process each coherence group separately, since within each group the tree doesn't require further fetching from the host memory. Each pass/iteration moves each ray along on the tree, changing group memberships along the way. So there's a cache mechanism at work. Here's an example . Anyway, it's been a while since I've had anything to do with computer graphics, so I'm out of my depth.
May 14, 2013 at 5:04 PM
That pdf is just theoretical stuff. But I meant how to do it in practice. How to read and write to system memory from kernel :)

I guess a virtual memory manager is needed.
May 15, 2013 at 1:04 PM
"how to read and write to system memory from kernel"
From the kernel? You don't. It's always the cpu that orchestrates everything, while the gpu is used only for "simple" inner loops. You weren't thinking of doing the whole raytrace algo from inside the gpu, right?