Number of cores

Jan 31, 2012 at 8:11 PM
Edited Feb 2, 2012 at 3:25 PM

In my system, I have a GTX 590 with 1024 Cuda Cores split into, apparently, two devices, as far as the system is concerned.

With Cudafy, is it possible to address all 1024 Cuda Cores as a single device? Or do I need to program each device as a separate entity?

I'm guessing that the 3.0 GB of CUDA memory is also split into 1.5 GB chunks.....




Feb 1, 2012 at 6:34 PM

I've personally no experience of the GTX 950 and could not find any mention of it either on google.  How does it show up in NVIDIA system information?

Feb 2, 2012 at 12:13 AM

Hi Nick,

How odd. I get more than 10 Million hits for:,or.r_gc.r_pw.,cf.osb&fp=fd7383ffd6cf4dd5&biw=1600&bih=718

I opted for the EVGA version of the card:

Although it takes up only one slot, it presents to the system as two separate entities.

If I can help you adapt/test this board config, I'm happy to do so.

I also have an nVidia Quadro 450 card (I currently have 6 monitors on my system) which also presents as 2 graphics cards, each with their own CUDA Cores, so it's a good testbed for testing CUDA Core availability


Feb 2, 2012 at 3:18 PM

Ah GTX590 ... I'd copy and pasted the number GTX950 from your first message.  

I guess if it shows up as two cards then as far as CUDA is concerned it is the same as having two cards.  Please let me know how it works out!


Feb 2, 2012 at 3:42 PM

Hi Nick

Apologies for the typo, which I've corrected above.

Aside from that, my initial concern remains. Is it possible for a base routine in Cudafy first ascertain what resources are currently available? (i.e. number of devices, number of Cores on each and amount of memory on each, etc) in order to create a "pool" from which Cudafy can draw to optimize CUDA usage?

Seems logical that, before solving a problem, you need to know not only the best problem description, but also the resources available with which to solve that problem.

I've since discovered the nVidia Quadro 450 cores aren't apparently usable in that theoretical 'pool', it seems that only the two GTX devices (and one can have, at this point, as many as 4 CUDA devices.

I would think that having such a base 'pooling' routine would be globally beneficial if it doesn't create any significant runtime overhead. At any rate, if you agree and don't have multiple cards, I'm happy to test out any code you care to throw my way.



Skype: merlinbrasil

Feb 2, 2012 at 6:25 PM

Check out the CudafyHost class.  It has various methods that should help you.  However the 'pooling' is something you'd need to manage yourself since CUDA and CUDAfy work on a device basis.

Feb 8, 2012 at 8:24 PM

Hello Merlin and Hello Nick - thanks for the help earlier in the year - I've had to do some different work in the meantime but should be back on the cudafy trail again in a few weeks!


I've done some research on this - as may buy a gtx590... and its two down tuned 580's on the same board sharing memory. This gives the advantages of only having to transfer data over to the GPU memory once, but you have to treat it like you have two separate GPU cards on the machine. Theres a host of CUDA forum literature on how to do this - but I 'think' you can call each device and get it to run a Kernel etc..

I wonder if this is something to do with this line in many of the Cudafy examples? (I'm probably wrong though!) 

_gpu = CudafyHost.GetDevice(eGPUType


People run machines with 2, 3 and 4 GTX590's in them  - but I suspect you have to address each 'card' separately rather than pool the resources..  I suspect this is quite easy really (e.g. dividng an array into two and sending each half to each GPU etc..) but it would be nice if it were 'seamless' so it treated it like one GPU.... Mind you, if that were the case, then you could fill your PC case with GTX580's for slightly less money and it would run faster (as they're not down tuned)....


Feb 9, 2012 at 4:45 AM

An overload of GetDevice allows you to specify the id of the device you want (0, 1, 2, 3...).  Even CUDA does not seamlessly pool the resources.  It could be done and some higher level libraries do it, but there would be some clever optimization necessary.

Feb 9, 2012 at 11:33 AM

Just to clarify, you must disable sli. SLI will make it look like you have one card, but in fact you'll only be using one of the GPU's with CUDA. As you have mentioned, you will have to code to use two seperate devices. You should be able to do this with Cudafy (pass device id ordinal to get device).

So where things get "difficult" is that pre-cuda 4 you could only have one device per thread. Which ment you needed two threads. Now with CUDA 4.x they have relaxed this, and made it a little easier to do multi-device and/or multi-thread applications. 

That said, Cudafy's dll linking at the moment is a little buggy when it comes to using the new API functions.

Bottom line, get your code working for one device and them make the transition to multi device.

Feb 9, 2012 at 6:40 PM

Which functions are "buggy" ?  I believe some of the graphics interop related dllimports.

Feb 9, 2012 at 8:54 PM

Quite a lot of API intermixing in the .net dll imports. Not just interop, but memcpy variants.
That said, I don't know whether its a problem for the context pushing/popping. You definitely need the new API for multiple devices on the same thread.

I'll start playing with multi-thread stuff soon myself.

Feb 10, 2012 at 5:42 AM

It's important to understand that CUDAfy.NET makes use of CUDA.NET and only seeks to update the parts of CUDA.NET that are required for its own functioning.  That said since there is evidently interest in this legacy API then keeping it up to date is useful.  However I'd rather make the desired CUDA.NET functionality available via CUDAfy.

Feb 10, 2012 at 12:31 PM
Edited Feb 10, 2012 at 12:33 PM

There are two issues here. One is maintaining the legacy functionality through the CUDA.Net... that is fine. But even a number of legacy functions are now _v2 api. The only reason to cater for different API versions is because someone isn't willing to update their driver.

The other issue is developing CUDAfy so that people don't need to use CUDA.NET at all. That's a great idea. However, you've just abstracted CUDA.NET and now, again, the only reason to test for different API calls is because someone hasn't updated their driver. i.e. the Behaviour of CUDAfy (being used) will be consistent. When CUDA 5 comes out you can switch CUDAfy to use CUDA 5 functions/behavour and the code that people have developed with CUDAfy will not be affected. This way you are only coding a single behaviour, rather than all possible behaviours due to cuda version.

Another point is that the API upgrade wasn't done in one CUDA update. Some functions went to Version 2 API with the CUDA 3.0 update. Then, in CUDA 4.0 more functions were switched to "_v2". You can't possibly cater for each CUDA version. The behaviours of certain functions change between CUDA versions (e.g. Parameter Passing and Context management in multi-threaded code) and you'd be writing certain code twice or even three times for each version case.

At the end of the day if someone wants to code against CUDA they should download GASS.CUDA.NET for CUDA If they want to use CUDAfy specifically with CUDA 3.0 then they can use an older release of CUDAfy that was built against CUDA 3.0. But the bleeding-edge verison of CUDAfy should soley be built against the latest CUDA release. 

That's my point of view anyway.


Feb 10, 2012 at 1:26 PM
Edited Feb 10, 2012 at 1:27 PM

My thoughts on that issue is Fermi is about two years old, and you can get a Fermi card for $38. Not much you can do for laptops though. I feel like CUDAfy is kinda stuck in legacy mode, so I've started a new translator from scratch that is geared specifically for Compute 2.0+ which works on Fermi+. Maybe I'll update CUDA.NET or start a new one from scratch that supports CUDA 4.1, shouldn't be hard as I think you could just make a program that you can copy and paste the nVidia's documentation into parse that out and spit out the updated C# api.

Feb 10, 2012 at 2:18 PM

CUDA 4.1 support is in the source code on svn and the current CUDAfy release will already support it if it is installed: only the maths libraries were not used. It is my intention to only support the latest CUDA release, however there are a few worries as JeffWayne correctly points out regarding __v2 functions and what would happen  to end user software.

Since we also only use Fermi cards here I am keen to drop earlier versions, however the number of earlier architecture users is still massive.  There has been a lot of work put into CUDAfy so maybe it's worth all interested parties pulling together.  There is no reason an alternative translator cannot exist within the current framework or an replacement for CUDA.NET.  


Xer21, if you can provide more details about yourself - via email - then once we've got the branching and versioning on svn better organized I'll give contribution access.


Feb 10, 2012 at 9:43 PM


If I understand correctly the licensing of CUDAfy is that for someone to use a modified version of the translator "source code" in a product they sold they would have to pay Hybird DSP for a commercial license(given the current fee is rather cheep). So when I was making the changes to CUDAfy for Debug.WriteLine() etc, it seemed like every modification was going to be a hack for every feature, and it also appears to me that not much work had been done in the translator so for me to put much effort in this and commit it directly to CUDAfy didn't make much sense to me as then it would be LGPL under Hybrid DSP.

Am I correct?

So I started a new translator from scratch so it wouldn't be bound by LGPL licensing... I'm not sure if I'm going to finish it completely or release it under MIT or what because I'm really just doing a "simple" test that ended up needing a whole lot of translation work as CUDAfy didn't support what I was trying to do :(


Feb 11, 2012 at 12:41 AM

Also if your interested in contacting me, you can reach me on skype @ xer-21


Sorry for highjacking your thread!

Feb 11, 2012 at 7:13 AM

The CudafyTranslator project is built on top of ILSpy.  The work involved was still fairly considerable because it was based on the earlier translator that used .NET Reflector.  I don't quite agree with your statement that every modification was a hack to add things like WriteLine.  The framework was already in place and you added this support with this framework.  What you've added is good and useful work.  What do you feel is the restriction of the current translator (I assume you are talking about the CudafyTranslator project and not the complete CUDAfy.NET) from a technical side?

The jagged arrays support is a grey area because I'd rather have it in a derived class of CudaGPU.  Again the framework exists.

Licensing - LGPL is only an issue when someone wants to use modified code without resubmitting the changes.  Whether LGPL is worth it in the long term, I do not know since revenue is basically non-existent and CUDAfy serves us better as a source of publicity and in use in our own projects.  

Feb 11, 2012 at 1:32 PM

You can message me on skype if you like to talk about it in more detail. I'm talking strictly about the translator and the Jagged arrays functionallity that was a rough draft that I abandoned because it was a workaround for the amount of data I was trying to transfer to the GPU which I should have never been doing in the first place.
What I think is a hack in the translator is the way the addon's are added to the output. Basically what I've done is made use of transforms to transform the object model before it goes to the visitor, then in the visitor I've modified the output'ed syntax to be C/C++/CUDA, instead of injecting string based CUDA code here. The way CUDAfy does it is fine I think for basic code but the functionallity I was looking for would be a mess in that format.