This project is read-only.

Choosing a new GPU

Aug 29, 2013 at 5:14 PM
Hi all,

I am ready to move my cudafy project from a dev machine to a production machine, of course we would like the maximum performance so I am looking to spec a new machine. The base spec of the machine will be a Intel box with 3.9GHz, 64GB of fast memory, SSD etc basically all the best bits you can buy running Win7.

For the GPU's I am thinking of running 2 Dual GPUs per machine, ie 2x GTX 690. At this point I am not sure how to spec the optimal card. Looking at the various sites, the GTX690 comes in a variety of options, the main differences seem to be speed, memory and bus size.

Can somebody please explain how these will effect the performace of my application. I imagine that memory is simply how much data the GPU can hold, (ie the more memory the more data can be processed between sending new data to the GPU), the card speed I assume is the faster the better, what sort of relationship is there between card speed and execution speed? For the bit size, I need advice, the 512bit cards are more expensive, but does a 512bit card transfer data 1.3x faster then a 384bit card, or have I missed the point?

Regarding the application, we are running a very simple genetic algorithum on the GPU. In the most basic example, we transfer a lot of csv data (30000 rows of 10 columns per row) to the GPU, along with another array of values which act as multipliers for each thread. Each thread simply gets the multipliers specific for that thread, then loops through the data applying the thread specic multipliers to each column of data for each row. The result is transfered back to the CPU where the genetic algo magic happens before new data and multipliers are sent back to the GPU and the loop continues forever. The whole cudafy code is about 40 lines long! For speed we use floats. Dev machine speed is about 120ms per loop using 128x32 threads (the fastest combination found through testing). We would like to decrease this loop time as much as possible, but also run the same code on the 4 GPU's in parallel.

Moving to the production machine, we are trying to maximise the performance. Using a GTX650ti card in the dev machine we have have no real problems with memory size, we are really after faster execution. I am hoping that a multicore CPU with 2x dual GPU cards will provide the best solution in a single machine. I would hope that I would see a 4x processing improvement at least compared to the dev environment, plus the additional speed that would come from the faster cards.

So, in closing, how do I spec the best cards, am I expecting too much improvement going from 1 GPU to 4? Does anybody have any real world numbers to compare card speed vs. execution etc? Will I run into any issues running 4 cudafy threads for the 4 GPUs?

Thanks in advance for any ideas or suggestions.
Aug 29, 2013 at 6:00 PM
I know this is a bit transversal to your question, but:
"then loops through the data applying the thread specic multipliers to each column of data for each row."
It looks like you're doing something akin to a matrix dot product. Have you considered using the highly optimized BLAS library instead of writing your own cudafy code?
Aug 30, 2013 at 3:12 PM
Analyze where your bottlenecks are. Can you easily distribute over multiple GPUs? Are the data transfers important? If so go for PCIe 3.0. If overlapped copy to and from device will help then consider a higher end Quadro or Tesla - why? Because they allow you to perform copy to and copy from in parallel, unlike Geforce.
If using multi-GPU and PCIe performance is important then go for socket LGA 2011 and not the cheaper varieties or Haswell since then you only have 16 PCIe lanes total from CPU which must be shared between all GPUs.
Do you transfer the text file to the GPU? I had also experimented with CSV parsing and writing with interesting results. However can you get the software that produces the CSV to instead produce binary?
Put the time into analysis, it will stop you throwing tight research budgets into overkill hardware that cannot be easily replicated. Use tools like NVIDIA Visual Profiler and .NET profilers.
By the way if you are publishing a paper then if you can let us at CUDAfy/Hybrid DSP know I can likely get you some publicity.
Aug 30, 2013 at 4:31 PM
Thanks for the replies.

pedritolo1 I think I may have oversimplified my example, there is some additional work that is performed on each row of data after the multiply so I am not sure that the BLAS library will help, although I will investigate it.

Nick, thanks for the details. For your comment "go for socket LGA 2011 and not the cheaper varieties or Haswell", are you suggesting to go for or avoid Haswell? Regarding the application, we read csv data into an array of floats using the CPU, basically float[,] gpuData = new float[numOfRows, numOfCols] and then transfer that data to the gpu. I'm not sure if there is a faster way to get the data onto the GPU? Regarding overlapped IO, I am not sure if this is important to us, we copy the data and the model vars to the GPU, run the kernal and call sync then copy the data off the gpu and process the results on the CPU, we then loop again and again... It will be trivial for us to parallel the process into because at the moment we have to run the data in batches, I hope that running on 4 GPUS with each GPU running in a separate CPU thread from the main c# app will not be a problem?

We have no plans to release a paper but I can create some project overview and implemetation documentation for the site here once we have implemented the multiple GPU solution if you are interested.
Aug 30, 2013 at 7:50 PM
"go for socket LGA 2011 and not the cheaper varieties or Haswell" == "LGA 2011 && !(the cheaper varieties or Haswell)"
Haswell I believe also only had 16 lanes to the CPU.
Try to find out how much process is involved in reading the csv in. If the csv is a non-changing layout you could upload the csv to the GPU and do the work there, but make sure the effort is worth it. If it is a commercial project I could consult on this part if this stage really is a bottleneck, but somehow I do not think it is. Give Visual Profiler a go and keep an eye on Task Manager for the CPU load - if CPU load is influential then try one of the .NET profilers.

Any info on the project and organization you are willing to share would be welcome!