Training and evaluating Neural Networks on GPU.

Oct 16, 2012 at 6:10 AM


I have developed Genetic Algorithm application that is optimising input parameters for Neural Networks. The end goal is to develop a Neural Network that can predict future prices in the Forex market. I am using a Genetic Algorithm because each network can have a maximum of 50 inputs although each input row in my data set has a 350 inputs. I’m basically trying to work out which of the 350 possible inputs provide the most accurate predictions.

Training the network is the time consuming process, currently I have 4 separate populations, each with 20 Genes(Networks). I have the populations running in there own threads, so that I train all populations simultaneously.

The process of decoding a Gene, gathering the information to create the network, removing outliers, normalizing and generating the training sets, doing the actual training and then evaluating the network to get the score of fitness of the gene is done by calling a single method, RunEpoch();

The application works well, however is very slow. To evaluate all the genes in all four populations takes almost a whole day. This is due largely to the fact that I am very limited by the cores in the CPU. I could train the populations one at a time, and train four networks simultaneously or run four populations simultaneously each training a single network at a time it would make much difference.

 My question is how complex can a method be when run on a GPU? For instance the RunEpoch method I use to evaluate a Gene has numerous classes which are instantiated, and then instantiate other classes that run methods, etc etc etc. Is is possible to cudafy the RunEpoch method and pass the training set as a parameter so that there is no I/O done by the GPU?

 Working out a way of doing this would significantly improve performance for me as I could train an evaluate the entire population in a single pass and simple process each population one after the other. Or I could even increase the population size from 20, to say 80 and make use of the higher number of cores on the GPU (I have 116 I believe on the Geforce 9800 GT), this would decrease the number of generations I would need to achieve a good result (potentially).

 I think I should also mention that each input for the network is a double, as is the output. I read somewhere that only GPU can only int, not sure if that is still the case.

 Thanks in advance for any thoughts

Oct 16, 2012 at 12:32 PM

Hi there,

Sounds interesting.  CUDAfy does not support classes, but does support structs.  NVIDIA GPUs always supported single floating point and for many years now double. I do not believe your GPU will support double. If considering a new GPU and double is a must for you then you may be better with an older Fermi GPU (e.g. 480 or 570) than the new Kepler (e.g. 660) - Kepler is not great with doubles.

Another important issue is that in Windows if a kernel takes longer than 1-2 seconds it will time-out.  So you may need to split your algorithm up into many launches. Even if you disable time-out it is still good practice to not lock the GPU up for too long.  Remember to also minimize data transfer over PCIe.

Porting complexity is entirely dependent on your existing code and your understanding of CUDA.  CUDAfy makes things easier for .NET programmers but it does not shield you from needing to know the basics of CUDA.

I hope you will give it a go and keep us informed of your progress.

Best regards,