How to optimize threads??

May 24, 2012 at 1:01 PM


If i have  a lots of pictures( 800 pics/sec), and pictures size 1280x40.

And i have 400 mutil-core one card.

In the algorithm, I just only use 1280 threads(3 blocks). How i can use the other threads to calcuate the other picture almost in the same time?

May 25, 2012 at 1:12 AM
Edited May 25, 2012 at 1:27 AM

May i write:

dim3 dimthread(512,1)
dim3 dimBlock_a(size/512+(512%size==0?0:1),1)
dim3 dimBlock_b(size/512+(512%size==0?0:1),2)
kernel_a<<<dimBlock_a,dimthread>>> (...........)
kernel_b<<drimBlock_b,dimthread>>> (.........)
where kernel_a handle  first  pictures ,kernel_b handle  sencond picture ....etc????
Jun 2, 2012 at 5:59 AM

Are you writing a CUDAfy program?  The code you have here is pure CUDA code?

Anyway this is an interesting issue that will only be truly addressed with some of the new Kepler GPUs.  Currently though you can use stream ids for copying between host and GPU and for launching kernels, there are actually only 3 separate queues: copy to device, copy from device and kernels (note that in Geforce and lower code Quadros the two copy queues are also effectively merged since they only have one copy engine).  Therefore your kernels are launched sequentially.  If via the NVIDIA profiling tools you see that there are still resources on your GPU over (it could be that a single kernel already saturates) then you would need to rewrite your kernel to process two frames.

Jun 3, 2012 at 2:51 PM
Edited Jun 3, 2012 at 2:58 PM


Sorry, the code is pure Cuda C not cudafy. In fact, I must find laser line at each picture(3D Laser Scan) and it's real time.  I use Geforce  GTX 570 (480 core).

I think this graphic card threads should be larger than 1280 (a lot of threads I never use, isn't it???)

SO, the best way is ony  to use parallel Programming,like OpenMp, TBB etc???

there is reference: