CUDAfy does not work with GTX480 on CUDA5.5?

Nov 28, 2013 at 8:49 AM
I have 2 GTX480 in my PC, and I installed CUDA5.5 in the default directory.

I'm using CUDAfy.net 1.25, and it can't find CUDA5.5. That's OK, I modified the source code and rebuilt it, it can find CUDA now.

I have code like this:
float time = 0, timeTotal = 0;
while (start < n)
{
     s_e_idx[0] = start;
     s_e_idx[1] = end;
     int[] dev_s_e_idx = gpu.CopyToDevice(s_e_idx);
     gpu.Set<int>(dev_caf);
     gpu.StartTimer();
     gpu.Launch(numBlocks, threadsPerBlock).CalculateKendallSEN(n, dev_lstAct, dev_lstFan, dev_lstPr, dev_caf, dev_cap, dev_cfp, dev_s_e_idx);
     time = gpu.StopTimer();
     timeTotal += time;
     Console.WriteLine("Process: {0}%, {1} ms this part, {2} ms in total.", Convert.ToDecimal(end) * 100 / n, time, timeTotal);
     start = end;
     end = start + span;
     if (end > n) end = n;
}
but an error raised:

Image

if i comment out the code about timer, new error raised:

Image

it seems the loop ran for one time, and then the video driver crashed and reinitialized, and then data copying failed.
Coordinator
Nov 29, 2013 at 2:22 PM
Without seeing the code in CalculateKendallSEN it is difficult to say. Chances are you are writing outside allocated memory or your kernel ran for longer than the default Windows driver timeout which is about 2 seconds.
Can you also say what you changed in the source code to get it working with CUDA 5.5?
You may also want to try version 1.26 that was released about 10 minutes ago.
Nick
Dec 16, 2013 at 5:01 AM
here is the code of CalculateKendallSEN:
[Cudafy]
        public static void CalculateKendallSEN(GThread thread, int elements, int[] act, int[] fan, int[] pr, int[] c_af, int[] c_ap, int[] c_fp, int[] s_e_idx)
        {
            int i = thread.threadIdx.x + thread.blockIdx.x * thread.blockDim.x;
            if (i < s_e_idx[0] || i >= s_e_idx[1] || i >= act.Length) return;
            c_af[i] = 0;
            c_ap[i] = 0;
            c_fp[i] = 0;
            for (int j = i + 1; j < elements; j++)
            {
                if ((act[i] < act[j] && fan[i] < fan[j]) || (act[i] > act[j] && fan[i] > fan[j])) c_af[i]++; else c_af[i]--;
                if ((act[i] < act[j] && pr[i] < pr[j]) || (act[i] > act[j] && pr[i] > pr[j])) c_ap[i]++; else c_ap[i]--;
                if ((fan[i] < fan[j] && pr[i] < pr[j]) || (fan[i] > fan[j] && pr[i] > pr[j])) c_fp[i]++; else c_fp[i]--;
            }
        }
the point is, that the same code works well on my TPW520 with Quadro 2000M.

to make it work with CUDA5.5, I just added an environment variable "NVSDKCOMPUTE_ROOT", and added the following function to Utitlities.cs:
public static string CUDA_PATH()
        {
            return Environment.GetEnvironmentVariable("NVSDKCOMPUTE_ROOT");
        }
Jan 1, 2014 at 5:06 PM
I modified my code as follows.

However, an unknown exception was caught when cudafy tried to copy data from device to host, at "gpu.CopyFromDevice<int>(dev_caf, C_AF);".

so I can't get the calculation result...
while (start < n)
{
         now1 = DateTime.Now;
         gpu.Launch(numBlocks, threadsPerBlock).CalculateKendallSEN(n, dev_lstAct, dev_lstFan, dev_lstPr, dev_caf, dev_cap, dev_cfp, start, end);
         now2 = DateTime.Now;
         time = (now2 - now1).TotalMilliseconds;
         timeTotal += time;
         Console.WriteLine("Process: {0}%, {1} ms this part, {2} ms in total.", Convert.ToDecimal(end) * 100 / n, time, timeTotal);
         start = end;
         end = start + span;
         if (end > n) end = n;
}
now2 = DateTime.Now;
Console.WriteLine("Finished at {0}-{1}-{2} {3}:{4}:{5}:{6}.", now2.Year, now2.Month, now2.Day, now2.Hour, now2.Minute, now2.Second, now2.Millisecond);

gpu.CopyFromDevice<int>(dev_caf, C_AF);
gpu.CopyFromDevice<int>(dev_cap, C_AP);
gpu.CopyFromDevice<int>(dev_cfp, C_FP);
Coordinator
Jan 2, 2014 at 12:24 PM
An ErrorUnknown in the CopyFromDevice is still likely to be an error in the Launch. If you remove the Launch you will more than likely not get this error in the CopyFromDevice call. Try simplifying your kernel, remove all lines from it and then start adding them in again one by one.
Jan 3, 2014 at 2:13 PM
Hi Nick,
[Cudafy]
public static void CalculateKendallSEN(GThread thread, int elements, int[] act, int[] fan, int[] pr, int[] c_af, int[] c_ap, int[] c_fp, int start_idx, int end_idx)
{
    int i = thread.threadIdx.x + thread.blockIdx.x * thread.blockDim.x;
    if (i < start_idx || i >= end_idx || i >= act.Length) return;
    for (int j = i + 1; j < elements; j++)
    {
        if ((act[i] < act[j] && fan[i] < fan[j]) || (act[i] > act[j] && fan[i] > fan[j])) c_af[i]++; else c_af[i]--;
        if ((act[i] < act[j] && pr[i] < pr[j]) || (act[i] > act[j] && pr[i] > pr[j])) c_ap[i]++; else c_ap[i]--;
        if ((fan[i] < fan[j] && pr[i] < pr[j]) || (fan[i] > fan[j] && pr[i] > pr[j])) c_fp[i]++; else c_fp[i]--;
    }
}
my kernal function is as follows. I tried your instruction. When I added the lines as follows, it failed:
[Cudafy]
public static void CalculateKendallSEN(GThread thread, int elements, int[] act, int[] fan, int[] pr, int[] c_af, int[] c_ap, int[] c_fp, int start_idx, int end_idx)
{
    int i = thread.threadIdx.x + thread.blockIdx.x * thread.blockDim.x;
    if (i < start_idx || i >= end_idx || i >= act.Length) return;
    for (int j = i + 1; j < elements; j++)
    {
        if ((act[i] < act[j] && fan[i] < fan[j]) || (act[i] > act[j] && fan[i] > fan[j]))
            c_af[i] =c_af[i]+1;
    }
}
so what's wrong with it? GTX480 cannot calculating adding???.....
Coordinator
Jan 4, 2014 at 5:37 PM
By far the most common cause of this kind of error is trying to access outside the bounds of the array. You check for i being between a certain range, however you then set j = i + 1 and increment to i + 1 + elements. Are you sure you should not check that i is less than fan.Length - 1 - elements and act.Length - 1 - elements?
Jan 6, 2014 at 1:17 PM
Thanks Nick.
because I don't calculate for the whole array, so I'm sure the bounds checking is ok. the same program works well on my W520 with a Q2000M.
but i'm using SLI on my 2 GTX480, maybe that's the problem? I'll try...
Jan 10, 2014 at 4:58 PM
Hi guys, I've got some progress.

firstly, i removed the SLI line between my 2 GTX480 cards, it looks better, the error changed, but i didn't recorded.

now the code can work with relative small value of the variable "elements", such as 100000, 200000. if i feed by 500000, it failed again, the same error, Timeout or unknown.

then, I implemented it by C++ AMP, I found it is faster than cudafy.net. I think the reason is cudafy makes the main thread monitor the GPU running, because I can see one thread of CPU is occupied when running, and I can terminate the program by Ctrl+C. But I can't do that to the program by C++ AMP.

However, the program by C++ AMP work on "elements" by 100000, but not 200000 or more, neither...
by debugging, an error message saying "Microsoft C++ exception: Concurrency::accelerator_view_removed at memory location 0x0022DCE4." was raised when trying to copy data from device to host.

According to the blog here: http://blogs.msdn.com/b/nativeconcurrency/archive/2012/02/24/accelerator-view-removed-exception-of-c-amp.aspx
I catched the error, it says "failed to create constant buffer".

does anyone has some idea about that?
is it because of the calculating capability of 2.0 of the GTX480 card?
Jan 10, 2014 at 8:04 PM
Well, I think we've got TDR...

here is the blog about it:
http://blogs.msdn.com/b/nativeconcurrency/archive/2012/03/06/handling-tdrs-in-c-amp.aspx
Jan 11, 2014 at 9:49 AM
Edited Jan 11, 2014 at 8:26 PM
Hi
I agree with Nick, I think you're either having a TDR or trying to access memory beyond a buffer's boundary. The SLI matter is not relevant for this sort of CUDA programming. If it's TDR, then is relatively easy to fix, there are other posts about it on this forum. That AMP would turn out faster is, IMO, merely a result of an improper implementation in CUDA. And the main CPU thread blocking on the GPU execution has no bearing on the GPU's performance.
cheers

Edit: Besides, it's normal for hardware such as the GPU to mask, ignore or act unpredictably on software errors. In fact, the only reason your C# code detects boundary violations is because the compiler generates code to check it. So a buggy program may seem to work on some graphic cards and raise an exception on others.
Jan 12, 2014 at 4:10 PM
Edited Jan 12, 2014 at 4:10 PM
Hi guys, finally, I made it work!

according to this post:
http://msdn.microsoft.com/en-us/library/windows/hardware/ff569918(v=VS.85).aspx
I just added a value into the registry to disable the TDR detection...