Error : Ensure that compiled architecture version is suitable for device

Jan 26, 2012 at 2:11 PM

This typically indicates that your GPU is an earlier CUDA architecture than that it was compiled for.  By default CUDAfy compiles for 1.2.  You can modify this by using an overload of the Cudafy method.

Nick

>>>

Hi I get this :

[Cudafy.Host.CudafyHostException] = {"CUDA.NET exception: ErrorNoBinaryForGPU (Ensure that compiled architecture version is suitable for device)."}

  CudafyModule km = CudafyTranslator.Cudafy();

 

 GPGPU gpu = CudafyHost.GetDevice(CudafyModes.Target);

When this line executes :  gpu.LoadModule(km);

 

We have tried changing the project to build in anycpu, x86 and x64 but same error ?

 

Any pointers ?, were on VS2010 Win7 64 bit Cuda 4.0

Jan 26, 2012 at 2:53 PM

Yes this works passing the architecture enum, I only have a 1.1 capable card and that was the issue.

Thanks Nick

Wayne

Nov 29, 2012 at 8:34 PM
Edited Nov 29, 2012 at 9:00 PM

Please explain, what does it mean 1.1 capable card?

  CudafyModule km = CudafyTranslator.Cudafy(ePlatform.x86, eArchitecture.sm_11); //tested 11 to 35 but gives error

GPGPU gpu = CudafyHost.GetDevice(CudafyModes.Target);

 gpu.LoadModule(km); // gives error

VisualStudio 2010 C# Express, Cuda v5.0, cudafy v1.12

what should I do?

Nov 29, 2012 at 9:10 PM

hi uugan

More modern GPUs have higher capability count. The higher the compute capability number, the more complex operations your GPU can make; for example, you can't have double-precision instructions or some complex mathematic instructions with a device of only compute 1.1

Your error is probably due to your GPU being of compute capagility lower than 1.1

A good way to find out is to call "gpu.GetArchitecture()" right before your call to "LoadModule".

 

cheers

Dec 16, 2012 at 4:57 AM
Edited Dec 17, 2012 at 8:59 PM

should you be able to compile for LOWER compute capabilities than you card is capable of?

I have a gtx660ti, which is capable of 3.0, but when I try to compile for 1.1 I currently get the ErrorNoBinaryForGPU. It did used to work though.


Dec 16, 2012 at 11:17 AM

You should be able to compile a module, but not be able to load it onto the gpu. Otherwise you'd have a hard time generating the instalation packages for each arch combination.

Dec 16, 2012 at 9:00 PM
Edited Dec 16, 2012 at 9:05 PM

why's that?

Dec 16, 2012 at 11:06 PM

The same reason why you're able to compile a c# project for either x86 or x64 at will, regardless of your cpu's architecture. You must be able, from within a unique dev environment, to produce deployment packages to diverse end clients.

That's why, in cudafy, the compilation stage and the module load stage are decoupled.

 

Dec 17, 2012 at 8:05 PM
Edited Dec 17, 2012 at 10:21 PM

Understood, although to use your cpu analogy I can still run x86 on a x64 machine.

I am confused though, my code DID used to work with 1.1 on this machine. It think also used to run approximately twice as fast, and that's almost the only thing I've changed.

I say almost, because I also changed my project to be 64 rather than 32 bit. I'm going back to test it all better now, I may be able to reproduce it.

Dec 17, 2012 at 10:16 PM
Edited Dec 17, 2012 at 10:18 PM

ok, I can reproduce it, the results are very strange.

I get the same behavior for 32 or 64 bit, but the speed is different (see below).

Here's my test code:

 

using System;
using Cudafy;
using Cudafy.Host;
using Cudafy.Translator;
using System.Diagnostics;

namespace GPUtester
{
	class Program
	{
		public static void Main(string[] args)
		{
			if(Environment.Is64BitProcess) Console.WriteLine("64 bit");
			else Console.WriteLine("32 bit");
			Tester tester = new Tester();
			tester.TestIt(eArchitecture.sm_11);
			
			Console.WriteLine();
			tester.TestIt(eArchitecture.sm_30);
			
			Console.Write("Press any key to continue . . . ");
			Console.ReadKey(true);
		}
		
	}
	
	class Tester
	{
		public void TestIt(eArchitecture eSetting)
		{
			GPGPU gpu = CudafyHost.GetDevice(CudafyModes.Target,0);
			gpu.UnloadModules();
			//GPGPU gpu = CudafyHost.GetDevice(eGPUType.Emulator,0);
			GPGPUProperties gpprop = gpu.GetDeviceProperties(false);
			Console.WriteLine("Device Cuda Capability = " + gpprop.Capability);
			Console.WriteLine("Cuda architecture = " + eSetting);
			CudafyModule km = CudafyTranslator.Cudafy(ePlatform.Auto, eSetting, typeof(KernelContainer));
			km.Serialize(@"C:\temp\trial.cdfy");
			gpu.LoadModule(km,false);
			//gpu.EnableMultithreading();
			gpu.FreeAll();
			
			KernelContainer kc = new KernelContainer();
			for(int i=0;i<10;i++) Console.WriteLine("\tTime " + i + " = " + kc.timeKernel(gpu) + "ms");
		}
		
	}
	
	class KernelContainer
	{
		[Cudafy]
		public static void Kernel(GThread thread, float[] vals, float[] rets)
		{
			int ThreadID,TotalThreads;

			ThreadID = thread.blockIdx.x*thread.blockDim.x + thread.threadIdx.x;
			TotalThreads = thread.blockDim.x * thread.gridDim.x;

			for(int i=ThreadID;i<vals.Length;i+=TotalThreads)
			{
				float v = (float)i;
				for(long s=0;s<1000;s++)
				{
					v = vals[i]*v + (1-vals[i])*(v+1);
					if(v>10) v = v/10;
				}
				rets[i] = v;
			}			
		}
		
		public long timeKernel(GPGPU gpu)
		{
			Stopwatch sw = new Stopwatch();
			
			float[] vals = new float[10000000];
			float[] rets = new float[vals.Length];
			for(int i=0;i<vals.Length;i++) vals[i] = 1 - i/vals.Length/2;
			
			float[] dev_vals, dev_rets;
			dev_vals = gpu.CopyToDevice(vals);
			dev_rets = gpu.Allocate<float>(rets);
			
			sw.Start();
			
			gpu.Launch(128, 8, "Kernel", dev_vals, dev_rets);
			gpu.Synchronize();
			
			sw.Stop();			
			
			return sw.ElapsedMilliseconds;
		}
	}
}

 

 

and here are the results:

32 bit

 

32 bit
Device Cuda Capability = 3.0
Cuda architecture = sm_11
        Time 0 = 1301ms
        Time 1 = 1273ms
        Time 2 = 1282ms
        Time 3 = 1273ms
        Time 4 = 1273ms
        Time 5 = 1273ms
        Time 6 = 1273ms
        Time 7 = 1273ms
        Time 8 = 1273ms
        Time 9 = 1273ms

Device Cuda Capability = 3.0
Cuda architecture = sm_30
        Time 0 = 1903ms
        Time 1 = 1885ms
        Time 2 = 1885ms
        Time 3 = 1885ms
        Time 4 = 1885ms
        Time 5 = 1885ms
        Time 6 = 1885ms
        Time 7 = 1885ms
        Time 8 = 1885ms
        Time 9 = 1885ms
Press any key to continue . . .

 

64 bit:

64 bit
Device Cuda Capability = 3.0
Cuda architecture = sm_11
        Time 0 = 1508ms
        Time 1 = 1465ms
        Time 2 = 1465ms
        Time 3 = 1465ms
        Time 4 = 1465ms
        Time 5 = 1465ms
        Time 6 = 1465ms
        Time 7 = 1465ms
        Time 8 = 1465ms
        Time 9 = 1465ms

Device Cuda Capability = 3.0
Cuda architecture = sm_30
        Time 0 = 1886ms
        Time 1 = 1886ms
        Time 2 = 1886ms
        Time 3 = 1886ms
        Time 4 = 1886ms
        Time 5 = 1886ms
        Time 6 = 1886ms
        Time 7 = 1886ms
        Time 8 = 1886ms
        Time 9 = 1886ms
Press any key to continue . . .

a couple of points / questions here:

1) I don't know why it now works with 1.1, but didn't earlier. What I have done is rebuild and run the project on an old machine of mine (with an older gpu), and then rebuild and run the project again on my new machine (where all these test results come from). Why that made it work, I have no idea. Nevertheless it does now work, and I can't seem to make it fail anymore.

2) 1.1 is significantly faster than 3.0.

3) 32-bit 1.1 is faster than 64-bit 1.1, but for 3.0 times are the same.

 

I'm hoping someone can shed some light on this, there are quite a few strange things here.

Dec 17, 2012 at 11:41 PM
Edited Dec 18, 2012 at 12:08 AM

Very interesting.

I'm not surprised by the dif between 32 and 64. But I certainly am regarding 1.1 vs 3.0 !

It probably all comes down to memory access. Your inner loop is extremely memory intensive. All those vals[i]'s over and over again will be handled differently based on your architecture. In both cases L1 cache will be used, of course, and the algorithms the gpu uses to determine L1 access might vary across different archs. Might be the compiler for 1.1 even managed to optimize them away onto the outer loop, while the 3.0 algo didn't. It would be interesting to instead test the use of a very instruction-intensive inner loop. After all, your example would never be used in a real-life scenario; memory access would instead be staged in blocks onto shared memory and further cached onto registers. So, right now you are profiling how different archs handle inefficient code, which isn't fair.

Nevermind the above considerations. I noticed that you're only launching kernels with 8 threads per block. You're throwing away 3/4 of a full warp, which is quite bad. Use at the very least full warps of 32 threads, but better still, use the cuda occupancy calculator. Could you try your test again with the following code? Try different multiples of 32 for the totTh constant.

  [Cudafy]
            public static void Kernel(GThread thread, float[] vals, float[] rets)
            {
                int ThreadID, TotalThreads;

                ThreadID = thread.blockIdx.x * thread.blockDim.x + thread.threadIdx.x;
                TotalThreads = thread.blockDim.x * thread.gridDim.x;
                for (int i = ThreadID; i < vals.Length; i += TotalThreads)
                {
                    float v = (float)i;
                    float v2 = vals[i]; 
                    for (long s = 0; s < 1000; s++)
                    {
                        v = v2 * v + (1 - v2) * (v + 1);
                        if (v > 10) v = v / 10;
                    }
                    rets[i] = v;
                }
            }

            public const int totTh = 64;
            public const int totblocks = 2048 / totTh;

            public long timeKernel(GPGPU gpu)
            {
                Stopwatch sw = new Stopwatch();

                float[] vals = new float[totblocks * totTh];
                float[] rets = new float[totblocks * totTh];
                for (int i = 0; i < vals.Length; i++) vals[i] = 1 - i / vals.Length / 2;

                float[] dev_vals, dev_rets;
                dev_vals = gpu.CopyToDevice(vals);
                dev_rets = gpu.Allocate<float>(rets);

                sw.Start();

                gpu.Launch(totblocks, totTh, Kernel, dev_vals, dev_rets);
                gpu.Synchronize();

                sw.Stop();

                return sw.ElapsedMilliseconds;
            }

 

Dec 18, 2012 at 12:46 AM

as an aside why is 64bit slower?

I've tried your code, although I had to change totblocks to be 1024 times bigger to get meaningful times. So the line looks like this:

		public const int totblocks = 2048 * 1024 / totTh;

will totTh=64, I get 25ms vs 28ms. 

with totTh=256, I get 22ms vs 21ms - so basically the same, 3.0 actually recording faster than 1.1.

So problem solved. 

Regarding your point on memory access - how should the kernel's memory loop be written to make it less memory intensive?

Coordinator
Dec 18, 2012 at 9:08 AM

Probably irrelevant in terms of the timings but may be worth considering: looks like you have a memory leak through use of the CopyToDevice and Allocate each loop.

Dec 18, 2012 at 11:25 AM

you mean because I don't free the memory? Good point. 

Dec 18, 2012 at 12:18 PM

In your case, and assuming you assumed (see what I did there?) the compiler would optimize away your redundant use of vals[i], then you are already doing optimal coalesced memory access, so please disregard my remarks. In case you didn't do that deliberately, then you absolutely must read the nvidia's performance guidelines. Writing code for a GPU is very different than writing it for a CPU, and CUDAfy can be misleading in how easy and seamless it makes it all look.

"I had to change totblocks to be 1024 times bigger to get meaningful times."

Yeah, sorry about that, I wasn't able to test the code I sent you.

As for x64 vs x86, well for one thing pointer arithmetic is a bit slower in x64, since the pointers are twice as large. There are other issues, of course, but that should suffice for now.

cheers