Enahncement: Allowing Non-Constant Dimensions for SharedAlloc

Nov 12, 2012 at 9:08 PM

Of particular use in performing tuning exercises is the ability to run multiple tests with varying values set for threads/block and blocks/grid. This has until now required source-code editing on each tet; no-more. By making the following change in the depths of CheckForAllocSharedi in CUDAOutputVisitor.cs:

#if OldCode
         if (!(arg is PrimitiveExpression))
             throw new CudafyLanguageException(CudafyLanguageException.csSHARED_MEMORY_MUST_BE_CONSTANT);
         object o = (arg as PrimitiveExpression).Value;
         object o;
         if (arg is PrimitiveExpression) {
            o = (arg as PrimitiveExpression).Value;
         } else if( (arg is MemberReferenceExpression)) {
            var method = GetGetMethod(((MemberReferenceExpression)arg));
            if (method == null)   // TO-DO Revise the message string: csSHARED_MEMORY_MUST_BE_CONSTANT
               throw new CudafyLanguageException(CudafyLanguageException.csSHARED_MEMORY_MUST_BE_CONSTANT);
            o = new PrimitiveExpression(method.Invoke(method.DeclaringType,null));
         } else {
            throw new CudafyLanguageException(CudafyLanguageException.csSHARED_MEMORY_MUST_BE_CONSTANT);
         if (ctr < argLen - 1)

And by adding this routine to the class:

      private System.Reflection.MethodInfo GetGetMethod(MemberReferenceExpression member) {
         var method   =(   from a in member.Annotations
                        where a is Mono.Cecil.PropertyDefinition
                        select ((Mono.Cecil.PropertyDefinition)a).GetMethod
         if (method == null) return null;
         var type      =(   from ass in AppDomain.CurrentDomain.GetAssemblies()
                        where ass.GetType(method.DeclaringType.FullName) != null
                        select ass.GetType(method.DeclaringType.FullName)
         if (type == null) return null;
         return      (   from p in type.GetProperties(System.Reflection.BindingFlags.Static
                        where p.Name == member.MemberName
                        select p.GetGetMethod() 

It is now possible to use public static getter properties to initialize the dimensions of an array in Shared Memory.

Nov 13, 2012 at 9:05 AM

Can you post some code that uses this?  


Nov 13, 2012 at 12:19 PM

Of course. I have a big test case, let me see how easily it can be cut down to a more reasonable size.

Nov 13, 2012 at 12:46 PM
using System;
using System.Diagnostics;
using System.Linq;
using Cudafy;
using Cudafy.Host;
using Cudafy.Translator;

namespace CudafyTuningTsp {
   public class TypeTest  {
		// Count permutations
		static TypeTest() { 
			ThreadsPerBlock	= 256;
			BlocksPerGrid		= 256;

      protected const int _cities				= 12;	// Set this to control sample size
      protected static long _permutations		= 1;
		public static int		ThreadsPerBlock	{ get; set; }
		public static int		BlocksPerGrid		{ get; set; }

		private static void SetPermutations() {
			_permutations = 1L;
			for (int i = 2; i <= _cities; i++) { _permutations *= i; } 

		private static void Main() {
         Console.WriteLine("Compiling ...");
         RunTest(GetThreadInfo(), GetAnswer());
         ThreadsPerBlock /= 2;
         RunTest(GetThreadInfo(), GetAnswer());
         ThreadsPerBlock /= 2;
         RunTest(GetThreadInfo(), GetAnswer());
         BlocksPerGrid /= 2;
         RunTest(GetThreadInfo(), GetAnswer());

         Console.WriteLine("Done ... Press Enter to shutdown.");
			try { Console.Read(); } catch (InvalidOperationException) { ; }

		private static string GetThreadInfo() {
			var target = string.Format(" ( {0,3} threads_per * {1,3} blocks input: ",
				ThreadsPerBlock, BlocksPerGrid );
			return target;

		static void RunTest(string threadInfo, AnswerStruct answer) {
				string.Format("{0} {1,3} threads * {2,3} blocks returned.",
					threadInfo, answer.pathNo, answer.distance));
      public struct AnswerStruct { 
			public float distance; 
			public long pathNo;
      internal static AnswerStruct GetAnswer() {
         using (var gpu         = CudafyHost.GetDevice()) { 

            var answer         = new AnswerStruct[BlocksPerGrid];;
            var gpuAnswer      = gpu.Allocate(answer);

            gpu.Launch(BlocksPerGrid, ThreadsPerBlock,
               GpuFindPathDistance,   gpuAnswer);


            var bestDistance      = float.MaxValue;
            var bestPermutation   = 0L;
            for (var i = 0; i < BlocksPerGrid; i++) {
               if (answer[i].distance < bestDistance) {
                  bestDistance      = answer[i].distance;
                  bestPermutation   = answer[i].pathNo;

            return new AnswerStruct { 
               distance = bestDistance, 
               pathNo   = bestPermutation

      public static void GpuFindPathDistance(GThread thread, AnswerStruct[] answer) {
         var answerLocal      = thread.AllocateShared<AnswerStruct> ("ansL",   ThreadsPerBlock);

         var bestDistance     = thread.gridDim.x;
         var bestPermutation  = thread.blockDim.x;

			var sum = 0;
			for (int i=0; i < thread.blockDim.x; i++) sum += i * thread.threadIdx.x;

         answerLocal[thread.threadIdx.x].distance   = bestDistance;
         answerLocal[thread.threadIdx.x].pathNo     = bestPermutation;

         if (thread.threadIdx.x == 0) {
            answer[thread.blockIdx.x]               = answerLocal[0];

Nov 13, 2012 at 1:10 PM

That's pretty impressive. However and it may be a big however, what would happen if you did not re-Cudafy between the successive alterations of ThreadsPerBlock?  You'd actually have a fairly dangerous situation there.  Officially CUDA does support dynamic shared memory, perhaps we need to look there?

Nov 13, 2012 at 1:40 PM

I actually ran that test (by accident); the first iteration runs fine of course, and the second croaks over promptly. I did not chase down details, once I recognized what I had forgotten, but my casual impression was that CUDA seemed to have identified the mismatch in structure sizes between host and device, and reported an inability to continue.

Personally, I believe the benefits outweigh the risks, given reasonable documentation. Could a compiler warning be issued to prevent 'unaware' usage? Is the dynamic shared memory available below compute capability 3.0? I tried using thread.blockDim.x in a SharedAllocate and as I recall CUDA complained after I got it past CUDAfy. (I have only a 2.1 GPU)