Cudafy Emulator result is different from Cudafy GPU result

Apr 28, 2013 at 1:57 AM
Hi i am having the same issue, i am making an image Gaussian filter and tried my code on the emulator gives correct results but with cuda itself the result is not correct and this is my code:

Download Link to the cs file:
TEXT

and i am testing this filter on small images because of the image indexing on the gpu using the blocks and threads together is not finished i am making dim of blocks with the size of the image so maximum image size can be tested is (300x200) = 60000 block so just try on small images

i am really stuck with it so thnx for any reply :)
Apr 29, 2013 at 12:43 PM
Edited Apr 29, 2013 at 12:45 PM
I'm not going to look over your source code if it's more than, say, 10 lines. If you can't be bothered to spend the time and effort to trim down your code to make it readable and explain carefuly what seems to be wrong/missing/broken/different, why should I spend my time going through media fire to download and parse some huge block of code for your benefit?
I speak only for myself, of course. Maybe you'll find others in this forum who think otherwise. BTW, I'm in no way affiliated with CUDAfy.

P.S. sorry if I sound too harsh, I've just been having a bad day.
May 3, 2013 at 12:29 AM
Edited May 3, 2013 at 12:50 AM
The threads working is different from the emulator working i don't know why.
To be more clear this is the original Image and the results from the emulator(correct) and from the Cuda(incorrect) respectively:

Image

Image

Image

Right-Click on each image to preview it or save it.
May 3, 2013 at 12:30 AM
pedritolo1 wrote:
I'm not going to look over your source code if it's more than, say, 10 lines. If you can't be bothered to spend the time and effort to trim down your code to make it readable and explain carefuly what seems to be wrong/missing/broken/different, why should I spend my time going through media fire to download and parse some huge block of code for your benefit?
I speak only for myself, of course. Maybe you'll find others in this forum who think otherwise. BTW, I'm in no way affiliated with CUDAfy.

P.S. sorry if I sound too harsh, I've just been having a bad day.
you really sound too harsh, but no problem.
BTW this is my first time to enter a forum discussion, am trying to be clear.
The code in media fire link is ready to run i made it like this to save time as you said, but sorry because i was not clear about that.
If you just want to look at the kernel code no problem, but sure it cannot be 10 lines maximum but i will try to simplify it as much as i can.

Here is the code i simplified it as much as i can to let you know what in general that code does:

        public static void Kernel(GThread tid, CudaPixel[,] buffer, double[,] mask, CudaPixel[,] filledMask, CudaPixel[,] result)
        {
            //int thread = tid.threadIdx.x + tid.blockIdx.x * tid.blockDim.x;
            int xLoc = tid.blockIdx.x;
            int yLoc = tid.blockIdx.y;

            if (xLoc < buffer.GetLength(0) && yLoc < buffer.GetLength(1))
            {
                CudaPixel sum = new CudaPixel();
                ///////////////fill mask\\\\\\\\\\\\\\\\\\\\\
                //get Dimensions of image1 and mask respectively
                int imgW = buffer.GetLength(0);
                int imgH = buffer.GetLength(1);
                int maskW = mask.GetLength(0);
                int maskH = mask.GetLength(1);
                //indexes used in loops
                int maskW_Ind = 0;//mask width index
                int maskH_Ind = 0;//mask height index
                int maskW_Ind_Loc = 0;//mask width index location on image1
                int maskH_Ind_Loc = 0;//mask height index location on image1
                for (int i = -maskH / 2; i <= maskH / 2; i++)
                {
                    for (int j = -maskW / 2; j <= maskW / 2; j++)
                    {
                        maskW_Ind = j + (maskW / 2);
                        maskH_Ind = i + (maskH / 2);

                        maskW_Ind_Loc = xLoc + j;
                        maskH_Ind_Loc = yLoc + i;

                        ////First Check Corners only\\\\
                        ////Then Check Edges only\\\\
                    }
                }
                ///////end of fill mask\\\\\\
                
                ///////get mask sum\\\\\\\\
                for (int i = 0; i < filledMask.GetLength(1); i++)
                {
                    for (int j = 0; j < filledMask.GetLength(0); j++)
                    {
                        //summing the mask
                    }
                }
                ////////end mask sum\\\\\\\\

                ///////CUT-OFF post process\\\\\\\
                //if (conditions) to limit the result value
                //////////////////\\\\\\\\\\\\\\\\\

                ////////assign the new pixel value\\\\\\\\\
                result[xLoc, yLoc].R = sum.R;
                result[xLoc, yLoc].G = sum.G;
                result[xLoc, yLoc].B = sum.B;
            }
        }
If you can help me i appreciate it a lot,
Thank you for your Time pedritolo1,
Coordinator
May 3, 2013 at 8:36 AM
Can you post the generated cu file? Does your code behave the same under both OpenCL and CUDA? You will need to use the latest V1.21 version for this.
Nick
May 3, 2013 at 11:34 AM
It surprises me that you don't use the tid.threadIdx anywhere. Why only use the block index?
Could you show us the code snippet containing the kernel launch, with the respective assignment of the blockdim and griddim arguments?
cheers, and sorry for snapping at you earlier.