dimanche 3 mai 2015

mobile OpenCL local memory bank conflict. Why using local memory is slower than does global memory in kernel?

I'm developing face detection app in android platform using OpenCL. Face detection algorithm is based on Viola Jones algorithm. I tried to make Cascade classification step kernel code. and I set classifier data of cascade stage 1 among cascade stages to local memory(__local) because classifier data are used for all work-items.

But, kernel profiling time without using local mem(using global mem) is more faster than that does with using local memory.


__kernel void CASCADE(__read_only image2d_t input_image, __write_only image2d_t output_image,__constant float* classifierMem){

   int cascadeLocalSize = get_local_size(0);
   __local float localStage1[5];
   int localIdx = get_local_id(1)*cascadeLocalSize + get_local_id(0);
   if(localIdx<5)
   {
   int stage1Idx = localIdx + idxNumValStageArray[0]+4;
      localStage1[localIdx] = classifierMem[stage1Idx];
   }


   barrier(CLK_LOCAL_MEM_FENCE);

   int gx = get_global_id(0);
   int gy = get_global_id(1);


   int featureIndex =0;

   float featureThres = localStage1[featureIndex++];
   float succVal = localStage1[featureIndex++];
   float failVal = localStage1[featureIndex++];
   float regionValue = localStage1[featureIndex++];
   float stageThres = localStage1[featureIndex];


   float featureValue += (regionValue < featureThres)?failVal:succVal;         
   if(featureValue < stageThres)   
         write_imagef(output_image, (int2)(gx, gy), (0.1));

}

Without using local memory version (original version):

__kernel void CASCADE(__read_only image2d_t input_image, __write_only image2d_t output_image,__constant float* classifierMem){

   int gx = get_global_id(0);
   int gy = get_global_id(1);


   int featureIndex =0;

   float featureThres = classifierMem[featureIndex++];
   float succVal = classifierMem[featureIndex++];
   float failVal = classifierMem[featureIndex++];
   float regionValue = classifierMem[featureIndex++];
   float stageThres = classifierMem[featureIndex];


   float featureValue += (regionValue < featureThres)?failVal:succVal;         
   if(featureValue < stageThres)   
         write_imagef(output_image, (int2)(gx, gy), (0.1));

}

Why using local memory version is slower??

Aucun commentaire:

Enregistrer un commentaire