代码之家 › 专栏 › 技术社区 › Marco van de Voort

旋转位图。编码

rotation image-manipulation image-processing delphi

Marco van de Voort · 技术社区 · 16 年前

用90度或270度位图,而不是简单地用倒坐标做嵌套循环?

位图为8bpp,通常为2048x2400x8bpp

目前,我通过简单地复制参数反转来实现这一点(伪代码:

for x = 0 to 2048-1
  for y = 0 to 2048-1
    dest[x][y]=src[y][x];

对于大型图像,GDI相当慢,纹理(GF7卡)的GPU加载/存储时间与当前CPU时间相同。

目标是Delphi,但它更像是一个算法问题。SSE(2)矢量化没问题,这对我来说是一个很大的问题,我可以在汇编程序中编写它

跟进尼尔斯的答覆

图2048x2700->2700x2048
带优化功能的编译器Turbo Explorer 2006。
Windows:电源方案设置为“始终打开”。( )
机器:Core26600(2.4 GHz)

步长为16的时间:10ms

步长为32+的时间:9ms

提速是值得的,谢谢。也许在夏天的几个月里,我会用SSE(2)版本折磨自己。然而,我已经考虑过如何解决这个问题,我想我会用光SSE2寄存器来直接实现:

for n:=0 to 7 do
  begin
    load r0, <source+n*rowsize> 
    shift byte from r0 into r1
    shift byte from r0 into r2
    ..
    shift byte from r0 into r8
  end; 
store r1, <target>   
store r2, <target+1*<rowsize>
..
store r8, <target+7*<rowsize>

所以8x8需要9个寄存器,但32位SSE只有8个。不管怎么说,这是夏天的事情:-)

请注意,指针是我出于本能而做的事情,但它可能是有实际意义的,如果您的维度没有硬编码,编译器无法将mul转换为移位。虽然MUL和sich现在很便宜,但它们也会产生更多的注册压力。

const stepsize = 32;
procedure rotatealign(Source: tbw8image; Target:tbw8image);

var stepsx,stepsy,restx,resty : Integer;
   RowPitchSource, RowPitchTarget : Integer;
   pSource, pTarget,ps1,ps2 : pchar;
   x,y,i,j: integer;
   rpstep : integer;
begin
  RowPitchSource := source.RowPitch;          // bytes to jump to next line. Can be negative (includes alignment)
  RowPitchTarget := target.RowPitch;        rpstep:=RowPitchTarget*stepsize;
  stepsx:=source.ImageWidth div stepsize;
  stepsy:=source.ImageHeight div stepsize;
  // check if mod 16=0 here for both dimensions, if so -> SSE2.
  for y := 0 to stepsy - 1 do
    begin
      psource:=source.GetImagePointer(0,y*stepsize);    // gets pointer to pixel x,y
      ptarget:=Target.GetImagePointer(target.imagewidth-(y+1)*stepsize,0);
      for x := 0 to stepsx - 1 do
        begin
          for i := 0 to stepsize - 1 do
            begin
              ps1:=@psource[rowpitchsource*i];   // ( 0,i)
              ps2:=@ptarget[stepsize-1-i];       //  (maxx-i,0);
              for j := 0 to stepsize - 1 do
               begin
                 ps2[0]:=ps1[j];
                 inc(ps2,RowPitchTarget);
               end;
            end;
          inc(psource,stepsize);
          inc(ptarget,rpstep);
        end;
    end;
  // 3 more areas to do, with dimensions
  // - stepsy*stepsize * restx        // right most column of restx width
  // - stepsx*stepsize * resty        // bottom row with resty height
  // - restx*resty                    // bottom-right rectangle.
  restx:=source.ImageWidth mod stepsize;   // typically zero because width is 
                                          // typically 1024 or 2048
  resty:=source.Imageheight mod stepsize;
  if restx>0 then
    begin
      // one loop less, since we know this fits in one line of  "blocks"
      psource:=source.GetImagePointer(source.ImageWidth-restx,0);    // gets pointer to pixel x,y
      ptarget:=Target.GetImagePointer(Target.imagewidth-stepsize,Target.imageheight-restx);
      for y := 0 to stepsy - 1 do
        begin
          for i := 0 to stepsize - 1 do
            begin
              ps1:=@psource[rowpitchsource*i];   // ( 0,i)
              ps2:=@ptarget[stepsize-1-i];       //  (maxx-i,0);
              for j := 0 to restx - 1 do
               begin
                 ps2[0]:=ps1[j];
                 inc(ps2,RowPitchTarget);
               end;
            end;
         inc(psource,stepsize*RowPitchSource);
         dec(ptarget,stepsize);
       end;
    end;
  if resty>0 then
    begin
      // one loop less, since we know this fits in one line of  "blocks"
      psource:=source.GetImagePointer(0,source.ImageHeight-resty);    // gets pointer to pixel x,y
      ptarget:=Target.GetImagePointer(0,0);
      for x := 0 to stepsx - 1 do
        begin
          for i := 0 to resty- 1 do
            begin
              ps1:=@psource[rowpitchsource*i];   // ( 0,i)
              ps2:=@ptarget[resty-1-i];       //  (maxx-i,0);
              for j := 0 to stepsize - 1 do
               begin
                 ps2[0]:=ps1[j];
                 inc(ps2,RowPitchTarget);
               end;
            end;
         inc(psource,stepsize);
         inc(ptarget,rpstep);
       end;
    end;
 if (resty>0) and (restx>0) then
    begin
      // another loop less, since only one block
      psource:=source.GetImagePointer(source.ImageWidth-restx,source.ImageHeight-resty);    // gets pointer to pixel x,y
      ptarget:=Target.GetImagePointer(0,target.ImageHeight-restx);
      for i := 0 to resty- 1 do
        begin
          ps1:=@psource[rowpitchsource*i];   // ( 0,i)
          ps2:=@ptarget[resty-1-i];       //  (maxx-i,0);
          for j := 0 to restx - 1 do
            begin
              ps2[0]:=ps1[j];
              inc(ps2,RowPitchTarget);
            end;
       end;
    end;
end;

更新2个泛型

更新3个仿制药 现在在XE10中工作

更新4

8x8 cubes of 8bpp images only 及相关 SO question 关于洗牌瓶颈,Peter Cordes慷慨地帮助了我。这段代码仍然错过了一个机会,并且仍然需要再次使用另一个循环平铺级别来将多个8x8块迭代聚合为伪较大的迭代,如64x64。现在又是整条线了,这是浪费。

4 回复 | 直到 3 年前

kot-da-vinci 8 年前

是的,有更快的方法可以做到这一点。

简单循环的大部分时间用于缓存未命中。这是因为在一个紧密的循环中,您在非常不同的位置接触了大量数据。更糟糕的是:你的记忆位置正好是两个相距的幂。这是缓存性能最差的大小。

如果改进内存访问的局部性,则可以改进此旋转算法。

例如,类似这样的内容(未检查,很抱歉C代码。我的Delphi技能不是最新的):

 // this is the outer-loop that breaks your image rotation
 // into chunks of 8x8 pixels each:
 for (int block_x = 0; block_x < 2048; block_x+=8)
 {
    for (int block_y = 0; blocky_y < 2048; block_y+=8)
    { 
       // this is the inner-loop that processes a block
       // of 8x8 pixels.
       for (int x= 0; x<8; x++)
         for (int y=0; y<8; y++)
            dest[x+block_x][y+block_y] = src[y+block_y][x+block_x]
    }
 }

还有其他方法。您可以按希尔伯特顺序或莫顿顺序处理数据。从理论上讲,这会快一点,但代码会复杂得多。

顺便说一句,既然你提到SSE是你的选择。请注意,您可以在SSE寄存器内旋转8x8字节块。让它工作起来有点棘手,但看看SSE矩阵转置代码应该可以让您开始,因为这是同一件事。

块大小为8x8像素,代码在我的机器上运行速度快约5倍。块大小为16x16时,运行速度快10倍。

用不同的块大小进行实验似乎是个好主意。

#include <stdio.h>
#include <windows.h>

char temp1[2048*2048];
char temp2[2048*2048];

void rotate1 (void)
{
  int x,y;
  for (y=0; y<2048; y++)
  for (x=0; x<2048; x++)
    temp2[2048*y+x] = temp1[2048*x+y];
}

void rotate2 (void)
{
  int x,y;
  int bx, by;

  for (by=0; by<2048; by+=8)
  for (bx=0; bx<2048; bx+=8)
  for (y=0; y<8; y++)
  for (x=0; x<8; x++)
    temp2[2048*(y+by)+x+bx] = temp1[2048*(x+bx)+y+by];
}

void rotate3 (void)
{
  int x,y;
  int bx, by;

  for (by=0; by<2048; by+=16)
  for (bx=0; bx<2048; bx+=16)
  for (y=0; y<16; y++)
  for (x=0; x<16; x++)
    temp2[2048*(y+by)+x+bx] = temp1[2048*(x+bx)+y+by];
}


int main (int argc, char **args)
{
  int i, t1;

  t1 = GetTickCount();
  for (i=0; i<20; i++) rotate1();
  printf ("%d\n", GetTickCount()-t1);

  t1 = GetTickCount();
  for (i=0; i<20; i++) rotate2();
  printf ("%d\n", GetTickCount()-t1);

  t1 = GetTickCount();
  for (i=0; i<20; i++) rotate3();
  printf ("%d\n", GetTickCount()-t1);

}

lothar 16 年前

如果你可以使用C++,那么你可能想看看 Eigen .

SSE(2及更高版本)和AltiVec指令集,可优雅地回退到非矢量化代码 .

对SSE(2及更高版本)和AltiVec指令集执行显式矢量化,并以优雅的方式回退到非矢量化代码。表达式模板允许对整个表达式全局执行这些优化。

对于大型矩阵,特别注意缓存友好性。

Pete Kirkham 16 年前

你可以

plinth 16 年前

如果图像不是正方形,则无法在适当的位置执行操作。即使在方形图像中工作,变换也不利于在位工作。

如果你想把事情做得快一点,你可以试着利用行的步幅来让它工作,但我认为最好的办法是从源代码一次读取4个字节,然后在dest中写入四个连续的行。这应该会减少你的一些开销,但我不会期望超过5%的改善。