代码之家 › 专栏 › 技术社区 › Bwebb

性能:Mod和赋值vs条件和赋值

micro-optimization x86 assembly c

2

Bwebb · 技术社区 · 6 年前

我在ISR中有一个计数器(它是由50us的外部IRQ触发的)。计数器递增并环绕最大值(240)。

我有以下代码:

if(condition){
  counter++;
  counter %= MAX_VAL;
  doStuff(table[counter]);
}

我正在考虑另一种实现方式:

if(condition){
  //counter++;//probably I would increment before the comparison in production code
  if(++counter >= MAX_VAL){
    counter=0;
  }
  doStuff(table[counter]);
}

我知道人们建议不要像这样尝试优化,但这让我很好奇。在x86上什么更快?最大值的什么值可以证明第二次实现的合理性?

大约每50us调用一次,因此减少指令集不是一个坏主意。if(++计数器>=MAX_VAL)将被预测为false,因此在绝大多数情况下,它将删除0的赋值。出于我的目的,我更喜欢%=实现的一致性。

1 回复 | 直到 6 年前

1

4

Peter Cordes 6 年前

正如@RossRidge所说,在现代x86上为中断提供服务(可能至少有100个时钟周期,而且 许多的 如果这是现代操作系统的一部分(设置了熔毁+幽灵缓解)的话,会有更多的问题。

MAX_VAL 是2的幂, counter %= MAX_VAL 非常好,尤其是 counter and movzx 字节到dword,在Intel CPU上可以没有延迟。当然,它仍然有吞吐量成本: Can x86's MOV really be "free"? Why can't I reproduce this at all? )

255-240 用无害的东西输入,或重复的东西?

是一个 计数器%=最大值 将高效地编译为几次乘法、移位和加法(同样,对于未签名的应用程序,效率更高。) Why does GCC use multiplication by a strange number in implementing integer division?

不过,检查一下环绕效果就更好了。无分支支票(使用 cmov )具有比使用乘法逆的余数更低的延迟,并且吞吐量的uops成本更低。

// simple version that works exactly like your question
// further optimizations assume that counter isn't used by other code in the function,
// e.g. making it a pointer or incrementing it for the next iteration
void isr_countup(int condition) {
    static unsigned int counter = 0;

    if(condition){
      ++counter;
      counter = (counter>=MAX_VAL) ? 0 : counter;  // gcc uses cmov
      //if(counter >= MAX_VAL) counter = 0;        // gcc branches
      doStuff(table[counter]);
    }
}

我编译了很多版本 on the Godbolt compiler explorer 最近的gcc和clang。

(有关x86 asm短块吞吐量和延迟的静态性能分析的更多信息,请参阅 What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand? x86 tag wiki ,尤其是 Agner Fog's guides .)

-fPIE 以防你把它用在你的果仁里。如果你能用 -fno-pie mov edi, [table + 4*rcx] ,假设您的目标是位置相关代码中的静态地址适合32位符号扩展常量(例如,在Linux内核中为true,但我不确定它们是否使用或者在加载内核时使用重定位执行内核ASLR。)

# clang7.0 -O3 -march=haswell -fPIE.
#  gcc's output is the same (with different registers), but uses `mov edx, 0` before the cmov for no reason, because it's also before a cmp that sets flags
isr_countup:                            # @isr_countup
    test    edi, edi
    je      .LBB1_1                  # if condition is false

    mov     eax, dword ptr [rip + isr_countup.counter]
    add     eax, 1                   # counter++
    xor     ecx, ecx
    cmp     eax, 239                 # set flags based on (counter , MAX_VAL-1)
    cmovbe  ecx, eax                 # ecx = (counter <= MAX_VAL-1) ? 0 : counter
    mov     dword ptr [rip + isr_countup.counter], ecx   # store the old counter
    lea     rax, [rip + table]
    mov     edi, dword ptr [rax + 4*rcx]        # index the table

    jmp     doStuff@PLT             # TAILCALL
.LBB1_1:
    ret

从加载旧计数器值开始的8条指令块总计为8个UOP(在AMD或Intel Broadwell及更高版本上,其中 cmov公司 只有1个uop)。来自的关键路径延迟 柜台 table[++counter % MAX_VAL] 准备就绪是1(add)+1(cmp)+1(cmov)+负载使用延迟。i、 e.额外循环3次。这是1的延迟 mul 说明。或者在旧的英特尔上多加一个周期是2个uops。

相比之下,对于带有gcc的块,使用模的版本是14个uop,包括3个uop mul r32

其他优化

使用旧值 ,并为下一次准备一个值(使计算脱离关键路径)
使用指针而不是计数器。节省两条指令,代价是使用8个字节而不是变量的1或4个缓存占用空间( uint8_t counter movzx公司

加载,将该逻辑从关键路径依赖链中移除,以便无序执行。

void isr_pointer_inc_after(int condition) {
    static int *position = table;

    if(condition){
        int tmp = *position;
        position++;
        position = (position >= table + MAX_VAL) ? table : position;
        doStuff(tmp);
    }
}

所以编译器无论如何都需要寄存器中的表地址。

# gcc8.2 -O3 -march=haswell -fPIE
isr_pointer_inc_after(int):
    test    edi, edi
    je      .L29

    mov     rax, QWORD PTR isr_pointer_inc_after(int)::position[rip]
    lea     rdx, table[rip+960]        # table+MAX_VAL
    mov     edi, DWORD PTR [rax]       # 
    add     rax, 4
    cmp     rax, rdx
    lea     rdx, -960[rdx]             # table, calculated relative to table+MAX_VAL
    cmovnb  rax, rdx
    mov     QWORD PTR isr_pointer_inc_after(int)::position[rip], rax

    jmp     doStuff(int)@PLT
.L29:
    ret

同样,8个UOP(假设 cmov公司 [rax] 在Sandybridge系列上,寻址模式(RAX来自负载)的延迟比索引寻址模式低1个周期。如果没有位移,它就不会受到中所述的惩罚 Is there a penalty when base+offset is in a different page than the base?

或者(用计数器)计数向下接近零:如果编译器可以使用减量设置的标志来检测值变为负值,则可能会保存一条指令。或者我们可以一直使用递减值,然后进行环绕:所以什么时候 柜台 table[--counter] ( table[0] ),但是商店 counter=MAX_VAL

如果你想要一个分支版本,你会希望它在carry标志上分支,因为 sub eax,1 / jc 宏可以融合成1个uops,但是 / js 无法对Sandybridge家族进行宏融合。 x86_64 - Assembly - loop conditions and out of order . 但如果没有小枝,那就好了。 cmovs cmovc (如果设置进位标志,则为mov)。

要让gcc使用dec或sub的标志结果而不同时执行 cdqe intptr_t 但那是愚蠢的;不如用个指针。对于一个未签名的计数器,gcc和clang都想做另一个 cmp eax, 239 (int)counter < 0 :

// Counts downward, table[] entries need to be reversed
void isr_branchless_dec_after(int condition) {
    static unsigned int counter = MAX_VAL-1;

    if(condition){
        int tmp = table[counter];
        --counter;
        counter = ((int)counter < 0) ? MAX_VAL-1 : counter;
        //counter = (counter >= MAX_VAL) ? MAX_VAL-1 : counter;
        //counter = (counter==0) ? MAX_VAL-1 : counter-1;
        doStuff(tmp);
    }
}

 # gcc8.2 -O3 -march=haswell -fPIE
isr_branchless_dec_after(int):
    test    edi, edi
    je      .L20

    mov     ecx, DWORD PTR isr_branchless_dec_after(int)::counter[rip]
    lea     rdx, table[rip]
    mov     rax, rcx                   # stupid compiler, this copy is unneeded
    mov     edi, DWORD PTR [rdx+rcx*4] # load the arg for doStuff
    mov     edx, 239                   # calculate the next counter value
    dec     eax
    cmovs   eax, edx
    mov     DWORD PTR isr_branchless_dec_after(int)::counter[rip], eax  # and store it

    jmp     doStuff(int)@PLT
.L20:
    ret

仍然是8个uops(应该是7个),但是在关键路径上没有额外的延迟。因此,所有额外的减量和换行指令都是用于无序执行的有趣的指令级并行。