代码之家 › 专栏 › 技术社区 › R zu

Xeon CPU(E5-2603)反向内存预取

intel x86 optimization caching performance

R zu · 技术社区 · 6 年前

Xeon CPU(E5-2603)中的反向内存预取是否与正向内存预取一样快?

我想实现一个算法,它需要对数据进行前向循环和后向循环。

因为每次迭代都需要上一次迭代的结果,所以我不能颠倒循环的顺序。

1 回复 | 直到 6 年前

Hadi Brais 6 年前

您可以运行实验来确定数据预取器是否能够处理前向顺序访问和后向顺序访问。我有一个Haswell的CPU,所以预取器可能不同于CPU中实现的预取器(Sandy Bridge)。

数组在正向顺序初始化,然后以相同的方式遍历。我把这种模式称为 forfor .
数组在正向顺序初始化,然后在反向顺序遍历(从最后一个元素到第一个元素)。我把这种模式称为 forback
数组在反向顺序初始化,然后以相同的方式遍历。我把这种模式称为 backback .

福尔 forfor1 和 forfor2 . 每个元素的平均延迟如下:

为1
对于2 :15个周期。
前卫 :35.8个循环。
:40.3个周期。

L1访问延迟对任何测量噪声都特别敏感。二级访问延迟应该是 12 cycles 福尔 ,大多数延迟是4个周期,这清楚地表明L1命中。在第二轮 福尔 前卫 后仰 ,我们可以看到大多数延迟是三级命中。这意味着L3预取器能够处理正向和反向遍历,但不能处理L1和L2预取器。

注意,我没有刷新数组两次遍历之间的缓存,因此第一次遍历可能会影响第二次遍历中测量的延迟。

这是我用来测量的代码。

/* compile with gcc at optimization level -O3 */
/* set the minimum and maximum CPU frequency for all cores using cpupower to get meaningful results */ 
/* run using "sudo nice -n -20 ./a.out" to minimize possible context switches, or at least use "taskset -c 0 ./a.out" */
/* make sure all cache prefetchers are enabled */
/* preferrably disable HT */
/* this code is Intel-specific */
/* see the note at the end of the answer */

#include <stdint.h>
#include <x86intrin.h>
#include <stdio.h>

// 2048 iterations.
#define LINES_SIZE 64
#define ITERATIONS 2048 * LINES_SIZE
// Forward
#define START 0
#define END ITERATIONS
// Backward
//#define START ITERATIONS - LINES_SIZE
//#define END 0
#if START < END
#define INCREMENT i = i + LINES_SIZE
#define COMP <
#else
#define INCREMENT i = i - LINES_SIZE
#define COMP >=
#endif

int main()
{
  int array[ ITERATIONS ];
  int latency[ ITERATIONS/LINES_SIZE ];
  uint64_t time1, time2, al, osl; /* initial values don't matter */

  // Perhaps necessary to prevents UB?
  for ( int i = 0; i < ITERATIONS; i = i + LINES_SIZE )
  {
     array[ i ] = i; 
  }

  printf( "address = %p \n", &array[ 0 ] ); /* guaranteed to be aligned within a single cache line */

  // Measure overhead.
  _mm_mfence();                      
  _mm_lfence();                      /* mfence and lfence must be in this order + compiler barrier for rdtsc */
  time1 = __rdtsc();                 /* set timer */
  _mm_lfence();                      /* serialize rdtsc with respect to trailing instructions + compiler barrier for rdtsc */
  /* no need for mfence because there are no stores in between */
  _mm_lfence();                      /* mfence and lfence must be in this order + compiler barrier for rdtsc */
  time2 = __rdtsc();
  _mm_lfence();                      /* serialize rdtsc with respect to trailing instructions */
  osl = time2 - time1;

  // Forward or backward traversal.
  for ( int i = START; i COMP END; INCREMENT )
  {

     _mm_mfence();                      /* this properly orders both clflush and rdtsc */
     _mm_lfence();                      /* mfence and lfence must be in this order + compiler barrier for rdtsc */
     time1 = __rdtsc();                 /* set timer */
     _mm_lfence();                      /* serialize rdtsc with respect to trailing instructions + compiler barrier for rdtsc */
     int temp = array[ i ];             /* access array[i] */
     _mm_lfence();                      /* mfence and lfence must be in this order + compiler barrier for rdtsc */
     time2 = __rdtsc();
     _mm_lfence();                      /* serialize rdtsc with respect to trailing instructions */
     al = time2 - time1;

     printf( "array[ %i ] = %i \n", i, temp );         /* prevent the compiler from optimizing the load */
     latency[i/64] = al - osl;

  }

  // Output measured latencies.
  for ( int i = 0; i < ITERATIONS/LINES_SIZE; ++i )
  {
     printf( "%i \n", latency[i] );
  }

  return 0;
}

这些实验的目的是测量单个访问延迟,以确定每个访问的缓存级别。但是,由于 LFENCE 指令,度量可以包括加载指令在管道的其他阶段所需的延迟。此外,编译器正在将一些ALU指令放在计时区域A中,因此测量可能会受到这些指令的影响(这可以通过在程序集中编写代码来避免)。这使得区分一级访问和二级访问变得很困难。例如,一些L1延迟测量报告为8个周期。尽管如此 前卫 后仰

如果我们有兴趣测量访问内存层次结构特定级别的平均延迟,那么使用指针跟踪可以提供更准确的结果。事实上,这是测量内存延迟的传统方法。

如果以硬件预取器(尤其是二级或三级)难以预测的模式访问大量数据,则软件预取可能非常有益。然而,获得正确的软件预取通常是困难的。此外,我得到的测量显示,L3预取器可以向前和向后预取。如果在内存访问和计算方面都有很好的并行性,那么OoO执行可以隐藏L3访问延迟的很大一部分。

:如果我没有使用输出重定向运算符>将所有输出重定向到文件,即所有输出都将打印在终端上,则所有测量的延迟都将接近三级命中延迟。原因是 printf ,它在每次迭代中都被调用,正在污染大部分L1和L2缓存。所以一定要使用>运算符。你也可以使用 (void) *((volatile int*)array + i) 而不是 int tmp = array[i] this 和 this