根据英特尔的优化手册,L1数据缓存是32 KiB,8路与64字节线相关联。我已经编写了以下微基准测试内存读取性能。
我假设如果我们只访问32 KiB缓存中的块,那么每次访问内存都会很快,但是如果我们超过了缓存大小,访问速度会突然变慢。什么时候?
skip
1
,基准按顺序访问每一行。
void benchmark(int bs, int nb, int trials, int skip)
{
printf("block size: %d, blocks: %d, skip: %d, trials: %d\n", bs, nb, skip, trials);
printf("total data size: %d\n", nb*bs*skip);
printf("accessed data size: %d\n", nb*bs);
uint8_t volatile data[nb*bs*skip];
clock_t before = clock();
for (int i = 0; i < trials; ++i) {
for (int block = 0; block < nb; ++block) {
data[block * bs * skip];
}
}
clock_t after = clock() - before;
double ns_per_access = (double)after/CLOCKS_PER_SEC/nb/trials * 1000000000;
printf("%f ns per memory access\n", ns_per_access);
}
再次与
skip = 1
~ â¯â¯â¯ ./bm -s 64 -b 128 -t 10000000 -k 1
block size: 64, blocks: 128, skip: 1, trials: 10000000
total data size: 8192
accessed data size: 8192
0.269054 ns per memory access
~ â¯â¯â¯ ./bm -s 64 -b 256 -t 10000000 -k 1
block size: 64, blocks: 256, skip: 1, trials: 10000000
total data size: 16384
accessed data size: 16384
0.278184 ns per memory access
~ â¯â¯â¯ ./bm -s 64 -b 512 -t 10000000 -k 1
block size: 64, blocks: 512, skip: 1, trials: 10000000
total data size: 32768
accessed data size: 32768
0.245591 ns per memory access
~ â¯â¯â¯ ./bm -s 64 -b 1024 -t 10000000 -k 1
block size: 64, blocks: 1024, skip: 1, trials: 10000000
total data size: 65536
accessed data size: 65536
0.582870 ns per memory access
到目前为止,一切都很好:当所有东西都放在L1缓存中时,内部循环大约每纳秒运行4次,或者每时钟周期运行一次以上。当我们把数据做得太大的时候,它需要更长的时间。这与我对缓存应该如何工作的理解是一致的。
现在让我们访问
以租代堵
跳过
是
2
~ â¯â¯â¯ ./bm -s 64 -b 512 -t 10000000 -k 2
block size: 64, blocks: 512, skip: 2, trials: 10000000
total data size: 65536
accessed data size: 32768
0.582181 ns per memory access
但如果我设定
跳过
3
,事情又快了。事实上,任何
跳过
~ â¯â¯â¯ ./bm -s 64 -b 512 -t 10000000 -k 7
block size: 64, blocks: 512, skip: 7, trials: 10000000
total data size: 229376
accessed data size: 32768
0.265338 ns per memory access
~ â¯â¯â¯ ./bm -s 64 -b 512 -t 10000000 -k 12
block size: 64, blocks: 512, skip: 12, trials: 10000000
total data size: 393216
accessed data size: 32768
0.616013 ns per memory access
为什么会这样?
完整性:我在2015年年中的MacBook Pro上运行macOS 10.13.4。我完整的CPU品牌字符串是
Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz
cc -O3 -o bm bm.c
;编译器是Xcode 9.4.1附带的编译器。我忽略了
main
函数;它所做的只是解析命令行选项并调用
benchmark