代码之家 › 专栏 › 技术社区 › Xiaoyong Guo

为什么一个共享的繁忙cpu内核会影响openmp的整体cpu利用率?

openmp linux c++

Xiaoyong Guo · 技术社区 · 1 年前

我的系统设置是linux系统,有12个内核,隔离内核2-11。核心0和1的使用率几乎100%被其他程序使用。其余所有内核都处于空闲状态。

第一轮测试。

export GOMP_CPU_AFFINITY=2,3,4
export OMP_NUM_THREADS=3
taskset -c $GOMP_CPU_AFFINITY perf stat -d ./test_openmp

输出为:

 Performance counter stats for './test_openmp':

         47,654.74 msec task-clock:u              #    2.981 CPUs utilized          
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
           115,358      page-faults:u             #    2.421 K/sec                  
   159,245,881,934      cycles:u                  #    3.342 GHz                    
   250,009,309,156      instructions:u            #    1.57  insn per cycle         
    20,002,132,172      branches:u                #  419.730 M/sec                  
           117,268      branch-misses:u           #    0.00% of all branches        
   110,002,614,320      L1-dcache-loads:u         #    2.308 G/sec                  
    10,796,435,741      L1-dcache-load-misses:u   #    9.81% of all L1-dcache accesses
                 0      LLC-loads:u               #    0.000 /sec                   
                 0      LLC-load-misses:u         #    0.00% of all LL-cache accesses

      15.986638336 seconds time elapsed

      47.175831000 seconds user
       0.414928000 seconds sys

第二轮测试。

export GOMP_CPU_AFFINITY=1,2,3,4
export OMG_NUM_THREADS=4

taskset -c $GOMP_CPU_AFFINITY perf stat -d ./test_openmp

输出为

pid: 4118342

 Performance counter stats for './test_openmp':

         48,241.03 msec task-clock:u              #    1.072 CPUs utilized          
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
           119,879      page-faults:u             #    2.485 K/sec                  
   161,605,704,451      cycles:u                  #    3.350 GHz                    
   250,011,376,400      instructions:u            #    1.55  insn per cycle         
    20,002,726,448      branches:u                #  414.641 M/sec                  
           118,657      branch-misses:u           #    0.00% of all branches        
   110,002,938,510      L1-dcache-loads:u         #    2.280 G/sec                  
    10,796,444,713      L1-dcache-load-misses:u   #    9.81% of all L1-dcache accesses
                 0      LLC-loads:u               #    0.000 /sec                   
                 0      LLC-load-misses:u         #    0.00% of all LL-cache accesses

      45.012033357 seconds time elapsed

      47.764469000 seconds user
       0.399934000 seconds sys

我的问题是:为什么我第二次为程序分配了一个内核(内核1),但运行时间必须更长(15.98秒对45.01秒),cpu利用率非常低(2.98对1.07)

这是我运行的测试代码。

#include <iostream>
#include <cstdint>
#include <unistd.h>

constexpr int64_t N = 100000;
int m = N;
int n = N;

int main() {
  double* a = new double[N];
  double* c = new double[N];
  double* b = new double[N*N];

  std::cout << "pid: " << getpid() << std::endl;

#pragma omp parallel for default(none) shared(m,n,a,b,c)

for (int i=0; i<m; i++) {
 double sum = 0.0;
 for (int j=0; j<n; j++)
   sum += b[i+j*N]*c[j];
   a[i] = sum;
}

  return 0;
}

1 回复 | 直到 1 年前

Joachim 1 年前

如果不为工作共享循环指定计划,则计划为 已定义实施 大多数实现选择静态调度,因为它对大多数工作负载的运行时开销最低。静态调度将相同数量的迭代分配给每个线程。

在您的情况下,您特别希望允许openmp以不同的方式将工作分配给线程。尝试添加 schedule(dynamic) 与指令并行。

您还可以选择 schedule(runtime) 并通过为每次执行设置环境变量来控制进度。