代码之家 › 专栏 › 技术社区 › Wad

CUDA减少-比赛条件?

cuda c++

Wad · 技术社区 · 7 年前

请考虑我从教程中获得的以下代码和附带的解释性图像。其目的是演示CUDA的并行减少。

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <iostream>
#include <numeric>
using namespace std;

__global__ void sumSingleBlock(int* d)
{
  int tid = threadIdx.x;

  // Number of participating threads (tc) halves on each iteration
  for (int tc = blockDim.x, stepSize = 1; tc > 0; tc >>= 1, stepSize <<= 1)
  {
    // Thread must be allowed to write
    if (tid < tc)
    {
      // We need to do A + B, where B is the element following A, so first we 
      // need to find the position of element A and of element B      
      int posA = tid * stepSize * 2;
      int posB = posA + stepSize;

      // Update the value at posA by adding the value at posB to it
      d[posA] += d[posB];
    }
  }
}

int main()
{
  cudaError_t status;

  const int count = 8;
  const int size = count * sizeof(int);
  int* h = new int[count];
  for (int i = 0; i < count; ++i)
    h[i] = i+1;

  int* d;
  status = cudaMalloc(&d, size);

  status = cudaMemcpy(d,h,size, cudaMemcpyHostToDevice);

  sumSingleBlock<<<1,count/2>>>(d);

  int result;
  status = cudaMemcpy(&result,d,sizeof(int),cudaMemcpyDeviceToHost);

  cout << "Sum is " << result << endl;

  getchar();

  cudaFree(d);
  delete [] h;

  return 0;
}

现在,我可以理解图中概述的还原的一般原则。我不明白的是,在附加赛中没有比赛条件 (*) :

很明显,所有四个线程都将运行相同的循环次数;只有 tid < tc 他们会做些有用的事吗。线程#0添加1和2,并将结果存储在元素0中。然后,它的第二次迭代访问元素2。同时,线程#1的第一次迭代将3和4相加,并将结果存储在元素2中。

如果线程#0在线程#1完成迭代1之前开始迭代2会怎么样?这意味着线程#0可以读取3而不是7,或者可能是一个撕裂的值(?) 这里没有任何同步,那么代码就是错的吗?

(*) 注意:我不确定是否存在竞态条件,我完全相信教程中的安全代码是正确的。

1 回复 | 直到 7 年前

Wad 7 年前

代码错误,需要 __syncthreads() 按如下所示调用。

__global__ void sumSingleBlock(int* d)
{
  int tid = threadIdx.x;

  // Number of participating threads (tc) halves on each iteration
  for (int tc = blockDim.x, stepSize = 1; tc > 0; tc >>= 1, stepSize <<= 1)
  {
    // Thread must be allowed to write
    if (tid < tc)
    {
      // We need to do A + B, where B is the element following A, so first we 
      // need to find the position of element A and of element B      
      int posA = tid * stepSize * 2;
      int posB = posA + stepSize;

      // Update the value at posA by adding the value at posB to it
      d[posA] += d[posB];
    }
     __syncthreads();
  }
}

推荐文章

mike · 音频速率下的OpenCL/CUDA GPU计算-是否有足够快的方法可以在每个音频缓冲区从GPU读取一次?(即最低约43 FPS)

10 月前

sof · cuobjdump不发出PTX算术指令

11 月前

armagedescu · 内联结构初始化,“非静态成员必须相对于静态对象”

11 月前

sof · 矩阵乘法样本中共享存储器的放置

1 年前

asmo_192 · CUDA(Libtorch)和OpenGL互操作的怪异行为

1 年前

Okhr · 运行时错误:CUDA错误:在带有GTX 1660 Super的Debian 12虚拟机上不支持此操作

1 年前

Sampath · CUDA内核,用于查找1D阵列中大于特定阈值的值的最小和最大索引

1 年前

Rahn · 将GPU内存分配给类的成员变量的正确方法是什么?

1 年前

Marko GrdiniÄ · 如何在CuPy项目中包含Cutlass?

1 年前

Athanasios Margaris · 如何将CUDA线程与三重嵌套循环相关联?

1 年前