完全避免使用原子是低效的,但通过减少每个扭曲,可以大大减少原子的使用。
如果你有Ampere或更好的,你就会有
__reduce_XXX_sync
函数,我会做你想做的。请注意,在实际测试中,共享内存上的warpwide atomicOp实际上更快(12个周期比26个周期),但您确实要求不要使用原子操作。
使用可以非常有效地测试谓词
__ballot_sync
。这将并行测试32个谓词。再减少一次,您就可以在四条指令中进行1024次测试(请参阅下面的代码)。通过一些比特处理,我们可以导出第一个线程来找到匹配,并使用它来导出匹配的索引。
这段代码显然内存有限,您可以使用
async_memcpy
(Ampere的新功能,Volta的预览功能)提前预取内存(不过超出了这个答案的范围)。
我的测试表明,这将使速度加快约30%。
请注意,GPU上的原子操作非常非常快,只有几个周期,即使是全局内存上的原子运算也是如此。这与CPU上需要1000个周期的原子不同。
如果将原子建模为即发即弃操作(即:不使用返回值),则它们会立即返回。
static constexpr auto big = 0x0FFFFFFF;
static constexpr auto small = 0;
__device__ [[nodiscard]] int laneid() { return threadIdx.x % 32; }
__device__ [[nodiscard]] int warpid() { return threadIdx.x / 32; }
template <int blockcount, blocksize>
__device__ int findminmax(int* data, int length, int threshold) {
constexpr auto warpcount = blocksize / 32;
const auto warpid = (threadIdx.x / 32);
__shared__ int minlocation[warpcount];
__shared__ int anywheremask;
const auto stride = (length + blocksize - 1) / blockcount;
const auto start = stride * blockIdx.x;
const auto end = start + stride; //allow processing past end.
__syncthreads();
//from start to middle
for (auto i = start + threadIdx.x; i < end; i += blockDim.x) {
//all threads will always be active inside this loop.
const auto a = (i < length) ? data[i] : small; //do not read past end though
const auto foundmask = __ballot_sync(-1u, a >= threshold);
//contains a 1 if found inside the warp.
minlocation[warpid] = foundmask;
__syncthreads();
if (warpid() == 0) { //only in warp0 to reduce demand on shared memory.
anywheremask = __ballot_sync(-1u, minlocation[laneid()]); //note that -1u assumes a block has 1024 threads, should really use a constexpr to derive the activemask from blocksize.
}
__syncthreads();
if (anywheremask) {
const auto warpfind = __ffs(anywheremask) -1;
assert(uint32_t(warpfind) < 32);
//do not worry about 1024 concurrent reads from shared memory, because this only happens once.
const auto lanefind = __ffs(minlocation[warpfind]) - 1;
assert(uint32_t(lanefind) < 32);
//you may use an atomicOr here to signal
//that you've found something and test that
//value elsewhere to exit early, so other blocks also have an early out.
return i - threadIdx.x + warpfind * 32 + lanefind;
}
}
constexpr auto not_found = -1;
return not_found;
}
template <int blockcount, int blocksize>
__global__ findminmax(int* data, int length, int threshold, int* gmin, int* gmax) {
assert(blockcount == blockDim.x);
gmin[blockIdx.x] = findmin<blockcount, blocksize>(data, length, threshold);
//now reduce the gmin array, to find the minimal value.
//code for gmax works the same way, but traverse the data in the opposite direction.
//do not! run findmin and findmax at the same time, because findmax needs min position as a stop.
}
int main() {
//init data
cudaDeviceSynchronize(); //or use collaborative groups
findminmax<<<46,1024>>><46, 1024>(data, len, t, gmin, gmax);
cudaDeviceSynchronize();
}
我还没有测试过上面的代码,也没有试图编译它,但我希望你能明白。
Pre-Ampere,您可以轻松编程
__减少XXX_同步
就像这样:
__device__ [[nodiscard]] int reduce_min_sync(int activemask, int data) {
assert(activemask == -1); //adjust code if not all threads are active
data = min(data, __shfl_down_sync(activemask, 1, data);
data = min(data, __shfl_down_sync(activemask, 2, data);
data = min(data, __shfl_down_sync(activemask, 4, data);
data = min(data, __shfl_down_sync(activemask, 8, data);
data = min(data, __shfl_down_sync(activemask, 16, data);
return data;
}
然而,如果使用原子进行还原会更快
__device__ [[nodiscard]] int reduce_min_sync(int activemask, int data) {
const auto starttime = clock64(); //not clock32, that is slow!
__shared__ int s_min;
s_min = big;
//__syncwarp()// no need for sync, everyone agrees on s_min
atomicMin(&s_min, data);
__syncwarp();
const auto endtime = clock64();
printf("time for atomic reduction = %i cycles\n", int(endtime - starttime));
return s_min;
}