代码之家 › 专栏 › 技术社区 › Sampath

CUDA内核,用于查找1D阵列中大于特定阈值的值的最小和最大索引

reduction gpgpu cuda performance

Sampath · 技术社区 · 1 年前

我正在尝试编写一个CUDA内核,用于查找1D数组中大于特定阈值的值的最小和最大索引下面是CPU中用于执行相同操作的伪代码

int min_index = 0, max_index = length;
    for(int i = 0; i < length; i++)
    {
        if(arr[i] >= threshold)
        {
            min_index = i;
            break;
        }
    }

    for(int i = length - 1 ; i >= min_index; i--)
    {
        if(arr[i] >= threshold)
        {
            max_index = i;
            break;
        }
    }

一种方法是GPU使用原子操作,但我不喜欢使用它们。我曾想过使用归约,但无法找出进行归约的逻辑。对于最小还原,可以遵循如下逻辑

    for(int i = 8; i >= 1; i /= 2)
    {
        if(threadIdx.x < i)
            sh_mem[threadIdx.x] = min(sh_mem[threadIdx.x], sh_mem[threadIdx.x + i]);
        __syncthreads();
    }

但是,对于我的情况,我可以使用什么逻辑来做呢?在这种情况下,我需要将共享内存中的值与阈值进行比较,并计算满足此标准的共享内存中最小索引?

非常感谢您对如何解决上述问题的任何想法

编辑:由于一些限制,我不可能使用推力库

0 回复 | 直到 1 年前

Johan 1 年前

完全避免使用原子是低效的,但通过减少每个扭曲,可以大大减少原子的使用。如果你有Ampere或更好的,你就会有 __reduce_XXX_sync 函数,我会做你想做的。请注意,在实际测试中,共享内存上的warpwide atomicOp实际上更快(12个周期比26个周期),但您确实要求不要使用原子操作。

使用可以非常有效地测试谓词 __ballot_sync 。这将并行测试32个谓词。再减少一次,您就可以在四条指令中进行1024次测试(请参阅下面的代码)。通过一些比特处理,我们可以导出第一个线程来找到匹配,并使用它来导出匹配的索引。

这段代码显然内存有限,您可以使用 async_memcpy (Ampere的新功能,Volta的预览功能)提前预取内存(不过超出了这个答案的范围)。我的测试表明,这将使速度加快约30%。

请注意,GPU上的原子操作非常非常快,只有几个周期,即使是全局内存上的原子运算也是如此。这与CPU上需要1000个周期的原子不同。

如果将原子建模为即发即弃操作(即:不使用返回值),则它们会立即返回。

static constexpr auto big = 0x0FFFFFFF;
static constexpr auto small = 0;

__device__ [[nodiscard]] int laneid() { return threadIdx.x % 32; }
__device__ [[nodiscard]] int warpid() { return threadIdx.x / 32; }

template <int blockcount, blocksize>
__device__ int findminmax(int* data, int length, int threshold) {
    constexpr auto warpcount = blocksize / 32;
    const auto warpid = (threadIdx.x / 32);
    __shared__ int minlocation[warpcount];
    __shared__ int anywheremask;
    
    const auto stride = (length + blocksize - 1) / blockcount;
    const auto start = stride * blockIdx.x;
    const auto end = start + stride; //allow processing past end.
    __syncthreads();
    //from start to middle
    for (auto i = start + threadIdx.x; i < end; i += blockDim.x) {
        //all threads will always be active inside this loop.
        const auto a = (i < length) ? data[i] : small; //do not read past end though
        const auto foundmask = __ballot_sync(-1u, a >= threshold);
        //contains a 1 if found inside the warp.
        minlocation[warpid] = foundmask;
        __syncthreads();
        if (warpid() == 0) { //only in warp0 to reduce demand on shared memory.
            anywheremask = __ballot_sync(-1u, minlocation[laneid()]); //note that -1u assumes a block has 1024 threads, should really use a constexpr to derive the activemask from blocksize.
        }
        __syncthreads();
        if (anywheremask) {
            const auto warpfind = __ffs(anywheremask) -1;
            assert(uint32_t(warpfind) < 32);
            //do not worry about 1024 concurrent reads from shared memory, because this only happens once.
            const auto lanefind = __ffs(minlocation[warpfind]) - 1;
            assert(uint32_t(lanefind) < 32);
            //you may use an atomicOr here to signal
            //that you've found something and test that
            //value elsewhere to exit early, so other blocks also have an early out.
            return i - threadIdx.x + warpfind * 32 + lanefind;
        }
    }
    constexpr auto not_found = -1;
    return not_found; 
}

template <int blockcount, int blocksize>
__global__ findminmax(int* data, int length, int threshold, int* gmin, int* gmax) {
    assert(blockcount == blockDim.x);
    gmin[blockIdx.x] = findmin<blockcount, blocksize>(data, length, threshold);
    //now reduce the gmin array, to find the minimal value.
    //code for gmax works the same way, but traverse the data in the opposite direction.
    //do not! run findmin and findmax at the same time, because findmax needs min position as a stop.    
}

int main() {
     //init data
     
     cudaDeviceSynchronize(); //or use collaborative groups
     findminmax<<<46,1024>>><46, 1024>(data, len, t, gmin, gmax);
     cudaDeviceSynchronize(); 
     
}

我还没有测试过上面的代码,也没有试图编译它,但我希望你能明白。

Pre-Ampere,您可以轻松编程 __减少XXX_同步 就像这样:

__device__ [[nodiscard]] int reduce_min_sync(int activemask, int data) {
    assert(activemask == -1); //adjust code if not all threads are active 
    data = min(data, __shfl_down_sync(activemask, 1, data);
    data = min(data, __shfl_down_sync(activemask, 2, data);
    data = min(data, __shfl_down_sync(activemask, 4, data);
    data = min(data, __shfl_down_sync(activemask, 8, data);
    data = min(data, __shfl_down_sync(activemask, 16, data);
    return data;
}

然而,如果使用原子进行还原会更快

__device__ [[nodiscard]] int reduce_min_sync(int activemask, int data) {
    const auto starttime = clock64(); //not clock32, that is slow!
    __shared__ int s_min;
    s_min = big;
    //__syncwarp()// no need for sync, everyone agrees on s_min
    atomicMin(&s_min, data);
    __syncwarp();
    const auto endtime = clock64();
    printf("time for atomic reduction = %i cycles\n", int(endtime - starttime));
    return s_min;
}

paleonix Nathan Kidd 1 年前

以下是如何在Thrust中使用单个算法而不是两个调用来实现这一点的示例 thrust::find_if (一个利用 reverse_iterator s此实现使用单个 transform_reduce 其中 transform 用中性元素替换阈值以上的值,过滤掉这些值,并准备以下内容 minmax 元组的减少工作。

虽然使用自定义内核可能会获得更好的性能,正如问题下面的评论中所讨论的那样,但这应该是一个很好的参考解决方案(即,这不一定是最终/可接受的答案)。

#include <iostream>

#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/transform_reduce.h>
#include <thrust/zip_function.h>

int main() {
    thrust::host_vector<float> h_vals{0.0f, 100.0f, 0.0f, 100.0f, 42.0f, 0.0f, 0.0f};
    thrust::device_vector<float> vals(h_vals);
    float threshold = 42.0f;

    const auto enumerate_iter = thrust::make_zip_iterator(
        thrust::make_tuple(
            thrust::make_counting_iterator(0),
            vals.cbegin()));
    const auto filter_minmax_input = thrust::make_zip_function(
            [threshold] __host__ __device__ (int idx, float val) {
                if (val >= threshold) {
                    return thrust::make_tuple(idx, idx);
                }
                else {
                    return thrust::make_tuple(INT_MAX, INT_MIN);
                }
            });
    const auto minmax_f = 
        [] __host__ __device__ (thrust::tuple<int, int> left,
                                thrust::tuple<int, int> right) {
            const int left_min_idx = thrust::get<0>(left);
            const int left_max_idx = thrust::get<1>(left);
            const int right_min_idx = thrust::get<0>(right);
            const int right_max_idx = thrust::get<1>(right);
            return thrust::make_tuple(
                thrust::min(left_min_idx, right_min_idx),
                thrust::max(left_max_idx, right_max_idx));
        };
    const auto res = thrust::transform_reduce(
        enumerate_iter, enumerate_iter + vals.size(),
        filter_minmax_input,
        thrust::make_tuple(INT_MAX, INT_MIN),
        minmax_f);
    
    std::cout << thrust::get<0>(res) << ' ' << thrust::get<1>(res) << '\n';
    return 0;
}

请注意,如果最小索引和最大索引相距甚远 find_if s实际上应该 表现更好 ,假设它们确实实现了早期退出并避免从全局内存中读取整个范围。在(边缘)情况下,这些指数非常接近,但只减少一次可以是有利的,因为它避免了间接费用。

因为它可能是(没有基准测试)总体上更好的解决方案,所以我将包括(非常简单) find_if 解决方案也在这里:

#include <iostream>

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/find.h>

int main() {
    thrust::host_vector<float> h_vals{0.0f, 100.0f, 0.0f, 100.0f, 42.0f, 0.0f, 0.0f};
    thrust::device_vector<float> vals(h_vals);
    float threshold = 42.0f;

    const auto filter = [threshold] __host__ __device__ (float val){
            return val >= threshold;
        };
    const auto min_idx = thrust::distance(
        vals.cbegin(),
        thrust::find_if(vals.cbegin(), vals.cend(), filter));
    const auto max_idx = thrust::distance(
        thrust::find_if(vals.crbegin(), vals.crend(), filter),
        vals.crend() - 1); // reverse iterator corresponding to cbegin
    
    std::cout << min_idx << ' ' << max_idx << '\n';
    return 0;
}