代码之家  ›  专栏  ›  技术社区  ›  interestedparty333

有什么好方法可以把库达马洛的数据归零?

  •  -1
  • interestedparty333  · 技术社区  · 6 年前

    有什么好方法可以把库达马洛的数据归零?假设使用 cudaMemset cudaMemsetAsync 从CPU导致与其他CUDAAPI调用的同步问题,迫使您执行其他操作。

    编辑1

    在下面的第一张图片中,您可以看到线程1291997440发出了 cudaMemcpyAsync 执行它需要一段时间。不知为什么,这个 Cudammcpyasync公司 好像挡住了 cudaMemsetAsync公司 ,如下图所示。注意,每个CPU线程都在自己的流中排队这些操作。有人在场外提到,使用内核而不是使用 cudaMemsetAsync公司 调用可能会更快地清除内存--这就是我追求这个问题的原因。

    Shows Thread 1291997440 issuing a DtoH async transfer

    Shows Thread 1308604160 waiting for that transfer to complete before actually executing the cudaMemsetAsync

    编辑2

    在这一点上,我已经改进了代码(通过减少HtoD和DtoH副本的大小)足够防止这个问题出现。上面的照片是前一天晚上拍的。如果注释是100%正确的,那么分析报告中一定还有一些我没有注意到的片段。在新版本的代码中,使用 cudaMemsetAsync公司 调用一个内核来清除内存。

    1 回复  |  直到 6 年前
        1
  •  2
  •   interestedparty333    6 年前

    这里我展示了用三种不同方法将~5.5GB数据归零的结果。代码是用 -O3 在一个 V100 有16GB的内存。

    方法A: cudaMemset

    为了建立一个基线,我使用 cudaMemset公司 . 这很快,但是 cudaMemsetAsync 如果 cudaMemcpy s在飞行中。

    结果:6毫秒

    方法B: memset

    打电话 清零 可能调用了CPU和GPU世界中最糟糕的一个。 清零 有这样的感觉,一旦你退出它,数据就如你所说的那样。当然,在其他的内核、种族状况等情况下,这是不正确的,但是,这就是为什么它如此缓慢的原因。

    结果:241 ms

    方法C:合并写入

    以一种联合的方式编写似乎是CPU和GPU世界中最好的。它的速度和CPU发出的速度一样快 cudaMemset公司 ,而且对于任何编写联合编写的程序员来说,这也很清楚,当然有竞争条件等。

    结果:6毫秒

    结论

    如果你不能使用 cudaMemset[Async] 从CPU,然后使用每个块32个或更多线程的合并写入。

    程序输出

    Starting timer for calling cudaMemset from CPU
    Stopping timer for calling cudaMemset from CPU took 0.006015s
    Starting timer for calling kernel<80,1> that uses memset
    Stopping timer for calling kernel<80,1> that uses memset took 0.393921s
    Starting timer for calling kernel<80,2> that uses memset
    Stopping timer for calling kernel<80,2> that uses memset took 0.300473s
    Starting timer for calling kernel<80,4> that uses memset
    Stopping timer for calling kernel<80,4> that uses memset took 0.269686s
    Starting timer for calling kernel<80,8> that uses memset
    Stopping timer for calling kernel<80,8> that uses memset took 0.241374s
    Starting timer for calling kernel<80,16> that uses memset
    Stopping timer for calling kernel<80,16> that uses memset took 0.645509s
    Starting timer for calling kernel<80,32> that uses memset
    Stopping timer for calling kernel<80,32> that uses memset took 0.611437s
    Starting timer for calling kernel<80,64> that uses memset
    Stopping timer for calling kernel<80,64> that uses memset took 0.611276s
    Starting timer for calling kernel<80,128> that uses memset
    Stopping timer for calling kernel<80,128> that uses memset took 0.459663s
    Starting timer for calling kernel<80,256> that uses memset
    Stopping timer for calling kernel<80,256> that uses memset took 0.308788s
    Starting timer for calling kernel<80,512> that uses memset
    Stopping timer for calling kernel<80,512> that uses memset took 0.595893s
    Starting timer for calling kernel<80,1024> that uses memset
    Stopping timer for calling kernel<80,1024> that uses memset took 2.552866s
    Starting timer for calling kernel<80,1> that performs coalesced writes
    Stopping timer for calling kernel<80,1> that performs coalesced writes took 0.136967s
    Starting timer for calling kernel<80,2> that performs coalesced writes
    Stopping timer for calling kernel<80,2> that performs coalesced writes took 0.068426s
    Starting timer for calling kernel<80,4> that performs coalesced writes
    Stopping timer for calling kernel<80,4> that performs coalesced writes took 0.039974s
    Starting timer for calling kernel<80,8> that performs coalesced writes
    Stopping timer for calling kernel<80,8> that performs coalesced writes took 0.017121s
    Starting timer for calling kernel<80,16> that performs coalesced writes
    Stopping timer for calling kernel<80,16> that performs coalesced writes took 0.008586s
    Starting timer for calling kernel<80,32> that performs coalesced writes
    Stopping timer for calling kernel<80,32> that performs coalesced writes took 0.006139s
    Starting timer for calling kernel<80,64> that performs coalesced writes
    Stopping timer for calling kernel<80,64> that performs coalesced writes took 0.006075s
    Starting timer for calling kernel<80,128> that performs coalesced writes
    Stopping timer for calling kernel<80,128> that performs coalesced writes took 0.006093s
    Starting timer for calling kernel<80,256> that performs coalesced writes
    Stopping timer for calling kernel<80,256> that performs coalesced writes took 0.006479s
    Starting timer for calling kernel<80,512> that performs coalesced writes
    Stopping timer for calling kernel<80,512> that performs coalesced writes took 0.006972s
    Starting timer for calling kernel<80,1024> that performs coalesced writes
    Stopping timer for calling kernel<80,1024> that performs coalesced writes took 0.007354s
    

    测试实现

    memset_计时.cu

    #include <iostream>
    #include <numeric>
    #include <stdlib.h>
    
    #include "timer.h"
    
    static void CheckCudaErrorAux (const char *, unsigned, const char *, cudaError_t);
    #define CUDA_CHECK_RETURN(value) CheckCudaErrorAux(__FILE__,__LINE__, #value, value)
    
    #define round_up(x, multiple) (((x + multiple - 1) / multiple) * multiple)
    
    const long COUNT = 80 << 24;
    const int GPU_CACHE_LINE_SIZE_IN_BYTES = 32;
    const long SIZE_OF_DATA = sizeof(int) * COUNT;
    
    __global__ void clear_scratch_space_kernel(int * data, int blocks, int threads) {
    
        // BOZO: change the code to just error out if we're any of the border cases below
        const int idx = blockIdx.x * threads + threadIdx.x;
        long size = sizeof(int) * COUNT;
        long size_of_typical_chunk = round_up(size / (blocks * threads), GPU_CACHE_LINE_SIZE_IN_BYTES);
        // Due to truncation, the threads at the end won't have anything to do.  This is a little sloppy but costs us
        // hardly anything in performance, so we do the simpler thing.
    
        long this_threads_offset = idx * size_of_typical_chunk;
        if (this_threads_offset > SIZE_OF_DATA) {
            return;
        }
    
        long size_of_this_threads_chunk;
        if (this_threads_offset + size_of_typical_chunk >= SIZE_OF_DATA) {
            // We are the last thread, so we do a partial write
            size_of_this_threads_chunk = SIZE_OF_DATA - this_threads_offset;
        } else {
            size_of_this_threads_chunk = size_of_typical_chunk;
        }
        void * starting_address = reinterpret_cast<void *>(reinterpret_cast<char *>(data) + this_threads_offset);
        memset((void *) starting_address, 0, size_of_this_threads_chunk);
    }
    
    __global__ void clear_scratch_space_with_coalesced_writes_kernel(int * data, int blocks, int threads) {
        if (COUNT % (blocks * threads) != 0) {
            printf("Adjust the SIZE_OF_DATA so it's divisible by the number of (blocks * threads)\n");
        }
    
        const long count_of_ints_in_each_blocks_chunk = COUNT / blocks;
    
        int block = blockIdx.x;
        int thread = threadIdx.x;
    
        const long rounds_needed = count_of_ints_in_each_blocks_chunk / threads;
    
        const long this_blocks_starting_offset = block * count_of_ints_in_each_blocks_chunk;
    
        //printf("Clearing %li ints starting at offset %li\n", count_of_ints_in_each_blocks_chunk, this_blocks_starting_offset);
    
        int * this_threads_base_pointer = &data[this_blocks_starting_offset + thread];
        for (int round = 0; round < rounds_needed; ++round) {
            *this_threads_base_pointer = 0;
            this_threads_base_pointer += threads;
        }
    }
    
    void set_gpu_data_to_ones(int * data_on_gpu) {
        cudaMemset(data_on_gpu, 1, SIZE_OF_DATA);
        CUDA_CHECK_RETURN(cudaDeviceSynchronize());
    }
    
    void check_gpu_data_is_zeroes(int * data_on_gpu, char * data_on_cpu) {
        cudaMemcpy(data_on_cpu, data_on_gpu, SIZE_OF_DATA, cudaMemcpyDeviceToHost);
        for (long i = 0; i < SIZE_OF_DATA; ++i) {
            if (data_on_cpu[i] != 0) {
                printf("Failed to zero-out byte offset %i in the data\n", i);
            }
        }
    }
    
    int main(void)
    {
        const long count = COUNT;
        int * data_on_gpu;
        char * data_on_cpu = (char *) malloc(SIZE_OF_DATA);
        if (data_on_cpu == NULL) {
            printf("Failed to allocate data on cpu");
        }
    
        CUDA_CHECK_RETURN(cudaMalloc(&data_on_gpu, sizeof(int) * count));
    
        {
            Timer memset_timer("calling cudaMemset from CPU");
            memset_timer.start();
            CUDA_CHECK_RETURN(cudaMemset(data_on_gpu, 0, SIZE_OF_DATA));
            CUDA_CHECK_RETURN(cudaDeviceSynchronize());
            memset_timer.stop_and_report();
        }
    
        for (int threads = 1; threads <= 1024; threads *= 2) {
    
            set_gpu_data_to_ones(data_on_gpu);
    
            char buffer[200];
            sprintf(buffer, "calling kernel<80,%i> that uses memset", threads);
            Timer memset_timer(buffer);
            memset_timer.start();
            clear_scratch_space_kernel<<<80, threads>>>(data_on_gpu, 80, threads);
            CUDA_CHECK_RETURN(cudaDeviceSynchronize());
            memset_timer.stop_and_report();
    
            check_gpu_data_is_zeroes(data_on_gpu, data_on_cpu);
        }
    
        for (int threads = 1; threads <= 1024; threads *= 2) {
    
            set_gpu_data_to_ones(data_on_gpu);
    
            char buffer[200];
            sprintf(buffer, "calling kernel<80,%i> that performs coalesced writes", threads);
            Timer memset_timer(buffer);
            memset_timer.start();
            clear_scratch_space_with_coalesced_writes_kernel<<<80, threads>>>(data_on_gpu, 80, threads);
            CUDA_CHECK_RETURN(cudaDeviceSynchronize());
            memset_timer.stop_and_report();
    
            check_gpu_data_is_zeroes(data_on_gpu, data_on_cpu);
        }
    
        free(data_on_cpu);
    }
    
    /**
     * Check the return value of the CUDA runtime API call and exit
     * the application if the call has failed.
     */
    static void CheckCudaErrorAux (const char *file, unsigned line, const char *statement, cudaError_t err)
    {
        if (err == cudaSuccess)
            return;
        std::cerr << statement<<" returned " << cudaGetErrorString(err) << "("<<err<< ") at "<<file<<":"<<line << std::endl;
        exit (1);
    }
    

    计时器.h

    #include <string>
    #include <chrono>
    
    class Timer {
    public:
        Timer(std::string name_, bool allow_output = true);
        virtual ~Timer();
    
        void start();
        void start_or_restart();
        void stop(bool force_no_output = false);
        void report(const int count = 0, bool preface_with_spaces = true);
        void stop_and_report(const int count = 0);
        double duration_in_seconds();
        long duration_in_microseconds();
    
    private:
        std::string name;
        // even though we call report, we still might suppress output since the output is often a type of debugging info
        bool allow_output;
        std::chrono::time_point<std::chrono::system_clock> start_time;
        std::chrono::time_point<std::chrono::system_clock> end_time;
        bool started_before = false;
        bool currently_rolling = false; // if timer (i.e., the clock) is currently rolling
        double duration = -1.0;
    };
    

    定时器.cpp

    #include <stdexcept>
    
    #include "timer.h"
    
    Timer::Timer(std::string name_, bool allow_output_) {
        name = name_;
        allow_output = allow_output_;
    }
    
    Timer::~Timer() {
    }
    
    void Timer::start() {
    #ifdef DEBUG
        if(started_before) {
            printf("Attempting to start same timer twice.  Exiting.\n");
            throw std::runtime_error("Attempting to start timer that was previously started");
        }
    #endif
    
        if (allow_output) {
            printf("Starting timer for %s\n", name.c_str());
        }
        start_time = std::chrono::system_clock::now();
        currently_rolling = true;
        started_before = true;
        duration = 0.0;
    }
    
    void Timer::start_or_restart() {
        if (currently_rolling) {
            throw std::runtime_error("Can't start or restart a timer that's already rolling.");
        }
        if (!started_before && allow_output) {
            printf("Starting timer for %s\n", name.c_str());
        }
        started_before = true;
        start_time = std::chrono::system_clock::now();
        currently_rolling = true;
        if (duration < 0.0) {
            duration = 0.0;
        }
    }
    
    void Timer::stop(bool force_no_output) {
        if (!force_no_output) { // Slight style violation: I prefer nested if's over && statements with two && operators
            if (allow_output && duration <= 0.0) {
                printf("Stopping timer for %s\n", name.c_str());
            }
        }
        end_time = std::chrono::system_clock::now();
        std::chrono::duration<double> elapsed_seconds = end_time - start_time;
        currently_rolling = false;
        duration += elapsed_seconds.count();
    }
    
    
    void Timer::stop_and_report(const int count) {
        stop(true);
        report(count, false);
    }
    
    double Timer::duration_in_seconds() {
        return duration;
    }
    
    long Timer::duration_in_microseconds() {
        return static_cast<long>(duration * 1000000);
    }
    
    void Timer::report(const int count, bool preface_with_spaces) {
        std::string preface;
        if (preface_with_spaces) {
            preface = "         ";
        } else {
            preface = "Stopping ";
        }
        if (allow_output) {
            if (!started_before) {
                printf("%stimer for %s was never started\n", preface.c_str(), name.c_str());
            } else if (count > 0) {
                double average = (duration / static_cast<double>(count)) * 1000.0;
                printf("%stimer for %s took %fs, %.3lfus each\n", preface.c_str(), name.c_str(), duration, average * 1000.0);
            } else {
                printf("%stimer for %s took %fs\n", preface.c_str(), name.c_str(), duration);
            }
        }
    }