嗯,我不知道关于延迟的最佳实践,尤其是在OpenCL上。但我可以提供一个简单的基准来衡量往返时间。
我在这里只做了两件事:
-
使用双缓冲使GPU保持忙碌
-
使用固定主机内存,不要进行显式复制
Nvidia GPU可以直接访问固定主机内存。虽然这确实会减慢内核的速度,并在等待数据传输时占用计算资源,但它也避免了等待或与复制操作同步。然而,我并没有将其与具有双缓冲的传统复制设置进行比较。
在我的硬件(运行Linux的Nvidia T1200笔记本电脑)上,此设置执行以下操作的往返传输
1024个样本,15个美国
在前一两个内核调用之后保持一致。在桌面RTX-3090上进行相同的测试会产生25 us RTT。
以下是代码:
#include <cuda_runtime.h>
#include <algorithm>
// using std::fill_n
#include <cstdio>
// using std::printf
#include <chrono>
// using std::steady_clock
/**
* Simple input = output kernel
*/
__global__ void kernel(unsigned* out, const unsigned* in, int n)
{
const int idx = blockDim.x * blockIdx.x + threadIdx.x;
if(idx < n)
out[idx] = in[idx];
}
/**
* Creates a time stamp in microseconds
*
* No defined zero-time. Only useful for measuring relative time intervals
*/
unsigned current_time_us()
{
using us = std::chrono::microseconds;
return static_cast<unsigned>(std::chrono::duration_cast<us>(
std::chrono::steady_clock::now().time_since_epoch()).count());
}
/** Fills the buffer with the current time stamp */
void fill_current_time(unsigned* buf, int n)
{
std::fill_n(buf, n, current_time_us());
}
int main()
{
int samples = 1024, repetitions = 100;
int blocksize = 128;
int gridsize = (samples + blocksize - 1) / blocksize;
cudaStream_t stream;
if(cudaStreamCreate(&stream))
return 1;
/*
* We use pinned host memory that is directly accessible by the device and
* the host for input and output transfer.
* Two input and two output buffers for double-buffering
*/
unsigned* transfer_bufs;
if(cudaHostAlloc(&transfer_bufs, 4 * samples * sizeof(unsigned), 0))
return 2;
unsigned* input_bufs = transfer_bufs;
unsigned* output_bufs = transfer_bufs + 2 * samples;
/*
* We use events for quick notification when a kernel is done without
* having to synchronize the stream
*/
cudaEvent_t output_avail[2];
for(cudaEvent_t& event: output_avail)
if(cudaEventCreate(&event))
return 3;
/*
* Initial fill of the first double buffer
*/
fill_current_time(input_bufs, samples);
kernel<<<blocksize, gridsize, 0, stream>>>(
output_bufs, input_bufs, samples);
if(cudaEventRecord(output_avail[0], stream))
return 4;
for(int i = 1; i < repetitions; ++i) {
int cur_buf = i & 1;
int last_buf = cur_buf ^ 1;
int cur_offset = samples * cur_buf;
int last_offset = samples * last_buf;
/*
* Schedule the next computation
*/
fill_current_time(input_bufs + cur_offset, samples);
kernel<<<blocksize, gridsize, 0, stream>>>(
output_bufs + cur_offset, input_bufs + cur_offset, samples);
if(cudaEventRecord(output_avail[cur_buf], stream))
return 5;
/*
* Wait for the previous computation
*/
if(cudaEventSynchronize(output_avail[last_buf]))
return 6;
/*
* Measure the time interval from filling the input buffer to
* receiving it back in the output buffer
*/
std::printf("RTT %u us\n", current_time_us() - output_bufs[last_offset]);
}
/*
* Wait for the last computation. No need to check the results
*/
if(cudaEventSynchronize(output_avail[(repetitions - 1) & 1]))
return 7;
}
输出:
RTT 94 us
RTT 22 us
RTT 12 us
RTT 15 us
RTT 15 us
RTT 15 us
RTT 15 us
RTT 15 us
...
然而,我还应该注意到,使用只有1024个样本的完整GPU听起来几乎是不可能的。即使是单个多处理器也有比这更多的线程!因此,虽然传输延迟不是问题,但在不增加缓冲区大小的情况下实际使用计算资源将是一个问题。
但我不知道,也许你把32个输入源和1024个样本混合在一起。顺便说一句,在我的测试中,将传输增加32倍只会将RTT增加到60 us(RTX 3090上为240 us)。