代码之家  ›  专栏  ›  技术社区  ›  ZHANG Juenjie

numba cuda无法使用+=(需要减少gpu?)

  •  0
  • ZHANG Juenjie  · 技术社区  · 6 年前

    代码只是将所有值相加到一个结果中,但是numba cuda给了我一个与numpy不同的结果。

    numba代码

     import math
     def numba_example(number_of_maximum_loop,gs,ts,bs):
            from numba import cuda
            result = cuda.device_array([3,])
    
            @cuda.jit(device=True)
    
        def BesselJ0(x):
            return math.sqrt(2/math.pi/x)
    
        @cuda.jit
        def cuda_kernel(number_of_maximum_loop,result,gs,ts,bs):
            i = cuda.grid(1)
            if i < number_of_maximum_loop:
                result[0] += BesselJ0(i/100+gs)
                result[1] += BesselJ0(i/100+ts)
                result[2] += BesselJ0(i/100+bs)
    
        # Configure the blocks
        threadsperblock = 128
        blockspergrid = (number_of_maximum_loop + (threadsperblock - 1)) // threadsperblock
    
        # Start the kernel 
        cuda_kernel[blockspergrid, threadsperblock](number_of_maximum_loop,result,gs,ts,bs) 
    
        return result.copy_to_host()
    
    numba_example(1000,20,20,20) 
    

    输出:

    array([ 0.17770302,  0.34166728,  0.35132036])
    

    numpy代码

    import math
    def numpy_example(number_of_maximum_loop,gs,ts,bs):
        import numpy as np
        result = np.zeros([3,])
    
        def BesselJ0(x):
            return math.sqrt(2/math.pi/x)
    
        for i in range(number_of_maximum_loop):
            result[0] += BesselJ0(i/100+gs)
            result[1] += BesselJ0(i/100+ts)
            result[2] += BesselJ0(i/100+bs)
    
        return result
    
    numpy_example(1000,20,20,20) 
    

    array([ 160.40546935,  160.40546935,  160.40546935])
    

    我不知道我错在哪里。我想我可能需要减少。但似乎不可能用一个cuda内核完成它。

    1 回复  |  直到 6 年前
        1
  •  3
  •   Robert Crovella    6 年前

    是的,要将多个GPU线程的数据求和到单个变量,需要进行适当的并行缩减。

    下面是一个简单的示例,说明如何从单个内核执行此操作:

    $ cat t23.py
    import math
    def numba_example(number_of_maximum_loop,gs,ts,bs):
        from numba import cuda
        result = cuda.device_array([3,])
    
        @cuda.jit(device=True)
        def BesselJ0(x):
            return math.sqrt(2/math.pi/x)
    
        @cuda.jit
        def cuda_kernel(number_of_maximum_loop,result,gs,ts,bs):
            i = cuda.grid(1)
            if i < number_of_maximum_loop:
                cuda.atomic.add(result, 0, BesselJ0(i/100+gs))
                cuda.atomic.add(result, 1, BesselJ0(i/100+ts))
                cuda.atomic.add(result, 2, BesselJ0(i/100+bs))
    
    # Configure the blocks
        threadsperblock = 128
        blockspergrid = (number_of_maximum_loop + (threadsperblock - 1)) // threadsperblock
    
     # Start the kernel
        init = [0.0,0.0,0.0]
        result = cuda.to_device(init)
        cuda_kernel[blockspergrid, threadsperblock](number_of_maximum_loop,result,gs,ts,bs)
    
        return result.copy_to_host()
    
    print(numba_example(1000,20,20,20))
    $ python t23.py
    [ 162.04299487  162.04299487  162.04299487]
    $
    

    你也可以直接用手指适当地减少麻木 reduce 如上所述的装饰者 here 虽然我不确定这样一个内核可以实现3个缩减。

    最后,您可以使用numba cuda编写一个普通的cuda并行缩减,如图所示 here . 我认为,将其扩展到在单个内核中执行3次缩减应该并不困难。

    $ cat t24.py
    import math
    def numpy_example(number_of_maximum_loop,gs,ts,bs):
        import numpy as np
        result = np.zeros([3,])
    
        def BesselJ0(x):
            return math.sqrt(2/math.pi/x)
    
        for i in range(number_of_maximum_loop):
            result[0] += BesselJ0(i/100+gs)
            result[1] += BesselJ0(i/100+ts)
            result[2] += BesselJ0(i/100+bs)
    
        return result
    
    print(numpy_example(1000,20,20,20))
    $ python t24.py
    [ 162.04299487  162.04299487  162.04299487]
    $
    
    推荐文章