代码之家  ›  专栏  ›  技术社区  ›  MatBailie

封装矢量化函数-用于Panda数据帧

  •  0
  • MatBailie  · 技术社区  · 5 年前

    我已经重新分解了一些代码,并使用它来探索在使用Pandas和Numpy时如何构造可维护、灵活、简洁的代码。(通常我只是简单地使用它们,我现在的角色应该是我的目标是成为一名前冲刺者。)

    我遇到的一个例子是一个函数,它有时可以在一列值上调用,有时可以在三列值上调用。使用Numpy的矢量化代码完美地封装了它。但是使用它会变得有点笨拙。

    我该如何“更好地”编写以下函数?

    def project_unit_space_to_index_space(v, vertices_per_edge):
        return np.rint((v + 1) / 2 * (vertices_per_edge - 1)).astype(int)
    
    
    input = np.concatenate(([df['x']], [df['y']], [df['z']]), axis=0)
    
    index_space = project_unit_space_to_index_space(input, 42)
    
    magic_space = some_other_transformation_code(index_space, foo, bar)
    
    df['x_'], df['y_'], df['z_'] = magic_space
    

    在编写时,该函数可以接受一列数据或多列数据,并且仍然可以正确、快速地工作。

    返回类型是直接传递给另一个类似结构的函数的正确形状,允许我整齐地链接函数。

    即使将结果分配回数据帧中的新列也不是“糟糕”,尽管它有点笨拙。

    np.ndarray 确实很笨重。


    我还没有找到任何关于这个的风格指南。它们遍布于itterrows和lambda表达式等,但我没有找到封装这种逻辑的最佳实践。



    编辑:

    %timeit test = project_unit_sphere_to_unit_cube(df[['x','y','z']].unstack().to_numpy())                      
    # 1.44 ms ± 57.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    
    %timeit test = project_unit_sphere_to_unit_cube(df[['x','y','z']].to_numpy().T)                              
    # 558 µs ± 6.25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    
    %timeit test = project_unit_sphere_to_unit_cube(df[['x','y','z']].transpose().to_numpy())                    
    # 817 µs ± 18.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    
    %timeit test = project_unit_sphere_to_unit_cube(np.concatenate(([df['x']], [df['y']], [df['z']]), axis=0))   
    # 3.46 ms ± 42.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    0 回复  |  直到 5 年前
        1
  •  2
  •   hpaulj    5 年前
    In [101]: df = pd.DataFrame(np.arange(12).reshape(4,3))                         
    In [102]: df                                                                    
    Out[102]: 
       0   1   2
    0  0   1   2
    1  3   4   5
    2  6   7   8
    3  9  10  11
    

    您正在从数据帧的n列生成(n,m)数组:

    In [103]: np.concatenate([[df[0]],[df[1]],[df[2]]],0)                           
    Out[103]: 
    array([[ 0,  3,  6,  9],
           [ 1,  4,  7, 10],
           [ 2,  5,  8, 11]])
    

    In [104]: df.to_numpy().T                                                       
    Out[104]: 
    array([[ 0,  3,  6,  9],
           [ 1,  4,  7, 10],
           [ 2,  5,  8, 11]])
    

    数据帧有自己的转置:

    In [109]: df.transpose().to_numpy()                                             
    Out[109]: 
    array([[ 0,  3,  6,  9],
           [ 1,  4,  7, 10],
           [ 2,  5,  8, 11]])
    

    计算使用数据帧,返回具有相同形状和索引的数据帧:

    In [113]: np.rint((df+1)/2 *(42-1)).astype(int)                                 
    Out[113]: 
         0    1    2
    0   20   41   62
    1   82  102  123
    2  144  164  184
    3  205  226  246
    

    numpy 函数将输入转换为 纽比 数组并返回数组。其他,通过将细节委托给 pandas 方法,可以直接在数据帧上工作,并返回一个数据帧。

        2
  •  0
  •   MatBailie    5 年前

    @hpaulj让我清楚地看到了更多的功能和机会,从而帮助我进一步探索这个问题。这有助于我更清楚地界定我的竞争目标,也有助于我开始将优先权赋予它们。

      • 利用结果
      • 速度慢5%,但各方面都好,这是可以接受的
      • 100%的速度慢可能是不可接受的
    1. 实现应该尽可能不依赖于数据类型

    这让我想到 目前

    def scale_unit_cube_to_unit_sphere(*values):
        """
        Scales all the inputs (on a row basis for array_line types) such that when
        treated as n-dimensional vectors, their scale is always 1.
    
        (Divides the vector represented by each row of inputs by that row's
         root-of-sum-of-squares, so as to normalise to a unit magnitude.)
    
        Examples - Scalar Inputs
        --------
    
        >>> scale_unit_cube_to_unit_sphere(1, 1, 1)
        [0.5773502691896258, 0.5773502691896258, 0.5773502691896258]
    
        Examples - Array Like Inputs
        --------
    
        >>> x = [ 1, 2, 3]
        >>> y = [ 1, 4, 3]
        >>> z = [ 1,-3,-1]
        >>> scale_unit_cube_to_unit_sphere(x, y, z)
        [array([0.57735027, 0.37139068, 0.6882472 ]),
         array([0.57735027, 0.74278135, 0.6882472 ]),
         array([ 0.57735027, -0.55708601, -0.22941573])]
    
        >>> a = np.array([x, y, z])
        >>> scale_unit_cube_to_unit_sphere(*a)
        [array([0.57735027, 0.37139068, 0.6882472 ]),
         array([0.57735027, 0.74278135, 0.6882472 ]),
         array([ 0.57735027, -0.55708601, -0.22941573])]
    
        scale_unit_cube_to_unit_sphere(*t)
        >>> t = (x, y, z)
        >>> scale_unit_cube_to_unit_sphere(*t)
        [array([0.57735027, 0.37139068, 0.6882472 ]),
         array([0.57735027, 0.74278135, 0.6882472 ]),
         array([ 0.57735027, -0.55708601, -0.22941573])]
    
        >>> df = pd.DataFrame(data={'x':x,'y':y,'z':z})
        >>> scale_unit_cube_to_unit_sphere(df['x'], df['y'], df['z'])
        [0    0.577350
         1    0.371391
         2    0.688247
         dtype: float64,
         0    0.577350
         1    0.742781
         2    0.688247
         dtype: float64,
         0    0.577350
         1   -0.557086
         2   -0.229416
         dtype: float64]
    
        For all array_like inputs, the results can then be utilised in similar
        ways, such as writing them to an existing DataFrame as follows:
    
        >>> transform = scale_unit_cube_to_unit_sphere(df['x'], df['y'], df['z'])
        >> df['i'], df['j'], df['k'] = transform
    
        """
        # Scale the position in space to be a unit vector, as on the surface of a sphere
        ################################################################################
    
        scaler = np.sqrt(sum([np.multiply(v, v) for v in values]))
        return [np.divide(v, scaler) for v in values]
    

    根据doc字符串,这适用于标量、数组、序列等,无论是否提供一个标量、三个标量、n个标量、n个数组等。

    它们也在“链”中工作,例如下面的示例(函数的实现不相关,只是将输入链接到输出的模式)。。。

    cube, ix = generate_index_cube(vertices_per_edge)
    
    df = pd.DataFrame(
             data  = {
                 'x': cube[0],
                 'y': cube[1],
                 'z': cube[2],
             },
             index = ix,
         )
    
    unit = scale_index_to_unit(vertices_per_edge, *cube)
    
    distortion = scale_unit_to_distortion(distortion_factor, *unit)
    
    df['a'], df['b'], df['c'] = distortion
    
    sphere = scale_unit_cube_to_unit_sphere(*distortion)
    
    df['i'], df['j'], df['k'] = sphere
    
    recovered_distortion = scale_unit_sphere_to_unit_cube(*sphere)
    
    df['a_'], df['b_'], df['c_'] = recovered_distortion
    
    recovered_cube = scale_unit_to_index(
                         vertices_per_edge,
                         *scale_distortion_to_unit(
                             distortion_factor,
                             *recovered_distortion,
                         ),
                     )
    
    df['x_'], df['y_'], df['z_'] = recovered_cube
    
    print(len(df[np.logical_not(np.isclose(df['a'], df['a_']))]))  # No Differences
    print(len(df[np.logical_not(np.isclose(df['b'], df['b_']))]))  # No Differences
    print(len(df[np.logical_not(np.isclose(df['c'], df['c_']))]))  # No Differences
    
    print(len(df[np.logical_not(np.isclose(df['x'], df['x_']))]))  # No Differences
    print(len(df[np.logical_not(np.isclose(df['y'], df['y_']))]))  # No Differences
    print(len(df[np.logical_not(np.isclose(df['z'], df['z_']))]))  # No Differences
    

    请做评论或评论。