代码之家 › 专栏 › 技术社区 › MatBailie

封装矢量化函数-用于Panda数据帧

encapsulation vector numpy pandas python

MatBailie · 技术社区 · 5 年前

我已经重新分解了一些代码,并使用它来探索在使用Pandas和Numpy时如何构造可维护、灵活、简洁的代码。(通常我只是简单地使用它们,我现在的角色应该是我的目标是成为一名前冲刺者。)

我遇到的一个例子是一个函数,它有时可以在一列值上调用,有时可以在三列值上调用。使用Numpy的矢量化代码完美地封装了它。但是使用它会变得有点笨拙。

我该如何“更好地”编写以下函数?

def project_unit_space_to_index_space(v, vertices_per_edge):
    return np.rint((v + 1) / 2 * (vertices_per_edge - 1)).astype(int)


input = np.concatenate(([df['x']], [df['y']], [df['z']]), axis=0)

index_space = project_unit_space_to_index_space(input, 42)

magic_space = some_other_transformation_code(index_space, foo, bar)

df['x_'], df['y_'], df['z_'] = magic_space

在编写时,该函数可以接受一列数据或多列数据,并且仍然可以正确、快速地工作。

返回类型是直接传递给另一个类似结构的函数的正确形状,允许我整齐地链接函数。

即使将结果分配回数据帧中的新列也不是“糟糕”,尽管它有点笨拙。

np.ndarray 确实很笨重。

我还没有找到任何关于这个的风格指南。它们遍布于itterrows和lambda表达式等,但我没有找到封装这种逻辑的最佳实践。

你

编辑:

%timeit test = project_unit_sphere_to_unit_cube(df[['x','y','z']].unstack().to_numpy())                      
# 1.44 ms Â± 57.1 Âµs per loop (mean Â± std. dev. of 7 runs, 1000 loops each)

%timeit test = project_unit_sphere_to_unit_cube(df[['x','y','z']].to_numpy().T)                              
# 558 Âµs Â± 6.25 Âµs per loop (mean Â± std. dev. of 7 runs, 1000 loops each)

%timeit test = project_unit_sphere_to_unit_cube(df[['x','y','z']].transpose().to_numpy())                    
# 817 Âµs Â± 18.4 Âµs per loop (mean Â± std. dev. of 7 runs, 1000 loops each)

%timeit test = project_unit_sphere_to_unit_cube(np.concatenate(([df['x']], [df['y']], [df['z']]), axis=0))   
# 3.46 ms Â± 42.7 Âµs per loop (mean Â± std. dev. of 7 runs, 100 loops each)

0 回复 | 直到 5 年前

hpaulj 5 年前

In [101]: df = pd.DataFrame(np.arange(12).reshape(4,3))                         
In [102]: df                                                                    
Out[102]: 
   0   1   2
0  0   1   2
1  3   4   5
2  6   7   8
3  9  10  11

您正在从数据帧的n列生成(n,m)数组:

In [103]: np.concatenate([[df[0]],[df[1]],[df[2]]],0)                           
Out[103]: 
array([[ 0,  3,  6,  9],
       [ 1,  4,  7, 10],
       [ 2,  5,  8, 11]])

In [104]: df.to_numpy().T                                                       
Out[104]: 
array([[ 0,  3,  6,  9],
       [ 1,  4,  7, 10],
       [ 2,  5,  8, 11]])

数据帧有自己的转置:

In [109]: df.transpose().to_numpy()                                             
Out[109]: 
array([[ 0,  3,  6,  9],
       [ 1,  4,  7, 10],
       [ 2,  5,  8, 11]])

计算使用数据帧,返回具有相同形状和索引的数据帧:

In [113]: np.rint((df+1)/2 *(42-1)).astype(int)                                 
Out[113]: 
     0    1    2
0   20   41   62
1   82  102  123
2  144  164  184
3  205  226  246

numpy 函数将输入转换为 纽比 数组并返回数组。其他,通过将细节委托给 pandas 方法,可以直接在数据帧上工作,并返回一个数据帧。

MatBailie 5 年前

@hpaulj让我清楚地看到了更多的功能和机会,从而帮助我进一步探索这个问题。这有助于我更清楚地界定我的竞争目标,也有助于我开始将优先权赋予它们。

- 利用结果
- 速度慢5%,但各方面都好,这是可以接受的
- 100%的速度慢可能是不可接受的
实现应该尽可能不依赖于数据类型

这让我想到目前

def scale_unit_cube_to_unit_sphere(*values):
    """
    Scales all the inputs (on a row basis for array_line types) such that when
    treated as n-dimensional vectors, their scale is always 1.

    (Divides the vector represented by each row of inputs by that row's
     root-of-sum-of-squares, so as to normalise to a unit magnitude.)

    Examples - Scalar Inputs
    --------

    >>> scale_unit_cube_to_unit_sphere(1, 1, 1)
    [0.5773502691896258, 0.5773502691896258, 0.5773502691896258]

    Examples - Array Like Inputs
    --------

    >>> x = [ 1, 2, 3]
    >>> y = [ 1, 4, 3]
    >>> z = [ 1,-3,-1]
    >>> scale_unit_cube_to_unit_sphere(x, y, z)
    [array([0.57735027, 0.37139068, 0.6882472 ]),
     array([0.57735027, 0.74278135, 0.6882472 ]),
     array([ 0.57735027, -0.55708601, -0.22941573])]

    >>> a = np.array([x, y, z])
    >>> scale_unit_cube_to_unit_sphere(*a)
    [array([0.57735027, 0.37139068, 0.6882472 ]),
     array([0.57735027, 0.74278135, 0.6882472 ]),
     array([ 0.57735027, -0.55708601, -0.22941573])]

    scale_unit_cube_to_unit_sphere(*t)
    >>> t = (x, y, z)
    >>> scale_unit_cube_to_unit_sphere(*t)
    [array([0.57735027, 0.37139068, 0.6882472 ]),
     array([0.57735027, 0.74278135, 0.6882472 ]),
     array([ 0.57735027, -0.55708601, -0.22941573])]

    >>> df = pd.DataFrame(data={'x':x,'y':y,'z':z})
    >>> scale_unit_cube_to_unit_sphere(df['x'], df['y'], df['z'])
    [0    0.577350
     1    0.371391
     2    0.688247
     dtype: float64,
     0    0.577350
     1    0.742781
     2    0.688247
     dtype: float64,
     0    0.577350
     1   -0.557086
     2   -0.229416
     dtype: float64]

    For all array_like inputs, the results can then be utilised in similar
    ways, such as writing them to an existing DataFrame as follows:

    >>> transform = scale_unit_cube_to_unit_sphere(df['x'], df['y'], df['z'])
    >> df['i'], df['j'], df['k'] = transform

    """
    # Scale the position in space to be a unit vector, as on the surface of a sphere
    ################################################################################

    scaler = np.sqrt(sum([np.multiply(v, v) for v in values]))
    return [np.divide(v, scaler) for v in values]

根据doc字符串,这适用于标量、数组、序列等,无论是否提供一个标量、三个标量、n个标量、n个数组等。

它们也在“链”中工作,例如下面的示例(函数的实现不相关,只是将输入链接到输出的模式)。。。

cube, ix = generate_index_cube(vertices_per_edge)

df = pd.DataFrame(
         data  = {
             'x': cube[0],
             'y': cube[1],
             'z': cube[2],
         },
         index = ix,
     )

unit = scale_index_to_unit(vertices_per_edge, *cube)

distortion = scale_unit_to_distortion(distortion_factor, *unit)

df['a'], df['b'], df['c'] = distortion

sphere = scale_unit_cube_to_unit_sphere(*distortion)

df['i'], df['j'], df['k'] = sphere

recovered_distortion = scale_unit_sphere_to_unit_cube(*sphere)

df['a_'], df['b_'], df['c_'] = recovered_distortion

recovered_cube = scale_unit_to_index(
                     vertices_per_edge,
                     *scale_distortion_to_unit(
                         distortion_factor,
                         *recovered_distortion,
                     ),
                 )

df['x_'], df['y_'], df['z_'] = recovered_cube

print(len(df[np.logical_not(np.isclose(df['a'], df['a_']))]))  # No Differences
print(len(df[np.logical_not(np.isclose(df['b'], df['b_']))]))  # No Differences
print(len(df[np.logical_not(np.isclose(df['c'], df['c_']))]))  # No Differences

print(len(df[np.logical_not(np.isclose(df['x'], df['x_']))]))  # No Differences
print(len(df[np.logical_not(np.isclose(df['y'], df['y_']))]))  # No Differences
print(len(df[np.logical_not(np.isclose(df['z'], df['z_']))]))  # No Differences

请做评论或评论。