代码之家 › 专栏 › 技术社区 › JohnE

将数据帧转换为rec数组(将对象转换为字符串)

numpy pandas arrays python

JohnE · 技术社区 · 6 年前

我有一个pandas数据帧,其中混合了一些数据类型(dtype),我希望将其转换为numpy结构数组(或者记录数组,在本例中基本上是相同的)。对于纯数字数据帧,使用 to_records() 串而不是物体所以我可以使用numpy方法 tofile() 它将数字和字符串输出到二进制文件,但不会输出对象。

简而言之,我需要用 dtype=object 对字符串或unicode数据类型的结构化数组进行numpy。

下面是一个示例,如果所有列都有数字(float或int)数据类型,那么代码就足够了。

import pandas as pd
df=pd.DataFrame({'f_num': [1.,2.,3.], 'i_num':[1,2,3], 
                 'char': ['a','bb','ccc'], 'mixed':['a','bb',1]})

struct_arr=df.to_records(index=False)

print('struct_arr',struct_arr.dtype,'\n')

# struct_arr (numpy.record, [('f_num', '<f8'), ('i_num', '<i8'), 
#                            ('char', 'O'), ('mixed', 'O')])

但是,因为我想以字符串数据类型结束,所以我需要添加以下额外的代码:

lst=[]
for col in struct_arr.dtype.names:  # this was the only iterator I 
                                    # could find for the column labels
    dt=struct_arr[col].dtype

    if dt == 'O':   # this is 'O', meaning 'object'

        # it appears an explicit string length is required
        # so I calculate with pandas len & max methods
        dt = 'U' + str( df[col].astype(str).str.len().max() )
       
    lst.append((col,dt))

struct_arr = struct_arr.astype(lst)
        
print('struct_arr',struct_arr.dtype)

# struct_arr (numpy.record, [('f_num', '<f8'), ('i_num', '<i8'), 
#                            ('char', '<U3'), ('mixed', '<U2')])

另请参见: How to change the dtype of certain columns of a numpy recarray?

这似乎是可行的,因为字符和混合数据类型现在是 <U3 和 <U2 而不是“O”或“object”。我只是想看看有没有更简单或更优雅的方法。但既然pandas不像numpy那样有原生的字符串类型,也许就没有了?

2 回复 | 直到 4 年前

JohnE 6 年前

to_records 为了提高速度),我提出了以下内容,这是更干净的代码,而且比我的原始代码快了大约5倍(通过将上面的示例数据帧扩展到10000行进行测试):

names = df.columns
arrays = [ df[col].get_values() for col in names ]

formats = [ array.dtype if array.dtype != 'O' 
            else f'{array.astype(str).dtype}' for array in arrays ] 

rec_array = np.rec.fromarrays( arrays, dtype={'names': names, 'formats': formats} )

上面的代码将输出unicode而不是字符串,一般来说这可能更好,但在我的例子中,我需要转换为字符串,因为我正在用fortran读取二进制文件,字符串似乎更容易读入。因此,最好将上面的“格式”行替换为:

formats = [ array.dtype if array.dtype != 'O' 
            else array.astype(str).dtype.str.replace('<U','S') for array in arrays ]

<U4 变成 S4

jpp 6 年前

据我所知,没有本机的功能。例如,序列中所有值的最大长度不会存储在任何位置。

但是,您可以通过列表理解和f字符串更有效地实现您的逻辑:

data_types = [(col, arr[col].dtype if arr[col].dtype != 'O' else \
               f'U{df[col].astype(str).str.len().max()}') for col in arr.dtype.names]