代码之家 › 专栏 › 技术社区 › M Hossain

在pandas中,如何将具有多个属性和值的列解析为新列并获取其值

data-science dataframe pandas python

-1

M Hossain · 技术社区 · 7 年前

我有一个数据框架,它包含许多列,其中一个列称为sourcetechattributes,它具有有价值的attributename和attribute value,例如

    df['SourceTechAttributes'][0]
    'DropFrame: True, Duration: 4874.1359333333333333333333333, FieldDominance: Upper Field First, FrameRate: 29.97, Height: 1080, MediaFormat: 912, NumberOfAudioChannels: 8, NumberOfAudioTracks: 8, ScanType: Interlaced, StartSmpte: 00:59:59;26, ViewportDisplayFormat: Anamorphic, Width: 1920'
0    DropFrame: True, Duration: 4874.13593333333333...
1    ActionType: CG, DropFrame: True, Duration: 129...
2    DropFrame: True, Duration: 4874.13593333333333...
3    DropFrame: True, Duration: 4874.13593333333333...
4    ActionType: CG, DropFrame: True, Duration: 129...
5    ActionType: CG, DropFrame: True, Duration: 129...
Name: SourceTechAttributes, dtype: object

这个列键和值也会改变它的位置, 我想解析该列并创建新的七列,如下所示

我可以一个接一个地做熊猫,比如

df['m']=df['SourceTechAttributes'][0].split(',')[0]

它给出了第一个逗号的解析结果,例如

df['m']
0        DropFrame: True
1        DropFrame: True
2        DropFrame: True
3        DropFrame: True

然后再次解析分隔的冒号,取最后一部分,并将列名设为df['dropframe']

df['DropFrame']=df['m'][0].split(':')[1]
df['DropFrame']

0         True
1         True
2         True
3         True

但是这个过程是错误的,因为有时它没有得到我想要的,因为有些行的属性和值很多,有时很少。有谁能帮我在这件事上创造一个功能,将照顾这一切,我可以实现我的目标。提前谢谢。

2 回复 | 直到 7 年前

DYZ 7 年前

首先,需要一个函数,它接受一个字符串,用逗号和冒号将其拆分,然后通过字典将其转换为pandas系列:

def str2series(s):
    pieces = [x.split(': ') for x in s.split(',')]
    return pd.Series({k.strip(): v.strip() for k,v in pieces})

接下来,将函数应用于列:

new_df = df.SourceTechAttributes.apply(str2series)

结果是您正在查找的数据帧。如果需要,可以将其与原始数据帧合并:它们具有相同的索引:

df = df.join(new_df)

gyoza 7 年前

以下三个步骤:

# 1. create a list in each row
df['SourceTechAttributes'] = (df['SourceTechAttributes']
                              .apply(lambda x: str(x).replace(" ", "")
                                     .replace(":", ",")
                                     .split(",")))

# 2. create a dictionary in each row
df['SourceTechAttributes'] = (df['SourceTechAttributes']
                              .apply(lambda x: dict(zip(x[::2], x[1::2]))))

# 3. create new columns
df['srcMediaFormat'] = (df['SourceTechAttributes']
                        .apply(lambda x: x['MediaFormat']))

我只创建了一个新列 srcMediaFormat 作为一个例子。