代码之家 › 专栏 › 技术社区 › Sai Kumar

基于列间部分字符串匹配的连接数据帧

partial string-matching join pandas python

Sai Kumar · 技术社区 · 7 年前

我有一个数据帧,我想比较,如果他们在另一个df中存在。

after_h.sample(10, random_state=1)

             movie           year   ratings
108 Mechanic: Resurrection   2016     4.0
206 Warcraft                 2016     4.0
106 Max Steel                2016     3.5
107 Me Before You            2016     4.5

              FILM                   Votes
0   Avengers: Age of Ultron (2015)   4170
1   Cinderella (2015)                 950
2   Ant-Man (2015)                   3000 
3   Do You Believe? (2015)            350
4   Max Steel (2016)                  560

    FILM              votes
0  Max Steel           560

2 回复 | 直到 7 年前

jpp 7 年前

df1 和 df2 pd.Series.isin . 要对齐电影字符串的格式,首先需要连接电影和年份 :

s = df1['movie'] + ' (' + df1['year'].astype(str) + ')'

res = df2[df2['FILM'].isin(s)]

print(res)

               FILM  VOTES
4  Max Steel (2016)    560

Gopal Chitalia 4 年前

获取部分匹配的行索引: FILM.startswith(title) 或 FILM.contains(title)

df1[ df1.movie.apply( lambda title: df2.FILM.str.startswith(title) ).any(1) ]

df1[ df1['movie'].apply(lambda title: df2['FILM'].str.contains(title)).any(1) ]

     movie      year      ratings
106  Max Steel  2016      3.5

或者, merge() 如果将复合字符串列df2['FILM']转换为它的两个组件列 movie_title (year) .

# see code at bottom to recreate your dataframes
df2[['movie','year']] = df2.FILM.str.extract('([^\(]*) \(([0-9]*)\)')
# reorder columns and drop 'FILM' now we have its subfields 'movie','year'
df2 = df2[['movie','year','Votes']]
df2['year'] = df2['year'].astype(int)

df2.merge(df1)
       movie  year  Votes  ratings
0  Max Steel  2016    560      3.5

import pandas as pd
from pandas.compat import StringIO

dat1 = """movie           year   ratings
108  Mechanic: Resurrection   2016     4.0
206  Warcraft                 2016     4.0
106  Max Steel                2016     3.5
107  Me Before You            2016     4.5"""

dat2 = """FILM                   Votes
0   Avengers: Age of Ultron (2015)   4170
1   Cinderella (2015)                 950
2   Ant-Man (2015)                   3000
3   Do You Believe? (2015)            350
4   Max Steel (2016)                  560"""

df1 = pd.read_csv(StringIO(dat1), sep='\s{2,}', engine='python', index_col=0)
df2 = pd.read_csv(StringIO(dat2), sep='\s{2,}', engine='python')

Brendan Cody-Kenny 6 年前

smci 的选项1就快到了,下面这些对我很有用:

df1['Votes'] = ''
df1['Votes']=df1['movie'].apply(lambda title: df2[df2['FILM'].str.startswith(title)]['Votes'].any(0))

在df1中创建投票列

lambda查找df2,选择df2中电影以电影标题开头的所有行

选择df2的结果子集的投票列

推荐文章

SerjantArbuz · 为什么正则表达式从末尾搜索第二组?

10 月前

L H · 识别Pandas中正确的字符串顺序

1 年前

Franck Dernoncourt · 当测试字符串100%包含查询字符串时,为什么fuzzywuzzy的process.extractBests不能给出100%的分数?

1 年前

João Bosco · 如何创建查询以在Firebase Firestore中查找特定字符串?复制

1 年前

user2981194 · 模块'offuzz'没有属性'partial_ratio'和其他奇怪的错误

1 年前

Andy Knipp · 在Java 11中遇到正则表达式问题

1 年前

M195 · 在整个pandas数据帧中查找部分字符串匹配的列和行

2 年前

Adrian · 如何在R[重复]中提取多个嵌套圆括号之间的字符串

2 年前

Aggie04 · 循环查找符合搜索条件的记录,然后在两个电子表格之间剪切和粘贴

7 年前

harlowworld · Python从子列表中提取与另一个列表的子列表中的项目匹配的项目

7 年前