代码之家  ›  专栏  ›  技术社区  ›  Sai Kumar

基于列间部分字符串匹配的连接数据帧

  •  0
  • Sai Kumar  · 技术社区  · 7 年前

    我有一个数据帧,我想比较,如果他们在另一个df中存在。

    after_h.sample(10, random_state=1)
    
                 movie           year   ratings
    108 Mechanic: Resurrection   2016     4.0
    206 Warcraft                 2016     4.0
    106 Max Steel                2016     3.5
    107 Me Before You            2016     4.5
    

                  FILM                   Votes
    0   Avengers: Age of Ultron (2015)   4170
    1   Cinderella (2015)                 950
    2   Ant-Man (2015)                   3000 
    3   Do You Believe? (2015)            350
    4   Max Steel (2016)                  560 
    

        FILM              votes
    0  Max Steel           560
    
    2 回复  |  直到 7 年前
        1
  •  2
  •   jpp    7 年前

    df1 df2 pd.Series.isin . 要对齐电影字符串的格式,首先需要连接电影和年份 :

    s = df1['movie'] + ' (' + df1['year'].astype(str) + ')'
    
    res = df2[df2['FILM'].isin(s)]
    
    print(res)
    
                   FILM  VOTES
    4  Max Steel (2016)    560
    
        2
  •  6
  •   Gopal Chitalia    4 年前

    1. 获取部分匹配的行索引: FILM.startswith(title) FILM.contains(title)

      df1[ df1.movie.apply( lambda title: df2.FILM.str.startswith(title) ).any(1) ]

      df1[ df1['movie'].apply(lambda title: df2['FILM'].str.contains(title)).any(1) ]

         movie      year      ratings
    106  Max Steel  2016      3.5
    
    1. 或者, merge() 如果将复合字符串列df2['FILM']转换为它的两个组件列 movie_title (year) .

    .

    # see code at bottom to recreate your dataframes
    df2[['movie','year']] = df2.FILM.str.extract('([^\(]*) \(([0-9]*)\)')
    # reorder columns and drop 'FILM' now we have its subfields 'movie','year'
    df2 = df2[['movie','year','Votes']]
    df2['year'] = df2['year'].astype(int)
    
    df2.merge(df1)
           movie  year  Votes  ratings
    0  Max Steel  2016    560      3.5
    

    import pandas as pd
    from pandas.compat import StringIO
    
    dat1 = """movie           year   ratings
    108  Mechanic: Resurrection   2016     4.0
    206  Warcraft                 2016     4.0
    106  Max Steel                2016     3.5
    107  Me Before You            2016     4.5"""
    
    dat2 = """FILM                   Votes
    0   Avengers: Age of Ultron (2015)   4170
    1   Cinderella (2015)                 950
    2   Ant-Man (2015)                   3000
    3   Do You Believe? (2015)            350
    4   Max Steel (2016)                  560"""
    
    df1 = pd.read_csv(StringIO(dat1), sep='\s{2,}', engine='python', index_col=0)
    df2 = pd.read_csv(StringIO(dat2), sep='\s{2,}', engine='python')
    
        3
  •  0
  •   Brendan Cody-Kenny    6 年前

    smci 的选项1就快到了,下面这些对我很有用:

    df1['Votes'] = ''
    df1['Votes']=df1['movie'].apply(lambda title: df2[df2['FILM'].str.startswith(title)]['Votes'].any(0))
    

    在df1中创建投票列

    lambda查找df2,选择df2中电影以电影标题开头的所有行

    选择df2的结果子集的投票列