代码之家  ›  专栏  ›  技术社区  ›  parvij

如何从数据帧中提取基表并放置Id而不是它?[副本]

  •  0
  • parvij  · 技术社区  · 6 年前

    我有两个数据帧 df1 df2 df1型 包含人的年龄信息,而 包含人的性别信息。不是所有人都在 df1型 df2型

    df1
         Name   Age 
    0     Tom    34
    1     Sara   18
    2     Eva    44
    3     Jack   27
    4     Laura  30
    
    df2
         Name      Sex 
    0     Tom       M
    1     Paul      M
    2     Eva       F
    3     Jack      M
    4     Michelle  F
    

    我想知道里面人的性别 和设置 NaN 如果我没有这个信息 df2型 df1 = pd.merge(df1, df2, on = 'Name', how = 'outer') 但我把一些人的信息 df2型 我不想要的。

    df1
         Name   Age     Sex
    0     Tom    34      M
    1     Sara   18     NaN
    2     Eva    44      F
    3     Jack   27      M
    4     Laura  30     NaN
    
    0 回复  |  直到 6 年前
        1
  •  8
  •   jezrael    6 年前

    Sample :

    df1 = pd.DataFrame({'Name': ['Tom', 'Sara', 'Eva', 'Jack', 'Laura'], 
                        'Age': [34, 18, 44, 27, 30]})
    
    #print (df1)
    df3 = df1.copy()
    
    df2 = pd.DataFrame({'Name': ['Tom', 'Paul', 'Eva', 'Jack', 'Michelle'], 
                        'Sex': ['M', 'M', 'F', 'M', 'F']})
    #print (df2)
    

    使用 map Series 创建人 set_index :

    df1['Sex'] = df1['Name'].map(df2.set_index('Name')['Sex'])
    print (df1)
        Name  Age  Sex
    0    Tom   34    M
    1   Sara   18  NaN
    2    Eva   44    F
    3   Jack   27    M
    4  Laura   30  NaN
    

    替代解决方案 merge

    df = df3.merge(df2[['Name','Sex']], on='Name', how='left')
    print (df)
        Name  Age  Sex
    0    Tom   34    M
    1   Sara   18  NaN
    2    Eva   44    F
    3   Jack   27    M
    4  Laura   30  NaN
    

    如果需要多列映射(例如。 Year Code )需要 合并 左连接:

    df1 = pd.DataFrame({'Name': ['Tom', 'Sara', 'Eva', 'Jack', 'Laura'], 
                        'Year':[2000,2003,2003,2004,2007],
                        'Code':[1,2,3,4,4],
                        'Age': [34, 18, 44, 27, 30]})
    
    print (df1)
        Name  Year  Code  Age
    0    Tom  2000     1   34
    1   Sara  2003     2   18
    2    Eva  2003     3   44
    3   Jack  2004     4   27
    4  Laura  2007     4   30
    
    df2 = pd.DataFrame({'Name': ['Tom', 'Paul', 'Eva', 'Jack', 'Michelle'], 
                        'Sex': ['M', 'M', 'F', 'M', 'F'],
                        'Year':[2001,2003,2003,2004,2007],
                        'Code':[1,2,3,5,3],
                        'Val':[21,34,23,44,67]})
    print (df2)
           Name Sex  Year  Code  Val
    0       Tom   M  2001     1   21
    1      Paul   M  2003     2   34
    2       Eva   F  2003     3   23
    3      Jack   M  2004     5   44
    4  Michelle   F  2007     3   67
    
    #merge by all columns
    df = df1.merge(df2, on=['Year','Code'], how='left')
    print (df)
      Name_x  Year  Code  Age Name_y  Sex   Val
    0    Tom  2000     1   34    NaN  NaN   NaN
    1   Sara  2003     2   18   Paul    M  34.0
    2    Eva  2003     3   44    Eva    F  23.0
    3   Jack  2004     4   27    NaN  NaN   NaN
    4  Laura  2007     4   30    NaN  NaN   NaN
    
    #specified columns - columns for join (Year, Code) need always + appended columns (Val)
    df = df1.merge(df2[['Year','Code', 'Val']], on=['Year','Code'], how='left')
    print (df)
        Name  Year  Code  Age   Val
    0    Tom  2000     1   34   NaN
    1   Sara  2003     2   18  34.0
    2    Eva  2003     3   44  23.0
    3   Jack  2004     4   27   NaN
    4  Laura  2007     4   30   NaN
    

    地图 这里的意思是按连接列复制 Name

    df1 = pd.DataFrame({'Name': ['Tom', 'Sara', 'Eva', 'Jack', 'Laura'], 
                        'Age': [34, 18, 44, 27, 30]})
    
    print (df1)
        Name  Age
    0    Tom   34
    1   Sara   18
    2    Eva   44
    3   Jack   27
    4  Laura   30
    
    df3, df4 = df1.copy(), df1.copy()
    
    df2 = pd.DataFrame({'Name': ['Tom', 'Tom', 'Eva', 'Jack', 'Michelle'], 
                        'Val': [1,2,3,4,5]})
    print (df2)
           Name  Val
    0       Tom    1 <-duplicated name Tom
    1       Tom    2 <-duplicated name Tom
    2       Eva    3
    3      Jack    4
    4  Michelle    5
    
    s = df2.set_index('Name')['Val']
    df1['New'] = df1['Name'].map(s)
    print (df1)
    

    无效索引:仅用唯一值索引对象有效的重新索引

    解决方案通过 DataFrame.drop_duplicates dict 对于上次重复匹配:

    #default keep first value
    s = df2.drop_duplicates('Name').set_index('Name')['Val']
    print (s)
    Name
    Tom         1
    Eva         3
    Jack        4
    Michelle    5
    Name: Val, dtype: int64
    
    df1['New'] = df1['Name'].map(s)
    print (df1)
        Name  Age  New
    0    Tom   34  1.0
    1   Sara   18  NaN
    2    Eva   44  3.0
    3   Jack   27  4.0
    4  Laura   30  NaN
    
    #add parameter for keep last value 
    s = df2.drop_duplicates('Name', keep='last').set_index('Name')['Val']
    print (s)
    Name
    Tom         2
    Eva         3
    Jack        4
    Michelle    5
    Name: Val, dtype: int64
    
    df3['New'] = df3['Name'].map(s)
    print (df3)
        Name  Age  New
    0    Tom   34  2.0
    1   Sara   18  NaN
    2    Eva   44  3.0
    3   Jack   27  4.0
    4  Laura   30  NaN
    
    #map by dictionary
    d = dict(zip(df2['Name'], df2['Val']))
    print (d)
    {'Tom': 2, 'Eva': 3, 'Jack': 4, 'Michelle': 5}
    
    df4['New'] = df4['Name'].map(d)
    print (df4)
        Name  Age  New
    0    Tom   34  2.0
    1   Sara   18  NaN
    2    Eva   44  3.0
    3   Jack   27  4.0
    4  Laura   30  NaN
    
        2
  •  1
  •   Xiaoyu Lu    6 年前

    您也可以使用 join

    df1.set_index("Name").join(df2.set_index("Name"), how="left")
    

    编辑:添加 set_index("Name")