代码之家  ›  专栏  ›  技术社区  ›  AAA

基于熊猫条件匹配两个dfs

  •  1
  • AAA  · 技术社区  · 1 年前

    我有两个dfs:dfname(有不同版本的球员名称)和dfgoals,其中有关于球员和他们进球的信息。 我想根据一个条件为答案df中的每个玩家返回一行:

    (i) 查看dfgoals中的actual_name列中是否存在name1值,如果存在,则返回第一个匹配的行,否则检查name2值并返回第一个相匹配的行

    (ii)匹配的值(name1或name2)也从dfname返回name col值

    dfname = pd.DataFrame({
                         "name": ["ryan", "bill", "saka", "Henry","Rooney"], 
                         "name1": ["ryan 112", "Bill Matt Cdevaca", "Bukayo Saka", "Super Henry","Rooney"], 
                         "name2": ["NaN", "XXVaca", "Bukayo", "Thierry","Rooney"]})
    
    
    
    dfgoals = pd.DataFrame({
                         "actual_name": ["ryan 112", "XXVaca", "Bukayo", "Thierry", "Ronaldo", "Messi"], 
                         "goals": [0, 2, 5, 10, 100, 200], 
                         "matches": [22, 100, 200, 300, 100, 90]})
    
    
    answerdf = pd.DataFrame({
                         "actual_name": ["ryan 112", "XXVaca", "Bukayo", "Thierry", "Rooney"], 
                         "goals": [0, 2, 5, 10, "NaN"], 
                         "matches": [22, 100, 200, 300, "NaN"],
                        "name_from_dfname": ["ryan", "bill", "saka", "Henry", "Rooney"]})
    answerdf
    
    Rooney's values are NaN because his goals record is not available
    
    

    到目前为止,我已经尝试过了,但它没有正确检查name1-2的值,例如,它只给我瑞恩的进球,而没有给其他球员的进球,因为他们的名字被不同地提及

    df = dfgoals
    values_to_check = ['ryan', 'Bill Matt Cdevaca', 'saka', 'henry', 'Rooney']
    
    filtered_rows = []
    
    # Iterate through the DataFrame rows to find matches and concatenate values
    for index, row in dfgoals.iterrows():
        matched_values = [value for value in values_to_check if value.lower() in row['actual_name'].lower()]
        if matched_values:
            row['concatenated_values'] = '|'.join(matched_values)
            filtered_rows.append(row)
    
    # Create a new DataFrame from the filtered rows
    result_df = pd.DataFrame(filtered_rows)
    result_df['concatenated_values'] = pd.Categorical(result_df['concatenated_values'], categories=values_to_check, ordered=True)
    
    # Sort the DataFrame based on the 'concatenated_values' column
    result_df.sort_values(by = "concatenated_values")
    
    1 回复  |  直到 1 年前
        1
  •  1
  •   Andrej Kesely    1 年前

    你可以试着 .stack 这个 dfname 然后根据的值设置新数据帧的索引 name1 / name2 列。然后使用 .map :

    x = dfname.set_index('name').stack().reset_index().dropna().drop_duplicates(subset=0).set_index(0)['name']
    dfgoals['name_from_dfname'] = dfgoals['actual_name'].map(x)
    

    打印:

      actual_name  goals  matches name_from_dfname
    0    ryan 112      0       22             ryan
    1      XXVaca      2      100             bill
    2      Bukayo      5      200             saka
    3     Thierry     10      300            Henry
    

    编辑:要添加缺少的玩家:

    x = dfname.set_index('name').stack().reset_index().dropna().drop_duplicates(subset=0).set_index(0)['name']
    dfgoals['name_from_dfname'] = dfgoals['actual_name'].map(x)
    
    # add missing players
    missing_players = dfname.loc[~dfname['name'].isin(dfgoals['name_from_dfname']), ['name']].rename(columns={'name': 'actual_name'})
    missing_players['name_from_dfname'] = missing_players['actual_name']
    
    dfgoals = pd.concat([dfgoals, missing_players])
    print(dfgoals)
    

    打印:

      actual_name  goals  matches name_from_dfname
    0    ryan 112    0.0     22.0             ryan
    1      XXVaca    2.0    100.0             bill
    2      Bukayo    5.0    200.0             saka
    3     Thierry   10.0    300.0            Henry
    4     Ronaldo  100.0    100.0              NaN
    5       Messi  200.0     90.0              NaN
    4      Rooney    NaN      NaN           Rooney