代码之家  ›  专栏  ›  技术社区  ›  Liu Hantao

如何在Pandas中过滤分类数据

  •  0
  • Liu Hantao  · 技术社区  · 4 年前

    这是数据的信息

        sex     age         race        
        Male    0.204082    Hispanic    
        Male    0.122449    African-American    
        Female  0.163265    African-American    
        Male    0.081633    African-American    
        Male    0.530612    African-American
    
    African-American    2968
    Caucasian           1969
    Hispanic             502
    Other                294
    Asian                 26
    Native American       13
    Name: race, dtype: int64 
    

    我想基本上从数据集中删除美洲原住民和亚洲人,这就是我所做的:

    df_train_val_scaled = df_train_val_scaled[df_train_val_scaled["race"] != "Native American" & df_train_val_scaled["race"] != "Asian"]
    

    这产生了以下错误:

    TypeError: Cannot perform 'rand_' with a dtyped [object] array and scalar of type [bool]
    

    所以我尝试了以下方法

    df_train_val_scaled = df_train_val_scaled[df_train_val_scaled["race"] not in ["Native American", "Asian"]]
    

    但它也会产生错误

    ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
    

    谢谢你的帮助

    0 回复  |  直到 4 年前
        1
  •  2
  •   Ishita Thakkar    4 年前

    您可以使用isin()函数根据任何列值过滤DataFrame,该函数返回一个布尔序列,可以将其传递给DataFrame以获得过滤结果。
    您可以将此布尔序列传递给DataFrame,然后DataFrame根据传递的布尔序列过滤行后返回DataFrame。

    import pandas as pd
    
    people = {
        'sex': ['Male', 'Male', 'Male', 'Female', 'Male'],
        'age': [0.204082, 0.163265, 0.204082, 0.214082, 0.204082],
        'race': ['Hispanic', 'African-American', 'Asian', 'Asian', 'Asian']
    }
    
    df = pd.DataFrame(people)
    
    filter_ = ~df['race'].isin(['African-American', 'Asian'])
    
    print(filter_)
    
    # 0     True
    # 1    False
    # 2    False
    # 3    False
    # 4    False
    # Name: race, dtype: bool
    
    df_filtered = df[filter_]
    print(df_filtered)
    
    #     sex       age      race
    # 0  Male  0.204082  Hispanic
    
        2
  •  1
  •   SultanOrazbayev    4 年前

    诀窍是检查每个元素是否都在给定的列表中 ~df['race'].isin(['a', 'b', c']) 。以下是一个示例:

    from io import StringIO as sio
    
    data = sio("""
     sex     age         race        
        Male    0.204082    Hispanic    
        Male    0.122449    African-American    
        Female  0.163265    African-American    
        Male    0.081633    African-American    
        Male    0.530612    African-American
    """)
    
    import pandas as pd
    df = pd.read_csv(data, sep='\s+').astype({'race': 'category'})
    
    df_train_val_scaled = df[~df["race"].isin(["Native American", "Asian"])]
    df_train_val_scaled