代码之家  ›  专栏  ›  技术社区  ›  Hatshepsut

多列交叉表

  •  1
  • Hatshepsut  · 技术社区  · 7 年前

    我有一个带有名称、日期和位置的数据框。对于每个name day location triple,我想知道具有该name day的行中有多大比例的行具有该位置。

    在代码中,我从 df expected .

    import pandas as pd
    
    df = pd.DataFrame(
        [
            {"name": "Alice", "day": "friday", "location": "left"},
            {"name": "Alice", "day": "friday", "location": "right"},
            {"name": "Bob", "day": "monday", "location": "left"},
        ]
    )
    
    print(df)
    
    
    
    expected = pd.DataFrame(
        [
            {"name": "Alice", "day": "friday", "location": "left", "row_percent": 50.0},
            {"name": "Alice", "day": "friday", "location": "right", "row_percent": 50.0},
            {"name": "Bob", "day": "monday", "location": "left", "row_percent": 100.0},
        ]
    ).set_index(['name', 'day', ])
    print(expected)
    

    印刷的:

    In [13]: df                                                                                                                                                                                  
    Out[13]: 
          day location   name
    0  friday     left  Alice
    1  friday    right  Alice
    2  monday     left    Bob
    
    
    
    
    In [12]: expected                                                                                                                                                                            
    Out[12]: 
                 location  row_percent
    name  day                         
    Alice friday     left         50.0
          friday    right         50.0
    Bob   monday     left        100.0
    
    1 回复  |  直到 7 年前
        1
  •  9
  •   user3483203    7 年前

    使用 groupby value_counts :

    df.groupby(['name', 'day']).location.value_counts(normalize=True).mul(100)
    

    name   day     location
    Alice  friday  left         50.0
                   right        50.0
    Bob    monday  left        100.0
    Name: location, dtype: float64
    

    对您想要的输出进行更多的清洁:

    out = (df.groupby(['name', 'day']).location.value_counts(normalize=True).mul(100)
              .rename('row_percent').reset_index(2))
    

                 location  row_percent
    name  day
    Alice friday     left         50.0
          friday    right         50.0
    Bob   monday     left        100.0
    

    out == expected
    

                  location  row_percent
    name  day
    Alice friday      True         True
          friday      True         True
    Bob   monday      True         True