代码之家  ›  专栏  ›  技术社区  ›  Shayna

按col3分组的熊猫等效选择计数(不同col1,col2)

  •  3
  • Shayna  · 技术社区  · 6 年前

    生成数据帧:

    people = ['shayna','shayna','shayna','shayna','john']
    dates = ['01-01-18','01-01-18','01-01-18','01-02-18','01-02-18']
    places = ['hospital', 'hospital', 'inpatient', 'hospital', 'hospital']
    d = {'Person':people,'Service_Date':dates, 'Site_Where_Served':places}
    df = pd.DataFrame(d)
    df
    
    Person   Service_Date   Site_Where_Served
    shayna   01-01-18       hospital 
    shayna   01-01-18       hospital 
    shayna   01-01-18       inpatient 
    shayna   01-02-18       hospital 
    john     01-02-18       hospital 
    

    我想做的是按Site\u Where\u服务分组,计算唯一的成对人员及其服务\u日期。

    预期输出:

    Site_Where_Served    Site_Visit_Count
    hospital             3
    inpatient            1
    

    我的尝试:

    df[['Person', 'Service_Date']].groupby(df['Site_Where_Served']).nunique().reset_index(name='Site_Visit_Count')
    

    但它不知道如何重置索引。因此,我尝试忽略这一点,我意识到这并不是计算唯一的一对“Person”和“Service\u Date”,因为输出如下所示:

                       Person    Service_Date
    Site_Where_Served
    hospital              2           2 
    inpatient             1           1 
    
    4 回复  |  直到 6 年前
        1
  •  4
  •   Community CDub    4 年前

    drop_duplicates 具有 groupby + count

    (df.drop_duplicates()
       .groupby('Site_Where_Served')
       .Site_Where_Served.count()
       .reset_index(name='Site_Visit_Count')
    )
    
      Site_Where_Served  Site_Visit_Count
    0          hospital                 3
    1         inpatient                 1
    

    注意,两者之间有一个微小的区别 计数 / size 前者不计算NaN个条目。


    元组化, 子句 nunique

    这实际上只是修复您当前的解决方案,但我不建议这样做,因为这是一个冗长的过程,需要执行的步骤太多。首先,对列进行tuplize,分组依据 Site_Where_Served ,然后计数:

    (df[['Person', 'Service_Date']]
       .apply(tuple, 1)
       .groupby(df.Site_Where_Served)
       .nunique()
       .reset_index(name='Site_Visit_Count')
    )
    
      Site_Where_Served  Site_Visit_Count
    0          hospital                 3
    1         inpatient                 1
    
        2
  •  3
  •   jpp    6 年前

    在我看来,更好的方法是在使用之前删除副本 groupby.size :

    res = df.drop_duplicates()\
            .groupby('Site_Where_Served').size()\
            .reset_index(name='Site_Visit_Count')
    
    print(res)
    
      Site_Where_Served  Site_Visit_Count
    0          hospital                 3
    1         inpatient                 1
    
        3
  •  2
  •   cs95 abhishek58g    6 年前

    大概 value_counts

    (df.drop_duplicates()
       .Site_Where_Served
       .value_counts()
       .to_frame('Site_Visit_Count')
       .rename_axis('Site_Where_Served')
       .reset_index()
    )
    
      Site_Where_Served  Site_Visit_Count
    0          hospital                 3
    1         inpatient                 1
    
        4
  •  1
  •   piRSquared    6 年前

    Counter 1.

    pd.Series(Counter(df.drop_duplicates().Site_Where_Served)) \
        .rename_axis('Site_Where_Served').reset_index(name='Site_Visit_Count')
    
      Site_Where_Served  Site_Visit_Count
    0          hospital                 3
    1         inpatient                 1
    

    柜台 2.

    pd.DataFrame(
        list(Counter(t[2] for t in set(map(tuple, df.values))).items()),
        columns=['Site_Where_Served', 'Site_Visit_Count']
    )
    
      Site_Where_Served  Site_Visit_Count
    0          hospital                 3
    1         inpatient                 1