代码之家  ›  专栏  ›  技术社区  ›  sdbbs

如何识别Pandas groupby中的非连续范围?

  •  1
  • sdbbs  · 技术社区  · 3 年前

    numrun test

    import sys
    if sys.version_info[0] < 3:
        from StringIO import StringIO
    else:
        from io import StringIO
    
    import pandas as pd
    
    # to demonstrate problem, replace lines 8,9,10 below with:
    # 63182.8423443,0,ZZ,661,615,767
    # 63183.1940174,0,ZZ,661,615,767
    # 63183.6097631,0,ZZ,661,615,767
    TESTDATA = StringIO("""tstamp,numrun,test,valA,valB,valC
    63180.6208982,0,AA,661,615,767
    63180.9724180,0,AA,661,615,767
    63181.3240476,0,AA,661,615,767
    63181.7246253,0,ZZ,661,615,767
    63182.0752904,0,BB,661,615,767
    63182.4280107,0,BB,661,615,767
    63182.8423443,0,CC,661,615,767
    63183.1940174,0,CC,661,615,767
    63183.6097631,0,CC,661,615,767
    63183.9612234,1,AA,661,615,767
    63184.3441680,1,AA,661,615,767
    63184.6971598,1,AA,661,615,767
    63185.0964634,1,BB,661,615,767
    63185.4480095,1,BB,660,615,767
    63185.8035320,1,BB,661,615,767
    63186.1987603,1,BB,660,615,767
    63186.5500834,1,BB,661,615,767
    63186.9661803,1,BB,661,615,767
    63187.3186681,2,AA,660,615,767
    63187.7182817,2,AA,660,615,767
    63188.0696054,2,AA,660,615,767
    63188.4689450,2,ZZ,660,615,767
    63188.8204257,2,ZZ,660,615,767
    63189.1719920,2,ZZ,660,615,767
    63189.5240004,2,BB,660,615,767
    63189.9121130,2,BB,660,615,767
    63190.2593203,2,BB,660,615,767
    """)
    
    df_data = pd.read_csv(TESTDATA)
    #print(df_data.head())
    
    # use sort=False, else tstamp order gets mesed up!
    grouped = df_data.groupby(['numrun', 'test'], sort=False)
    for igx, (group_name, df_grouped) in enumerate(grouped):
      firstlastrow = df_grouped.iloc[[0, -1]]
      ttime_start = firstlastrow.iloc[0]["tstamp"]
      ttime_end = firstlastrow.iloc[1]["tstamp"]
      deltasec = ttime_end - ttime_start
      with pd.option_context('float_format', '{:f}'.format):
        print("{:02d}: {}: {:.2f} sec ({} - {})".format(igx+1, group_name, deltasec, ttime_start, ttime_end ))
    

    $ python3 /tmp/test.py
    01: (0, 'AA'): 0.70 sec (63180.6208982 - 63181.3240476)
    02: (0, 'ZZ'): 0.00 sec (63181.7246253 - 63181.7246253)
    03: (0, 'BB'): 0.35 sec (63182.0752904 - 63182.4280107)
    04: (0, 'CC'): 0.77 sec (63182.8423443 - 63183.6097631)
    05: (1, 'AA'): 0.74 sec (63183.9612234 - 63184.6971598)
    06: (1, 'BB'): 1.87 sec (63185.0964634 - 63186.9661803)
    07: (2, 'AA'): 0.75 sec (63187.3186681 - 63188.0696054)
    08: (2, 'ZZ'): 0.70 sec (63188.468945 - 63189.171992)
    09: (2, 'BB'): 0.74 sec (63189.5240004 - 63190.2593203)
    

    要模拟该情况,请替换中的线 TESTDATA 如报告中所述 to demonstrate problem ... 在代码中添加注释,然后再次运行脚本。我得到:

    $ python3 /tmp/test.py
    01: (0, 'AA'): 0.70 sec (63180.6208982 - 63181.3240476)
    02: (0, 'ZZ'): 1.89 sec (63181.7246253 - 63183.6097631)
    03: (0, 'BB'): 0.35 sec (63182.0752904 - 63182.4280107)
    04: (1, 'AA'): 0.74 sec (63183.9612234 - 63184.6971598)
    05: (1, 'BB'): 1.87 sec (63185.0964634 - 63186.9661803)
    06: (2, 'AA'): 0.75 sec (63187.3186681 - 63188.0696054)
    07: (2, 'ZZ'): 0.70 sec (63188.468945 - 63189.171992)
    08: (2, 'BB'): 0.74 sec (63189.5240004 - 63190.2593203)
    

    请注意,只标识了一个(0,'ZZ')范围,而数据中的更改又引入了一个(总共两个)。

    基本上,更改数据的期望输出如下所示:

    01: (0, 'AA'): 0.70 sec (63180.6208982 - 63181.3240476)
    02: (0, 'ZZ'): 0.00 sec (63181.7246253 - 63181.7246253)
    03: (0, 'BB'): 0.35 sec (63182.0752904 - 63182.4280107)
    04: (0, 'ZZ+1'): 0.77 sec (63182.8423443 - 63183.6097631)
    05: (1, 'AA'): 0.74 sec (63183.9612234 - 63184.6971598)
    06: (1, 'BB'): 1.87 sec (63185.0964634 - 63186.9661803)
    07: (2, 'AA'): 0.75 sec (63187.3186681 - 63188.0696054)
    08: (2, 'ZZ'): 0.70 sec (63188.468945 - 63189.171992)
    09: (2, 'BB'): 0.74 sec (63189.5240004 - 63190.2593203)
    
    1 回复  |  直到 3 年前
        1
  •  2
  •   jezrael    3 年前

    您可以按两列创建连续的组,并添加到 groupby

    g = df_data[['numrun', 'test']].ne(df_data[['numrun', 'test']].shift()).any(axis=1).cumsum()
    
    grouped = df_data.groupby(['numrun', 'test', g], sort=False)
    for igx, (group_name, df_grouped) in enumerate(grouped):
        #your code