代码之家  ›  专栏  ›  技术社区  ›  00__00__00

合并间隔和时间戳数据帧

  •  1
  • 00__00__00  · 技术社区  · 7 年前

    我有张有间隔的桌子

    dfa = pd.DataFrame({'Start': [0, 101, 666], 'Stop': [100, 200, 1000]})
    

    我有另一个包含时间戳和值的表

    dfb = pd.DataFrame({'Timestamp': [102, 145, 113], 'ValueA': [1, 2, 21],
    'ValueB': [1, 2, 21]})
    

    我需要创建一个与 dfa ,并添加了一列,其中包含 ValueA / ValueB ,对于中的所有行 dfb 用一个 Timestamp 包含于 Start Stop .

    所以这里如果把我的聚合定义为

    {'ValueA':[np.nanmean,np.nanmin],
    'ValueB':[np.nanmax]}
    

    我期望的输出是:

    ValueA   ValueA    ValueB
    nanmean  nanmin    nanmax    Start      Stop
    
    nan      nan       nan        0          100
    8        1         21         101        200
    nan      nan       nan        666        1000
    
    1 回复  |  直到 7 年前
        1
  •  1
  •   jezrael    7 年前

    merge assign

    d = {'ValueA':[np.nanmean,np.nanmin],
         'ValueB':[np.nanmax]}
    
    df = dfa.assign(A=1).merge(dfb.assign(A=1), on='A', how='outer')
    

    Start Stop

    df = (df[(df.Timestamp >= df.Start) & (df.Timestamp <= df.Stop)]
             .groupby(['Start','Stop']).agg(d))
    

    map join

    df.columns = df.columns.map('_'.join)
    print (df)
                ValueA_nanmean  ValueA_nanmin  ValueB_nanmax
    Start Stop                                              
    101   200                8              1             21
    

    join

    df = dfa.join(df, on=['Start','Stop'])
    print (df)
       Start  Stop  ValueA_nanmean  ValueA_nanmin  ValueB_nanmax
    0      0   100             NaN            NaN            NaN
    1    101   200             8.0            1.0           21.0
    2    666  1000             NaN            NaN            NaN
    

    cut

    d = {'ValueA':[np.nanmean,np.nanmin],
         'ValueB':[np.nanmax]}
    
    #if not default index create it
    dfa = dfa.reset_index(drop=True)
    print (dfa)
       Start  Stop
    0      0   100
    1    101   200
    2    666  1000
    
    #add to bins first value of Start
    bins = np.insert(dfa['Stop'].values, 0, dfa.loc[0, 'Start'])
    print (bins)
    [   0  100  200 1000]
    
    #binning
    dfb['id'] = pd.cut(dfb['Timestamp'], bins=bins, labels = dfa.index)
    print (dfb)
       Timestamp  ValueA  ValueB id
    0        102       1       1  1
    1        145       2       2  1
    2        113      21      21  1
    
    #aggregate and flatten
    df = dfb.groupby('id').agg(d)
    df.columns = df.columns.map('_'.join)
    
    #add to dfa
    df = pd.concat([dfa, df], axis=1)
    print (df)
       Start  Stop  ValueA_nanmean  ValueA_nanmin  ValueB_nanmax
    0      0   100             NaN            NaN            NaN
    1    101   200             8.0            1.0           21.0
    2    666  1000             NaN            NaN            NaN