代码之家  ›  专栏  ›  技术社区  ›  Aquiles Páez

Pandas:将datetime时间线从一个数据帧强制到另一个数据帧

  •  1
  • Aquiles Páez  · 技术社区  · 6 年前

    主数据框:

        value    feed_id                created_at  
    0     0.0  1010077.0 2019-03-06 07:38:18-05:00   
    1     1.0  1010077.0 2019-03-06 07:39:26-05:00   
    2     1.0  1010077.0 2019-03-06 07:40:33-05:00   
    3     1.0  1010077.0 2019-03-06 07:41:41-05:00   
    4     1.0  1010077.0 2019-03-06 07:42:49-05:00   
    5     1.0  1010077.0 2019-03-06 07:43:56-05:00   
    

    辅助测向:

           value    feed_id                created_at
    0  20.298492  1009408.0 2019-03-06 07:35:33-05:00
    1  20.315002  1009408.0 2019-03-06 07:36:34-05:00
    2  20.315002  1009408.0 2019-03-06 07:37:36-05:00
    3  20.359650  1009408.0 2019-03-06 07:38:36-05:00
    4  20.359650  1009408.0 2019-03-06 07:39:37-05:00
    5  20.383179  1009408.0 2019-03-06 07:40:38-05:00
    6  20.383179  1009408.0 2019-03-06 07:41:38-05:00
    7  20.449524  1009408.0 2019-03-06 07:42:39-05:00
    8  20.449524  1009408.0 2019-03-06 07:43:40-05:00
    9  20.521912  1009408.0 2019-03-06 07:44:41-05:00
    

    在这种情况下,我希望以下内容(最终的\u-df):我希望aux \u-df的'created \u at'列中描述的“timeline”完全合并到main \u-df中,不管它在两列中是否有公共值。对于普通的,我取整个时间戳,忽略以秒为单位的部分(注意 ).

           value    feed_id                created_at
    0        nan        nan 2019-03-06 07:35:33-05:00
    1        nan        nan 2019-03-06 07:36:34-05:00
    2        nan        nan 2019-03-06 07:37:36-05:00
    3        0.0  1010077.0 2019-03-06 07:38:36-05:00
    4        1.0  1010077.0 2019-03-06 07:39:37-05:00
    5        1.0  1010077.0 2019-03-06 07:40:38-05:00
    6        1.0  1010077.0 2019-03-06 07:41:38-05:00
    7        1.0  1010077.0 2019-03-06 07:42:39-05:00
    8        1.0  1010077.0 2019-03-06 07:43:40-05:00
    9        nan        nan 2019-03-06 07:44:41-05:00
    

    1. 在两个数据帧上创建一个名为“created_at_2”的新列,使用 在每个时间戳上按分钟循环,这样我就可以丢弃
    2. 使用合并。

      main_df['created_at_2']=main_数据框创建时间.dt.圆形('min') aux_df['created_at_2']=辅助_数据框创建时间.dt.圆形('min')

    但是这种方法并不健壮,如本例所示。当你把时间戳(比如2019-03-06 07:40:33-05:00)四舍五入时,你将得到41分钟而不是40分钟。我需要一个按分钟连续的专栏。

    我可以用这个重新格式化时间戳时间线:

    main_df.created_at.map(lambda t: t.strftime('%Y-%m-%d %H:%M'))
    aux_df.created_at.map(lambda t: t.strftime('%Y-%m-%d %H:%M'))
    final_df = pd.merge(main_df, aux_df, on=['created_at_2'], how='inner')
    

    提前谢谢!

    1 回复  |  直到 6 年前
        1
  •  1
  •   jezrael    6 年前

    一个想法是使用 merge_asof ,但最后一行不同:

    main_df['created_at'] = pd.to_datetime(main_df['created_at'])
    aux_df['created_at'] = pd.to_datetime(aux_df['created_at'])
    
    df = pd.merge_asof(aux_df[['created_at']], main_df, on=['created_at'])
    print (df)
                     created_at  value    feed_id
    0 2019-03-06 07:35:33-05:00    NaN        NaN
    1 2019-03-06 07:36:34-05:00    NaN        NaN
    2 2019-03-06 07:37:36-05:00    NaN        NaN
    3 2019-03-06 07:38:36-05:00    0.0  1010077.0
    4 2019-03-06 07:39:37-05:00    1.0  1010077.0
    5 2019-03-06 07:40:38-05:00    1.0  1010077.0
    6 2019-03-06 07:41:38-05:00    1.0  1010077.0
    7 2019-03-06 07:42:39-05:00    1.0  1010077.0
    8 2019-03-06 07:43:40-05:00    1.0  1010077.0
    9 2019-03-06 07:44:41-05:00    1.0  1010077.0
    

    另一个是使用 Series.dt.floor round

    main_df['created_at'] = pd.to_datetime(main_df['created_at'])
    aux_df['created_at'] = pd.to_datetime(aux_df['created_at'])
    main_df['created_at_2'] = main_df.created_at.dt.floor('min') 
    aux_df['created_at_2'] = aux_df.created_at.dt.floor('min') 
    
    df = pd.merge(aux_df[['created_at_2']], main_df, on=['created_at_2'], how='left')
    print (df)
                   created_at_2  value    feed_id                created_at
    0 2019-03-06 07:35:00-05:00    NaN        NaN                       NaT
    1 2019-03-06 07:36:00-05:00    NaN        NaN                       NaT
    2 2019-03-06 07:37:00-05:00    NaN        NaN                       NaT
    3 2019-03-06 07:38:00-05:00    0.0  1010077.0 2019-03-06 07:38:18-05:00
    4 2019-03-06 07:39:00-05:00    1.0  1010077.0 2019-03-06 07:39:26-05:00
    5 2019-03-06 07:40:00-05:00    1.0  1010077.0 2019-03-06 07:40:33-05:00
    6 2019-03-06 07:41:00-05:00    1.0  1010077.0 2019-03-06 07:41:41-05:00
    7 2019-03-06 07:42:00-05:00    1.0  1010077.0 2019-03-06 07:42:49-05:00
    8 2019-03-06 07:43:00-05:00    1.0  1010077.0 2019-03-06 07:43:56-05:00
    9 2019-03-06 07:44:00-05:00    NaN        NaN                       NaT