代码之家 › 专栏 › 技术社区 › Gerry

在数据框中查找最新状态

state-management dataframe python-3.x

Gerry · 技术社区 · 3 年前

查询一个系统以随机抽取系统的历史状态,时间由 update_time 并将其附加到数据帧 df .每次提取都会获取特定日期范围的数据,该日期范围可以估计为列的最小值和最大值 timestamp .最新的获取是在获取数据期间提供系统最可靠的信息。从…起 df 我想删除前一行中存在的所有行 更新时间 最近的一份报告对此进行了报道 更新时间 .

我正在考虑下面的算法,它可以获得预期的结果,但对于大数据帧来说速度非常慢:

import pandas as pd
from datetime import datetime as dt
# Sample DataFrame
df = pd.DataFrame({'id':range(10), 'name':['John', 'James', 'Harry', 'Lilia', 'Rachel', 'Harry', 'Lilia', 'Stu', 'Lilia', 'Tom'], 'timestamp':[dt(2022,1,3),dt(2021,12,26),dt(2021,11,13),dt(2021,11,3),dt(2021,10,2),dt(2021,11,13),dt(2021,11,3),dt(2021,10,1),dt(2021,11,3),dt(2021,10,3)], 'update_time':[dt(2022,1,3,0,0,12),dt(2022,1,3,0,0,12),dt(2022,1,3,0,0,12),dt(2022,1,3,0,0,12),dt(2022,1,3,0,0,12),dt(2021,11,15),dt(2021,11,15),dt(2021,11,15),dt(2021,11,10),dt(2021,11,10)]})
# Get unique update times sorted in descending order.
update_times = df['update_time'].unique()
update_times.sort()
update_times = np.flip(update_times)
# Holds the desired output
df_output = pd.DataFrame()
for update_time in update_times:
    df_temp = df[df['update_time'] == update_time]
    df_output = pd.concat([df_output, df_temp], axis=0)
    df = df[df['timestamp'] < min(df_output['timestamp'])]

>>> df
   id    name  timestamp         update_time
0   0    John 2022-01-03 2022-01-03 00:00:12
1   1   James 2021-12-26 2022-01-03 00:00:12
2   2   Harry 2021-11-13 2022-01-03 00:00:12
3   3   Lilia 2021-11-03 2022-01-03 00:00:12
4   4  Rachel 2021-10-02 2022-01-03 00:00:12
5   5   Harry 2021-11-13 2021-11-15 00:00:00
6   6   Lilia 2021-11-03 2021-11-15 00:00:00
7   7     Stu 2021-10-01 2021-11-15 00:00:00
8   8   Lilia 2021-11-03 2021-11-10 00:00:00
9   9     Tom 2021-10-03 2021-11-10 00:00:00
>>> df_output
   id    name  timestamp         update_time
0   0    John 2022-01-03 2022-01-03 00:00:12
1   1   James 2021-12-26 2022-01-03 00:00:12
2   2   Harry 2021-11-13 2022-01-03 00:00:12
3   3   Lilia 2021-11-03 2022-01-03 00:00:12
4   4  Rachel 2021-10-02 2022-01-03 00:00:12
7   7     Stu 2021-10-01 2021-11-15 00:00:00

有什么明智的建议可以更快地完成吗?

0 回复 | 直到 3 年前

推荐文章

ÎÎÎ½Î· ÎÎ®Î¹Î½Î¿Ï · Python lxml.html语法错误:使用lxml find时XPATH的谓词无效

6 月前

Cam · Pandas列表日期到日期时间

6 月前

RASEL MAHMUD · 为什么以及如何在is_even()函数内的IF条件中递归X变量在满足0后递增?[副本]

6 月前

jjkennedy · Pandas文本文件导入:当每个文件中存在多个表时,自动选择1个表

6 月前

LMC · Numpy数组布尔索引以获取包含元素

6 月前

vr8ce · 非成对标记中特定字符的正则表达式

7 月前

Kernel · 如果指定了crs参数,shapefile的geopandas.read_file将出错

7 月前

ShaAnder · 为什么sqllachemy返回的是类而不是字符串

7 月前

sixtytrees · detectron2软件包未安装(没有名为“torch”的模块),但我安装了torch

7 月前

Pernoctador · Python映射可以复制吗?我需要参考地图

7 月前