我有一个包含更改的日志表。符号+表示添加,符号-表示删除。
import pandas as pd
history = pd.DataFrame({
"First":
["X","X", "Y", "Y", "X", "X", "Y", "Z"],
"Last":
["Y", "X", "Y", "Y", "X", "X", "Y", "A"],
"Change":
["+", "+", "-", "+", "-", "+", "+", "-"],
"Date":
["2022-05-01", "2024-05-01", "2024-06-01", "2024-06-01",
"2024-05-03", "2024-05-02", "2024-06-02", "2024-06-01"]
})
history = history.sort_values(by=["Date", "Change"])
# sort needed to process the entries chronologically
这会产生
First Last Change Date
0 X Y + 2022-05-01
1 X X + 2024-05-01
5 X X + 2024-05-02
4 X X - 2024-05-03
3 Y Y + 2024-06-01
2 Y Y - 2024-06-01
7 Z A - 2024-06-01
6 Y Y + 2024-06-02
在下一步中,我只想显示当前可用的内容。
-
最后一个可用的符号必须是+,项目才能可用。
-
可用:+、++、+-+、-+、--+等。
-
不可用:-、++-、+--等。
-
项是第一列和最后一列的组合
-
按日期和更改进行排序
我使用迭代来构建这个逻辑,这是非常缓慢的。大体上
latest = {}
item_columns = [
"First",
"Last",
]
for _, row in history.iterrows():
key = tuple(row[column] for column in item_columns)
if row["Change"] == "+":
latest[key] = row
elif row["Change"] == "-" and key in latest:
del latest[key]
available = pd.DataFrame(latest.keys(), columns=item_columns)
这会产生可用的项目
First Last
0 X Y
1 Y Y
问题是大表格的循环很慢,例如20秒以下
latest = {}
item_columns = [
"First",
"Last",
]
duplicated = pd.concat([history.iloc[[1]]] * 50000, ignore_index=True)
history = pd.concat([history, duplicated], ignore_index=True)
for _, row in history.iterrows():
key = tuple(row[column] for column in item_columns)
if row["Change"] == "+":
latest[key] = row
elif row["Change"] == "-" and key in latest:
del latest[key]
available = pd.DataFrame(latest.keys(), columns=item_columns)
有办法加快速度吗?