代码之家 › 专栏 › 技术社区 › thedude

Python:将不同列的值分组到时间段中

binning grouping pandas python

thedude · 技术社区 · 7 年前

假设您有这个数据帧:

Name    Item    Date    value1  value2
Marc    bike    21-Dec-17   7   1000
Marc    bike    05-Jan-18   9   2000
Marc    bike    27-Jul-18   4   500
John    house   14-Dec-17   4   500
John    house   02-Feb-18   6   500
John    house   07-Feb-18   8   1000
John    house   16-Feb-18   2   1000
John    house   05-Dec-21   7   1000
John    house   27-Aug-25   8   500
John    car     17-Apr-18   4   500

我想将value1和value2放入每个名称项目组合的每月存储桶(接下来48个月的每个第三个星期三)。

因此,每个组合有49个时间段,每个月的值1和值2之和:Marc/bike、John/house、John/car。。。

John/house的解决方案如下所示:

Name    Item    TimeBucket  value1  value2
John    house   20-Dec-17   4   500
John    house   17-Jan-18   0   0
John    house   21-Feb-18   16  2500
John    house   21-Mar-18   0   0
John    house   18-Apr-18   0   0
John    house   â¦           0   0
John    house   17-Nov-21   0   0
John    house   15-Dec-21   7   1000
John    house   rest        8   500

我和熊猫相处不好。我能想到的唯一解决方案是通过数据帧进行逐行迭代,但我真的希望避免这样做。有优雅的方式吗?

1 回复 | 直到 7 年前

Cornflex 7 年前

问题可以归结为三个步骤:

1、如何找到每个月的第三个星期三?

这可能不是最优雅的解决方案,但您可以 通过掩蔽过滤掉每个月的第三个星期三 a熊猫 DatetimeIndex 包含时间范围内的每一天。

# generate a DatetimeIndex for all days in the relevant time frame
from datetime import datetime
start = datetime(2017, 12, 1)
end = datetime(2022, 1, 31)
days = pd.date_range(start, end, freq='D')

# filter out only the third wednesday of each month
import itertools
third_wednesdays = []
for year, month in itertools.product(range(2017, 2023), range(1,13)):
    mask = (days.weekday == 2) & \
        (days.year == year) & \
        (days.month == month)
    if len(days[mask]) > 0:
        third_wednesdays.append(days[mask][2])
bucket_lower_bounds = pd.DatetimeIndex(third_wednesdays)

将结果列表转换为 日期时间索引 因此,您可以将其用作步骤2中箱子的下限。

2.如何存储数据帧的值?

然后,一旦您将桶列表作为 日期时间索引 ,您可以简单地 使用 panda's cut function 将每个日期分配给一个存储桶 . 将日期列转换为整数,然后将其传递到 cut ,然后将结果转换回日期:

time_buckets = pd.to_datetime(
    pd.cut(
        x = pd.to_numeric(df['Date']), 
        bins = pd.to_numeric(bucket_lower_bounds), 
        labels = bucket_lower_bounds[:-1]
    )
)

系列 time_buckets

df['TimeBucket'] = time_buckets

结果应该有点像这样(不是那样 NaT 代表“休息”桶):

    Name    Item    Date    value1  value2  TimeBucket
0   Marc    bike    2017-12-21  7   1000    2017-12-20
1   Marc    bike    2018-01-05  9   2000    2017-12-20
2   Marc    bike    2018-07-27  4   500     2018-07-18
3   John    house   2017-12-14  4   500     NaT
4   John    house   2018-02-02  6   500     2018-01-17
5   John    house   2018-02-07  8   1000    2018-01-17
6   John    house   2018-02-16  2   1000    2018-01-17
7   John    house   2021-12-05  7   1000    2021-11-17
8   John    house   2025-08-27  8   500     NaT
9   John    car     2018-04-17  4   500     2018-03-21

3、如何聚合装箱数据帧?

现在就这么简单 使用 groupby 得到每个组合的总和 名称、项目和存储桶:

df.groupby(['Name','Item','TimeBucket']).sum()

结果:

Name    Item    TimeBucket  value1  value2
John    car     2018-03-21  4       500
        house   2018-01-17  16      2500
                2021-11-17  7       1000
Marc    bike    2017-12-20  16      3000
                2018-07-18  4       500

不幸地 NaT values are excluded from groupby . 如果您还需要对这些数据进行求和,那么最简单的方法可能是确保您的存储桶列表在输入范围内的每个日期都至少有一个存储桶。

编辑:步骤2需要pandas版本>=0.18.1.