代码之家  ›  专栏  ›  技术社区  ›  shockwave

Python:计算一段时间内数据帧中的累计量

  •  0
  • shockwave  · 技术社区  · 5 年前

    目标:计算自2020-01-01以来的累计收入。

    我有一个python字典,如下所示

    data = [{"game_id":"Racing","user_id":"ABC123","amt":5,"date":"2020-01-01"},
        {"game_id":"Racing","user_id":"ABC123","amt":1,"date":"2020-01-04"},
        {"game_id":"Racing","user_id":"CDE123","amt":1,"date":"2020-01-04"},
        {"game_id":"DH","user_id":"CDE123","amt":100,"date":"2020-01-03"},
        {"game_id":"DH","user_id":"CDE456","amt":10,"date":"2020-01-02"},
        {"game_id":"DH","user_id":"CDE789","amt":5,"date":"2020-01-02"},
        {"game_id":"DH","user_id":"CDE456","amt":1,"date":"2020-01-03"},
        {"game_id":"DH","user_id":"CDE456","amt":1,"date":"2020-01-03"}]
    

       game_id   user_id  amt  activity date
      'Racing', 'ABC123', 5,   '2020-01-01'
      'Racing', 'ABC123', 1,   '2020-01-04'
      'Racing', 'CDE123', 1,   '2020-01-04'
      'DH',     'CDE123', 100, '2020-01-03'
      'DH',     'CDE456', 10,  '2020-01-02'
      'DH', '    CDE789', 5,   '2020-01-02'
      'DH',     'CDE456', 1,   '2020-01-03'
      'DH',     'CDE456', 1,   '2020-01-03'
    

    年龄计算为交易日期和2020-01-01之间的差值。总付款人计数是每个游戏中付款人的数目。

    预期产量

    Game    Age    Cum_rev    Total_unique_payers_per_game
    Racing  0      5          2
    Racing  1      5          2
    Racing  2      5          2
    Racing  3      7          2
    DH      0      0          3
    DH      1      15         3
    DH      2      117        3
    DH      3      117        3
    

    0 回复  |  直到 5 年前
        1
  •  1
  •   rpanai    5 年前

    这里非常复杂的部分是填写日期。我用了一个应用程序,但我不确定这是最好的方式

    import pandas as pd
    
    data = [{"game_id":"Racing","user_id":"ABC123","amt":5,"date":"2020-01-01"},
            {"game_id":"Racing","user_id":"ABC123","amt":1,"date":"2020-01-04"},
            {"game_id":"Racing","user_id":"CDE123","amt":1,"date":"2020-01-04"},
            {"game_id":"DH","user_id":"CDE123","amt":100,"date":"2020-01-03"},
            {"game_id":"DH","user_id":"CDE456","amt":10,"date":"2020-01-02"},
            {"game_id":"DH","user_id":"CDE789","amt":5,"date":"2020-01-02"},
            {"game_id":"DH","user_id":"CDE456","amt":1,"date":"2020-01-03"},
            {"game_id":"DH","user_id":"CDE456","amt":1,"date":"2020-01-03"}]
    
    df = pd.DataFrame(data)
    # we want datetime not object
    df["date"] = df["date"].astype("M8[us]")
    
    # we will need to merge this at the end
    grp = df.groupby("game_id")['user_id']\
            .nunique()\
            .reset_index(name="Total_unique_payers_per_game")
    
    # sum amt per game_id date
    df = df.groupby(["game_id", "date"])["amt"].sum().reset_index()
    
    # dates from 2020-01-01 till the max date in df
    dates = pd.DataFrame({"date": pd.date_range("2020-01-01", df["date"].max())})
    
    # add missing dates
    def expand_dates(x):
        x = pd.merge(dates, x.drop("game_id", axis=1), how="left")
        x["amt"] = x["amt"].fillna(0)
        return x
    
    df = df.groupby("game_id")\
           .apply(expand_dates)\
           .reset_index().drop("level_1", axis=1)
    
    df["Cum_rev"] = df.groupby("game_id")['amt'].transform("cumsum")
    
    # this is equivalent as long as data is sorted
    # df["Cum_rev"] = df.groupby("game_id")['amt'].cumsum()
    
    # merge unique payers per game
    df = pd.merge(df, grp, how="left")
    
    # dates difference
    df["Age"] = "2020-01-01"
    df["Age"] = df["Age"].astype("M8[us]")
    df["Age"] = (df["date"]-df["Age"]).dt.days
    
    
    # then you can eventually filter
    df = df[["game_id", "Age", 
             "Cum_rev", "Total_unique_payers_per_game"]]\
           .rename(columns={"game_id":"Game"})