代码之家  ›  专栏  ›  技术社区  ›  bbbb

矢量化或在大型数据帧上使用多重处理[重复]

  •  1
  • bbbb  · 技术社区  · 1 年前

    我有一个奇怪的数据框架,其中有高达第30天的理论存款量,所以有30个三元组的列。我想把所有的数据收集到一列中,在那里我有所有的日期和存款,无论它们发生在该玩家的第1天、第2天还是第X天。我尝试了以下代码,它给出了正确的输出,但接管了>在大型数据集上运行120分钟:

    import pandas as pd
    
    # Sample DataFrame with Day 1 and Day 2 data
    data = {
        'Day_date_1': ['2024-01-01', '2024-01-02'],
        'Day_date_1_week_year': ['2024-W01', '2024-W01'],
        'Day1_deposit': [100, 200],
        'Day_date_2': ['2024-01-02', '2024-01-03'],
        'Day_date_2_week_year': ['2024-W01', '2024-W01'],
        'Day2_deposit': [150, 250],
        'PLAYER_ID': [1, 2],
    }
    
    
    date_columns = [f'Day_date_{i}' for i in range(1, 3)]  # Includes Day 1 and Day 2
    deposit_columns = [f'Day{i}_deposit' for i in range(1, 3)]  # Includes Day 1 and Day 2
    deposit_week_columns = [f'Day_date_{i}_week_year' for i in range(1, 3)]  # Includes Day 1 and Day 2
    
    # Empty DataFrame to store results
    result_df = pd.DataFrame()
    
    # Processing the DataFrame
    for date_col, week_col, value_col in zip(date_columns, deposit_week_columns, deposit_columns):
        temp_df = df[[date_col, week_col, value_col, "PLAYER_ID"]]
        temp_df.columns = ["deposit_date", 'deposit_week', "deposit_amount", "PLAYER_ID"]
        result_df = result_df.append(temp_df, ignore_index=True)
    
    result_df['deposit_date'] = pd.to_datetime(result_df['deposit_date'])
    result_deposit = result_df.groupby(['PLAYER_ID', 'deposit_date', 'deposit_week'])['deposit_amount'].mean().reset_index().rename(columns={'deposit_amount': 'mean_deposit_amount'})
    
    # Output the result
    print(result_deposit)
    
    

    有没有办法将循环矢量化或通过多处理加快处理速度?

    我想要的输出如下:

    PLAYER_ID   deposit_date    deposit_week    mean_deposit_amount
    1              2024-01-01     2024-W01            100.0
    1              2024-01-02     2024-W01            150.0
    2              2024-01-02     2024-W01            200.0
    2              2024-01-03     2024-W01            250.0
    
    
    1 回复  |  直到 1 年前
        1
  •  2
  •   mozway    1 年前

    我会用 pandas.wide_to_long 对列名进行一些预处理和后处理:

    out = (
     pd.wide_to_long(df.set_axis(df.columns.str.replace(r'_?(\d+)_(week_year|deposit)',
                                                        r'_\2_\1', regex=True),
                                 axis=1),
                     stubnames=['Day_date', 'Day_date_week_year', 'Day_deposit'],
                     i='PLAYER_ID', j='j',
                     sep='_')
       .reset_index('PLAYER_ID')
       .rename(columns={'Day_date': 'deposit_date',
                        'Day_date_week_year': 'deposit_week',
                        'Day_deposit': 'mean_deposit_amount',
                       })
       .sort_values(by=['PLAYER_ID', 'deposit_date'])
       .reset_index(drop=True)
    )
    

    输出:

       PLAYER_ID deposit_date deposit_week  mean_deposit_amount
    0          1   2024-01-01     2024-W01                  100
    1          1   2024-01-02     2024-W01                  150
    2          2   2024-01-02     2024-W01                  200
    3          2   2024-01-03     2024-W01                  250