代码之家  ›  专栏  ›  技术社区  ›  KOB

熊猫融化在多指标柱上

  •  0
  • KOB  · 技术社区  · 5 年前

    | a  | b  | 2018 | 2018 | 2019 | 2019 |
    |    |    | jan  | feb  | jan  | feb  |
    ---------------------------------------
    | a1 | b1 | 0    | 1    | 2    | 3    |
    | a1 | b2 | 4    | 5    | 6    | 7    |
    | a2 | b1 | 8    | 9    | 10   | 11   |
    | a2 | b2 | 12   | 13   | 14   | 15   |
    

    我想把它读成pandas DF,然后把它融化成以下格式:

    | a  | b  | year | month | value |
    ----------------------------------
    | a1 | b1 | 2018 | jan   | 0     |
    | a1 | b1 | 2018 | feb   | 1     |
    | a1 | b1 | 2019 | jan   | 2     |
    | a1 | b1 | 2019 | feb   | 3     |
    | a1 | b2 | 2018 | jan   | 4     |
    | a1 | b2 | 2018 | feb   | 5     |
    | a1 | b2 | 2019 | jan   | 6     |
    | a1 | b2 | 2019 | feb   | 7     |
    | a2 | b1 | 2018 | jan   | 8     |
    | a2 | b1 | 2018 | feb   | 9     |
    | a2 | b1 | 2019 | jan   | 10    |
    | a2 | b1 | 2019 | feb   | 11    |
    | a2 | b2 | 2018 | jan   | 12    |
    | a2 | b2 | 2018 | feb   | 13    |
    | a2 | b2 | 2019 | jan   | 14    |
    | a2 | b2 | 2019 | feb   | 15    |
    

    如何才能做到这一点?

    1 回复  |  直到 5 年前
        1
  •  1
  •   Quant Christo    5 年前

    如果是普通数据帧,则应执行以下操作:

    import pandas as pd
    
    
    df = pd.DataFrame({
        'a': ['a1', 'a1', 'a2', 'a2',],
        'b': ['b1', 'b2', 'b2', 'b2',],
        '2018 jan': [0, 4, 8, 12],
        '2018 feb': [1, 5, 9, 13],
        '2019 jan': [2, 6, 10, 14],
        '2019 feb': [3, 7, 11, 15],    
    })
    
    df = df.melt(id_vars=['a', 'b'], var_name='date', value_name='value')
    df['date'] = df['date'].str.split(' ')
    df['year'] = df['date'].str[0]
    df['month'] = df['date'].str[1]
    df.drop(columns='date', inplace=True)
    

        a   b  value  year month
    0   a1  b1      0  2018   jan
    1   a1  b2      4  2018   jan
    2   a2  b2      8  2018   jan
    3   a2  b2     12  2018   jan
    4   a1  b1      1  2018   feb
    5   a1  b2      5  2018   feb
    6   a2  b2      9  2018   feb
    7   a2  b2     13  2018   feb
    8   a1  b1      2  2019   jan
    9   a1  b2      6  2019   jan
    10  a2  b2     10  2019   jan
    11  a2  b2     14  2019   jan
    12  a1  b1      3  2019   feb
    13  a1  b2      7  2019   feb
    14  a2  b2     11  2019   feb
    15  a2  b2     15  2019   feb
    

    如果在注释中提到的列中有多个索引,可以将其转换为纯数据帧:

    df = pd.read_csv('file.csv', header=[0,1])
    df.columns = [' '.join(col).strip() for col in df.columns.values]
    df.rename(columns={'a Unnamed: 0_level_1': 'a', 'b Unnamed: 1_level_1': 'b'}, inplace=True)
    
        2
  •  0
  •   david.abekasis    5 年前

    @KOB my answer通常可以容纳任何csv文件的两行标题,其中部分列仅位于第一行,部分列同时位于第一行和第二行。根据您的问题,此代码将按要求正确放置所有标题。

    df_multiidx = pd.read_csv('two_levels_header_file.csv', header=[0,1])
    id_vars = [idv for idv in df_multiidx.columns if 'Unnamed' in idv[1]]
    value_vars = [valv for valv in df_multiidx.columns if 'Unnamed' not in valv[1]]
    df_multiidx= df_multiidx.melt(id_vars=id_vars, value_vars=value_vars,var_name=['year','month'])
    df_multiidx.rename(columns={col_ren:col_ren[0] for col_ren in id_vars})
    

    输出:

        a   b   year    month   value
    0   a1  b1  2018    jan 0
    1   a1  b2  2018    jan 4
    2   a2  b1  2018    jan 8
    3   a2  b2  2018    jan 12
    4   a1  b1  2018    feb 1
    5   a1  b2  2018    feb 5
    6   a2  b1  2018    feb 9
    7   a2  b2  2018    feb 13
    8   a1  b1  2019    jan 2
    9   a1  b2  2019    jan 6
    10  a2  b1  2019    jan 10
    11  a2  b2  2019    jan 14
    12  a1  b1  2019    feb 3
    13  a1  b2  2019    feb 7
    14  a2  b1  2019    feb 11
    15  a2  b2  2019    feb 15