代码之家  ›  专栏  ›  技术社区  ›  David 54321

如何将Excel单元格拆分为多行?

  •  -1
  • David 54321  · 技术社区  · 5 年前

    pandas dataframe 其中一列文本字符串包含逗号分隔的值。我想分割每个CSV字段,并为每个条目创建一个新行(假设CSV是干净的,只需要在','上分割)。例如, a b :

    In [7]: a
    Out[7]: 
        var1  var2
    0  a,b,c     1
    1  d,e,f     2
    
    In [8]: b
    Out[8]: 
      var1  var2
    0    a     1
    1    b     1
    2    c     1
    3    d     2
    4    e     2
    5    f     2
    

    .apply 方法在一个轴上使用时似乎只接受一行作为返回值,我无法获得 .transform

    示例数据:

    from pandas import DataFrame
    import numpy as np
    a = DataFrame([{'var1': 'a,b,c', 'var2': 1},
                   {'var1': 'd,e,f', 'var2': 2}])
    b = DataFrame([{'var1': 'a', 'var2': 1},
                   {'var1': 'b', 'var2': 1},
                   {'var1': 'c', 'var2': 1},
                   {'var1': 'd', 'var2': 2},
                   {'var1': 'e', 'var2': 2},
                   {'var1': 'f', 'var2': 2}])
    

    我知道这是行不通的,因为我们通过numpy丢失了DataFrame元数据,但它应该能让您了解我试图做的事情:

    def fun(row):
        letters = row['var1']
        letters = letters.split(',')
        out = np.array([row] * len(letters))
        out['var1'] = letters
    a['idx'] = range(a.shape[0])
    z = a.groupby('idx')
    z.transform(fun)
    
    0 回复  |  直到 7 年前
        1
  •  0
  •   Mykola Zotko    4 年前

    In [55]: pd.concat([Series(row['var2'], row['var1'].split(','))              
                        for _, row in a.iterrows()]).reset_index()
    Out[55]: 
      index  0
    0     a  1
    1     b  1
    2     c  1
    3     d  2
    4     e  2
    5     f  2
    

        2
  •  0
  •   Hamza usman ghani    4 年前

    normal 多个 list

    def explode(df, lst_cols, fill_value='', preserve_index=False):
        # make sure `lst_cols` is list-alike
        if (lst_cols is not None
            and len(lst_cols) > 0
            and not isinstance(lst_cols, (list, tuple, np.ndarray, pd.Series))):
            lst_cols = [lst_cols]
        # all columns except `lst_cols`
        idx_cols = df.columns.difference(lst_cols)
        # calculate lengths of lists
        lens = df[lst_cols[0]].str.len()
        # preserve original index values    
        idx = np.repeat(df.index.values, lens)
        # create "exploded" DF
        res = (pd.DataFrame({
                    col:np.repeat(df[col].values, lens)
                    for col in idx_cols},
                    index=idx)
                 .assign(**{col:np.concatenate(df.loc[lens>0, col].values)
                                for col in lst_cols}))
        # append those rows that have empty lists
        if (lens == 0).any():
            # at least one list in cells is empty
            res = (res.append(df.loc[lens==0, idx_cols], sort=False)
                      .fillna(fill_value))
        # revert the original index order
        res = res.sort_index()
        # reset index if requested
        if not preserve_index:        
            res = res.reset_index(drop=True)
        return res
    

    演示:

    列表 列-全部 列表

    In [134]: df
    Out[134]:
       aaa  myid        num          text
    0   10     1  [1, 2, 3]  [aa, bb, cc]
    1   11     2         []            []
    2   12     3     [1, 2]      [cc, dd]
    3   13     4         []            []
    
    In [135]: explode(df, ['num','text'], fill_value='')
    Out[135]:
       aaa  myid num text
    0   10     1   1   aa
    1   10     1   2   bb
    2   10     1   3   cc
    3   11     2
    4   12     3   1   cc
    5   12     3   2   dd
    6   13     4
    

    保留原始索引值:

    In [136]: explode(df, ['num','text'], fill_value='', preserve_index=True)
    Out[136]:
       aaa  myid num text
    0   10     1   1   aa
    0   10     1   2   bb
    0   10     1   3   cc
    1   11     2
    2   12     3   1   cc
    2   12     3   2   dd
    3   13     4
    

    设置:

    df = pd.DataFrame({
     'aaa': {0: 10, 1: 11, 2: 12, 3: 13},
     'myid': {0: 1, 1: 2, 2: 3, 3: 4},
     'num': {0: [1, 2, 3], 1: [], 2: [1, 2], 3: []},
     'text': {0: ['aa', 'bb', 'cc'], 1: [], 2: ['cc', 'dd'], 3: []}
    })
    

    CSV列:

    In [46]: df
    Out[46]:
            var1  var2 var3
    0      a,b,c     1   XX
    1  d,e,f,x,y     2   ZZ
    
    In [47]: explode(df.assign(var1=df.var1.str.split(',')), 'var1')
    Out[47]:
      var1  var2 var3
    0    a     1   XX
    1    b     1   XX
    2    c     1   XX
    3    d     2   ZZ
    4    e     2   ZZ
    5    f     2   ZZ
    6    x     2   ZZ
    7    y     2   ZZ
    

    列:

    In [48]: df.assign(var1=df.var1.str.split(','))
    Out[48]:
                  var1  var2 var3
    0        [a, b, c]     1   XX
    1  [d, e, f, x, y]     2   ZZ
    

    更新: 通用矢量化方法(也适用于多列):

    In [177]: df
    Out[177]:
            var1  var2 var3
    0      a,b,c     1   XX
    1  d,e,f,x,y     2   ZZ
    

    解决方案:

    In [178]: lst_col = 'var1' 
    
    In [179]: x = df.assign(**{lst_col:df[lst_col].str.split(',')})
    
    In [180]: x
    Out[180]:
                  var1  var2 var3
    0        [a, b, c]     1   XX
    1  [d, e, f, x, y]     2   ZZ
    

    In [181]: pd.DataFrame({
         ...:     col:np.repeat(x[col].values, x[lst_col].str.len())
         ...:     for col in x.columns.difference([lst_col])
         ...: }).assign(**{lst_col:np.concatenate(x[lst_col].values)})[x.columns.tolist()]
         ...:
    Out[181]:
      var1  var2 var3
    0    a     1   XX
    1    b     1   XX
    2    c     1   XX
    3    d     2   ZZ
    4    e     2   ZZ
    5    f     2   ZZ
    6    x     2   ZZ
    7    y     2   ZZ
    

    旧答案:

    @AFinkelstein solution ,我想让它更通用一点,它可以应用于具有两列以上的DF,并且与AFinkelstein的解决方案一样快):

    In [2]: df = pd.DataFrame(
       ...:    [{'var1': 'a,b,c', 'var2': 1, 'var3': 'XX'},
       ...:     {'var1': 'd,e,f,x,y', 'var2': 2, 'var3': 'ZZ'}]
       ...: )
    
    In [3]: df
    Out[3]:
            var1  var2 var3
    0      a,b,c     1   XX
    1  d,e,f,x,y     2   ZZ
    
    In [4]: (df.set_index(df.columns.drop('var1',1).tolist())
       ...:    .var1.str.split(',', expand=True)
       ...:    .stack()
       ...:    .reset_index()
       ...:    .rename(columns={0:'var1'})
       ...:    .loc[:, df.columns]
       ...: )
    Out[4]:
      var1  var2 var3
    0    a     1   XX
    1    b     1   XX
    2    c     1   XX
    3    d     2   ZZ
    4    e     2   ZZ
    5    f     2   ZZ
    6    x     2   ZZ
    7    y     2   ZZ