代码之家  ›  专栏  ›  技术社区  ›  Gerry

DataFrame接受列的并集并保留find first non NaN值

  •  3
  • Gerry  · 技术社区  · 6 年前

    df 有上千列和行。对于按特定顺序给定的列的子集,例如列 B, C, E NaN 中的值 B 在其余列中找到第一个非NaN值( C, E )按顺序搜索。最后 C、 电子 被丢弃

    样品 测向 可按如下方式构建:

    import numpy as np
    import pandas as pd
    df = pd.DataFrame(10*(2+np.random.randn(6, 5)), columns=list('ABCDE'))
    df.loc[1, 'B'] = np.nan
    df.loc[2, 'B'] = np.nan
    df.loc[5, 'B'] = np.nan
    df.loc[2, 'C'] = np.nan
    df.loc[5, 'C'] = np.nan
    df.loc[2, 'D'] = np.nan
    df.loc[2, 'E'] = np.nan
    df.loc[4, 'E'] = np.nan
    df
               A         B          C          D          E
    0  18.161033  6.453597  25.253036  18.542586  20.667311
    1  27.629402       NaN  40.654821  22.804547  23.633502
    2  15.459256       NaN        NaN        NaN        NaN
    3  19.115203  4.002131  14.167508  23.796780  29.557706
    4  27.180622       NaN  20.763618  15.923794        NaN
    5  17.917170       NaN        NaN  21.865184   9.867743
    

               A         B         D
    0  18.161033  6.453597 18.542586
    1  27.629402 40.654821 22.804547
    2  15.459256       NaN       NaN
    3  19.115203  4.002131 23.796780
    4  27.180622 20.763618 15.923794
    5  17.917170  9.867743 21.865184
    
    2 回复  |  直到 6 年前
        1
  •  2
  •   BENY    6 年前

    这里有一个方法

    drop = ['C', 'E']
    fill= 'B'
    d=dict(zip(df.columns,[fill if x in drop else x for x in df.columns.tolist() ]))
    df.groupby(d,axis=1).first()
    Out[172]: 
               A          B          D
    0  14.472915  30.598602  24.528571
    1  22.010242  22.215140  15.412039
    2   5.383674        NaN        NaN
    3  38.265940  24.746673  35.367622
    4  22.730089  20.244289  27.570413
    5  31.216037  15.496690   9.746814
    
        2
  •  2
  •   cs95 abhishek58g    6 年前

    IIUC,使用 bfill 回填,然后 drop

    df.assign(B=df[['B', 'C', 'E']].bfill(axis=1)['B']).drop(['C', 'E'], axis=1)
    
               A          B          D
    0  18.161033   6.453597  18.542586
    1  27.629402  40.654821  22.804547
    2  15.459256        NaN        NaN
    3  19.115203   4.002131  23.796780
    4  27.180622  20.763618  15.923794
    5  17.917170   9.867743  21.865184
    

    这是上面的一个更一般化的版本,

    to_drop = ['C', 'E']
    upd = 'B'
    
    df.update(df[[upd, *to_drop]].bfill(axis=1)[upd])  # in-place
    df.drop(to_drop, axis=1)                           # not in-place, need to assign
    
               A          B          D
    0  18.161033   6.453597  18.542586
    1  27.629402  40.654821  22.804547
    2  15.459256        NaN        NaN
    3  19.115203   4.002131  23.796780
    4  27.180622  20.763618  15.923794
    5  17.917170   9.867743  21.865184