代码之家  ›  专栏  ›  技术社区  ›  Avi

如何根据给定行中第三次出现的值获取列?

  •  1
  • Avi  · 技术社区  · 6 年前

    我需要创建一个包含10列(浮数字)的数据帧,并且我需要确保每行有5个NaN值。

    Data Frame Which I want to create 
    
    A    B    C     D     E     F     G     H    I    J   
    1.0  Nan  2.0   Nan   Nan   Nan   Nan   5.0  6.0  7.0
    Nan  Nan  Nan   3.0   5.0   Nan   Nan   5.0  6.0  7.0
    1.0   2.0  3.0   5.0   8.0   Nan   Nan   Nan  Nan  Nan
    1.0   Nan  3.0   Nan  8.0   10.0  Nan   12.0  Nan  Nan
    

    我想创建这种类型的数据集,其中每行有5个NAN值和5个有效值。我想返回一个系列中每行第三次出现NaN值的列值。

      Expected Output 
      E (it has 3rd occurrence of Nan value in 1st row) 
      C (it has 3rd occurrence of Nan value in 2nd row)
      H (it has 3rd occurrence of Nan value in 3rd row)
      G (it has 3rd occurrence of Nan value in 4th row)
    
    3 回复  |  直到 6 年前
        1
  •  3
  •   BENY    6 年前

    cumsum argmax

    df.columns[np.argmax(df.isnull().cumsum(1).eq(3).values,1)]
    Out[788]: Index(['E', 'C', 'H', 'G'], dtype='object')
    

    df=pd.DataFrame(np.random.randn(4, 10),columns=list('ABCDEFGHIJ'))
    for x in range(len(df)):
        df.iloc[x,np.random.choice(10, 5, replace=False)]=np.nan
    df
    Out[783]: 
              A         B         C         D   E         F         G         H  \
    0  1.263644       NaN -0.427018       NaN NaN  0.160732  0.033323 -1.285068   
    1       NaN  2.713568 -0.964603  1.456543 NaN       NaN  0.201837  1.034501   
    2       NaN       NaN       NaN -0.262311 NaN  0.361472 -0.089562  0.478207   
    3       NaN  1.497916 -0.324090       NaN NaN       NaN  0.711363 -0.094587   
        I         J  
    0 NaN       NaN  
    1 NaN       NaN  
    2 NaN  0.944062  
    3 NaN -0.298129  
    
        2
  •  1
  •   Haleemur Ali    6 年前

    isnull cumsum axis=1 idxmax

    (df.isnull().cumsum(axis=1) == 3).idxmax(axis=1)
    

    randn

    import string
    import numpy as np
    from numpy.random import permutation, randn
    def get_matrix(rows, vals):
        return [permutation(np.append(randn(vals), [np.nan]*(vals))) for _ in range(rows)]
    
    df = pd.DataFrame(
        get_matrix(4,5), list(string.ascii_uppercase[:2*5])
    )
    
        3
  •  0
  •   YaOzI    6 年前

    %timeit

    In [69]: df_cumsum = df.isna().cumsum(1) # The common base
    
    In [70]: %timeit df_cumsum == 3
    310 µs ± 7.89 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    
    In [71]: %timeit df_cumsum.eq(3) # WIN by slight advantage
    123 µs ± 2.06 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    
    In [72]: df_locate = df.isna().cumsum(1).eq(3) # To find the index
    
    In [73]: %timeit df_locate.idxmax(axis=1)
    206 µs ± 8.39 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)    
    
    In [74]: %timeit np.argmax(df_locate.values, 1) # WIN by enormous advantage
    9.63 µs ± 183 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)