代码之家 › 专栏 › 技术社区 › Avi

如何根据给定行中第三次出现的值获取列?

pandas python-3.x python

Avi · 技术社区 · 6 年前

我需要创建一个包含10列(浮数字)的数据帧,并且我需要确保每行有5个NaN值。

Data Frame Which I want to create 

A    B    C     D     E     F     G     H    I    J   
1.0  Nan  2.0   Nan   Nan   Nan   Nan   5.0  6.0  7.0
Nan  Nan  Nan   3.0   5.0   Nan   Nan   5.0  6.0  7.0
1.0   2.0  3.0   5.0   8.0   Nan   Nan   Nan  Nan  Nan
1.0   Nan  3.0   Nan  8.0   10.0  Nan   12.0  Nan  Nan

我想创建这种类型的数据集,其中每行有5个NAN值和5个有效值。我想返回一个系列中每行第三次出现NaN值的列值。

  Expected Output 
  E (it has 3rd occurrence of Nan value in 1st row) 
  C (it has 3rd occurrence of Nan value in 2nd row)
  H (it has 3rd occurrence of Nan value in 3rd row)
  G (it has 3rd occurrence of Nan value in 4th row)

3 回复 | 直到 6 年前

BENY 6 年前

cumsum argmax

df.columns[np.argmax(df.isnull().cumsum(1).eq(3).values,1)]
Out[788]: Index(['E', 'C', 'H', 'G'], dtype='object')

df=pd.DataFrame(np.random.randn(4, 10),columns=list('ABCDEFGHIJ'))
for x in range(len(df)):
    df.iloc[x,np.random.choice(10, 5, replace=False)]=np.nan
df
Out[783]: 
          A         B         C         D   E         F         G         H  \
0  1.263644       NaN -0.427018       NaN NaN  0.160732  0.033323 -1.285068   
1       NaN  2.713568 -0.964603  1.456543 NaN       NaN  0.201837  1.034501   
2       NaN       NaN       NaN -0.262311 NaN  0.361472 -0.089562  0.478207   
3       NaN  1.497916 -0.324090       NaN NaN       NaN  0.711363 -0.094587   
    I         J  
0 NaN       NaN  
1 NaN       NaN  
2 NaN  0.944062  
3 NaN -0.298129

Haleemur Ali 6 年前

isnull cumsum axis=1 idxmax

(df.isnull().cumsum(axis=1) == 3).idxmax(axis=1)

randn

import string
import numpy as np
from numpy.random import permutation, randn
def get_matrix(rows, vals):
    return [permutation(np.append(randn(vals), [np.nan]*(vals))) for _ in range(rows)]

df = pd.DataFrame(
    get_matrix(4,5), list(string.ascii_uppercase[:2*5])
)

YaOzI 6 年前

%timeit

In [69]: df_cumsum = df.isna().cumsum(1) # The common base

In [70]: %timeit df_cumsum == 3
310 Âµs Â± 7.89 Âµs per loop (mean Â± std. dev. of 7 runs, 1000 loops each)

In [71]: %timeit df_cumsum.eq(3) # WIN by slight advantage
123 Âµs Â± 2.06 Âµs per loop (mean Â± std. dev. of 7 runs, 10000 loops each)

In [72]: df_locate = df.isna().cumsum(1).eq(3) # To find the index

In [73]: %timeit df_locate.idxmax(axis=1)
206 Âµs Â± 8.39 Âµs per loop (mean Â± std. dev. of 7 runs, 1000 loops each)    

In [74]: %timeit np.argmax(df_locate.values, 1) # WIN by enormous advantage
9.63 Âµs Â± 183 ns per loop (mean Â± std. dev. of 7 runs, 100000 loops each)