代码之家  ›  专栏  ›  技术社区  ›  sds Niraj Rajbhandari

熊猫:哪一个门槛适用于每一行?

  •  0
  • sds Niraj Rajbhandari  · 技术社区  · 7 年前

    给出一个分数列,例如,

    scores = pd.DataFrame({"score":np.random.randn(10)})
    

    和阈值

    thresholds = pd.DataFrame({"threshold":[0.2,0.5,0.8]},index=[7,13,33])
    

    我想找到每个分数的适用阈值,例如:

          score   threshold
     0 -1.613293   NaN
     1 -1.357980   NaN
     2  0.325720     7
     3  0.116000   NaN
     4  1.423171    33
     5  0.282557     7
     6 -1.195269   NaN
     7  0.395739     7
     8  1.072041    33
     9  0.197853   NaN
    

    低,每分 s 我想要门槛 t 这样的话

    t = min(t: thresholds.threshold[t] < s)
    

    我该怎么做?

    ps.根据删除的答案:

    pd.cut(scores.score, bins=[-np.inf]+list(thresholds.threshold)+[np.inf],
           labels=["low"]+list(thresholds.index))
    
    3 回复  |  直到 7 年前
        1
  •  2
  •   user3483203    7 年前

    pd.cut

    scores['threshold'] = pd.cut(
                             scores.score,
                             bins=thresholds.threshold.values.tolist() + [np.nan],
                             labels=thresholds.index.values
                          )
    
          score threshold
    0 -1.613293       NaN
    1 -1.357980       NaN
    2  0.325720       7.0
    3  0.116000       NaN
    4  1.423171      33.0
    5  0.282557       7.0
    6 -1.195269       NaN
    7  0.395739       7.0
    8  1.072041      33.0
    9  0.197853       NaN
    

    This answer cut apply digitize

    scores = pd.DataFrame({"score":np.random.randn(10)})
    scores = pd.concat([scores]*10000)
    
    %timeit pd.cut(scores.score,thresholds.threshold.values.tolist() + [np.nan],labels=thresholds.index.values)
    4.41 ms ± 39.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    indeces = [None,] + thresholds.index.tolist()
    
    %timeit scores["score"].apply(lambda x: indeces[np.digitize(x, thresholds["threshold"])])
    1.64 s ± 18.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    应用

        2
  •  1
  •   koPytok    7 年前

    np.digitize

    indeces = [None,] + thresholds.index.tolist()
    scores["score"].apply(
        lambda x: indeces[np.digitize(x, thresholds["threshold"])])
    
        3
  •  0
  •   Ben.T    7 年前

    merge_asof

    (pd.merge_asof( scores.reset_index().sort_values('score'), 
                    thresholds.reset_index(), 
                    left_on='score', right_on= 'threshold', suffixes = ('','_'))
         .drop('threshold',1).rename(columns={'index_':'threshold'})
         .set_index('index').sort_index())
    

    通过你的数据,你可以得到:

              score  threshold
    index                     
    0     -1.613293        NaN
    1     -1.357980        NaN
    2      0.325720        7.0
    3      0.116000        NaN
    4      1.423171       33.0
    5      0.282557        7.0
    6     -1.195269        NaN
    7      0.395739        7.0
    8      1.072041       33.0
    9      0.197853        NaN