代码之家  ›  专栏  ›  技术社区  ›  zesla

插入符号如何计算重采样的灵敏度和特异性?

  •  1
  • zesla  · 技术社区  · 6 年前

    最近,当我使用caret包来运行我的模型时,我发现从它的训练对象的重采样得到的灵敏度和特异性与手工计算的每一次折叠的灵敏度和特异性是不同的。

    library(caret)
    data("GermanCredit")
    form = as.formula('credit_risk~amount+savings+installment_rate+age+housing+number_credits')
    train.control <- trainControl(method="cv", 
                               number=5,
                               summaryFunction = twoClassSummary,
                               classProbs = TRUE,
                               savePredictions='all')
    rf = train(form, data=GermanCredit,  method = 'rf',
               metric = 'ROC', trControl=train.control)
    
    print(rf$resample)
    

    我们得到:

    ROC         Sens        Spec        Resample
    0.6239881   0.9428571   0.13333333  Fold1   
    0.6603571   0.9714286   0.08333333  Fold2   
    0.6622619   0.9642857   0.06666667  Fold5   
    0.6502381   0.9928571   0.10000000  Fold4   
    0.7072619   0.9714286   0.16666667  Fold3
    

    如您所见,对于折叠1,敏感性和特异性分别为0.94和0.13。

    现在,如果我们从Fold1中进行重采样,并使用confusionMatrix来计算度量,我们得到以下结果:

    resamp.1 = rf$pred %>% filter(Resample=='Fold1')
    cm=confusionMatrix(resamp.1$pred, resamp.1$obs)
    print(cm) 
    
    Confusion Matrix and Statistics
    
              Reference
    Prediction good bad
          good  366 135
          bad    54  45
    
                   Accuracy : 0.685          
                     95% CI : (0.6462, 0.722)
        No Information Rate : 0.7            
        P-Value [Acc > NIR] : 0.8018         
    
                      Kappa : 0.1393         
     Mcnemar's Test P-Value : 5.915e-09      
    
                Sensitivity : 0.8714         
                Specificity : 0.2500         
             Pos Pred Value : 0.7305         
             Neg Pred Value : 0.4545         
                 Prevalence : 0.7000         
             Detection Rate : 0.6100         
       Detection Prevalence : 0.8350         
          Balanced Accuracy : 0.5607         
    
           'Positive' Class : good
    

    我做错什么了吗?或者caret做了一些不同的事情?谢谢。

    1 回复  |  直到 6 年前
        1
  •  1
  •   nadizan    6 年前

    请注意 data(GermanCredit) 与保存在中的变量不同 form set.seed()

    然而,这里的问题是你需要考虑到 mtry See documentation and code here .

    我调整了方向盘 GermanCredit

    library(caret)
    data("GermanCredit")
    form = as.formula('Class~Amount+SavingsAccountBonds.100.to.500+SavingsAccountBonds.lt.100+SavingsAccountBonds.500.to.1000+
    SavingsAccountBonds.lt.100+SavingsAccountBonds.gt.1000+SavingsAccountBonds.Unknown+
                      InstallmentRatePercentage+Age+Housing.ForFree+Housing.Own+Housing.Rent+NumberExistingCredits')
    train.control <- trainControl(method="cv", 
                                  number=5,
                                  summaryFunction = twoClassSummary,
                                  classProbs = TRUE,
                                  savePredictions='all')
    
    set.seed(100)
    rf <- train(form, data=GermanCredit,  method = 'rf',
               metric = 'ROC', trControl=train.control)
    

    rf mtry公司 mtry = 2 .

    > rf
    Random Forest 
    
    1000 samples
      12 predictor
       2 classes: 'Bad', 'Good' 
    
    No pre-processing
    Resampling: Cross-Validated (5 fold) 
    Summary of sample sizes: 800, 800, 800, 800, 800 
    Resampling results across tuning parameters:
    
      mtry  ROC        Sens        Spec     
       2    0.6465714  0.06333333  0.9842857
       7    0.6413214  0.31333333  0.8571429
      12    0.6358214  0.31666667  0.8385714
    
    ROC was used to select the optimal model using the largest value.
    The final value used for the model was mtry = 2.
    

    因此通过过滤 rf$pred 你会得到预期的结果。

    resamp.1 <- rf$pred %>% filter(Resample=='Fold1' & mtry == 2)
    cm <- confusionMatrix(resamp.1$pred, resamp.1$obs)
    print(cm) 
    Confusion Matrix and Statistics
    
              Reference
    Prediction Bad Good
          Bad    7    5
          Good  53  135
    
                   Accuracy : 0.71            
                     95% CI : (0.6418, 0.7718)
        No Information Rate : 0.7             
        P-Value [Acc > NIR] : 0.4123          
    
                      Kappa : 0.1049          
     Mcnemar's Test P-Value : 6.769e-10       
    
                Sensitivity : 0.1167          
                Specificity : 0.9643          
             Pos Pred Value : 0.5833          
             Neg Pred Value : 0.7181          
                 Prevalence : 0.3000          
             Detection Rate : 0.0350          
       Detection Prevalence : 0.0600          
          Balanced Accuracy : 0.5405          
    
           'Positive' Class : Bad  
    
     cm$byClass[1:2] == rf$resample[1,2:3]
      Sens Spec
      TRUE TRUE
    

    编辑:

    您也可以通过检查 rf$resampledCM mtry公司 还有褶皱。