代码之家  ›  专栏  ›  技术社区  ›  Lacri Mosa

文本数据的R-Lime包

  •  2
  • Lacri Mosa  · 技术社区  · 7 年前

    我在探索如何在文本数据集上使用R-lime来解释黑盒模型预测,并遇到了一个例子 https://cran.r-project.org/web/packages/lime/vignettes/Understanding_lime.html

    数据集: https://drive.google.com/file/d/1-pzY7IQVyB_GmT5dT0yRx3hYzOFGrZSr/view?usp=sharing

    # Importing the dataset
    dataset_original = read.delim('Restaurant_Reviews.tsv', quote = '', stringsAsFactors = FALSE)
    
    # Cleaning the texts
    # install.packages('tm')
    # install.packages('SnowballC')
    library(tm)
    library(SnowballC)
    corpus = VCorpus(VectorSource(dataset_original$Review))
    corpus = tm_map(corpus, content_transformer(tolower))
    corpus = tm_map(corpus, removeNumbers)
    corpus = tm_map(corpus, removePunctuation)
    corpus = tm_map(corpus, removeWords, stopwords())
    corpus = tm_map(corpus, stemDocument)
    corpus = tm_map(corpus, stripWhitespace)
    
    # Creating the Bag of Words model
    dtm = DocumentTermMatrix(corpus)
    dtm = removeSparseTerms(dtm, 0.999)
    dataset = as.data.frame(as.matrix(dtm))
    dataset$Liked = dataset_original$Liked
    
    # Encoding the target feature as factor
    dataset$Liked = factor(dataset$Liked, levels = c(0, 1))
    
    # Splitting the dataset into the Training set and Test set
    # install.packages('caTools')
    library(caTools)
    set.seed(123)
    split = sample.split(dataset$Liked, SplitRatio = 0.8)
    training_set = subset(dataset, split == TRUE)
    test_set = subset(dataset, split == FALSE)
    
    library(caret)
    model <- train(Liked~., data=training_set, method="xgbTree")
    
    ######
    #LIME#
    ######
    library(lime)
    explainer <- lime(training_set, model)
    explanation <- explain(test_set[1:4,], explainer, n_labels = 1, n_features = 5)
    plot_features(explanation)
    

    https://www.dropbox.com/s/pf9dq0kba0d5flt/Udemy_NLP_Lime.jpeg?dl=0

    我想要什么(不同的数据集): https://www.dropbox.com/s/e1472i4yw1owmlc/DMT_A5_lime.jpeg?dl=0

    1 回复  |  直到 6 年前
        1
  •  1
  •   Sam S.    6 年前

    我无法打开您为数据集和输出提供的链接。不过,我使用的是您提供的相同链接 https://cran.r-project.org/web/packages/lime/vignettes/Understanding_lime.html . 我使用text2vec,因为它是在链接中,和xgboost包进行分类;这对我很有用。要显示更多功能,您可能需要在“解释”功能中增加n挈U功能的值,请参阅 https://www.rdocumentation.org/packages/lime/versions/0.4.0/topics/explain .

    library(lime)
    library(xgboost)  # the classifier
    library(text2vec) # used to build the BoW matrix
    
    # load data
    data(train_sentences, package = "lime")  # from lime 
    data(test_sentences, package = "lime")   # from lime
    
    # Tokenize data
    get_matrix <- function(text) {
      it <- text2vec::itoken(text, progressbar = FALSE)
    
      # use the following lines if you want to prune vocabulary
      # vocab <- create_vocabulary(it, c(1L, 1L)) %>%   
      # prune_vocabulary(term_count_min = 10, doc_proportion_max = 0.2)
      #   vectorizer <- vocab_vectorizer(vocab )
    
      # there is no option to prune the vocabulary, but it is very fast for big data
      vectorizer <- hash_vectorizer(hash_size = 2 ^ 10, ngram = c(1L, 1L))
      text2vec::create_dtm(it,vectorizer = vectorizer) # hash_vectorizer())
    }
    
    # BoW matrix generation
    # features should be the same for both dtm_train and dtm_test 
    dtm_train <- get_matrix(train_sentences$text)
    dtm_test  <- get_matrix(test_sentences$text) 
    
    # xgboost for classification
    param <- list(max_depth = 7, 
              eta = 0.1, 
              objective = "binary:logistic", 
              eval_metric = "error", 
              nthread = 1)
    
    xgb_model <-xgboost::xgb.train(
      param, 
      xgb.DMatrix(dtm_train, label = train_sentences$class.text == "OWNX"),
      nrounds = 100 
    )
    
    # prediction
    predictions <- predict(xgb_model, dtm_test) > 0.5
    test_labels <- test_sentences$class.text == "OWNX"
    
    # Accuracy
    print(mean(predictions == test_labels))
    
    # what are the most important words for the predictions.
    n_features <- 5 # number of features to display
    sentence_to_explain <- head(test_sentences[test_labels,]$text, 6)
    explainer <- lime::lime(sentence_to_explain, model = xgb_model, 
                        preprocess = get_matrix)
    explanation <- lime::explain(sentence_to_explain, explainer, n_labels = 1, 
                             n_features = n_features)
    
    #
    explanation[, 2:9]
    
    # plot
    lime::plot_features(explanation)
    

    在您的代码中,当在train\u数据集上应用时,将在下面的行中创建NAs。请检查以下代码。

    dataset$Liked = factor(dataset$Liked, levels = c(0, 1))
    

    删除级别或将级别更改为标签对我很有用。

    请检查您的数据结构,并确保您的数据不是由于那些NAs的零矩阵,或者它不是太稀疏。它也可能导致问题,因为它无法找到前n个功能。