代码之家  ›  专栏  ›  技术社区  ›  Monica Muller

将主题模型输出转换为JSON

  •  0
  • Monica Muller  · 技术社区  · 9 年前

    我使用以下函数将topicmodels输出转换为JSON输出,以便在ldavis中使用。

    topicmodels_json_ldavis <- function(fitted, corpus, doc_term){
         ## Required packages
         library(topicmodels)
         library(dplyr)
         library(stringi)
         library(tm)
         library(LDAvis)
    
         ## Find required quantities
         phi <- posterior(fitted)$terms %>% as.matrix
         theta <- posterior(fitted)$topics %>% as.matrix
         vocab <- colnames(phi)
         doc_length <- vector()
         for (i in 1:length(corpus)) {
              temp <- paste(corpus[[i]]$content, collapse = ' ')
              doc_length <- c(doc_length, stri_count(temp, regex = '\\S+'))
         }
         temp_frequency <- inspect(doc_term)
         freq_matrix <- data.frame(ST = colnames(temp_frequency),
                                   Freq = colSums(temp_frequency))
         rm(temp_frequency)
    
         ## Convert to json
         json_lda <- LDAvis::createJSON(phi = phi, theta = theta,
                                        vocab = vocab,
                                        doc.length = doc_length,
                                        term.frequency = freq_matrix$Freq)
    
         return(json_lda)
    }
    

    但我收到以下错误

    LDAvis中的错误::createJSON(phi=phi,theta=theta,vocab=voab,doc.length=doc_length,:文档长度长度不相等 θ中的行数;两者都应等于 数据中的文档。

    这是我的完整代码:

    data <- read.csv("textmining.csv")
    
    
    corpus <- Corpus(DataframeSource(data.frame(data$reasonforleaving))) 
    
    # Remove punctuations and numbers because they are generally uninformative.
    corpus <- tm_map(corpus, removePunctuation)
    corpus <- tm_map(corpus, removeNumbers)
    # Convert all words to lowercase.
    corpus <- tm_map(corpus, content_transformer(tolower))
    # Remove stopwords such as "a", "the", etc.
    corpus <- tm_map(corpus, removeWords, stopwords("english"))
    # Use the SnowballC package to do stemming.
    library(SnowballC)
    corpus <- tm_map(corpus, stemDocument)
    
    
    # remove extra words
    toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
    corpus <- tm_map(corpus, toSpace, "still")
    corpus <- tm_map(corpus, toSpace, "also")
    
    # Remove excess white spaces between words.
    
    corpus <- tm_map(corpus, stripWhitespace)
    # Inspect the first document to see what it looks like.
    corpus[[1]]$content 
    
    dtm <- DocumentTermMatrix(corpus)
    
    # remove empty documents
    library(slam)
    dtm = dtm[row_sums(dtm)>0,]
    
    # Use topicmodels package to conduct LDA analysis.
    
    burnin <- 500
    iter <- 1000
    keep <- 30
    k <- 5
    
    result55 <- LDA(dtm, 5)
    ldaoutput = topicmodels_json_ldavis(result55,corpus, dtm)
    

    你知道我为什么收到错误吗?

    谢谢

    2 回复  |  直到 9 年前
        1
  •  6
  •   Léo Joubert    8 年前

    我在相同的代码中遇到了相同的问题,并找到了这个函数 here :

    topicmodels2LDAvis <- function(x, ...){
        post <- topicmodels::posterior(x)
        if (ncol(post[["topics"]]) < 3) stop("The model must contain > 2 topics")
        mat <- x@wordassignments
        LDAvis::createJSON(
            phi = post[["terms"]], 
            theta = post[["topics"]],
            vocab = colnames(post[["terms"]]),
            doc.length = slam::row_sums(mat, na.rm = TRUE),
            term.frequency = slam::col_sums(mat, na.rm = TRUE)
        )
    }
    

    使用起来简单得多,只需将LDA结果作为参数:

    result55 <- LDA(dtm, 5)
    serVis(topicmodels2LDAvis(result55))
    
        2
  •  0
  •   Eugene    9 年前

    问题

    你的问题在 for (i in 1:length(corpus)) 在里面

     doc_length <- vector()
         for (i in 1:length(corpus)) {
              temp <- paste(corpus[[i]]$content, collapse = ' ')
              doc_length <- c(doc_length, stri_count(temp, regex = '\\S+'))
         }
    

    请记住,您已在中从DocumentTermMatrix中删除了一些“空”文档 dtm = dtm[row_sums(dtm)>0,] , 所以这里的向量长度太大了。

    建议

    您可能需要保留一个空文档的矢量,因为它不仅可以帮助您生成JSON,还可以帮助您在空文档集和完整文档集之间来回切换。
    doc.length = colSums( as.matrix(tdm) > 0 )[!empty.docs]

    我的建议是假设你有完整的 tdm 有空文档