代码之家  ›  专栏  ›  技术社区  ›  antecessor

在R中使用udpipe提取关键字时的For循环

  •  0
  • antecessor  · 技术社区  · 7 年前

    让我们从一个可复制的示例开始,这是一个名为 key

    key <- structure(c("Make Professional Maps with QGIS and Inkscape", 
    "Gain the skills to produce original, professional, and aesthetically pleasing maps using free software", 
    "English", "Inkscape 101 for Beginners - Design Vector Graphics", 
    "Learn how to create and design vector graphics for free!", "English", 
    "Design & Create Vector Graphics With Inkscape 2016", "The Beginners Guide to designing and creating Vector Graphics with Inkscape. No Experience needed!", 
    "English", "Design a Logo for Free in Inkscape", "Learn from an award winning, published logo design professional!", 
    "English", "Inkscape - Beginner to Pro", "If you want to have a decent learning curve, you are new to the program or even in design, this course is for you.", 
    "English", "Creating 2D Textures in Inkscape", "A guide to creating colorful and interesting textures in inkscape.", 
    "English", "Vector Art in Inkscape - Icon Design | Make Vector Graphics", 
    "Learn Icon Design by creating Vector Graphics using the .SVG and PNG format with the Free Software Inkscape!", 
    "English", "Inkscape and Bootstrap 3 -> Responsive Web Design!", 
    "Design responsive websites using Free tools Inkscape and Bootstrap 3! Mood Boards and Style Tiles to Mobile First!", 
    "English"), .Dim = c(3L, 8L), .Dimnames = list(c("Title", "Short_Description", 
    "Language"), c("1", "2", "4", "5", "6", "9", "13", "15")))
    

    我想 . 为此,我使用 udpipe 从R。

    for 循环。

    在开始之前,我们以英语为参考创建模型( see this link for more info ):

    library(udpipe)
    ud_model <- udpipe_download_model(language = "english")
    ud_model <- udpipe_load_model(ud_model$file_model)
    

    理想情况下,我的最终输出将是一个包含8列的数据帧,提取了这么多行作为关键字。

    我试过两种方法:

    方法一:使用 dplyr

    library(dplyr)
    keywords <- list()
    for(i in ncol(keywords_en_t)){
      keywords[[i]] <- keywords_en_t %>%
        udpipe_annotate(ud_model,s)
        as.data.frame()
    }
    

    方法2:

    key <- list()
    stats <- list()
    for(i in ncol(keywords_en_t)){
        key[[i]] <- as.data.frame(udpipe_annotate(ud_model, x = keywords_en_t[,i]))
        stats[[i]] <- subset(key[[i]], upos %in% "NOUN")
        stats <- txt_freq(x = stats$lemma)
    }
    

    输出

    在这两种情况下,或我得到一些错误或输出不是预期的。

    如前所述,我期望的输出是一个dataframe,其中8列表示关键字行

    1 回复  |  直到 7 年前
        1
  •  1
  •   Community Mohan Dere    5 年前

    1:ncol seq_along . udpipe_annotate 一个字符向量。如果您只提供了一个键[,8],那么您也提供了dimnames给 UDU注释 . 可能会产生你不需要的关键字。在方法1中,使用udpipe\u annotate(ud\u model,s),但没有 s 定义。在方法2中,您使用stats[[i]],紧接着,您将使用stats覆盖它。

    为了纠正一些问题,首先我将数据转换为data.frame。接下来我运行循环来创建包含关键字的向量列表。在此之后,我创建了关键字的data.frame。这部分代码考虑了向量的不同长度。

    您可能需要检查如何获取数据,因为有3列(“标题”、“简短描述”、“语言”)和许多行更符合逻辑/更整洁。

    # Transform key into a data.frame. Now it is a matrix. 
    key <- as.data.frame(key, stringsAsFactors = FALSE)
    
    library(udpipe)
    # prevent downloading ud model if it already exists in the working directory
    ud_model <- udpipe_download_model(language = "english", overwrite = FALSE)
    ud_model <- udpipe_load_model(ud_model$file_model)
    
    # prepare list with correct length
    keywords <- vector(mode = "list", length = ncol(key))
    
    for(i in 1:ncol(key)){
      temp <- as.data.frame(udpipe_annotate(ud_model, x = key[, i]))
      keywords[[i]] <- temp$lemma[temp$upos == "NOUN"]
    }
    
    #transform list of vectors to data.frame. 
    # Use sapply because vectors are of different lengths.
    keywords <- as.data.frame(sapply(keywords, '[', seq(max(lengths(keywords)))), stringsAsFactors = FALSE)
    
    keywords
    
            V1        V2         V3     V4       V5       V6     V7      V8
    1    skill beginners  beginners   logo learning       2d Design     web
    2      map    design      guide  award    curve  Texture format  design
    3 software    Vector experience   logo  program    guide   <NA>  design
    4     <NA>  graphics       <NA> design   design  texture   <NA> website
    5     <NA>    vector       <NA>   <NA>   course inkscape   <NA>    tool
    6     <NA>   graphic       <NA>   <NA>     <NA>     <NA>   <NA>    <NA>