代码之家  ›  专栏  ›  技术社区  ›  pranav nerurkar

使用r中的regex提取文本

r
  •  1
  • pranav nerurkar  · 技术社区  · 7 年前

    我读取了包含以下数据的文本文件,并试图将其转换为数据帧

    Id:   1
    ASIN: 0827229534
      title: Patterns of Preaching: A Sermon Sampler
      group: Book
      salesrank: 396585
      similar: 5  0804215715  156101074X  0687023955  0687074231  082721619X
      reviews: total: 2  downloaded: 2  avg rating: 5
    

    带有列和数据的示例数据框

    Id | ASIN      | title                                   |group | similar   | avg rating
    1  | 0827229534 | Patterns of Preaching: A Sermon Sampler | Book | 0804215715 | 5
    

    代码:

    text <- readLines("https://raw.githubusercontent.com/pranavn91/PhD/master/Expt/sample.txt")
    ids <- gsub('Id:\\s+', '', text)
    ASIN <- gsub('ASIN:\\s+', '', text)
    title <- gsub('title:\\s+', '', text)
    group <- gsub('group:\\s+', '', text)
    similar <- gsub('similar:\\s+', '', text)
    rating <- gsub('avg rating:\\s+', '', text)
    

    这不起作用,我得到完整的文本文件作为输出。

    5 回复  |  直到 7 年前
        1
  •  3
  •   acylam    7 年前

    这里有一个不同的方法 separate_rows spread 要将文本文件重新格式化为数据帧:

    text = readLines(path_to_textfile)
    
    library(dplyr)
    library(tidyr)
    
    data.frame(text = text) %>%
      separate_rows(text, sep = "(?<=\\d)\\s+(?=[a-z])") %>%
      extract(text, c("title", "value"), regex = "(?i)([a-z]+):(.+)") %>%
      filter(!title %in% c("reviews", "downloaded")) %>%
      group_by(title) %>%
      mutate(id = 1:n()) %>%
      spread(title, value) %>%
      select(-id)
    

    结果:

             ASIN group   Id rating salesrank
    1  0827229534  Book    1      5    396585
    2    12412441  Book    2     10   4225352
                                                             similar
    1  5  0804215715  156101074X  0687023955  0687074231  082721619X
    2                                         1241242 1412414 124124
                                         title
    1  Patterns of Preaching: A Sermon Sampler
    2                                Patterns2
    

    数据:

    Id:   1
    ASIN: 0827229534
      title: Patterns of Preaching: A Sermon Sampler
      group: Book
      salesrank: 396585
      similar: 5  0804215715  156101074X  0687023955  0687074231  082721619X
      reviews: total: 2  downloaded: 2  avg rating: 5
    Id:   2
    ASIN: 12412441
      title: Patterns2
      group: Book
      salesrank: 4225352
      similar: 1241242 1412414 124124
      reviews: total: 2  downloaded: 2  avg rating: 10
    

    注:

    在文本文件末尾多留一行空白。否则 readLines 尝试读取文件时返回错误。

        2
  •  2
  •   cirofdo    7 年前

    编辑:更正我的答案。

    使用 stringr :

    library(stringr)
    
    ids <- str_extract(text, 'Id:[ ]*\\S+')
    ASIN <- str_extract(text, 'ASIN:[ ]*\\S+')
    title <- str_extract(text, 'title:[ ]*\\S+')
    group <- str_extract(text, 'group:[ ]*\\S+')
    similar <- str_extract(text, 'similar:[ ]*\\S+')
    rating <- str_extract(text, 'avg rating:[ ]*\\S+')
    
        3
  •  2
  •   SeGa    7 年前

    这只是一个开始。因为我不是一个专家,我会让别人做的魔术。:)

    或者为每个对象定义规则,然后这样做。

    ids <- do.call(rbind, regmatches(regexec(pattern = 'Id:\\s+', text = text), x = text))
    ASIN <- do.call(rbind, regmatches(regexec(pattern = 'ASIN:\\s+', text = text), x = text))
    title <- do.call(rbind, regmatches(regexec(pattern = 'title:\\s+', text = text), x = text))
    

    或者定义一个通用规则,它应该适用于每一行。像这样:

    sapply(text,  FUN = function(x) {
      regmatches(x, regexec(text = x, pattern = "([^:]+)"))
      })
    
    sapply(text,  FUN = function(x) {
      regmatches(x, regexec(text = x, pattern = "(:.*)"))
    })
    
        4
  •  1
  •   Adam Sampson    7 年前

    使用tidyverse软件包:

    library(tidyverse)
    
    text <- list(readLines("https://raw.githubusercontent.com/pranavn91/PhD/master/Expt/sample.txt"))
    
    out <- tibble(text = text)
    
    out <- out %>%
      rowwise() %>%
      mutate(ids = str_extract(text,"Id: .+") %>% na.omit() %>% str_remove("Id: ") %>% str_c(collapse = ", "),
             ASIN = str_extract(text,"ASIN: .+") %>% na.omit() %>% str_remove("ASIN: ") %>% str_c(collapse = ", "),
             title = str_extract(text,"title: .+") %>% na.omit() %>% str_remove("title: ") %>% str_c(collapse = ", "),
             group = str_extract(text,"group: .+") %>% na.omit() %>% str_remove("group: ") %>% str_c(collapse = ", "),
             similar = str_extract(text,"similar: .+") %>% na.omit() %>% str_remove("similar: ") %>% str_c(collapse = ", "),
             rating = str_extract(text,"avg rating: .+") %>% na.omit() %>% str_remove("avg rating: ") %>% str_c(collapse = ", ")
             ) %>%
      ungroup()
    

    我将文本放在列表中,因为我假设您希望创建一个包含多个正在查找的项的数据框。如果这样做,只需为所做的每一个readlines添加一个新的列表项。

    请注意,mutate将列表中的每一项视为一个对象,该对象等同于使用文本[[1]]…

    如果您有和项目发生多次,则需要添加 %>% str_c(collapse = ", ") 就像我做的那样,否则你可以把它移走。

    根据新样本数据更新:

    新的示例数据集创建了一些不同的挑战,这些挑战在我的原始答案中没有解决。

    首先,数据都在一个文件中,我假设它在多个文件中。可以将所有内容分隔成列表,也可以将所有内容分隔成字符向量。我选择了第二个选项。

    因为我选择了第二个选项,所以现在必须更新代码以提取数据,直到\r到达为止(需要在r中\\r,因为r处理转义的方式不同)。

    接下来,一些字段是空的!必须添加一个检查以查看结果是否为空,如果为空则修复输出。我在用 %>% ifelse(length(.)==0,NA,.) 为了完成这一点。

    注意:如果添加其他字段,如类别:在此搜索中,代码将只捕获第一行文本。需要修改它以捕获多行。

    library(tidyverse)
    
    # Read text into a single long file.
    text <- read_file("https://raw.githubusercontent.com/pranavn91/PhD/master/Expt/sample.txt")
    
    # Separate each Id: into a character string in a vector
    # Use negative lookahead to capture groups that don't have Id: in them.
    # Use an or to also capture any non-words that don't have Id: in them.
    text <- str_extract_all(text,"Id: (((?!Id:).)|[^(Id:)])+") %>% 
      flatten()
    
    out <- tibble(text = text)
    
    out <- out %>%
      rowwise() %>%
      mutate(ids = str_extract(text,"Id: ((?!\\\\r).)+") %>% na.omit() %>% str_remove("Id: ") %>% str_c(collapse = ", ") %>% ifelse(length(.)==0,NA,.),
             ASIN = str_extract(text,"ASIN: ((?!\\\\r).)+") %>% na.omit() %>% str_remove("ASIN: ") %>% str_c(collapse = ", ") %>% ifelse(length(.)==0,NA,.),
             title = str_extract(text,"title: ((?!\\\\r).)+") %>% na.omit() %>% str_remove("title: ") %>% str_c(collapse = ", ") %>% ifelse(length(.)==0,NA,.),
             group = str_extract(text,"group: ((?!\\\\r).)+") %>% na.omit() %>% str_remove("group: ") %>% str_c(collapse = ", ") %>% ifelse(length(.)==0,NA,.),
             similar = str_extract(text,"similar: ((?!\\\\r).)+") %>% na.omit() %>% str_remove("similar: \\d") %>% str_c(collapse = ", ") %>% ifelse(length(.)==0,NA,.),
             rating = str_extract(text,"avg rating: ((?!\\\\r).)+") %>% na.omit() %>% str_remove("avg rating: ") %>% str_c(collapse = ", ") %>% ifelse(length(.)==0,NA,.)
      ) %>%
      ungroup()
    
        5
  •  0
  •   PKumar    7 年前

    我在这里主要使用baser(除了zoo和tiydr),代码可能有点长,但它可以得到想要的结果。

    options(stringsAsFactors = F)
    text <- readLines("https://raw.githubusercontent.com/pranavn91/PhD/master/Expt/sample.txt") #Input file
    
    textdf <- data.frame(text, stringsAsFactors = F) #Reading it
    search_words <- c("Id","ASIN","title","group","salesrank","similar","avg rating") #search words as per OP
    textdf <- data.frame(text = textdf[grepl(paste0(search_words,collapse = "|"), textdf$text),]) #finding the words and filtering it
    textdf$key <- as.numeric(gsub("Id:\\s+(\\d+)","\\1",textdf$text))
    View(textdf) # Making a key for each Id
    
    textdf$key <- zoo::na.locf(textdf$key) #Propagating the key for same set of Ids
    textdf$text <- gsub( "(.*)(?=avg rating:\\s*\\d+)","", textdf$text, perl=T) #Removing text from before "avg rating" 
    textdf$text <- gsub("(similar:\\s*\\d+)(.*)","\\1", textdf$text, perl=T) #Removing text after "similar"
    textdf$text <- trimws(textdf$text) ##removing leading and trailing blanks
    textdf$text <- sub(":","+",textdf$text) #Replacing the first instance of : so that we can split with plus sign, since plus sign is very uncommon hence took it
    splits <- strsplit(textdf$text, "\\+")  #Splitting 
    max_len <- max(lengths(splits)) #checking for max length of items in the list
    all_lyst_eq_len <- lapply(splits, `length<-`, max_len) #equaling the list
    df_final <- data.frame(cbind(do.call('rbind', all_lyst_eq_len), textdf$key))# binding the data frame
    
    df_final <- df_final[!duplicated(df_final),] #Removing the duplicates, there is some dups in data
    df_f <- tidyr::spread(df_final, X1,X2) # Reshaping it(transposing)
    
    df_f[,c("Id","ASIN", "title", "group","similar",
                "avg rating")] #Final dataset 
    

    输出:

    文本文件被包装得很好,因此添加了一个屏幕截图,我向社区道歉。

    输出与操作相同。

    enter image description here