代码之家  ›  专栏  ›  技术社区  ›  Santosh

使用模式R正则表达式的文本捕获

  •  1
  • Santosh  · 技术社区  · 6 年前

    下面是对象表中的示例数据

    +-----------+-------------------------------------------------------------------------------------------------+
    | Unique_Id |                                               Text                                              |
    +-----------+-------------------------------------------------------------------------------------------------+
    | Ax23z12   | Tool generated code 2015-8134 upon further validation, the tool confirmed the code as 2015-8134 |
    +-----------+-------------------------------------------------------------------------------------------------+
    

    使用以下代码

    regmatches(table[1,2],gregexpr("2000-\\d{4}",table[1,2]))
    

    我能够将输出提取为

    [[1]]
    [1] "2000-0511" "2000-0511"
    

    然而,我想要的输出如下

    +-----------+---------------------------------------------------------------------------+-----------+-----------+
    | Unique_Id |                                    Text                                   |  Column1  |  Column2  |
    +-----------+---------------------------------------------------------------------------+-----------+-----------+
    | Ax23z12   | Tool generated code 2015-8134 upon further validation, the tool confirmed | 2015-8134 | 2015-8134 |
    |           |   the code as 2015-8134                                                   |           |           |
    +-----------+---------------------------------------------------------------------------+-----------+-----------+
    
    

    文本列下的数据由该数字组成多次(最多7次),因此寻找动态解决方案

    非常感谢你

    3 回复  |  直到 6 年前
        1
  •  3
  •   jazzurro    6 年前

    这里有一种方法。我使用了以下示例数据,称为 foo .

    #     id                                                                     text
    #  <int>                                                                    <chr>
    #1     1                Here is my code, 2015-8134. Here is your code, 2015-1111.
    #2     2 His code is 2016-8888, her code is 2016-7777, and your code is 2016-6666
    

    stri_extract_all_regex() 对于 text bind_cols() . 最后一项工作是修改列名。我替换了 X Column 在里面 gsub()

    library(dplyr)
    library(stringi)
    
    out <- stri_extract_all_regex(str = foo$text, pattern = "\\d+-\\d+", simplify = TRUE) %>%
                                  data.frame(stringsAsFactors = FALSE) %>%
           bind_cols(foo,. )
    
    names(out) <- names(out) %>%
                  gsub(pattern = "X", replacement = "Column")
    
    #     id                                                                     text   Column1   Column2   Column3
    #  <int>                                                                    <chr>     <chr>     <chr>     <chr>
    #1     1                Here is my code, 2015-8134. Here is your code, 2015-1111. 2015-8134 2015-1111          
    #2     2 His code is 2016-8888, her code is 2016-7777, and your code is 2016-6666 2016-8888 2016-7777 2016-6666
    

    数据

    foo <- structure(list(id = 1:2, text = c("Here is my code, 2015-8134. Here is your code, 2015-1111.", 
    "His code is 2016-8888, her code is 2016-7777, and your code is 2016-6666"
    )), .Names = c("id", "text"), class = c("tbl_df", "tbl", "data.frame"
    ), row.names = c(NA, -2L))
    
        2
  •  2
  •   Psidom    6 年前

    stringr data.table :

    str_match_all

    transpose 将提取的模式转换为列;

    3) 通过将提取的列与原始列相结合来构造新的数据帧;

    library(stringr)
    library(data.table)
    
    lst = transpose(str_match_all(df$Text, "2015-\\d{4}"))
    data.frame(df, setNames(lst, paste0("Column", seq_along(lst))))
    #  Unique_Id                                                                                            Text   Column1   Column2
    #1   Ax23z12 Tool generated code 2015-8134 upon further validation, the tool confirmed the code as 2015-8134 2015-8134 2015-8134
    #2   By56m22                                           Tool generated code 2015-8134 upon further validation 2015-8134      <NA>
    
        3
  •  0
  •   CPak    6 年前

    df[apply(df, 1, function(x) any(grepl("2000-\\d{4}", x))), ]
    

    参见此可复制示例

    iris[apply(iris, 1, function(x) any(grepl("set", x))), ]
    
       # Sepal.Length Sepal.Width Petal.Length Petal.Width Species
    # 1           5.1         3.5          1.4         0.2  setosa
    # 2           4.9         3.0          1.4         0.2  setosa
    # 3           4.7         3.2          1.3         0.2  setosa
    # 4           4.6         3.1          1.5         0.2  setosa
    # 5           5.0         3.6          1.4         0.2  setosa
    # 6           5.4         3.9          1.7         0.4  setosa
    # etc