代码之家  ›  专栏  ›  技术社区  ›  Haakonkas

检查字符串中的多个值中是否有任何一个在数值范围R内

  •  2
  • Haakonkas  · 技术社区  · 7 年前

    我有以下虚拟数据帧:

    structure(list(ref = structure(1:7, .Label = c("a", "b", "c", 
    "d", "e", "f", "g"), class = "factor"), gene = structure(c(1L, 
    1L, 1L, 1L, 1L, 2L, 2L), .Label = c("gyrA", "parC"), class = "factor"), 
        result = structure(c(2L, 4L, 6L, 2L, 3L, 5L, 1L), .Label = c("S479T", 
        "S83L", "S83L, D678E, D741E", "S83L, D87G", "T765E", "V196A, M248V, E678D"
        ), class = "factor")), class = "data.frame", row.names = c(NA, 
    -7L))
    

    ref  gene  result
    a    gyrA  S83L
    b    gyrA  S83L, D87G
    c    gyrA  V196A, M248V, E678D
    d    gyrA  S83L
    e    gyrA  S83L, D678E, D741E
    f    parC  T765E
    g    parC  S479T
    

    我想做的是检查“result”列中的数值(每个条目中两个字母之间)是否在特定范围内,特别是67-106,但仅当“gene”列==gyrA时。需要检查“结果”列中每个单元格中的所有数字。 如果单元格中的任何数字在指定范围内,result\ u pos中的结果应返回1。

    df %>%
       mutate(gyrA_pos = ifelse(gene == "gyrA", gsub("[[:alpha:]]", "", result), NA),
       result_pos = ifelse(gene == "gyrA" & gyrA_pos %in% as.character(seq(from = 67, to = 106)) == TRUE, 1, 0))
    

    这是有效的,但只适用于只有一个值的条目。我还发现,在匹配之前必须创建一个去掉字母的列是很乏味的。最后我想说:

    ref  gene  result                 result_pos
    a    gyrA  S83L                   1
    b    gyrA  S83L, D87G             1
    c    gyrA  V196A, M248V, E678D    0
    d    gyrA  S83L                   1
    e    gyrA  S83L, D678E, D741E     1
    f    parC  T765E                  NA
    g    parC  S479T                  NA
    
    2 回复  |  直到 7 年前
        1
  •  2
  •   Calum You    7 年前

    这里有一条路。你可以用 str_extract_all 把所有的数字都写进 result map 具有 any 检查是否有任何数字在指定的范围内。结尾只是插入 NA 并转换为整数。

    library(tidyverse)
    df <- structure(list(ref = structure(1:7, .Label = c("a", "b", "c", "d", "e", "f", "g"), class = "factor"), gene = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L), .Label = c("gyrA", "parC"), class = "factor"), result = structure(c(2L, 4L, 6L, 2L, 3L, 5L, 1L), .Label = c("S479T", "S83L", "S83L, D678E, D741E", "S83L, D87G", "T765E", "V196A, M248V, E678D"), class = "factor")), class = "data.frame", row.names = c(NA, -7L))
    
    df %>%
      mutate(
        result_pos = result %>%
          str_extract_all("\\d+") %>%
          map(as.integer) %>%
          map_lgl(~ any(.x >= 67L & .x <= 106L)),
        result_pos = if_else(gene != "gyrA", NA, result_pos),
        result_pos = as.integer(result_pos)
      )
    #>   ref gene              result result_pos
    #> 1   a gyrA                S83L          1
    #> 2   b gyrA          S83L, D87G          1
    #> 3   c gyrA V196A, M248V, E678D          0
    #> 4   d gyrA                S83L          1
    #> 5   e gyrA  S83L, D678E, D741E          1
    #> 6   f parC               T765E         NA
    #> 7   g parC               S479T         NA
    

    创建日期:2018-09-04 reprex package (第0.2.0版)。

        2
  •  1
  •   markus    7 年前

    这是一个 data.table

    library(data.table)
    setDT(DF)
    DF[, `:=`(result = as.character(result), # coerce result to character
              result_pos = NA_integer_)] # set result_pos to NA 
    DF[gene == 'gyrA', result_pos := {
      x <-
        lapply(strsplit(result, split = ","),
               gsub,
               pattern = "\\D+",
               replacement = "")
      as.integer(sapply(x, function(i)
        any(as.numeric(i) >= 67 & as.numeric(i) <= 106)))
    }][]
    #   ref gene              result result_pos
    #1:   a gyrA                S83L          1
    #2:   b gyrA          S83L, D87G          1
    #3:   c gyrA V196A, M248V, E678D          0
    #4:   d gyrA                S83L          1
    #5:   e gyrA  S83L, D678E, D741E          1
    #6:   f parC               T765E         NA
    #7:   g parC               S479T         NA
    

    我们的想法是 strsplit result gene == 'gyrA'