代码之家  ›  专栏  ›  技术社区  ›  Powege

如何用R从IUPAC符号中得到所有可能的序列

  •  0
  • Powege  · 技术社区  · 4 年前

    我有一个带IUPAC符号的DNA序列载体( https://www.bioinformatics.org/sms/iupac.html

    seq <- "AATCRVTAA"
    iuapc <- data.table(code = c("A", "C", "G", "T", "R", "Y", "S", "W", "K", "M", "B", "D", "H", "V", "N"),
                    base = c("A", "C", "G", "T", "AG", "CT", "GC", "AT", "GT", "AC", "CGT", "AGT", "ACT", "ACG", "ACGT"))
    

    其中,“R”和“V”是DNA核苷酸的模糊值,“R”表示“A”或“G”,“V”表示“A”、“C”或“G”。

    如何生成可以由上述模糊序列表示的所有不同序列组合?

    "AATCAATAA"
    "AATCACTAA"
    "AATCAGTAA"
    "AATCGATAA"
    "AATCGCTAA"
    "AATCGGTAA"
    

    序列的向量很大,所以性能很重要。任何帮助都将不胜感激!

    how to extend ambiguous dna sequence

    0 回复  |  直到 4 年前
        1
  •  1
  •   s_baldur    4 年前

    这里有一些非常原始的东西:

    library(data.table)
    library(magrittr)
    
    # Convert iuapc$base to list of vectors
    iuapc[, base := list(strsplit(base, ''))]
    setkey(iuapc, code)
    
    
    tstrsplit(seq, '') %>% 
      lapply(function(x) iuapc[x, base[[1]]]) %>% 
      do.call(CJ, .) %>% 
      .[, paste(.SD, collapse = ''), by = 1:nrow(.)] %>% 
      .[, V1]
    
    # [1] "AATCAATAA" "AATCACTAA" "AATCAGTAA" "AATCGATAA" "AATCGCTAA" "AATCGGTAA"
    
        2
  •  0
  •   Jon Spring    4 年前

    利用你今天早些时候的问题( https://stackoverflow.com/a/66274136/6851825

    library(tidyverse)
    
    tibble(seq) %>%
      separate_rows(seq, sep = '(?<=.)(?=.)') %>%
      left_join(iuapc, by = c("seq" = "code")) %>%
      pull(base) %>%
      str_split("") %>%
      expand.grid(stringsAsFactors = FALSE)
    
    #  Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9
    #1    A    A    T    C    A    A    T    A    A
    #2    A    A    T    C    G    A    T    A    A
    #3    A    A    T    C    A    C    T    A    A
    #4    A    A    T    C    G    C    T    A    A
    #5    A    A    T    C    A    G    T    A    A
    #6    A    A    T    C    G    G    T    A    A
    
        3
  •  0
  •   Powege    4 年前
    library(stringr)
    
    all.seq.iuapc <- function(seq, dictio_replace){
      seq <- toupper(seq)
      vec <- strsplit(seq, "")[[1]]
      vec2 <- str_replace_all(string = vec, pattern= dictio_replace)
      tmp <- expand.grid(strsplit(vec2, ""), stringsAsFactors = FALSE)
      strings <- apply(tmp, 1, paste0, collapse = "")
      return(strings)
    }
    
    dictio_replace= c("A" = "A",
                      "C" = "C",
                      "G" = "G",
                      "T" = "T",
                      "R" = "AG",        
                      "Y" = "CT",
                      "S" = "GC",
                      "W" = "AT",
                      "K" = "GT",
                      "M" = "AC",
                      "B" = "CGT",
                      "D" = "AGT",
                      "H" = "ACT",
                      "V" = "ACG",
                      "N" = "ACGT")