代码之家  ›  专栏  ›  技术社区  ›  knb

如何使用dplyr执行separate()和mutate_each()

  •  0
  • knb  · 技术社区  · 10 年前

    我有一个sqlite数据库中的数据,该数据库包含一个非第一标准格式的实体。“sample_attribute”列中的字符串如下所示:

     isolate: R4166 || age: 43.88 || biomaterial_provider: LIBD || sex: male || tissue: DLPFC || disease: control || race: AA || RIN: 8.7 || Fraction: total || BioSampleModel: Human
    

    此时我的代码:

    library(tidyr)
    library(dplyr)
    library(stringi)
    
    
    
    rs.df <- structure(list(run_accession = c("SRR1554537", "SRR2071348"), 
    platform_parameters = c("INSTRUMENT_MODEL: Illumina HiSeq 2000", 
    "INSTRUMENT_MODEL: Illumina HiSeq 2000"), sample_attribute = c("isolate: R3452 || age: -0.3836 || biomaterial_provider: LIBD || sex: female || tissue: DLPFC || disease: control || race: AA || RIN: 9.6 || Fraction: total || BioSampleModel: Human", "isolate: R3452 || age: -0.3836 || biomaterial_provider: LIBD || sex: female || tissue: DLPFC || disease: control || race: AA || RIN: 9.6 || Fraction: total || BioSampleModel: Human")), .Names = c("run_accession", "platform_parameters", "sample_attribute"
    ), row.names = c(NA, -2L), class = "data.frame")
    
    coln <- c("isolate", "age", "biomaterial_provider", "sex", "tissue", "disease", "race",
              "RIN", "Fraction", "BioSampleModel")
    
    rs.df <- rs.df %>%
            separate(sample_attribute, coln, sep = "\\|\\|")
    
    head(rs.df, 1)
    

    中间结果:

           sample_attribute
      run_accession                   platform_parameters         isolate          age
    1    SRR1554534 INSTRUMENT_MODEL: Illumina HiSeq 2000 isolate: DLPFC   age: 40.42 
              biomaterial_provider         sex          tissue            disease
    1  biomaterial_provider: LIBD   sex: male   tissue: DLPFC   disease: Control 
            race        RIN          Fraction         BioSampleModel
    1  race: AA   RIN: 8.4   Fraction: total   BioSampleModel: Human
    

    目前我继续

    for (x in coln){
            rs.df[,x] <- stri_replace(rs.df[,x], regex = "^.+:\\s*", replacement = "")
    }
    

    但这是不灵活的。

    是否有扩展dplyr管道的方法,使for循环(尽可能)被%>%中的调用替换管道

    至少,对于 coln ,删除字符串,直到 separate() 呼叫:

    rs.df <- rs.df %>%
            separate(sample_attribute, coln, sep = "\\|\\|") %>%
            mutate_each(... stri_replace...) #split pairs at ":", remove part before ":"
    

    (这里for循环解决了我分离/清理字符串的问题。但是,SRAdb数据库中可能有更多这样的列,它们的键:值对由“||”分隔。如何以更灵活的方式处理它们?)

    1 回复  |  直到 10 年前
        1
  •  1
  •   Community Mohan Dere    9 年前

    请在此处查看@docendo discimus的答案: dplyr certain columns

    在你的情况下

    rs.df <- rs.df %>%
        separate(sample_attribute, coln, sep = "\\|\\|") %>%
        mutate_each_(funs(stri_replace(., regex="^.+:\\s*", replacement="")), coln)