代码之家  ›  专栏  ›  技术社区  ›  Camila

如何在R中连接多个数据帧中的选定列

  •  0
  • Camila  · 技术社区  · 11 月前

    我有很多 data.frames (448)具有相同的列名(共9个),如下所示:

           V1              V2      V4    ... V9
    ENSG00000000003.15   TSPAN6   7095
    ENSG00000000005.6     TNMD    4355
           .                .       .
           .                .       .
    

    我想再创建一个 data.frame ,保持前2列(V1和V2),每列都是相同的 数据帧 ,并合并V4列(每个列都不同 数据帧 )从所有 数据帧 s、 其余的列应该被排除在外。

    如果可能的话,我想将V4列重命名为“sample1”、“sample2”等,直到448。

    因此,最终的数据帧应该是:

           V1              V2     V4_1    V4_2 ... V4_448
    ENSG00000000003.15   TSPAN6   7095    3856       .
    ENSG00000000005.6     TNMD    4355    2976       .
           .                .       .      .         . 
           .                .       .      .         .
    

    我已经这样做了:

    reader <- function(f){
      read.table(f, sep='\t', skip=6, header=FALSE)
    }
    
    files <- list.files(path, 
                        recursive=TRUE, full.names=TRUE)
    
    myfilelist <- lapply(files, reader)
    

    但我不知道如何仅组合这些选定的列

    这是输出 dput(lapply(myfilelist[1:2], head)) :

    myfilelist <- list(structure(list(V1 = c("ENSG00000000003.15", "ENSG00000000005.6", 
    "ENSG00000000419.13", "ENSG00000000457.14", "ENSG00000000460.17", 
    "ENSG00000000938.13"), V2 = c("TSPAN6", "TNMD", "DPM1", "SCYL3", 
    "C1orf112", "FGR"), V3 = c("protein_coding", "protein_coding", 
    "protein_coding", "protein_coding", "protein_coding", "protein_coding"
    ), V4 = c(7094L, 2L, 4355L, 1149L, 372L, 585L), V5 = c(3573L, 
    1L, 2201L, 953L, 553L, 281L), V6 = c(3521L, 1L, 2154L, 883L, 
    579L, 308L), V7 = c(59.9764, 0.052, 138.3704, 6.4018, 2.3896, 
    6.6335), V8 = c(20.5827, 0.0178, 47.4859, 2.197, 0.8201, 2.2765
    ), V9 = c(22.2037, 0.0192, 51.2256, 2.37, 0.8847, 2.4558)), row.names = c(NA, 
    6L), class = "data.frame"), structure(list(V1 = c("ENSG00000000003.15", 
    "ENSG00000000005.6", "ENSG00000000419.13", "ENSG00000000457.14", 
    "ENSG00000000460.17", "ENSG00000000938.13"), V2 = c("TSPAN6", 
    "TNMD", "DPM1", "SCYL3", "C1orf112", "FGR"), V3 = c("protein_coding", 
    "protein_coding", "protein_coding", "protein_coding", "protein_coding", 
    "protein_coding"), V4 = c(2616L, 23L, 3746L, 1288L, 510L, 1578L
    ), V5 = c(1369L, 9L, 1876L, 1015L, 681L, 797L), V6 = c(1250L, 
    14L, 1871L, 984L, 693L, 782L), V7 = c(16.8063, 0.4541, 90.4417, 
    5.4531, 2.4895, 13.5969), V8 = c(4.8615, 0.1314, 26.1617, 1.5774, 
    0.7201, 3.9331), V9 = c(6.0158, 0.1625, 32.3733, 1.9519, 0.8911, 
    4.867)), row.names = c(NA, 6L), class = "data.frame"))
    
    1 回复  |  直到 11 月前
        1
  •  1
  •   LMc    11 月前

    我不清楚你想如何连接这个数据帧列表(前两列是否相同?),但这里有一个使用左连接的选项:

    library(dplyr)
    library(purrr)
    
    imap(myfilelist, \(df, i) select(df, 1:2, "sample{i}" := 4) ) |>
      reduce(left_join, by = join_by(V1, V2))
    

    如果前两列在所有数据帧中都相同,则可以将它们绑定在一起:

    library(dplyr)
    library(purrr)
    
    bind_cols(myfilelist[[1]][1:2],
              imap(myfilelist, \(df, i) select(df, "sample{i}" := 4)) |> bind_cols())
    
        2
  •  0
  •   AkselA    11 月前

    我们也可以在基础R中很容易地做到这一点

    bound <- data.frame(myfilelist[[1]][1:2], do.call(cbind, lapply(myfilelist, "[", 4)))
    colnames(bound)[-(1:2)] <- paste0("sample", seq(length(myfilelist)))
    bound
    #                   V1       V2 sample1 sample2
    # 1 ENSG00000000003.15   TSPAN6    7094    2616
    # 2  ENSG00000000005.6     TNMD       2      23
    # 3 ENSG00000000419.13     DPM1    4355    3746
    # 4 ENSG00000000457.14    SCYL3    1149    1288
    # 5 ENSG00000000460.17 C1orf112     372     510
    # 6 ENSG00000000938.13      FGR     585    1578