代码之家 › 专栏 › 技术社区 › Camila

如何在R中连接多个数据帧中的选定列

merge join function dataframe r

Camila · 技术社区 · 11 月前

我有很多 data.frames (448)具有相同的列名(共9个),如下所示:

       V1              V2      V4    ... V9
ENSG00000000003.15   TSPAN6   7095
ENSG00000000005.6     TNMD    4355
       .                .       .
       .                .       .

我想再创建一个 data.frame ,保持前2列(V1和V2),每列都是相同的 数据帧 ,并合并V4列(每个列都不同 数据帧 )从所有 数据帧 s、其余的列应该被排除在外。

如果可能的话,我想将V4列重命名为“sample1”、“sample2”等,直到448。

因此,最终的数据帧应该是:

       V1              V2     V4_1    V4_2 ... V4_448
ENSG00000000003.15   TSPAN6   7095    3856       .
ENSG00000000005.6     TNMD    4355    2976       .
       .                .       .      .         . 
       .                .       .      .         .

我已经这样做了:

reader <- function(f){
  read.table(f, sep='\t', skip=6, header=FALSE)
}

files <- list.files(path, 
                    recursive=TRUE, full.names=TRUE)

myfilelist <- lapply(files, reader)

但我不知道如何仅组合这些选定的列

这是输出 dput(lapply(myfilelist[1:2], head)) :

myfilelist <- list(structure(list(V1 = c("ENSG00000000003.15", "ENSG00000000005.6", 
"ENSG00000000419.13", "ENSG00000000457.14", "ENSG00000000460.17", 
"ENSG00000000938.13"), V2 = c("TSPAN6", "TNMD", "DPM1", "SCYL3", 
"C1orf112", "FGR"), V3 = c("protein_coding", "protein_coding", 
"protein_coding", "protein_coding", "protein_coding", "protein_coding"
), V4 = c(7094L, 2L, 4355L, 1149L, 372L, 585L), V5 = c(3573L, 
1L, 2201L, 953L, 553L, 281L), V6 = c(3521L, 1L, 2154L, 883L, 
579L, 308L), V7 = c(59.9764, 0.052, 138.3704, 6.4018, 2.3896, 
6.6335), V8 = c(20.5827, 0.0178, 47.4859, 2.197, 0.8201, 2.2765
), V9 = c(22.2037, 0.0192, 51.2256, 2.37, 0.8847, 2.4558)), row.names = c(NA, 
6L), class = "data.frame"), structure(list(V1 = c("ENSG00000000003.15", 
"ENSG00000000005.6", "ENSG00000000419.13", "ENSG00000000457.14", 
"ENSG00000000460.17", "ENSG00000000938.13"), V2 = c("TSPAN6", 
"TNMD", "DPM1", "SCYL3", "C1orf112", "FGR"), V3 = c("protein_coding", 
"protein_coding", "protein_coding", "protein_coding", "protein_coding", 
"protein_coding"), V4 = c(2616L, 23L, 3746L, 1288L, 510L, 1578L
), V5 = c(1369L, 9L, 1876L, 1015L, 681L, 797L), V6 = c(1250L, 
14L, 1871L, 984L, 693L, 782L), V7 = c(16.8063, 0.4541, 90.4417, 
5.4531, 2.4895, 13.5969), V8 = c(4.8615, 0.1314, 26.1617, 1.5774, 
0.7201, 3.9331), V9 = c(6.0158, 0.1625, 32.3733, 1.9519, 0.8911, 
4.867)), row.names = c(NA, 6L), class = "data.frame"))

1 回复 | 直到 11 月前

LMc 11 月前

我不清楚你想如何连接这个数据帧列表(前两列是否相同?),但这里有一个使用左连接的选项:

library(dplyr)
library(purrr)

imap(myfilelist, \(df, i) select(df, 1:2, "sample{i}" := 4) ) |>
  reduce(left_join, by = join_by(V1, V2))

如果前两列在所有数据帧中都相同,则可以将它们绑定在一起:

library(dplyr)
library(purrr)

bind_cols(myfilelist[[1]][1:2],
          imap(myfilelist, \(df, i) select(df, "sample{i}" := 4)) |> bind_cols())

AkselA 11 月前

我们也可以在基础R中很容易地做到这一点

bound <- data.frame(myfilelist[[1]][1:2], do.call(cbind, lapply(myfilelist, "[", 4)))
colnames(bound)[-(1:2)] <- paste0("sample", seq(length(myfilelist)))
bound
#                   V1       V2 sample1 sample2
# 1 ENSG00000000003.15   TSPAN6    7094    2616
# 2  ENSG00000000005.6     TNMD       2      23
# 3 ENSG00000000419.13     DPM1    4355    3746
# 4 ENSG00000000457.14    SCYL3    1149    1288
# 5 ENSG00000000460.17 C1orf112     372     510
# 6 ENSG00000000938.13      FGR     585    1578