代码之家 › 专栏 › 技术社区 › Max

用两列中的特定值对识别R中的数据帧行

indexing matrix r

Max · 技术社区 · 1 年前

我想识别数据帧(或矩阵)中第1列和第2列中的值与特定对匹配的所有行。例如,如果我有一个矩阵

testmat=rbind(c(1,1),c(1,2),c(1,4),c(2,1),c(2,4),c(3,4),c(3,10))

我想确定包含以下任何一对的行,即在第一列和第二列中包含1,2或2,4组合的所有行

of_interest = rbind(c(1,2),c(2,4))

以下方法不起作用

which(testmat[,1] %in% of_interest[,1] & testmat[,2] %in% of_interest[,2])

因为,正如预期的那样,它返回第一列中1,2和第二列中2,4的所有组合(即行2,3,5,而不是所需的仅行2和5),因此即使这不是我查询的对之一,也包括行[1,4]。一定有一些简单的方法来使用…%%。。。为了匹配这样的特定配对,但我还没有找到一个有效的例子。

请注意,我需要与所需条件匹配的行的位置/行号。

4 回复 | 直到 1 年前

SamR 1 年前

标准方法

我假设你正在使用 which() 你想要的是位置,而不仅仅是是否有比赛。你可以 cbind() 行号为 testmat 然后 merge() 这个与 of_interest .

merge(
    cbind(testmat, seq_len(nrow(testmat))),
    of_interest
) |> setNames(c("x", "y", "row_num"))

#   x y row_num
# 1 1 2       2
# 2 2 4       5

`Rcpp` 超大矩阵方法

你在你的 comment 你有 10e8 排。这让我想到两件事:

不要 合并() 因为这将把矩阵强制转换为数据帧,即将每一列复制到内存连续向量中,这将非常昂贵。
如果 感兴趣的 也很大,您希望在找到匹配后尽早打破循环,而不是继续迭代。请参阅 this question 以获得性能优势。

鉴于此,我会避免使用 which() 或其他不会提前退出的方法。这里有一些 Rcpp 应该是的代码 许多的 比 合并() 对于大型数据集:

Rcpp::cppFunction("
IntegerVector get_row_position(NumericMatrix testmat, NumericMatrix of_interest) {
    const R_xlen_t nrow_testmat = testmat.nrow();
    const R_xlen_t nrow_of_interest = of_interest.nrow();

    IntegerVector result;

    // loop through the rows of testmat
    for (R_xlen_t i = 0; i < nrow_testmat; ++i) {
        NumericMatrix::Row test_row = testmat(i, _);

        for (R_xlen_t j = 0; j < nrow_of_interest; ++j) {
            NumericMatrix::Row interest_row = of_interest(j, _);

            if (is_true(all(test_row == interest_row))) {
                result.push_back(i + 1); // because of 1-indexing
                break; // leave inner loop early
            }
        }
    }
    return result;
}
")

get_row_position(testmat, of_interest)
# [1] 2 5

我想 accessing rows as sub-matrices 更地道 Rcpp 代码比使用矩阵索引的双for循环更快,但我不知道哪种更快,所以如果性能是你最关心的问题,我会尝试各种方法和基准测试。

ThomasIsCoding 1 年前

以下是一种方法 which + asplit

> which(asplit(testmat, 1) %in% asplit(of_interest, 1))
[1] 2 5

这可能有点低效,因为 aplist ,但如果速度是您关注的问题之一,那么它应该适用于小数据集。

SEAnalyst 1 年前

你可以 paste() 将示例中的值(testmat和of_interest)合并为一个值,然后执行一个操作 %in% 评价。例如:

testmat_keys <- paste(testmat[, 1], testmat[, 2], sep = "_")
of_interest_keys <- paste(of_interest[, 1], of_interest[, 2], sep = "_")

which(testmat_keys %in% of_interest_keys) #returns [1] 2 5

如果 %百分比 对你来说不够快,考虑试试 %fin% 或 fmatch() 从 fastmatch 作为一种更快的替代方案 %百分比 .

#install.packages('fastmatch')   
library(fastmatch)

matches <- which(fmatch(test_keys, of_interest_keys, nomatch = 0) > 0)

Friede 1 年前

我们可以使用 row.names() + {ivs} .

设置:

testmat = rbind(c(1,1), c(1,2), c(1,4), c(2,1), c(2,4), c(3,4), c(3,10))
row.names(testmat) = seq_len(nrow(testmat))

索引,

i = testmat[, 1] < testmat[, 2]

比较,

library(ivs)
w = iv_overlaps(iv(testmat[i, 1], testmat[i, 2]), 
                iv_pairs(c(1,2), c(2,4)), 
                type = "equals")

再次索引:

> names(i[i == TRUE][w]) # |> strtoi() # to return integers instead
[1] "2" "5"

用两列中的特定值对识别R中的数据帧行

标准方法

Rcpp 超大矩阵方法

`Rcpp` 超大矩阵方法