代码之家 › 专栏 › 技术社区 › Pete

通过将相应变量相乘并求和来创建新变量

tidyverse r

Pete · 技术社区 · 6 月前

我有一长串变量,我想将它们乘以相应的变量并求和。 a_1 对应于 b_1 , a_2 到 b_2 等。所需输出将通过以下公式计算 (a_1*b_1 + a_2*b_2...)

library(dplyr)
(df <- tibble(
  a_1 = sample(1:5),
  a_2 = sample(1:5),
  b_1 = sample(1:5),
  b_2 = sample(1:5),
  desired_output = (a_1*b_1 + a_2*b_2)
))

# A tibble: 5 Ã 5
    a_1   a_2   b_1   b_2 desired_output
  <int> <int> <int> <int>          <int>
1     4     5     1     3             19
2     1     2     2     5             12
3     2     1     4     2             10
4     5     3     5     1             28
5     3     4     3     4             25

我尝试过编写一个函数来实现这一点,但失败了(我对编写函数非常陌生!)例如。

df %>%
  mutate(desired_output = function(df) {
  for (i in 1:2) {
    y1 <- get(paste0(x,'$','a_',i))
    y2 <- get(paste0(x,'$','a_',i))
    z <- y1*y2 
  }
  return(z)
}

4 回复 | 直到 6 月前

G. Grothendieck 6 月前

1. 互为二 pick(...) 调用然后使用 rowSums 如图所示:

df %>%
 mutate(desired_output = rowSums(pick(starts_with("a")) * pick(starts_with("b"))))

给

# A tibble: 5 Ã 5
    a_1   a_2   b_1   b_2 desired_output
  <int> <int> <int> <int>          <dbl>
1     4     5     1     3             19
2     1     2     2     5             12
3     2     1     4     2             10
4     5     3     5     1             28
5     3     4     3     4             25

2. 将其顺理成章地直接转换为基数R可以得到:

transform(df, desired_output = rowSums(
  df[startsWith(names(df), "a")] * df[startsWith(names(df), "b")]
))

3. 与问题中的尝试相当接近的方法是

df %>%
 mutate(desired_output = {
   tmp <- 0
   for(i in 1:2) tmp <- tmp + get(paste0("a_", i)) * get(paste0("b_", i))
   tmp
 })

3a) 或使用基数R:

within(df, {
  desired_output <- 0
  for(i in 1:2) desired_output <- desired_output + 
    get(paste0("a_", i)) * get(paste0("b_", i))
  i <- NULL
})

注:

由于问题中使用了随机数,因此输入不可重复。下次请使用 set.seed(...) 第一。

这是以可重复形式显示的问题中使用的数据:

library(tibble)

df <- tibble(
  a_1 = c(4L, 1L, 2L, 5L, 3L),
  a_2 = c(5L, 2L, 1L, 3L, 4L),
  b_1 = c(1L, 2L, 4L, 5L, 3L),
  b_2 = c(3L, 5L, 2L, 1L, 4L)
)

jpsmith 6 月前

在通用解决方案的基础R中,您可以首先通过模式识别所需的列(这里是感兴趣的列( ccols )由字母、下划线和数字标识,例如“\\D_\\D”),然后使用 sapply 在内部 rowSums 进行乘法和加法:

ccols <- unique(gsub("\\d", "", 
                     grep("\\D_\\d", names(df), value = TRUE)))

df$desired <- rowSums(
  sapply(seq_along(ccols), \(x) {
    df[[paste0(ccols[1], x)]] * df[[paste0(ccols[2], x)]]
  }))

    a_1   a_2   b_1   b_2 desired
  <int> <int> <int> <int>   <dbl>
1     3     5     5     1      20
2     3     4     3     1      13
3     2     1     3     5      11
4     2     2     1     3       8
5     3     3     4     2      18

请注意,如果你确定它们只是“a_xx”和“b_xx”,你可以这样做:

df$desired <- rowSums(
  sapply(1:2, \(x) {
    df[[paste0("a_", x)]] * df[[paste0("b_", x)]]
  }))

数据(带种子)

set.seed(123)
df <- tibble::tibble(
  a_1 = sample(1:5, 5, replace = TRUE),
  a_2 = sample(1:5, 5, replace = TRUE),
  b_1 = sample(1:5, 5, replace = TRUE),
  b_2 = sample(1:5, 5, replace = TRUE)
)

ThomasIsCoding 6 月前

这是一个基本的R选项 split.default + rowSums

transform(
  df,
  prodsum = with(
    split.default(df, sub("_.*", "", names(df))),
    rowSums(a * b)
  )
)

这给了

  a_1 a_2 b_1 b_2 prodsum
1   1   5   3   2      13
2   4   3   5   5      35
3   3   4   1   4      19
4   5   2   4   3      26
5   2   1   2   1       5

数据

set.seed(0)
(df <- tibble(
  a_1 = sample(1:5),
  a_2 = sample(1:5),
  b_1 = sample(1:5),
  b_2 = sample(1:5)
))

Eliot Dixon tmfmnk 6 月前

一种选择是:

df %>%
 mutate(desired_output = rowSums(across(starts_with("a"), 
                                        ~ . * get(stringr::str_replace(cur_column(), "a_", "b_")))))

    a_1   a_2   b_1   b_2 desired_output
  <int> <int> <int> <int>          <dbl>
1     2     3     2     5             19
2     4     2     1     3             10
3     3     4     5     4             31
4     5     1     4     2             22
5     1     5     3     1              8

SamR 6 月前

我发现在这些情况下,将数据放在长格式中更直观 tidyverse 这意味着: tidyr::pivot_longer() ,一个快速 mutate() 要创建结果,则 pivot_wider() .:

df |>
    mutate(rn = row_number()) |>
    tidyr::pivot_longer(
        cols = -rn,
        names_to = c(".value", "group"),
        names_sep = "_"
    ) |>
    mutate(result = sum(a * b), .by = rn) |>
    tidyr::pivot_wider(
        id_cols = c(rn, result),
        names_from = group,
        values_from = c(a, b),
        names_glue = "{.value}_{group}"
    ) |>
    select(c(names(df)), desired_output = result)

# # A tibble: 5 Ã 5
#     a_1   a_2   b_1   b_2 desired_output
#   <int> <int> <int> <int>          <int>
# 1     4     5     1     3             19
# 2     1     2     2     5             12
# 3     2     1     4     2             10
# 4     5     3     5     1             28
# 5     3     4     3     4             25

诚然,这比其他一些方法需要更多的代码行,但a)计算本身, result = sum(a * b) ,在这种形式下更容易理解(至少对我来说),b)通常跳过 pivot_ider() 并为数据操作的下一步保持长格式的数据,在这种情况下,这会变得更短。