代码之家  ›  专栏  ›  技术社区  ›  Monica Whatley

R help-多数据框列上的函数

  •  1
  • Monica Whatley  · 技术社区  · 7 年前

    我想用一个函数在数据框的四列上重复一组过程。最终,我需要一个包含所有输出的长数据帧。这是我的数据框:

    > sample_data
    # A tibble: 10 x 7
          REVENUEID AMOUNT  YEAR REPORT_CODE PAYMENT_METHOD INBOUND_CHANNEL  AMOUNT_CAT
              <chr>  <dbl> <chr>       <chr>          <chr>           <chr>      <fctr>
     1 rev-24985629     30  FY18           S          Check            Mail     [25,50)
     2 rev-22812413      1  FY16           Q          Other      Canvassing   [0.01,10)
     3 rev-23508794    100  FY17           Q    Credit card             Web   [100,250)
     4 rev-23506121    300  FY17           S    Credit card            Mail   [250,500)
     5 rev-23550444    100  FY17           S    Credit card             Web   [100,250)
     6 rev-21508672     25  FY14           J          Check            Mail     [25,50)
     7 rev-24981769    500  FY18           S    Credit card             Web [500,1e+03)
     8 rev-23503684     50  FY17           R          Check            Mail     [50,75)
     9 rev-24982087     25  FY18           R          Check            Mail     [25,50)
    10 rev-24979834     50  FY18           R    Credit card             Web     [50,75)
    

    这是我的代码:

    AMOUNT_CAT<- sample_data %>% group_by(AMOUNT_CAT,YEAR) %>% summarize(num=n(),total=sum(AMOUNT)) %>% rename(REPORT_VALUE=AMOUNT_CAT) %>% mutate(REPORT_CATEGORY="AMOUNT_CAT")
    INBOUND_CHANNEL<- sample_data %>% group_by(INBOUND_CHANNEL,YEAR) %>% summarize(num=n(),total=sum(AMOUNT)) %>% rename(REPORT_VALUE=INBOUND_CHANNEL) %>% mutate(REPORT_CATEGORY="INBOUND_CHANNEL")
    PAYMENT_METHOD<- sample_data %>% group_by(PAYMENT_METHOD,YEAR) %>% summarize(num=n(),total=sum(AMOUNT)) %>% rename(REPORT_VALUE=PAYMENT_METHOD) %>% mutate(REPORT_CATEGORY="PAYMENT_METHOD")
    REPORT_CODE<- sample_data %>% group_by(REPORT_CODE,YEAR) %>% summarize(num=n(),total=sum(AMOUNT)) %>% rename(REPORT_VALUE=REPORT_CODE) %>% mutate(REPORT_CATEGORY="REPORT_CODE")
    final_product<-bind_rows(REPORT_CODE,PAYMENT_METHOD,INBOUND_CHANNEL,AMOUNT_CAT)
    

    以下是该代码的最终产物:

           > final_product
            # A tibble: 27 x 5
            # Groups:   REPORT_VALUE [16]
               REPORT_CATEGORY REPORT_VALUE  YEAR   num total
    
                     <chr>        <chr> <chr> <int> <dbl>
         1     REPORT_CODE            J  FY14     1    25
         2     REPORT_CODE            Q  FY16     1     1
         3     REPORT_CODE            Q  FY17     1   100
         4     REPORT_CODE            R  FY17     1    50
         5     REPORT_CODE            R  FY18     2    75
         6     REPORT_CODE            S  FY17     2   400
         7     REPORT_CODE            S  FY18     2   530
         8  PAYMENT_METHOD        Check  FY14     1    25
         9  PAYMENT_METHOD        Check  FY17     1    50
        10  PAYMENT_METHOD        Check  FY18     2    55
        # ... with 17 more rows
    

    cat.list <- c("REPORT_CODE","PAYMENT_METHOD","INBOUND_CHANNEL","AMOUNT_CAT")
    repeat_procs <- lapply(cat.list, function(x) x <- sample_data %>% group_by(x,YEAR) %>% summarize(num=n(),total=sum(AMOUNT)) %>% rename(REPORT_VALUE=x) %>% mutate(REPORT_CATEGORY="x")
    

    有谁能建议我如何编写“更智能”的代码,使其不经常重复?

    谢谢

    3 回复  |  直到 7 年前
        1
  •  2
  •   acylam    7 年前

    您需要将字符串解析为符号( rlang::sym )并将其取消引用 group_by rename 如下所示。另一件需要注意的事情是 cat.list 已经是字符串向量,因此不需要在其周围添加双引号 x 在里面 mutate :

    library(dplyr)
    library(rlang)
    
    cat.list <- c("REPORT_CODE","PAYMENT_METHOD","INBOUND_CHANNEL","AMOUNT_CAT")
    repeat_procs <- lapply(cat.list, function(x){
      final_data <- sample_data %>% 
        group_by(!!sym(x), YEAR) %>% 
        summarize(num=n(),total=sum(AMOUNT)) %>% 
        rename(REPORT_VALUE=!!sym(x)) %>% 
        mutate(REPORT_CATEGORY=x)
    }) %>%
      bind_rows()
    

    结果:

    > repeat_procs
    # A tibble: 27 x 5
    # Groups:   REPORT_VALUE [16]
       REPORT_VALUE   YEAR   num total REPORT_CATEGORY
              <chr> <fctr> <int> <int>           <chr>
     1            J   FY14     1    25     REPORT_CODE
     2            Q   FY16     1     1     REPORT_CODE
     3            Q   FY17     1   100     REPORT_CODE
     4            R   FY17     1    50     REPORT_CODE
     5            R   FY18     2    75     REPORT_CODE
     6            S   FY17     2   400     REPORT_CODE
     7            S   FY18     2   530     REPORT_CODE
     8        Check   FY14     1    25  PAYMENT_METHOD
     9        Check   FY17     1    50  PAYMENT_METHOD
    10        Check   FY18     2    55  PAYMENT_METHOD
    # ... with 17 more rows
    
        2
  •  2
  •   yutannihilation    7 年前

    要获得更“智能”的代码,您需要将数据转换为 "tidy data" 分组汇总前形成。

    data_tidy <-
      tidyr::gather(sample_data, key = "REPORT_CATEGORY", value = "REPORT_VALUE", !! cat.list)
    
    data_tidy
    #>       REVENUEID AMOUNT YEAR REPORT_CATEGORY REPORT_VALUE
    #> 1  rev-24985629     30 FY18     REPORT_CODE            S
    #> 2  rev-22812413      1 FY16     REPORT_CODE            Q
    #> 3  rev-23508794    100 FY17     REPORT_CODE            Q
    #> 4  rev-23506121    300 FY17     REPORT_CODE            S
    #> 5  rev-23550444    100 FY17     REPORT_CODE            S
    #> 6  rev-21508672     25 FY14     REPORT_CODE            J
    #> 7  rev-24981769    500 FY18     REPORT_CODE            S
    #> 8  rev-23503684     50 FY17     REPORT_CODE            R
    #> 9  rev-24982087     25 FY18     REPORT_CODE            R
    #> 10 rev-24979834     50 FY18     REPORT_CODE            R
    #> 11 rev-24985629     30 FY18  PAYMENT_METHOD        Check
    #> 12 rev-22812413      1 FY16  PAYMENT_METHOD        Other
    #> 13 rev-23508794    100 FY17  PAYMENT_METHOD  Credit card
    #> 14 rev-23506121    300 FY17  PAYMENT_METHOD  Credit card
    #> 15 rev-23550444    100 FY17  PAYMENT_METHOD  Credit card
    #> 16 rev-21508672     25 FY14  PAYMENT_METHOD        Check
    #> 17 rev-24981769    500 FY18  PAYMENT_METHOD  Credit card
    #> 18 rev-23503684     50 FY17  PAYMENT_METHOD        Check
    #> 19 rev-24982087     25 FY18  PAYMENT_METHOD        Check
    #> 20 rev-24979834     50 FY18  PAYMENT_METHOD  Credit card
    #> 21 rev-24985629     30 FY18 INBOUND_CHANNEL         Mail
    #> 22 rev-22812413      1 FY16 INBOUND_CHANNEL   Canvassing
    #> 23 rev-23508794    100 FY17 INBOUND_CHANNEL          Web
    #> 24 rev-23506121    300 FY17 INBOUND_CHANNEL         Mail
    #> 25 rev-23550444    100 FY17 INBOUND_CHANNEL          Web
    #> 26 rev-21508672     25 FY14 INBOUND_CHANNEL         Mail
    #> 27 rev-24981769    500 FY18 INBOUND_CHANNEL          Web
    #> 28 rev-23503684     50 FY17 INBOUND_CHANNEL         Mail
    #> 29 rev-24982087     25 FY18 INBOUND_CHANNEL         Mail
    #> 30 rev-24979834     50 FY18 INBOUND_CHANNEL          Web
    #> 31 rev-24985629     30 FY18      AMOUNT_CAT      [25,50)
    #> 32 rev-22812413      1 FY16      AMOUNT_CAT    [0.01,10)
    #> 33 rev-23508794    100 FY17      AMOUNT_CAT    [100,250)
    #> 34 rev-23506121    300 FY17      AMOUNT_CAT    [250,500)
    #> 35 rev-23550444    100 FY17      AMOUNT_CAT    [100,250)
    #> 36 rev-21508672     25 FY14      AMOUNT_CAT      [25,50)
    #> 37 rev-24981769    500 FY18      AMOUNT_CAT  [500,1e+03)
    #> 38 rev-23503684     50 FY17      AMOUNT_CAT      [50,75)
    #> 39 rev-24982087     25 FY18      AMOUNT_CAT      [25,50)
    #> 40 rev-24979834     50 FY18      AMOUNT_CAT      [50,75)
    
    data_tidy %>%
      group_by(REPORT_CATEGORY, REPORT_VALUE, YEAR) %>%
      summarise(num = n(), total = sum(AMOUNT)) %>%
      ungroup()
    #> # A tibble: 27 x 5
    #>    REPORT_CATEGORY REPORT_VALUE  YEAR   num total
    #>              <chr>        <chr> <chr> <int> <int>
    #>  1      AMOUNT_CAT    [0.01,10)  FY16     1     1
    #>  2      AMOUNT_CAT    [100,250)  FY17     2   200
    #>  3      AMOUNT_CAT      [25,50)  FY14     1    25
    #>  4      AMOUNT_CAT      [25,50)  FY18     2    55
    #>  5      AMOUNT_CAT    [250,500)  FY17     1   300
    #>  6      AMOUNT_CAT      [50,75)  FY17     1    50
    #>  7      AMOUNT_CAT      [50,75)  FY18     1    50
    #>  8      AMOUNT_CAT  [500,1e+03)  FY18     1   500
    #>  9 INBOUND_CHANNEL   Canvassing  FY16     1     1
    #> 10 INBOUND_CHANNEL         Mail  FY14     1    25
    #> # ... with 17 more rows
    
        3
  •  0
  •   leerssej    7 年前

    A. purrr 增加了一种方法,使您的代码更加简洁 smarter .

    library(tidyverse)
    library(rlang) 
    cat.list <- c("REPORT_CODE","PAYMENT_METHOD","INBOUND_CHANNEL","AMOUNT_CAT")
    
    map_df(cat.list,
           function(report_cat) {
               sample_data %>%
                   group_by(!!sym(report_cat), YEAR) %>%
                   summarize(num=n(),total=sum(AMOUNT)) %>% 
                   rename(REPORT_VALUE = !!sym(report_cat)) %>% 
                   mutate(REPORT_CATEGORY = report_cat)
           }
          )
    

    正如哈德利所描述的那样 here (大约往下走了一半):

    map_df(x, f) 实际上与相同 do.call("rbind", lapply(x, f))

    完全公开,感谢@用户向我展示如何使用 sym(!!() 方法我把自己倒进了一个角落 Programming in Dplyr vignette构建了我认为最新的功能化方法 dplyr . dplyr 使用 var <- enquo(var) 然后 !!var 但是我找不到一种方法来处理在 cat.list map_df lapply . 谢谢你,用户教了我一个更好的方法来用tidyverse编码

    here

    这使我能够完成我先前开发的另一种量化方法:

    report <- function(report_cat){
        report_cat <- enquo(report_cat)
        sample_data %>%
        group_by(!!report_cat, YEAR) %>%
        summarize(num=n(),total=sum(AMOUNT)) %>% 
        rename(REPORT_VALUE = !!report_cat) %>% 
        mutate(REPORT_CATEGORY := as.character(quote(!!report_cat))[2])
    }
    report_named <- function(x) {do.call("report", list(as.name(x)))}
    map_df(cat.list, report_named)
    
    > map_df(cat.list, report_named)
        # A tibble: 27 x 5
        # Groups:   REPORT_VALUE [16]
           REPORT_VALUE  YEAR   num total REPORT_CATEGORY
                  <chr> <chr> <int> <int>           <chr>
         1            J  FY14     1    25     REPORT_CODE
         2            Q  FY16     1     1     REPORT_CODE
         3            Q  FY17     1   100     REPORT_CODE
         4            R  FY17     1    50     REPORT_CODE
         5            R  FY18     2    75     REPORT_CODE
         6            S  FY17     2   400     REPORT_CODE
         7            S  FY18     2   530     REPORT_CODE
         8        Check  FY14     1    25  PAYMENT_METHOD
         9        Check  FY17     1    50  PAYMENT_METHOD
        10        Check  FY18     2    55  PAYMENT_METHOD
        # ... with 17 more rows
    

    注:YutaniHilation的整洁解确实是IMHO的最优解 -我只是利用这个机会来扩大我对如何继续 拆分、应用、合并 包括的方法 dplyr 功能。