代码之家  ›  专栏  ›  技术社区  ›  Varun

R根据列名中的日期值重新排列dataframe中的列

  •  0
  • Varun  · 技术社区  · 5 年前

    _Score 列)以及该月所有医院的相应平均分数( _Average 列)。

    df = data.frame(Hospital=c(rep("Hospital A",10),rep("Hospital B",10),rep("Hospital C",10),rep("Hospital D",10)),
                    Question=c(rep("Q1",40)),
                    key=c(rep(c("2020-01-31_Average","2020-01-31_Score","2020-02-29_Average","2020-02-29_Score",
                          "2020-03-31_Average","2020-03-31_Score","2020-04-30_Average","2020-04-30_Score",
                          "2020-05-31_Average","2020-05-31_Score"),4)),
                    value=c(round(runif(40,0,1),2)))
    
    library(tidyr)
    df = df %>% spread(key,value)
    

    我想把这个数据帧转换成-

    1) 前两列, Hospital Question 保持不变

    (二) _得分 的列 最近三个月

    (三) _平均

    4) 理想情况下,需要将列从最旧重新排序到最新列(即按以下顺序: Month M-2_Score Month M-1_Score , Month M_Score , Month M_Average

    5) 计算列 Variance 最后,M分和M-1分的差值是多少

    结果表如下所示-

    #Final table
    df_transformed = df %>%
      select(1:2,8,10,12,11) %>%
      mutate(Variance=.[[5]]-.[[4]])
    

    对于如何更有效地使用列名中的日期值来实现这一点,我们将不胜感激。

    0 回复  |  直到 5 年前
        1
  •  1
  •   Ric S    5 年前

    如果数据集中的列已经按时间顺序排序,这是一个可能的解决方案

    # create vectors of variables: 3 last "_Score" and 1 last "_Average"
    score_vars <- tail(names(df)[grep("_Score", names(df))], 3)
    average_var <- tail(names(df)[grep("_Average", names(df))], 1)
    
    df %>% 
      select(Hospital, Question, !!score_vars, !!average_var) %>% 
      mutate(Variance = !!rlang::sym(score_vars[3]) - !!rlang::sym(score_vars[2]))
    

    输出

    # Hospital Question 2020-03-31_Score 2020-04-30_Score 2020-05-31_Score 2020-05-31_Average Variance
    # 1 Hospital A       Q1             0.28             0.69             0.31               0.94    -0.38
    # 2 Hospital B       Q1             0.19             0.41             0.27               0.91    -0.14
    # 3 Hospital C       Q1             0.53             0.03             0.25               0.05     0.22
    # 4 Hospital D       Q1             0.43             0.59             0.46               0.36    -0.13
    
        2
  •  0
  •   Georgery    5 年前

    我并没有真正得到问题4和问题5,但他们有点像“你能做我的家庭作业吗?”。对于问题1至3,请考虑以下内容:

    library(tidyverse)
    library(lubridate)
    
    df <- data.frame(Hospital=c(rep("Hospital A",10),rep("Hospital B",10),rep("Hospital C",10),rep("Hospital D",10)),
                    Question=c(rep("Q1",40)),
                    key=c(rep(c("2020-01-31_Average","2020-01-31_Score","2020-02-29_Average","2020-02-29_Score",
                          "2020-03-31_Average","2020-03-31_Score","2020-04-30_Average","2020-04-30_Score",
                          "2020-05-31_Average","2020-05-31_Score"),4)),
                    value=c(round(runif(40,0,1),2)))
    
    # take the dataframe
    df %>%
        # get month as a date and key separately
        mutate(month = str_replace(key, "_[[:alpha:]]*$", "") %>% ymd()
               , key = str_extract(key, "[[:alpha:]]*$")) %>%
        # filter Score for the last 3 and Average for the last 1 months
        filter(!(today() - month > months(3) & 
                     key == "Score")) %>%
        filter(!(today() - month > months(1) &
                     key == "Average"))
    
        3
  •  0
  •   Ronak Shah    5 年前

    我用了你的原版 df spread 步骤。

    library(dplyr)
    library(tidyr)
    
    df %>%
      #Bring date and key in separate columns
      separate(key, c('Date', 'key'), sep = '_') %>%
      #Convert date column to date class
      mutate(Date = as.Date(Date)) %>%
      #arrange data according with highest date first
      arrange(Hospital, key, desc(Date)) %>%
      #For each hospital and key
      group_by(Hospital, key) %>%
      #If it is a "score" column select top 3 values and 
      #for average column select only 1 value
      slice(if(first(key) == 'Score') 1:3 else 1) %>%
      select(-Question) %>%
      ungroup() %>%
      #Get the data in wide format
      pivot_wider(names_from = c(key, Date), values_from = value) %>%
      #Calculate variance column
      mutate(Variance = .[[3]] - .[[4]])
    
    # A tibble: 4 x 6
    #  Hospital   `Average_2020-05-31` `Score_2020-05-31` `Score_2020-04-30` `Score_2020-03-31` Variance
    #  <chr>                     <dbl>              <dbl>              <dbl>              <dbl>    <dbl>
    #1 Hospital A                 0.45               0.44               0.66               0.97    -0.22
    #2 Hospital B                 0.11               0.53               0.68               0.27    -0.15
    #3 Hospital C                 1                  0.18               0.56               0.41    -0.38
    #4 Hospital D                 0.31               0.83               0.6                0.79     0.23
    

    计算方差 .[[3]] - .[[4]] 将被修复,因为 "Hospital" 列是固定的,并且始终是第一列。 "Average" 纵队会在前面 "Score" Date ,我们知道最高日期会放在第一位,然后是第二位,以此类推。

    推荐文章