代码之家 › 专栏 › 技术社区 › Varun

R根据列名中的日期值重新排列dataframe中的列

data-wrangling dplyr r

0

Varun · 技术社区 · 5 年前

_Score 列)以及该月所有医院的相应平均分数( _Average 列)。

df = data.frame(Hospital=c(rep("Hospital A",10),rep("Hospital B",10),rep("Hospital C",10),rep("Hospital D",10)),
                Question=c(rep("Q1",40)),
                key=c(rep(c("2020-01-31_Average","2020-01-31_Score","2020-02-29_Average","2020-02-29_Score",
                      "2020-03-31_Average","2020-03-31_Score","2020-04-30_Average","2020-04-30_Score",
                      "2020-05-31_Average","2020-05-31_Score"),4)),
                value=c(round(runif(40,0,1),2)))

library(tidyr)
df = df %>% spread(key,value)

我想把这个数据帧转换成-

1) 前两列, Hospital 和 Question 保持不变

(二) _得分 的列 最近三个月

(三) _平均

4) 理想情况下,需要将列从最旧重新排序到最新列(即按以下顺序: Month M-2_Score Month M-1_Score , Month M_Score , Month M_Average

5) 计算列 Variance 最后,M分和M-1分的差值是多少

结果表如下所示-

#Final table
df_transformed = df %>%
  select(1:2,8,10,12,11) %>%
  mutate(Variance=.[[5]]-.[[4]])

对于如何更有效地使用列名中的日期值来实现这一点,我们将不胜感激。

0 回复 | 直到 5 年前

1

Ric S 5 年前

如果数据集中的列已经按时间顺序排序,这是一个可能的解决方案

# create vectors of variables: 3 last "_Score" and 1 last "_Average"
score_vars <- tail(names(df)[grep("_Score", names(df))], 3)
average_var <- tail(names(df)[grep("_Average", names(df))], 1)

df %>% 
  select(Hospital, Question, !!score_vars, !!average_var) %>% 
  mutate(Variance = !!rlang::sym(score_vars[3]) - !!rlang::sym(score_vars[2]))

输出

# Hospital Question 2020-03-31_Score 2020-04-30_Score 2020-05-31_Score 2020-05-31_Average Variance
# 1 Hospital A       Q1             0.28             0.69             0.31               0.94    -0.38
# 2 Hospital B       Q1             0.19             0.41             0.27               0.91    -0.14
# 3 Hospital C       Q1             0.53             0.03             0.25               0.05     0.22
# 4 Hospital D       Q1             0.43             0.59             0.46               0.36    -0.13

2

0

Georgery 5 年前

我并没有真正得到问题4和问题5,但他们有点像“你能做我的家庭作业吗?”。对于问题1至3,请考虑以下内容:

library(tidyverse)
library(lubridate)

df <- data.frame(Hospital=c(rep("Hospital A",10),rep("Hospital B",10),rep("Hospital C",10),rep("Hospital D",10)),
                Question=c(rep("Q1",40)),
                key=c(rep(c("2020-01-31_Average","2020-01-31_Score","2020-02-29_Average","2020-02-29_Score",
                      "2020-03-31_Average","2020-03-31_Score","2020-04-30_Average","2020-04-30_Score",
                      "2020-05-31_Average","2020-05-31_Score"),4)),
                value=c(round(runif(40,0,1),2)))

# take the dataframe
df %>%
    # get month as a date and key separately
    mutate(month = str_replace(key, "_[[:alpha:]]*$", "") %>% ymd()
           , key = str_extract(key, "[[:alpha:]]*$")) %>%
    # filter Score for the last 3 and Average for the last 1 months
    filter(!(today() - month > months(3) & 
                 key == "Score")) %>%
    filter(!(today() - month > months(1) &
                 key == "Average"))

3

0

Ronak Shah 5 年前

我用了你的原版 df 在 spread 步骤。

library(dplyr)
library(tidyr)

df %>%
  #Bring date and key in separate columns
  separate(key, c('Date', 'key'), sep = '_') %>%
  #Convert date column to date class
  mutate(Date = as.Date(Date)) %>%
  #arrange data according with highest date first
  arrange(Hospital, key, desc(Date)) %>%
  #For each hospital and key
  group_by(Hospital, key) %>%
  #If it is a "score" column select top 3 values and 
  #for average column select only 1 value
  slice(if(first(key) == 'Score') 1:3 else 1) %>%
  select(-Question) %>%
  ungroup() %>%
  #Get the data in wide format
  pivot_wider(names_from = c(key, Date), values_from = value) %>%
  #Calculate variance column
  mutate(Variance = .[[3]] - .[[4]])

# A tibble: 4 x 6
#  Hospital   `Average_2020-05-31` `Score_2020-05-31` `Score_2020-04-30` `Score_2020-03-31` Variance
#  <chr>                     <dbl>              <dbl>              <dbl>              <dbl>    <dbl>
#1 Hospital A                 0.45               0.44               0.66               0.97    -0.22
#2 Hospital B                 0.11               0.53               0.68               0.27    -0.15
#3 Hospital C                 1                  0.18               0.56               0.41    -0.38
#4 Hospital D                 0.31               0.83               0.6                0.79     0.23

计算方差 .[[3]] - .[[4]] 将被修复,因为 "Hospital" 列是固定的,并且始终是第一列。 "Average" 纵队会在前面 "Score" Date ,我们知道最高日期会放在第一位,然后是第二位,以此类推。