代码之家 › 专栏 › 技术社区 › nasa313

在R中,如何编写对数据帧列表进行t检验的函数

split dataframe list excel r

nasa313 · 技术社区 · 4 年前

使用 iris 以数据集为例,我想编写一个用户定义的函数

跑 pairwise t-test 在所有4列上 Species 每个数据拆分的列
将结果导出为csv文件的3个工作表

请参阅下面的尝试:

library(tidyr)
library(reshape) # for melting /stacking the data
library(multcomp) # for pairwise test 
library(xlsx) # export excel file with worksheet

options(scipen = 100)

# dataset
iris

data_stats <- function(data){
    # melt the dataframe
     df <- melt(data, id.vars=c('Species'),var='group')
    
# split the dataframe into three list of dataframe
    dfsplit<-split(df,df$column)
    
    #  pairwise t-test
    results <- pairwise.t.test(dfsplit$value, dfsplit$group,p.adjust.method = "BH")
    
    # export each result as a worksheet of an excel file
    write.xlsx(results, file="Results.xlsx", sheetName="versicolor_stats", row.names=FALSE)
    write.xlsx(results, file="Results.xlsx", sheetName="virginica_stats", append=TRUE, row.names=FALSE)
    write.xlsx(results, file="Results.xlsx", sheetName="setosa_stats", append=TRUE, row.names=FALSE)
    
}
# testing the code on iris data
data_stats(iris)

请评论并分享您的代码。谢谢

0 回复 | 直到 4 年前

akrun 4 年前

这里有一个选项 tidyverse -使用重塑为“长”格式 pivot_longer ,然后使用 group_modify 做 pairwise.t.test , tidy 输出和 unnest 这个 list 输出

library(dplyr)
library(tidyr)
library(broom)
ttest_out <- iris %>% 
  pivot_longer(cols = -Species) %>% 
  group_by(Species) %>%
  group_modify(~ .x %>%
     summarise(out = list(pairwise.t.test(value, name) %>% 
       tidy))) %>% 
   ungroup %>% 
   unnest(out)

-输出

ttest_out
# A tibble: 18 Ã 4
   Species    group1       group2         p.value
   <fct>      <chr>        <chr>            <dbl>
 1 setosa     Petal.Width  Petal.Length 1.77e- 54
 2 setosa     Sepal.Length Petal.Length 2.77e-132
 3 setosa     Sepal.Length Petal.Width  1.95e-156
 4 setosa     Sepal.Width  Petal.Length 1.61e- 86
 5 setosa     Sepal.Width  Petal.Width  1.13e-123
 6 setosa     Sepal.Width  Sepal.Length 4.88e- 71
 7 versicolor Petal.Width  Petal.Length 5.35e- 90
 8 versicolor Sepal.Length Petal.Length 3.78e- 52
 9 versicolor Sepal.Length Petal.Width  5.02e-125
10 versicolor Sepal.Width  Petal.Length 1.36e- 45
11 versicolor Sepal.Width  Petal.Width  3.46e- 44
12 versicolor Sepal.Width  Sepal.Length 1.25e- 95
13 virginica  Petal.Width  Petal.Length 1.39e- 90
14 virginica  Sepal.Length Petal.Length 6.67e- 22
15 virginica  Sepal.Length Petal.Width  3.47e-110
16 virginica  Sepal.Width  Petal.Length 2.35e- 68
17 virginica  Sepal.Width  Petal.Width  1.87e- 19
18 virginica  Sepal.Width  Sepal.Length 2.47e- 92

-1

dipetkov 4 年前

使现代化 :这个问题的统计部分(适用 pairwise.t.test 到iris数据集) answered previously 关于SO。以及 here 是另一种解决方案。

公认的解决方案运行一系列成对的t检验,并产生一列p值(正如问题所问),但这是一组意义不大的t检验。你可能会怀疑,从我们看到2.77e-132这样的p值,以及组1和组2是连续变量,而不是因子的水平。

这些测试评估的假设是,对于每个物种,萼片是否与花瓣相同,长度是否与宽度相同。成对t检验程序旨在比较单个连续变量(如萼片长度)在所有水平上的一个因子(如物种)。

首先,让我们申请 成对测试 到列 Sepal.Length ,这样我们以后可以检查我们是否得到了正确的p值。

library("broom")
library("tidyverse")

pairwise.t.test(iris$Sepal.Width, iris$Species)
#> 
#>  Pairwise comparisons using t tests with pooled SD 
#> 
#> data:  iris$Sepal.Width and iris$Species 
#> 
#>            setosa  versicolor
#> versicolor < 2e-16 -         
#> virginica  9.1e-10 0.0031    
#> 
#> P value adjustment method: holm

如果你看过虹膜数据集,你就会知道这些p值“有意义”:Virginica&Versicolor彼此比Setosa更相似。

因此,现在让我们以一种整洁的方式将测试应用于四个数字列。

t_pvals <- iris %>%
  pivot_longer(
    -Species,
    names_to = "Variable",
    values_to = "x"
  ) %>%
  # The trick to performing the right tests is to group the tibble by Variable,
  # not by Species because Species is the grouping variable for the t-tests.
  group_by(
    Variable
  ) %>%
  group_modify(
    ~ tidy(pairwise.t.test(.x$x, .x$Species))
  ) %>%
  ungroup()
t_pvals
#> # A tibble: 12 Ã 4
#>    Variable     group1     group2      p.value
#>    <chr>        <chr>      <chr>         <dbl>
#>  1 Petal.Length versicolor setosa     1.05e-68
#>  2 Petal.Length virginica  setosa     1.23e-90
#>  3 Petal.Length virginica  versicolor 1.81e-31
#>  4 Petal.Width  versicolor setosa     2.51e-57
#>  5 Petal.Width  virginica  setosa     2.39e-85
#>  6 Petal.Width  virginica  versicolor 8.82e-37
#>  7 Sepal.Length versicolor setosa     1.75e-15
#>  8 Sepal.Length virginica  setosa     6.64e-32
#>  9 Sepal.Length virginica  versicolor 2.77e- 9
#> 10 Sepal.Width  versicolor setosa     5.50e-17
#> 11 Sepal.Width  virginica  setosa     9.08e-10
#> 12 Sepal.Width  virginica  versicolor 3.15e- 3

分隔带宽度比较的p值位于底部。我们得到了我们期望的p值!

接下来,我们对p值进行格式化,使它们更容易显示在眼睛上。

t_pvals <- t_pvals %>%
  mutate(
    across(
      p.value, rstatix::p_format,
      accuracy = 0.05
    )
  )
t_pvals
#> # A tibble: 12 Ã 4
#>    Variable     group1     group2     p.value
#>    <chr>        <chr>      <chr>      <chr>  
#>  1 Petal.Length versicolor setosa     <0.05  
#>  2 Petal.Length virginica  setosa     <0.05  
#>  3 Petal.Length virginica  versicolor <0.05  
#>  4 Petal.Width  versicolor setosa     <0.05  
#>  5 Petal.Width  virginica  setosa     <0.05  
#>  6 Petal.Width  virginica  versicolor <0.05  
#>  7 Sepal.Length versicolor setosa     <0.05  
#>  8 Sepal.Length virginica  setosa     <0.05  
#>  9 Sepal.Length virginica  versicolor <0.05  
#> 10 Sepal.Width  versicolor setosa     <0.05  
#> 11 Sepal.Width  virginica  setosa     <0.05  
#> 12 Sepal.Width  virginica  versicolor <0.05

最后,我们将结果保存到一个文件中。

t_pvals %>%
  write_csv("pairwse-t-tests-on-iris-data.csv")