代码之家  ›  专栏  ›  技术社区  ›  Zheyuan Li

简单线性回归快速分组

  •  5
  • Zheyuan Li  · 技术社区  · 6 年前

    此问答源于 How to make group_by and lm fast? 在这里,OP试图对一个大的数据帧对每个组进行简单的线性回归。

    理论上,一系列的回归 y ~ x | g 相当于单一集合回归 y ~ x * g . 后者非常吸引人,因为不同组之间的统计测试是直接的。但在实践中,进行这种较大的回归计算并不容易。我对链接的Q&A评论包的回答 speedlm glm4

    大回归问题是困难的,特别是当有因素变量时。这或许可以解释为什么许多人放弃这个想法,而更喜欢按组分割数据,并按组拟合模型。我没有必要枚举按回归分组的方法(请参见 Linear Regression and group by in R ). 我在乎的是速度。

    对于简单线性回归 ,按组拆分数据,然后依赖标准模型拟合例程,如 lm 是个表演杀手。首先,对大型数据帧进行子集设置是低效的。其次,标准模型拟合例程遵循以下过程,这对有用的回归计算来说是一个巨大的开销。

    1. 将模型公式解析为“terms”对象(使用 terms.formula );
    2. 构造模型框架(使用 model.frame.default );
    3. 建立模型矩阵(使用 model.matrix.default ).

    对于简单的线性回归有聪明的计算技巧。正如我在 Fast pairwise simple linear regression between variables in a data frame 协方差法运算速度非常快。我们能通过简单的线性回归通过 group_by_simpleLM 功能?

    1 回复  |  直到 6 年前
        1
  •  6
  •   Zheyuan Li    6 年前

    我们必须通过编写编译后的代码来实现这一点。我会把这个交给Rcpp。请注意,我是一个C程序员,一直在使用R的常规C接口。Rcpp只是用来简化对列表、字符串和属性的处理,以及促进R中的即时测试。代码大部分是用C风格编写的。来自R的常规C接口的宏 REAL INTEGER 仍在使用。请参阅此答案底部的“group_by_simpleLM.cpp”。

    R包装函数 group_by_simpleLM 有四个参数:

    group_by_simpleLM <- function (dat, LHS, RHS, group) {
    ##.... [TRUNCATED]
    

    • dat
    • LHS 是一个字符向量,给出了 ~ . 支持多个LHS变量。
    • RHS 是一个字符向量,给出了 ~ . 在简单的线性回归中,仅允许一个非因子RHS变量。您可以为 右手侧 ,但函数将只保留第一个元素(带有警告)。如果在中找不到该变量 数据
    • group 提供分组变量名称的字符向量。它最好是 数据 ,否则函数使用 match(group, unique(group)) 对于一个快速的胁迫和警告发出。未使用水平的因素是无害的。 group_by_simpleLM_cpp 看到这个就把你们都还给我 NaN 对于这样的水平。 NULL 所以对所有数据进行一次回归。

    马术功能 小组讨论 返回矩阵的命名列表,以保存每个响应的回归结果。每个矩阵都是“宽”的 nlevels(group) 列和5行:

    • “阿尔法”(用于拦截);
    • “β”(用于斜率);
    • “r2”(代表R平方);
    • “df.resid”(用于剩余自由度);

    For a simple linear regression, these five statistics are sufficient to obtain other statistics .

    被退回。另一种特殊情况是一个组只有两个数据。这样拟合就完美了,斜率的标准误差为0。

    函数是 nlme::lmList(RHS ~ LHS | group, dat, pool = FALSE) 什么时候 group != NULL ,以及 lm(RHS ~ LHS, dat) group = NULL (甚至可能比 general_paired_simpleLM 因为它是用C)写的。

    注意:

    • 不处理加权回归,因为协方差方法在这种情况下无效。
    • 不检查 NA / / Inf / -Inf 数据

    实例

    library(Rcpp)
    sourceCpp("group_by_simpleLM.cpp")
    
    ## a toy dataset
    set.seed(0)
    dat <- data.frame(y1 = rnorm(10), y2 = rnorm(10), x = 1:5,
                      f = gl(2, 5, labels = letters[1:2]),
                      g = sample(gl(2, 5, labels = LETTERS[1:2])))
    

    回归分组法 nlme::lmList

    group_by_simpleLM(dat, c("y1", "y2"), "x", "f")
    #$y1
    #                    a          b
    #alpha     0.820107094 -2.7164723
    #beta     -0.009796302  0.8812007
    #beta.se   0.266690568  0.2090644
    #r2        0.000449565  0.8555330
    #df.resid  3.000000000  3.0000000
    #
    #$y2
    #                  a           b
    #alpha     0.1304709  0.06996587
    #beta     -0.1616069 -0.14685953
    #beta.se   0.2465047  0.24815024
    #r2        0.1253142  0.10454374
    #df.resid  3.0000000  3.00000000
    
    fit <- nlme::lmList(cbind(y1, y2) ~ x | f, data = dat, pool = FALSE)
    
    ## results for level "a"; use `fit[[2]]` to see results for level "b"
    lapply(summary(fit[[1]]), "[", c("coefficients", "r.squared"))
    #$`Response y1`
    #$`Response y1`$coefficients
    #                Estimate Std. Error     t value  Pr(>|t|)
    #(Intercept)  0.820107094  0.8845125  0.92718537 0.4222195
    #x           -0.009796302  0.2666906 -0.03673284 0.9730056
    #
    #$`Response y1`$r.squared
    #[1] 0.000449565
    #
    #
    #$`Response y2`
    #$`Response y2`$coefficients
    #              Estimate Std. Error    t value  Pr(>|t|)
    #(Intercept)  0.1304709  0.8175638  0.1595850 0.8833471
    #x           -0.1616069  0.2465047 -0.6555936 0.5588755
    #
    #$`Response y2`$r.squared
    #[1] 0.1253142
    

    不分解处理职缺

    ## with unused level "b"
    group_by_simpleLM(dat[1:5, ], "y1", "x", "f")
    #$y1
    #                    a   b
    #alpha     0.820107094 NaN
    #beta     -0.009796302 NaN
    #beta.se   0.266690568 NaN
    #r2        0.000449565 NaN
    #df.resid  3.000000000 NaN
    
    ## rank-deficient case for level "b"
    group_by_simpleLM(dat[1:6, ], "y1", "x", "f")
    #$y1
    #                    a        b
    #alpha     0.820107094 -1.53995
    #beta     -0.009796302      NaN
    #beta.se   0.266690568      NaN
    #r2        0.000449565      NaN
    #df.resid  3.000000000  0.00000
    

    多个分组变量

    一组一组 不能直接处理。但你可以用 interaction 首先创建一个单因子变量。

    dat$fg <- with(dat, interaction(f, g, drop = TRUE, sep = ":"))
    group_by_simpleLM(dat, c("y1", "y2"), "x", "fg")
    #$y1
    #                a:A        b:A        a:B        b:B
    #alpha     1.4750325 -2.7684583 -1.6393289 -1.8513669
    #beta     -0.2120782  0.9861509  0.7993313  0.4613999
    #beta.se   0.0000000  0.2098876  0.4946167  0.0000000
    #r2        1.0000000  0.9566642  0.7231188  1.0000000
    #df.resid  0.0000000  1.0000000  1.0000000  0.0000000
    #
    #$y2
    #                a:A         b:A        a:B        b:B
    #alpha     1.0292956 -0.22746944 -1.5096975 0.06876360
    #beta     -0.2657021 -0.20650690  0.2547738 0.09172993
    #beta.se   0.0000000  0.01945569  0.3483856 0.00000000
    #r2        1.0000000  0.99120195  0.3484482 1.00000000
    #df.resid  0.0000000  1.00000000  1.0000000 0.00000000
    
    fit <- nlme::lmList(cbind(y1, y2) ~ x | fg, data = dat, pool = FALSE)
    
    ## note that the first group a:A only has two values, so df.resid = 0
    ## my method returns 0 standard error for the slope
    ## but lm or lmList would return NaN
    lapply(summary(fit[[1]]), "[", c("coefficients", "r.squared"))
    #$`Response y1`
    #$`Response y1`$coefficients
    #              Estimate Std. Error t value Pr(>|t|)
    #(Intercept)  1.4750325        NaN     NaN      NaN
    #x           -0.2120782        NaN     NaN      NaN
    #
    #$`Response y1`$r.squared
    #[1] 1
    #
    #
    #$`Response y2`
    #$`Response y2`$coefficients
    #              Estimate Std. Error t value Pr(>|t|)
    #(Intercept)  1.0292956        NaN     NaN      NaN
    #x           -0.2657021        NaN     NaN      NaN
    #
    #$`Response y2`$r.squared
    #[1] 1
    

    不分组:一种快速的 lm

    group_by_simpleLM(dat, c("y1", "y2"), "x", NULL)
    #$y1
    #                ALL
    #alpha    -0.9481826
    #beta      0.4357022
    #beta.se   0.2408162
    #r2        0.2903691
    #df.resid  8.0000000
    #
    #$y2
    #                ALL
    #alpha     0.1002184
    #beta     -0.1542332
    #beta.se   0.1514935
    #r2        0.1147012
    #df.resid  8.0000000
    

    快速大型简单线性回归

    set.seed(0L)
    nSubj <- 200e3
    nr <- 1e6
    DF <- data.frame(subject = gl(nSubj, 5),
                     day = 3:7,
                     y1 = runif(nr), 
                     y2 = rpois(nr, 3), 
                     y3 = rnorm(nr), 
                     y4 = rnorm(nr, 1, 5))
    
    system.time(group_by_simpleLM(DF, paste0("y", 1:4), "day", "subject"))
    #   user  system elapsed 
    #  0.200   0.016   0.219 
    
    library(MatrixModels)
    system.time(glm4(y1 ~ 0 + subject + day:subject, data = DF, sparse = TRUE))
    #   user  system elapsed 
    #  9.012   0.172   9.266 
    

    一组一组 这四种反应几乎都是瞬间的,而 glm4 一次响应需要9秒!

    请注意 glm4型 一组一组 不会的。


    附录:“cpp小组”

    #include <Rcpp.h>
    using namespace Rcpp;
    
    // [[Rcpp::export]]
    List group_by_simpleLM_cpp (List Y, NumericVector x, IntegerVector group, CharacterVector group_levels, bool group_unsorted) {
    
      /* number of data and number of responses */
      int n = x.size(), k = Y.size(), n_groups = group_levels.size();
    
      /* set up result list */
      List result(k);
      List dimnames = List::create(CharacterVector::create("alpha", "beta", "beta.se", "r2", "df.resid"), group_levels);
      int j; for (j = 0; j < k; j++) {
        NumericMatrix mat(5, n_groups);
        mat.attr("dimnames") = dimnames;
        result[j] = mat;
        }
      result.attr("names") = Y.attr("names");
    
      /* set up a vector to hold sample size for each group */
      size_t *group_offset = (size_t *)calloc(n_groups + 1, sizeof(size_t));
    
      /*
        compute group offset: cumsum(group_offset)
        The offset is used in a different way when group is sorted or unsorted
        In the former case, it is the offset to real x, y values;
        In the latter case, it is the offset to ordering index indx
     */
      int *u = INTEGER(group), *u_end = u + n, i;
      if (n_groups > 1) {
        while (u < u_end) group_offset[*u++]++;
        for (i = 0; i < n_groups; i++) group_offset[i + 1] += group_offset[i];
        } else {
        group_offset[1] = n;
        group_unsorted = 0;
        }
    
      /* local variables & pointers */
      double *xi, *xi_end;    /* pointer to the 1st and the last x value */
      double *yi;             /* pointer to the first y value */
      int gi; double inv_gi;  /* sample size of the i-th group */
      double xi_mean, xi_var; /* mean & variance of x values in the i-th group */
      double yi_mean, yi_var; /* mean & variance of y values in the i-th group */
      double xiyi_cov;        /* covariance between x and y values in the i-th group */
      double beta, r2; int df_resi;
      double *matij;
    
      /* additional storage and variables when group is unsorted */
      int *indx; double *xb, *xbi, dtmp;
      if (group_unsorted) {
        indx = (int *)malloc(n * sizeof(int));
        xb = (double *)malloc(n * sizeof(double));  // buffer x for caching
        R_orderVector1(indx, n, group, TRUE, FALSE);  // Er, how is TRUE & FALSE recogonized as Rboolean?
        }
    
      /* loop through groups */
      for (i = 0; i < n_groups; i++) {
        /* set group size gi */
        gi = group_offset[i + 1] - group_offset[i];
        /* special case for a factor level with no data */
        if (gi == 0) {
          for (j = 0; j < k; j++) {
            /* matrix column for write-back */
            matij = REAL(result[j]) + i * 5;
            matij[0] = R_NaN; matij[1] = R_NaN; matij[2] = R_NaN;
            matij[3] = R_NaN; matij[4] = R_NaN;
            }
          continue;
          }
        /* rank-deficient case */
        if (gi == 1) {
          gi = group_offset[i];
          if (group_unsorted) gi = indx[gi];
          for (j = 0; j < k; j++) {
            /* matrix column for write-back */
            matij = REAL(result[j]) + i * 5;
            matij[0] = REAL(Y[j])[gi];
            matij[1] = R_NaN; matij[2] = R_NaN;
            matij[3] = R_NaN; matij[4] = 0.0;
            }
          continue;
          }
        /* general case where a regression line can be estimated */
        inv_gi = 1 / (double)gi;
        /* compute mean & variance of x values in this group */
        xi_mean = 0.0; xi_var = 0.0;
        if (group_unsorted) {
          /* use u, u_end and xbi */
          xi = REAL(x);
          u = indx + group_offset[i];  /* offset acts on index */
          u_end = u + gi;
          xbi = xb + group_offset[i];
          for (; u < u_end; xbi++, u++) {
            dtmp = xi[*u];
            xi_mean += dtmp;
            xi_var += dtmp * dtmp;
            *xbi = dtmp;
            }
          } else {
          /* use xi and xi_end */
          xi = REAL(x) + group_offset[i];  /* offset acts on values */
          xi_end = xi + gi;
          for (; xi < xi_end; xi++) {
            xi_mean += *xi;
            xi_var += (*xi) * (*xi);
            }
          }
        xi_mean = xi_mean * inv_gi;
        xi_var = xi_var * inv_gi - xi_mean * xi_mean;
        /* loop through responses doing simple linear regression */
        for (j = 0; j < k; j++) {
          /* compute mean & variance of y values, as well its covariance with x values */
          yi_mean = 0.0; yi_var = 0.0; xiyi_cov = 0.0;
          if (group_unsorted) {
            xbi = xb + group_offset[i];  /* use buffered x values */
            yi = REAL(Y[j]);
            u = indx + group_offset[i];  /* offset acts on index */
            for (; u < u_end; u++, xbi++) {
              dtmp = yi[*u];
              yi_mean += dtmp;
              yi_var += dtmp * dtmp;
              xiyi_cov += dtmp * (*xbi);
              } 
            } else {
            /* set xi and yi */
            xi = REAL(x) + group_offset[i];  /* offset acts on values */
            yi = REAL(Y[j]) + group_offset[i];  /* offset acts on values */
            for (; xi < xi_end; xi++, yi++) {
              yi_mean += *yi;
              yi_var += (*yi) * (*yi);
              xiyi_cov += (*yi) * (*xi);
              }
            }
          yi_mean = yi_mean * inv_gi;
          yi_var = yi_var * inv_gi - yi_mean * yi_mean;
          xiyi_cov = xiyi_cov * inv_gi - xi_mean * yi_mean;
          /* locate the right place to write back regression result */
          matij = REAL(result[j]) + i * 5 + 4;
          /* residual degree of freedom */
          df_resi = gi - 2; *matij-- = (double)df_resi;
          /* R-squared = squared correlation */
          r2 = (xiyi_cov * xiyi_cov) / (xi_var * yi_var); *matij-- = r2;
          /* standard error of regression slope */
          if (df_resi == 0) *matij-- = 0.0;
          else *matij-- = sqrt((1 - r2) * yi_var / (df_resi * xi_var));
          /* regression slope */
          beta = xiyi_cov / xi_var; *matij-- = beta;
          /* regression intercept */
          *matij = yi_mean - beta * xi_mean;
          }
        }
    
      if (group_unsorted) {
        free(indx);
        free(xb);
        }
      free(group_offset);
      return result;
      }
    
    /*** R
    group_by_simpleLM <- function (dat, LHS, RHS, group = NULL) {
    
      ## basic input validation
      if (missing(dat)) stop("no data provided to 'dat'!")
      if (!is.data.frame(dat)) stop("'dat' must be a data frame!")
    
      if (missing(LHS)) stop("no 'LHS' provided!")
      if (!is.character(LHS)) stop("'LHS' must be provided as a character vector of variable names!")
    
      if (missing(RHS)) stop("no 'RHS' provided!")
      if (!is.character(RHS)) stop("'RHS' must be provided as a character vector of variable names!")
    
      if (!is.null(group)) {
    
        ## grouping variable provided: a fast method of `nlme::lmList`
    
        if (!is.character(group)) stop("'group' must be provided as a character vector of variable names!")
    
        ## ensure that group has length 1, is available in the data frame and is a factor
        if (length(group) > 1L) {
          warning("only one grouping variable allowed for group-by simple linear regression; ignoring all but the 1st variable provided!")
          group <- group[1L]
          }
        grp <- dat[[group]]
        if (is.null(grp)) stop(sprintf("grouping variable '%s' not found in 'dat'!", group))
    
        if (is.factor(grp)) {
          grp_levels <- levels(grp)
          } else {
          warning("grouping variable is not provided as a factor; fast coercion is made!")
          grp_levels <- unique(grp)
          grp <- match(grp, grp_levels)
          grp_levels <- as.character(grp_levels)
          }
    
        grp_unsorted <- .Internal(is.unsorted(grp, FALSE))
    
        } else {
    
        ## no grouping; a fast method of `lm`
        grp <- 1L; grp_levels <- "ALL"; grp_unsorted <- FALSE
    
        }
    
      ## the RHS must has length 1, is available in the data frame and is numeric
      if (length(RHS) > 1L) {
        warning("only one RHS variable allowed for simple linear regression; ignoring all but the 1st variable provided!")
        RHS <- RHS[1L]
        }
      x <- dat[[RHS]]
      if (is.null(x)) stop(sprintf("RHS variable '%s' not found in 'dat'!", RHS))
      if (!is.numeric(x) || is.factor(x)) {
        stop("RHS variable must be 'numeric' for simple linear regression!")
        }
      x < as.numeric(x)  ## just in case that `x` is an integer
    
      ## check LHS variables
      nested <- match(RHS, LHS, nomatch = 0L)
      if (nested > 0L) {
        warning(sprintf("RHS variable '%s' found in LHS variables; removing it from LHS", RHS))
        LHS <- LHS[-nested]
        }
      if (length(LHS) == 0L) stop("no usable LHS variables found!")
      missed <- !(LHS %in% names(dat))
      if (any(missed)) {
        warning(sprintf("LHS variables '%s' not found in 'dat'; removing them from LHS", toString(LHS[missed])))
        LHS <- LHS[!missed]
        }
      if (length(LHS) == 0L) stop("no usable LHS variables found!")
      Y <- dat[LHS]
      invalid_LHS <- vapply(Y, is.factor, FALSE) | (!vapply(Y, is.numeric, FALSE))
      if (any(invalid_LHS)) {
        warning(sprintf("LHS variables '%s' are non-numeric or factors; removing them from LHS", toString(LHS[invalid_LHS])))
        Y <- Y[!invalid_LHS]
        }
      if (length(Y) == 0L) stop("no usable LHS variables found!")
      Y <- lapply(Y, as.numeric)  ## just in case that we have integer variables in Y
    
      ## check for exsitence of NA, NaN, Inf and -Inf and drop them?
    
      ## use Rcpp
      group_by_simpleLM_cpp(Y, x, grp, grp_levels, grp_unsorted)
      }
    */