代码之家  ›  专栏  ›  技术社区  ›  Carlos Eduardo Lagosta

使用data.table按组不重复采样

  •  4
  • Carlos Eduardo Lagosta  · 技术社区  · 6 年前

    我将用一个假设的场景来说明这个问题。这是一张桌子,上面有音乐家和他们演奏的乐器,还有一张桌子,上面有乐队的乐曲:

    musicians <- data.table(
      instrument = rep(c('bass','drums','guitar'), each = 4),
      musician = c('Chas','John','Paul','Stuart','Andy','Paul','Peter','Ringo','George','John','Paul','Ringo')
    )
    
    band.comp <- data.table(
      instrument = c('bass','drums','guitar'),
      n = c(2,1,2)
    )
    

    为了避免争论谁最适合使用哪种乐器,乐队将通过排练进行组合。我是这样做的:

    musicians[band.comp, on = 'instrument'][, sample(musician, n), by = instrument]
    
       instrument     V1
    1:       bass   Paul
    2:       bass   Chas
    3:      drums   Andy
    4:     guitar   Paul
    5:     guitar George
    

    问题是:因为有音乐家会演奏不止一种乐器,所以一个人可能会被画上不止一次。

    一个人可以建立一个for循环,为随后的每一个乐器子集绘制音乐家,然后从表的其余部分中删除这些音乐家。但我想知道如何使用data.table来实现这一点。主要是因为在现实生活中,我需要用这种逻辑解决的问题涉及数十万行的数据库。而且也是因为我试图更好地理解data.table语法。

    作为参考,我试了一些 tips from Andrew Brooks blog ,但无法想出解决方案。

    3 回复  |  直到 6 年前
        1
  •  1
  •   chinsoon12    6 年前

    遇到一个相关的帖子: Randomly draw rows from dataframe based on unique values and column values 埃迪的回答非常适合这次行动:

    #keep number of musicians per instrument in 1 data.table
    musicians[band.comp, n:=n, on=.(instrument)]
    
    #for storing the musician that has been sampled so far
    m <- c()
    
    musicians[, {
        #exclude sampled musician before sampling
        res <- .SD[!musician %chin% m][sample(.N, n[1L])]
        m <- c(m, res$musician)
        res
    }, by=.(instrument)]
    

    样本输出:

       instrument musician n
    1:       bass   Stuart 2
    2:       bass     Chas 2
    3:      drums     Paul 1
    4:     guitar     John 2
    5:     guitar    Ringo 2
    

    或者更简洁地处理错误:

    m <- c()
    musicians[
        band.comp, 
        on=.(instrument), 
        j={
            s <- setdiff(musician, m)
            if (length(s) < n) stop(paste("Not enough musicians playing", .BY))
            res <- sample(s, n)    
            m <- c(m, res)
            res
        }, 
        by=.EACHI]
    
        2
  •  5
  •   chinsoon12    5 年前

    这可以是一个解决方案,首先你选择一个音乐家的乐器,然后你选择你的样本音乐家。但是,当为每个音乐家选择一种乐器时,你的样本量可能大于总样本量,那么你会得到一个错误(但在你的真实数据中,这可能不是问题)。

    musicians[, .(instrument = sample(instrument, 1)), by = musician][band.comp, on = 'instrument'][, sample(musician, n), by = instrument]
    
        3
  •  3
  •   Frank    6 年前

    你可以把乐队扩编成 sum(band.comp$n) 在找到可行的成分之前,保持不同的位置和取样:

    roles = musicians[, 
      CJ(posn = 1:band.comp[.BY, on=.(instrument), x.n], musician = musician)
    , by=instrument]
    
    set.seed(1)
    while (TRUE){
      roles[sample(1:.N), keep := !duplicated(.SD, by="musician") & !duplicated(.SD, by=c("instrument", "posn"))][]
      if (sum(roles$keep) == sum(band.comp$n)) break
    }
    
    setorder(roles[keep == TRUE, !"keep"])[]
    
       instrument posn musician
    1:       bass    1   Stuart
    2:       bass    2     John
    3:      drums    1     Andy
    4:     guitar    1   George
    5:     guitar    2     Paul
    

    也许你可以用线性规划或二部图来回答一个可行的COMP是否存在的问题,但是还不清楚“采样”在可行的COMPs上的分布意味着什么。