代码之家  ›  专栏  ›  技术社区  ›  W Barker

按数据帧的子段排序

  •  4
  • W Barker  · 技术社区  · 7 年前

    我和我的团队正在处理成千上万个具有类似段的URL。 有些URL在我们感兴趣的位置有一个段(“seg”,复数,“segs”)。其他类似的URL在我们感兴趣的位置上有不同的SEG。 我们需要对一个由URL和相关的唯一段组成的数据帧进行排序。 在感兴趣的位置,显示那些独特的分段的频率。

    下面是一个简单的例子:

     url <- c(1, 3, 1, 4, 2, 3, 1, 3, 3, 3, 3, 2)
     seg <- c("a", "c", "a", "d", "b", "c", "a", "x", "x", "y", "c", "b")
     df <- data.frame(url,seg)
    

    我们正在寻找以下产品:

    url freq seg 
     1   3    a   in other words, url #1 appears three times each with a seg = "a",
     2   2    b   in other words: url #2 appears twice each with a seg = "b",
     3   3    c   in other words: url #3 appears three times with a seg = "c", 
     3   2    x                                  two times with a seg = "x", and, 
     3   1    y                                  once with a seg = "y"
     4   1    d   etc.
    

    我可以通过一个循环和几个小步骤到达那里,但我确信有一种更优雅的方法可以做到这一点。这是我不雅的做法:

    使用num.unique行和三列(url、freq、seg)创建空数据帧

     result <- data.frame(url=0, Freq=0, seg=0)
    

    确定唯一的URL

     unique.df.url <- unique(df$url)
    

    通过数据帧循环

     for (xx in unique.df.url) {
       url.seg <- df[which(df$url == unique.df.url[xx]), ] # create a dataframe for each of the unique urls and associated segs
       freq.df.url <- data.frame(table(url.seg))  # summarize the frequency distribution of the segs by url
       result <- rbind(result,freq.df.url)  # append a new data.frame onto the last one
     }
    

    消除数据帧中频率=0的行

     result.freq <- result[which(result$Freq |0), ]
    

    按URL对数据帧排序

     result.order <- result.freq[order(result.freq$url), ]
    

    这就产生了期望的结果,但由于它是如此的不雅,我担心一旦我们进入规模,所需的时间将是令人望而却步,或者至少是令人担忧的。有什么建议吗?

    5 回复  |  直到 7 年前
        1
  •  2
  •   moodymudskipper    7 年前

    在基R中,可以这样做:

    aggregate(freq~seg+url,`$<-`(df,freq,1),sum)
    # or aggregate(freq~seg+url, data.frame(df,freq=1),sum)
    
    #   seg url freq
    # 1   a   1    3
    # 2   b   2    2
    # 3   c   3    3
    # 4   x   3    2
    # 5   y   3    1
    # 6   d   4    1
    

    技巧与 $<- 只是添加一列 freq 任何地方都有值1,而不更改源表。

    另一种可能性:

    subset(as.data.frame(table(df[2:1])),Freq!=0)
    #    seg url Freq
    # 1    a   1    3
    # 8    b   2    2
    # 15   c   3    3
    # 17   x   3    2
    # 18   y   3    1
    # 22   d   4    1
    

    这里我用 [2:1] 切换列的顺序,以便 table 按要求的方式订购结果。

        2
  •  0
  •   AntoniosK    7 年前
    url <- c(1, 3, 1, 4, 2, 3, 1, 3, 3, 3, 3, 2)
    seg <- c("a", "c", "a", "d", "b", "c", "a", "x", "x", "y", "c", "b")
    df <- data.frame(url,seg)
    
    library(dplyr)
    
    df %>% count(url, seg) %>% arrange(url, desc(n))
    
    # # A tibble: 6 x 3
    #     url seg       n
    #   <dbl> <fct> <int>
    # 1     1 a         3
    # 2     2 b         2
    # 3     3 c         3
    # 4     3 x         2
    # 5     3 y         1
    # 6     4 d         1
    
        3
  •  0
  •   Pavel Paltsev    7 年前

    下面的代码是否对您更好?

    library(dplyr)
    df %>% group_by(url, seg) %>% summarise(n()) 
    
        4
  •  0
  •   r.user.05apr    7 年前

    或快速粘贴:

    url <- c(1, 3, 1, 4, 2, 3, 1, 3, 3, 3, 3, 2)
    seg <- c("a", "c", "a", "d", "b", "c", "a", "x", "x", "y", "c", "b")
    df <- data.frame(url,seg)
    
    want <- tapply(url, INDEX = paste(url, seg, sep = "_"), length)
    want <- data.frame(do.call(rbind, strsplit(names(want), "_")), want)
    colnames(want) <- c("url", "seg", "freq")
    want <- want[order(want$url, -want$freq), ]
    rownames(want) <- NULL # needed?
    want <- want[ , c("url", "freq", "seg")] # needed?
    want
    
        5
  •  0
  •   MKR    7 年前

    可以选择使用 table tidyr::gather 要获取OP所需格式的数据:

    library(tidyverse)
    table(df) %>% as.data.frame() %>% 
      filter(Freq > 0 ) %>%
      arrange(url, desc(Freq))
    
    
    #   url seg  Freq
    # 1   1   a     3
    # 2   2   b     2
    # 3   3   c     3
    # 4   3   x     2
    # 5   3   y     1
    # 6   4   d     1
    

    df %>% group_by(url, seg) %>%
      summarise(freq = n()) %>%
      arrange(url, desc(freq))
    
    # # A tibble: 6 x 3
    # # Groups: url [4]
    #    url seg      freq
    #   <dbl> <fctr> <int>
    # 1  1.00 a          3
    # 2  2.00 b          2
    # 3  3.00 c          3
    # 4  3.00 x          2
    # 5  3.00 y          1
    # 6  4.00 d          1