代码之家  ›  专栏  ›  技术社区  ›  MAPK

如何通过在geom_point中显示所有三个变量来正确放置此数据中的形状大小?

  •  2
  • MAPK  · 技术社区  · 7 年前

    我有下面的数据和代码。我想准确地反映核苷酸的大小 Size 列。如果你检查数据的统计数据,你可以清楚地看到T是最大的总大小,然后A是第二大,这是不正确地显示在我的绘图。我下面的绘图代码有什么问题?

    #check some statistics:
    counts <- aggregate(Size~Nucleotides,all.data,length)
    names(counts)[2] <- 'counts'
    totalSize <- aggregate(Size~Nucleotides,all.data,sum)
    names(totalSize)[2] <- 'totalSize'
    merge(counts,totalSize)
    
    # Nucleotides counts totalSize
    # 1           A      6 24.700016
    # 2           C      6  3.001356
    # 3           G      6  5.155665
    # 4           T      6 37.471940
    

    绘图代码:

    p <- ggplot(all.data) +
      geom_point(aes(x=Pos, y = Size, color = bases,group = Samples, shape = Samples, size = Nucleotides))+
      # geom_point(aes(x=Pos, y = Size, color = bases,group = Samples, shape = Samples))+
      scale_shape_manual(values=1:nlevels(all.data$Samples)) +
      theme_bw() 
    p
    

    数据:

    all.data <- structure(list(Pos = c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 
    2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), Nucleotides = structure(c(1L, 
    1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L, 1L, 1L, 1L, 2L, 2L, 
    2L, 3L, 3L, 3L, 4L, 4L, 4L), .Label = c("A", "C", "G", "T"), class = "factor"), 
        Size = c(0.80519048411246, 0.375977374812843, 10.6754283813009, 
        0.495757777408085, 0.615538180003327, 0.329396107136916, 
        0.835135584761271, 0.562302445516553, 1.11795042422226, 0.246215272001331, 
        0.339377807353186, 20.0931625353519, 1.06859576968273, 0.264394829612221, 
        11.510428907168, 0.554494712103408, 0.624265569917744, 0.381903642773208, 
        0.829905992949471, 0.631609870740306, 1.17876028202115, 0.334165687426557, 
        0.290099882491187, 16.1689189189189), Samples = structure(c(2L, 
        2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 
        1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Ago2_SsHV2L_1_CATGGC_L003_R1_001", 
        "Ago2_SsHV2L_2_CATTTT_L003_R1_001"), class = "factor"), bases = c("21", 
        "21", "21", "21", "21", "21", "21", "21", "21", "21", "21", 
        "21", "21", "21", "21", "21", "21", "21", "21", "21", "21", 
        "21", "21", "21")), .Names = c("Pos", "Nucleotides", "Size", 
    "Samples", "bases"), row.names = c("1.A", "2.A", "3.A", "1.C", 
    "2.C", "3.C", "1.G", "2.G", "3.G", "1.T", "2.T", "3.T", "1.A1", 
    "2.A1", "3.A1", "1.C1", "2.C1", "3.C1", "1.G1", "2.G1", "3.G1", 
    "1.T1", "2.T1", "3.T1"), reshapeLong = structure(list(varying = list(
        c("A", "C", "G", "T")), v.names = "Mismatches", idvar = "Pos", 
        timevar = "Nucleotides"), .Names = c("varying", "v.names", 
    "idvar", "timevar")), class = "data.frame")
    
    1 回复  |  直到 7 年前
        1
  •  2
  •   Andrew Lavers    7 年前

    这显示了如何汇总值并将它们连接回原始数据帧,以便它们可以在ggplot中直接引用

    #check some statistics:
    counts <- aggregate(Size~Nucleotides,all.data,length)
    names(counts)[2] <- 'counts'
    totalSize <- aggregate(Size~Nucleotides,all.data,sum)
    names(totalSize)[2] <- 'totalSize'
    
    ## compute the summary and join with detail dataframe
    summarized <- merge(counts,totalSize, sort = T)
    merged <- merge(all.data, summarized, by ="Nucleotides")
    
    ## make a summarized label column example  "A 24.70"
    summarized$NucleotidesTotalSize <- paste(summarized$Nucleotides, format(round(summarized$totalSize,2), nsmall=2))
    
    library(ggplot2)
    p <- ggplot(merged) +
      geom_point(aes(x=Pos, y = Size, shape = Samples, size = totalSize, color = bases))+
      scale_shape_manual(values=1:nlevels(all.data$Samples)) +
      # use the summarized dataframe for labelling and breaks
      scale_size(name = "Nucleotides Total Size", breaks = summarized$totalSize, labels=summarized$NucleotidesTotalSize) +
      theme_bw() 
    
    print(p)
    

    enter image description here