代码之家  ›  专栏  ›  技术社区  ›  Laramie

T-SQL:一个更好的滑动分布函数/查询

  •  6
  • Laramie  · 技术社区  · 15 年前

    我需要一个类似于ntile()提供的T-SQL排序方法,只是每个tile的成员都在一个滑动分布中,这样排名较高的tile的成员就更少了。

    例如

    CREATE TABLE #Rank_Table(
    id int identity(1,1) not null,
    hits bigint not null default 0,
    PERCENTILE smallint null
    )
    --Slant the distribution of the data
    INSERT INTO #Rank_Table (hits)
    select CASE 
      when DATA > 9500 THEN DATA*30
      WHEN data > 8000  THEN DATA*5 
      WHEN data < 7000  THEN DATA/3 +1
      ELSE DATA
     END
    FROM
     (select top 10000 (ABS(CHECKSUM(NewId())) % 99 +1) * (ABS(CHECKSUM(NewId())) % 99 +1 ) DATA
     from master..spt_values t1
      cross JOIN master..spt_values t2) exponential
    
    Declare @hitsPerGroup as bigint
    Declare @numGroups as smallint
    set @numGroups=100
    
    select @hitsPerGroup=SUM(hits)/(@numGroups -1) FROM #Rank_Table 
    
    select @hitsPerGroup HITS_PER_GROUP
    
    --This is an even distribution
    SELECT  id,HITS, NTILE(@numGroups) Over (Order By HITS DESC) PERCENTILE 
    FROM #Rank_Table 
    GROUP by id, HITS
    
    --This is my best attempt, but it skips groups because of the erratic distribution
    select 
        T1.ID, 
        T1.hits, 
        T.RunningTotal/@hitsPerGroup + 1 TILE,
        T.RunningTotal
    FROM    #Rank_Table T1
            CROSS APPLY ( Select SUM(hits) RunningTotal FROM #Rank_Table where hits <= T1.hits) T
    order by T1.hits 
    
    DROP TABLE #Rank_Table
    

    在rank_表中,ntile(@numgroups)创建了@numgroups组的均匀分布。我需要的是@numgroups组,其中tile 1的成员最少,tile 2将有一个或多个tile 1,tile 3将有1个或多个tile 2…瓷砖100将是最多的。

    我正在使用SQL Server 2008。实际上,这将针对一个可能有数百万行的永久表运行,以便定期用百分位数从1到100更新百分位数列。

    我最好的尝试以上将跳过百分位数和表现不佳。一定有更好的办法。

    2 回复  |  直到 14 年前
        1
  •  1
  •   gbn    14 年前

    A better NTILE implementation ?牛传染性胃肠炎病毒

        2
  •  0
  •   Laramie    14 年前

    为了创建更线性的分布,我在数据表中添加了一个计算列,点击 HITS_SQRT AS (CONVERT([int],sqrt(HITS*4),(0))) PERSISTED .

    使用此列,您可以计算“按百分比点击”的目标数量。

    select @hitsPerGroup=SUM(HITS_SQRT)/(@numGroups -1)-@numGroups, @dataPoints=COUNT(*) FROM #Rank_Table 
    

    然后,该脚本创建一个临时表,该表的行_number()按点击数排序,并按降序迭代这些行,将其百分位数从100更新为1。一个连续的总命中数被保存,当 @hitsPerGroup 通过后,百分比从100降低到99、99降低到98等。

    然后,源数据表用它的百分比更新。有一个临时工作表的索引来加速更新。

    完整脚本使用 #Rank_Table 作为源数据表。

    --Create Test Data
    CREATE TABLE #Rank_Table(
    id int identity(1,1) not null,
    hits bigint not null default 0,
    PERCENTILE smallint NULL,
    HITS_SQRT  AS (CONVERT([int],sqrt(HITS*4),(0))) PERSISTED
    )
    --Slant the distribution of the data
    INSERT INTO #Rank_Table (hits)
    select CASE 
      when DATA > 9500 THEN DATA*30
      WHEN data > 8000  THEN DATA*5 
      WHEN data < 7000  THEN DATA/3 +1
      ELSE DATA
     END
    FROM
     (select top 10000 (ABS(CHECKSUM(NewId())) % 99 +1) * (ABS(CHECKSUM(NewId())) % 99 +1 ) DATA
     from master..spt_values t1
      cross JOIN master..spt_values t2) exponential
    
    --Create temp work table and variables to calculate percentiles
        Declare @hitsPerGroup as int
        Declare @numGroups as int
        Declare @dataPoints as int
        set @numGroups=100
    
        select @hitsPerGroup=SUM(HITS_SQRT)/(@numGroups -1)-@numGroups, @dataPoints=COUNT(*) FROM #Rank_Table 
    
        --show the number of hits that each group should have
        select @hitsPerGroup HITS_PER_GROUP
    
        --Use temp table for the calculation
        CREATE TABLE #tbl (
            row int,
            hits int,
            ID bigint,
            PERCENTILE smallint null
        )
        --add index to row
        CREATE CLUSTERED INDEX idxRow ON #tbl(row) 
    
        insert INTO #tbl
        select ROW_NUMBER() over (ORDER BY HITS), hits_SQRT, ID, null from #Rank_Table
    
        --Update each row with a running total.
        --lower the percentile by one when we cross a threshold for the maximum number of hits per group (@hitsPerGroup)
        DECLARE @row as int
        DEClare @runningTotal as int
        declare @percentile int
        set @row = 0
        set @runningTotal = 0
        set @percentile = @numGroups
    
        while @row <= @dataPoints
        BEGIN
            select @runningTotal=@runningTotal + hits from #tbl where row=@row
    
            if @runningTotal >= @hitsPerGroup
            BEGIN
    
                update #tbl
                set PERCENTILE=@percentile
                WHERE PERCENTILE is null and row <@row
    
                set @percentile = @percentile - 1
    
                set @runningTotal = 0
            END
    
            --change rows
            set @row = @row + 1
        END
    
        --get remaining
        update #tbl
        set PERCENTILE=@percentile
        WHERE PERCENTILE is null
    
        --update source data
        UPDATE m SET PERCENTILE = t.PERCENTILE
        FROM #tbl t
        inner join #Rank_Table m on t.ID=m.ID
    
    
    --Show the results
        SELECT PERCENTILE, COUNT(id) NUMBER_RECORDS, SUM(HITS) HITS_IN_PERCENTILE 
        FROM #Rank_Table 
        GROUP BY PERCENTILE
        ORDER BY PERCENTILE 
    
    --cleanup
        DROP TABLE #Rank_Table
        DROP TABLE #tbl
    

    它的性能不是一流的,但它达到了平滑滑动分布的目的。