代码之家  ›  专栏  ›  技术社区  ›  Rob Hunter

从日志表计算部分访问持续时间的SQL查询

  •  7
  • Rob Hunter  · 技术社区  · 6 年前

    我有一个表,它记录每次加载网页时的用户ID、课程、会话ID和请求日期。 我想为给定的courseid计算每个用户id的持续时间。 由于时间跨度重叠,这样做是有问题的。

    这里提供的数据应该会导致课程1中每个用户的持续时间为10分钟。 我好像搞不好。

    CREATE TABLE PageLogSample (
        id INT NOT NULL PRIMARY KEY IDENTITY
    ,   userid INT
    ,   courseid INT
    ,   sessionid INT
    ,   requestdate DATETIME
    );
    
    TRUNCATE TABLE PageLogSample;
    
    INSERT INTO PageLogSample (userid, courseid, sessionid, requestdate)
    -- [0, 10] = 10 minutes
              SELECT 1, 1, 1, '00:00:00'
    UNION ALL SELECT 1, 1, 1, '00:10:00'
    -- [0, 12] - [3, 5] = 10 minutes
    -- or ... [0, 3] + [5, 12] = 10 minutes
    UNION ALL SELECT 2, 1, 2, '00:00:00'
    UNION ALL SELECT 2, 2, 2, '00:03:00'
    UNION ALL SELECT 2, 2, 2, '00:05:00'
    UNION ALL SELECT 2, 1, 2, '00:12:00'
    -- [0, 12] - [3, 5] = 10 minutes
    -- or ... [0, 3] + [5, 12] = 10 minutes
    UNION ALL SELECT 3, 1, 3, '00:00:00'
    UNION ALL SELECT 3, 2, 3, '00:03:00'
    UNION ALL SELECT 3, 2, 3, '00:05:00'
    UNION ALL SELECT 3, 1, 3, '00:12:00'
    UNION ALL SELECT 3, 2, 3, '00:15:00'
    -- [1, 13] - [3, 5] = 10 minutes
    -- or ... [1, 3] + [5, 13] = 10 minutes
    UNION ALL SELECT 4, 2, 4, '00:00:00'
    UNION ALL SELECT 4, 1, 4, '00:01:00'
    UNION ALL SELECT 4, 2, 4, '00:03:00'
    UNION ALL SELECT 4, 2, 4, '00:05:00'
    UNION ALL SELECT 4, 1, 4, '00:13:00'
    UNION ALL SELECT 4, 2, 4, '00:15:00'
    -- [0, 5] + [10, 15] = 10 minutes
    UNION ALL SELECT 5, 1, 5, '00:00:00'
    UNION ALL SELECT 5, 1, 5, '00:05:00'
    UNION ALL SELECT 5, 1, 6, '00:10:00'
    UNION ALL SELECT 5, 1, 6, '00:15:00'
    -- [0, 10] = 10 minutes (ignoring everything inbetween)
    UNION ALL SELECT 6, 1, 7, '00:00:00'
    UNION ALL SELECT 6, 1, 7, '00:03:00'
    UNION ALL SELECT 6, 1, 7, '00:05:00'
    UNION ALL SELECT 6, 1, 7, '00:07:00'
    UNION ALL SELECT 6, 1, 7, '00:10:00'
    -- [0, 11] - [5, 6] = 10 minutes
    -- or ... [0, 3] + [7, 11] = 6 minutes (good)
    -- or ... [0, 5] + [7, 11] = 9 minutes (better)
    UNION ALL SELECT 7, 1, 8, '00:00:00'
    UNION ALL SELECT 7, 1, 8, '00:03:00'
    UNION ALL SELECT 7, 2, 8, '00:05:00'
    UNION ALL SELECT 7, 2, 8, '00:06:00'
    UNION ALL SELECT 7, 1, 8, '00:07:00'
    UNION ALL SELECT 7, 1, 8, '00:11:00'
    -- [0, 1] + [2, 4] + [5, 7] + [8, 13] = 10
    UNION ALL SELECT 8, 1, 9, '00:00:00'
    UNION ALL SELECT 8, 2, 9, '00:01:00'
    UNION ALL SELECT 8, 1, 9, '00:02:00'
    UNION ALL SELECT 8, 1, 9, '00:03:00'
    UNION ALL SELECT 8, 2, 9, '00:04:00'
    UNION ALL SELECT 8, 1, 9, '00:05:00'
    UNION ALL SELECT 8, 1, 9, '00:06:00'
    UNION ALL SELECT 8, 2, 9, '00:07:00'
    UNION ALL SELECT 8, 1, 9, '00:08:00'
    UNION ALL SELECT 8, 1, 9, '00:13:00'
    ;
    

    首先尝试天真的方法。这会导致课程重叠部分出错。

    DECLARE @courseid INT;
    SET @courseid = 1;
    
    SELECT subquery.userid
    , COUNT(DISTINCT subquery.sessionid) AS sessioncount
    , SUM(subquery.duration) AS duration
    , CASE SUM(subquery.duration) 
        WHEN 10 THEN 'ok' 
        ELSE 'ERROR' 
    END
    FROM (
        SELECT userid
        , sessionid
        , DATEDIFF(MINUTE, MIN(requestdate), MAX(requestdate)) AS duration
        FROM PageLogSample
        WHERE courseid = @courseid
        GROUP BY userid
        , sessionid
    ) subquery
    GROUP BY subquery.userid
    ORDER BY subquery.userid;
    
    -- userid  sessioncount  duration   
    -- 1       1             10       ok
    -- 2       1             12       ERROR
    -- 3       1             12       ERROR
    -- 4       1             12       ERROR
    -- 5       2             10       ok
    

    第二次尝试。避免重叠。这只能部分起作用。

    DECLARE @courseid INT;
    SET @courseid = 1;
    
    WITH cte (userid, courseid, sessionid, start, finish, duration)
    AS (
        SELECT userid
        , courseid
        , sessionid
        , MIN(requestdate)
        , MAX(requestdate)
        , DATEDIFF(MINUTE, MIN(requestdate), MAX(requestdate))
        FROM PageLogSample
        GROUP BY userid
        , courseid
        , sessionid
    )
    SELECT naive.userid
    , naive.sessioncount
    , naive.duration AS naiveduration
    , correction.duration AS correctionduration
    , naive.duration - ISNULL(correction.duration, 0) AS duration
    , CASE naive.duration - ISNULL(correction.duration, 0)
        WHEN 10 THEN 'ok' 
        ELSE 'ERROR' 
    END
    FROM (
        SELECT cte.userid
        , COUNT(DISTINCT cte.sessionid) AS sessioncount
        , SUM(cte.duration) AS duration
        FROM cte
        WHERE cte.courseid = @courseid
        GROUP BY cte.userid
    ) naive
    LEFT JOIN (
        SELECT errors.userid
        , SUM(errors.duration) AS duration
        FROM cte errors
        WHERE errors.courseid <> @courseid
        AND EXISTS (
            SELECT *
            FROM cte
            WHERE cte.start <= errors.start
            AND cte.finish >= errors.finish
            AND cte.courseid = @courseid
        )
        GROUP BY errors.userid
    ) correction
    ON naive.userid = correction.userid
    ;
    
    -- userid  sessioncount  naiveduration  correctionduration  duration
    -- 1       1             10             NULL                10        ok
    -- 2       1             12             2                   10        ok
    -- 3       1             12             NULL                12        ERROR
    -- 4       1             12             NULL                12        ERROR
    -- 5       2             10             NULL                10        ok
    

    更新: Ed Harpers comment 真的让我重新考虑我的方法。

    所以第三次审判来了。在这里,我首先搜索哪些行代表课程的入口,哪些行代表某人离开。然后我取所有结束时间的和,再减去所有开始时间的和。我认为这更正确,但并不完美。

    DECLARE @courseid INT;
    SET @courseid = 1;
    
    WITH numberedcte (rn, id, userid, courseid, sessionid, requestdate)
    AS (
        SELECT ROW_NUMBER() OVER (PARTITION BY sessionid, userid ORDER BY id)
        , id
        , userid
        , courseid
        , sessionid
        , requestdate
        FROM PageLogSample
    )
    , typedcte (rowtype, id, userid, courseid, sessionid, requestdate, nextrequestdate)
    AS (
        SELECT CASE
            WHEN previousrequest.courseid = nextrequest.courseid
                THEN 'between'
            WHEN previousrequest.courseid IS NULL
                OR nextrequest.courseid = numberedcte.courseid
                THEN 'begin'
            WHEN nextrequest.courseid IS NULL
                OR previousrequest.courseid = numberedcte.courseid
                THEN 'end'
            ELSE 'error?'
        END AS rowtype
        , numberedcte.id
        , numberedcte.userid
        , numberedcte.courseid
        , numberedcte.sessionid
        , numberedcte.requestdate
        , nextrequest.requestdate
        FROM numberedcte
        LEFT JOIN numberedcte previousrequest
            ON previousrequest.userid = numberedcte.userid
            AND previousrequest.sessionid = numberedcte.sessionid
            AND previousrequest.rn = numberedcte.rn - 1
        LEFT JOIN numberedcte nextrequest
            ON nextrequest.userid = numberedcte.userid
            AND nextrequest.sessionid = numberedcte.sessionid
            AND nextrequest.rn = numberedcte.rn + 1
        WHERE numberedcte.courseid = @courseid
        AND (
            nextrequest.courseid = @courseid
            OR previousrequest.courseid = @courseid
        )
    )
    , beginsum (userid, value)
    AS (
        SELECT userid, SUM(DATEPART(MINUTE, requestdate))
        FROM typedcte
        WHERE rowtype = 'begin'
        GROUP BY userid
    )
    , endsum (userid, value)
    AS (
        SELECT userid, SUM(DATEPART(MINUTE, ISNULL(nextrequestdate, requestdate)))
        FROM typedcte
        WHERE rowtype = 'end'
        GROUP BY userid
    )
    SELECT beginsum.userid
    , endsum.value - beginsum.value AS duration
    FROM beginsum
    INNER JOIN endsum
        ON beginsum.userid = endsum.userid
    ;
    

    这里唯一的问题是,我只从原始示例数据中获得用户1和5的输出。添加的用户6也提供正确的输出。添加的用户7现在给了我一个满意的输出。用户8几乎是完美的,从第一排到第二排我错过了一分钟。

    -- userid  duration
    -- 1       10
    -- 5       10
    -- 6       10
    -- 7       9
    -- 8       9
    

    我觉得我离完全正确的答案还有几英寸远。唯一缺少的持续时间是来自分组中没有发生的pagerequests。有人能帮我找到一个方法来获得孤独的页面浏览量吗?

    更新: 这是第四次审判。在这里,我为每个请求分配一个值,并对它们进行汇总。它并没有给我准确的输出我希望,但看起来它可能是足够好的。

    DECLARE @courseid INT;
    SET @courseid = 1;
    
    WITH numberedcte (rn, userid, courseid, sessionid, requestdate)
    AS (
        SELECT ROW_NUMBER() OVER (PARTITION BY sessionid, userid ORDER BY id)
        , userid
        , courseid
        , sessionid
        , requestdate
        FROM PageLogSample
    )
    , valuecte (value, userid, courseid, sessionid)
    AS (
        SELECT CASE
            --alone
            WHEN ( previousrequest.courseid IS NULL
                OR previousrequest.courseid <> numberedcte.courseid
                )
                AND nextrequest.courseid <> numberedcte.courseid
                THEN DATEDIFF(MINUTE, numberedcte.requestdate, nextrequest.requestdate)
            --between
            WHEN previousrequest.courseid = nextrequest.courseid
                THEN 0
            --begin
            WHEN previousrequest.courseid IS NULL
                OR nextrequest.courseid = numberedcte.courseid
                THEN -1 * DATEPART(MINUTE, numberedcte.requestdate)
            --ignored (end with no next request)
            WHEN nextrequest.courseid IS NULL
                AND previousrequest.courseid <> numberedcte.courseid
                THEN 0
            --end
            WHEN nextrequest.courseid IS NULL
                OR previousrequest.courseid = numberedcte.courseid
                THEN DATEPART(MINUTE, ISNULL(nextrequest.requestdate, numberedcte.requestdate))
            --impossible?
            ELSE 0
        END
        , numberedcte.userid
        , numberedcte.courseid
        , numberedcte.sessionid
        FROM numberedcte
        LEFT JOIN numberedcte previousrequest
            ON previousrequest.userid = numberedcte.userid
            AND previousrequest.sessionid = numberedcte.sessionid
            AND previousrequest.rn = numberedcte.rn - 1
        LEFT JOIN numberedcte nextrequest
            ON nextrequest.userid = numberedcte.userid
            AND nextrequest.sessionid = numberedcte.sessionid
            AND nextrequest.rn = numberedcte.rn + 1
        WHERE numberedcte.courseid = @courseid
    )
    SELECT userid
    , courseid
    , COUNT(DISTINCT sessionid) AS sessioncount
    , SUM(value) AS duration
    FROM valuecte
    GROUP BY userid
    , courseid
    ORDER BY userid
    ;
    

    正如你所看到的,结果并不完全是我所期望的。

    -- userid  courseid  sessioncount  duration
    -- 1       1         1             10
    -- 2       1         1              3
    -- 3       1         1              6
    -- 4       1         1              4
    -- 5       1         2             10
    -- 6       1         1             10
    -- 7       1         1              9
    -- 8       1         1             10
    

    在真实数据库的本地副本上的性能很糟糕。所以如果有人想用更有效的方式来写这篇文章…射击。

    更新: 性能提高了。我添加了一个索引,它现在很有魅力。

    5 回复  |  直到 16 年前
        1
  •  0
  •   armitage    16 年前

    对不起,我想你有数据问题。查看提供的示例数据,用户2在CourseID 1中停留12分钟,CourseID 2停留2分钟。

    您确定提供了正确的数据吗?

        2
  •  0
  •   Ed Harper    16 年前

    这是我能做到的。用户ID 4失败。

    正如我在评论中所说, requestdate 有时是一门课程的开始,有时是一门课程的结束,我看不出一个简单的一般规则来推导它在一行中扮演的角色。

    DECLARE @courseid INT;
    SET @courseid = 1;
    
    WITH orderCTE
    AS
    (
            SELECT *
    
                   ,ROW_NUMBER() OVER (PARTITION BY sessionid
                                       ORDER BY id
                                      ) AS rn
            FROM PageLogSample
            --order by rn
    )
    ,startendCTE
    AS
    (
            SELECT  CASE WHEN start1.rn = 1
                         THEN start1.courseid
                         ELSE end1.courseid
                     END courseid
                    ,start1.sessionid
                    ,start1.userid
                    ,DATEDIFF(mi,start1.requestdate,end1.requestdate) duration
            FROM orderCTE AS start1
            JOIN orderCTE AS end1
            ON end1.rn = start1.rn + 1
            AND end1.sessionid = start1.sessionid
    )
    SELECT courseid
           ,COUNT(1) sessionCount
           ,userid
           ,SUM(duration) totalDuration
    FROM startendCTE
    WHERE courseid = @courseid
    GROUP BY courseid
             ,userid;
    
        3
  •  0
  •   WesleyJohnson    16 年前

    这相当混乱,但似乎对CourseID 1有效。我没有尝试其他课程,所以你可能想测试一下!D

    基本前提是,我得到目标课程ID的第一个和最后一个会话之间的持续时间,然后减去任何不属于指定课程ID的会话的持续时间,但会话请求时间在目标课程ID的最小和最大请求时间内。我希望这是有道理的。

    查询肯定可以被清除,可能是用CTE或其他东西。有趣的问题。:)

    DECLARE @courseid INT;
    SET @courseid = 1;
    
    SELECT 
        TargetCourse.UserID, 
        COUNT(Distinct(TargetCourse.SessionID)) as SessionCount,
        SUM(TargetCourse.Duration - Coalesce(OtherCourses.Duration,0)) as Duration
    FROM
    (
        SELECT 
            TargetCourse.UserID, TargetCourse.SessionID, 
            MIN(TargetCourse.RequestDate) FirstRequest, MAX(TargetCourse.RequestDate) LastRequest, 
            DATEDIFF(MINUTE, MIN(TargetCourse.RequestDate), MAX(TargetCourse.RequestDate)) AS duration
        FROM 
            PageLogSample TargetCourse
        WHERE
            TargetCourse.CourseID = @courseid
        GROUP BY
            TargetCourse.UserID, TargetCourse.SessionID     
    ) as TargetCourse
    LEFT OUTER JOIN
    (
        SELECT 
            OtherCourses.UserID, OtherCourses.SessionID, 
            MIN(OtherCourses.RequestDate) AS FirstRequest, MAX(OtherCourses.RequestDate) AS LastRequest, 
            DATEDIFF(MINUTE, MIN(OtherCourses.RequestDate), MAX(OtherCourses.RequestDate)) AS duration
        FROM 
            PageLogSample OtherCourses
        WHERE
            OtherCourses.CourseID <> @courseid AND
            OtherCourses.RequestDate between
                (Select MIN(RequestDate) From PageLogSample T Where T.UserID = OtherCourses.UserID and T.CourseID = @courseid) AND
                (Select MAX(RequestDate) From PageLogSample T Where T.UserID = OtherCourses.UserID and T.CourseID = @courseid)
        GROUP BY
            OtherCourses.UserID, OtherCourses.SessionID 
    ) as OtherCourses ON
    OtherCourses.UserID = TargetCourse.UserID AND
    OtherCourses.FirstRequest BETWEEN TargetCourse.FirstRequest and TargetCourse.LastRequest
    Group By TargetCourse.UserID
    
        4
  •  0
  •   Kristof Neirynck    16 年前

    一些更多的示例数据和每个用户在每个课程中花费的时间的逻辑假设。

    INSERT INTO PageLogSample (userid, courseid, sessionid, requestdate)
    -- [0, 10] = 10 minutes
              SELECT 1, 1, 1, '00:00:00'
    UNION ALL SELECT 1, 1, 1, '00:10:00'
    -- [0, 3] = 3 minutes
    -- there is no way to know how long the user was on that last page
    UNION ALL SELECT 2, 1, 2, '00:00:00'
    UNION ALL SELECT 2, 2, 2, '00:03:00'
    UNION ALL SELECT 2, 2, 2, '00:05:00'
    UNION ALL SELECT 2, 1, 2, '00:12:00'
    -- [0, 3] + [12, 15] = 6 minutes
    -- the [5, 12] part was spent on a page of course 2
    UNION ALL SELECT 3, 1, 3, '00:00:00'
    UNION ALL SELECT 3, 2, 3, '00:03:00'
    UNION ALL SELECT 3, 2, 3, '00:05:00'
    UNION ALL SELECT 3, 1, 3, '00:12:00'
    UNION ALL SELECT 3, 2, 3, '00:15:00'
    -- [1, 3] + [13, 15] = 4 minutes
    UNION ALL SELECT 4, 2, 4, '00:00:00'
    UNION ALL SELECT 4, 1, 4, '00:01:00'
    UNION ALL SELECT 4, 2, 4, '00:03:00'
    UNION ALL SELECT 4, 2, 4, '00:05:00'
    UNION ALL SELECT 4, 1, 4, '00:13:00'
    UNION ALL SELECT 4, 2, 4, '00:15:00'
    -- [0, 5] + [10, 15] = 10 minutes
    UNION ALL SELECT 5, 1, 5, '00:00:00'
    UNION ALL SELECT 5, 1, 5, '00:05:00'
    UNION ALL SELECT 5, 1, 6, '00:10:00'
    UNION ALL SELECT 5, 1, 6, '00:15:00'
    -- [0, 10] = 10 minutes (ignoring everything inbetween)
    UNION ALL SELECT 6, 1, 7, '00:00:00'
    UNION ALL SELECT 6, 1, 7, '00:03:00'
    UNION ALL SELECT 6, 1, 7, '00:05:00'
    UNION ALL SELECT 6, 1, 7, '00:07:00'
    UNION ALL SELECT 6, 1, 7, '00:10:00'
    -- [0, 5] + [7, 11] = 9 minutes
    UNION ALL SELECT 7, 1, 8, '00:00:00'
    UNION ALL SELECT 7, 1, 8, '00:03:00'
    UNION ALL SELECT 7, 2, 8, '00:05:00'
    UNION ALL SELECT 7, 2, 8, '00:06:00'
    UNION ALL SELECT 7, 1, 8, '00:07:00'
    UNION ALL SELECT 7, 1, 8, '00:11:00'
    -- [0, 1] + [2, 4] + [5, 7] + [8, 13] = 10
    UNION ALL SELECT 8, 1, 9, '00:00:00'
    UNION ALL SELECT 8, 2, 9, '00:01:00'
    UNION ALL SELECT 8, 1, 9, '00:02:00'
    UNION ALL SELECT 8, 1, 9, '00:03:00'
    UNION ALL SELECT 8, 2, 9, '00:04:00'
    UNION ALL SELECT 8, 1, 9, '00:05:00'
    UNION ALL SELECT 8, 1, 9, '00:06:00'
    UNION ALL SELECT 8, 2, 9, '00:07:00'
    UNION ALL SELECT 8, 1, 9, '00:08:00'
    UNION ALL SELECT 8, 1, 9, '00:13:00'
    -- there is nothing we can say about either of there requests
    -- 0 minutes
    UNION ALL SELECT 9, 1, 10, '00:10:00'
    UNION ALL SELECT 9, 1, 11, '00:20:00'
    ;
    

    现在我们得到这样的数据:

    WITH numberedcte (rn, userid, courseid, sessionid, requestdate)
    AS (
        SELECT ROW_NUMBER() OVER (PARTITION BY sessionid, userid ORDER BY id)
        , userid
        , courseid
        , sessionid
        , requestdate
        FROM PageLogSample
    )
    , valuecte (value, userid, courseid, sessionid)
    AS (
        SELECT CASE
            --alone in session
            WHEN previousrequest.courseid IS NULL
                AND nextrequest.courseid  IS NULL
                THEN 0
            --alone
            WHEN ( previousrequest.courseid IS NULL
                OR previousrequest.courseid <> numberedcte.courseid
                )
                AND nextrequest.courseid <> numberedcte.courseid
                THEN DATEDIFF(MINUTE, numberedcte.requestdate, nextrequest.requestdate)
            --between
            WHEN previousrequest.courseid = nextrequest.courseid
                THEN 0
            --begin
            WHEN previousrequest.courseid IS NULL
                OR nextrequest.courseid = numberedcte.courseid
                THEN -1 * DATEPART(MINUTE, numberedcte.requestdate)
            --ignored (end with no next request)
            WHEN nextrequest.courseid IS NULL
                AND previousrequest.courseid <> numberedcte.courseid
                THEN 0
            --end
            WHEN nextrequest.courseid IS NULL
                OR previousrequest.courseid = numberedcte.courseid
                THEN DATEPART(MINUTE, ISNULL(nextrequest.requestdate, numberedcte.requestdate))
            --impossible?
            ELSE 0
        END
        , numberedcte.userid
        , numberedcte.courseid
        , numberedcte.sessionid
        FROM numberedcte
        LEFT JOIN numberedcte previousrequest
            ON previousrequest.userid = numberedcte.userid
            AND previousrequest.sessionid = numberedcte.sessionid
            AND previousrequest.rn = numberedcte.rn - 1
        LEFT JOIN numberedcte nextrequest
            ON nextrequest.userid = numberedcte.userid
            AND nextrequest.sessionid = numberedcte.sessionid
            AND nextrequest.rn = numberedcte.rn + 1
        WHERE numberedcte.courseid = @courseid
    )
    SELECT userid
    , courseid
    , COUNT(DISTINCT sessionid) AS sessioncount
    , SUM(value) AS duration
    FROM valuecte
    GROUP BY userid
    , courseid
    ORDER BY userid
    ;
    

    这就是我得到的结果。我很满意。请注意,用户9的会话计数如何保持正确。

    userid  courseid  sessioncount  duration
    1       1         1             10
    2       1         1              3
    3       1         1              6
    4       1         1              4
    5       1         2             10
    6       1         1             10
    7       1         1              9
    8       1         1             10
    9       1         2              0
    
        5
  •  -1
  •   Erwin Smout    16 年前

    “数据是正确的,但很难从中获得相关的含义。”

    我不得不回答,这是一个术语的矛盾。你不知道这意味着什么的数据不是数据。

    关于你最初的问题:

    您需要的是一个DBMS,它为间隔类型提供适当的支持。在那个联盟里没有SQL系统。除了一些教程系统之外,我自己的DBMS(在这种情况下不需要进一步推动,因此没有链接)是我所知道的唯一一个提供此类问题真正需要的支持的DBMS。

    如果你感兴趣的话,在谷歌上搜索“间隔类型”、“压缩正常形式”、“时间数据”,你最终会碰到它。