代码之家  ›  专栏  ›  技术社区  ›  George Johnston

通过SQL查询帮助识别论坛垃圾邮件发送者?

  •  6
  • George Johnston  · 技术社区  · 15 年前

    我想有一个简单的查询,我可以运行一个数据库,以返回异常的时间阈值,用户张贴到我们的论坛。如果我有以下数据库结构:

    ThreadId | UserId | PostAuthor | PostDate |
    1          1000     Spammer      2010-11-14 02:52:50.093
    2          1000     Spammer      2010-11-14 02:53:06.893
    3          1000     Spammer      2010-11-14 02:53:22.130
    4          1000     Spammer      2010-11-14 02:53:37.073
    5          2000     RealUser     2010-11-14 02:53:52.383
    6          1000     Spammer      2010-11-14 02:54:07.430 
    

    我想设定一个门槛,例如,如果来自同一用户的3个帖子在1分钟的时间内出现,则该帖子可能在论坛上进行垃圾邮件传播。反过来,我想返回用户'垃圾邮件发送者'在一个查询,与发帖数量在分配的时间内。

    在上面的示例中,垃圾邮件发送者在1分钟的时间内发布了4条消息,因此查询结果可能如下所示:

    UserId | PostAuthor | PostCount | DateStart               | DateEnd
    1000     Spammer      4           2010-11-14 02:52:50.093   2010-11-14 02:53:37.073
    

    如有任何建议,欢迎以返回资料的格式提出。格式对我来说并不重要,重要的是准确识别论坛滥用者。

    5 回复  |  直到 15 年前
        1
  •  1
  •   Stu    15 年前

    在输出中没有您想要的所有内容,但这是一个开始:

    (ReWord:给我2个或更多其他帖子的所有帖子,但在一分钟内)

    Select 
      Spammer = PostAuthor,
      NumberOfPosts = (Select Count(*) 
                       From Posts As AllPosts 
                       Where AllPosts.UserID = Posts.UserID)
    From Posts
    Where 2 <= (Select Count(*)
                From Posts As OtherPosts
                Where OtherPosts.UserID = Posts.UserID
                  And OtherPosts.PostDate > Posts.PostDate
                  And OtherPosts.PostDate < DateAdd(Minute, 1, Posts.PostDate))
    
        2
  •  1
  •   Axn    15 年前

    自连接解决方案:

    Select T1.UserId, T1.PostAuthor, T1.PostDate, Max(T2.PostDate), Count(*)
    from
      Posts T1 INNER JOIN Posts T2 
      ON T1.UserId = T2.UserId and 
         T2.PostDate between T1.PostDate and dateadd(minute, 1, T1.PostDate)
    group by T1.UserId, T1.PostAuthor, T1.PostDate
    having count(*) >= 3
    
        3
  •  0
  •   Sadhir    15 年前

    我试着用我的手来做这个(我想它给出的结果和斯图的几乎一样,尽管是点数)。这将标识在1分钟内有3个帖子的用户(因此,如果有5个帖子,它将重复用户3次)

    DECLARE @threshold INT;
    SET @threshold = 3;
    
    ;WITH postCTE as
    (
    SELECT 
      Userid,
      PostAuthor,
      PostDate,
      RowNumber = ROW_NUMBER() OVER (PARTITION by UserId ORDER BY PostDate ASC)
    FROM Posts
    )
    SELECT 
      p1.UserId, 
      p1.PostAuthor, 
      p1.PostDate AS StartTime, 
      p2.PostDate AS EndTime
    FROM postCTE p1
       JOIN postCTE p2 
         ON p1.UserId = p2.UserId 
         AND p1.Rownumber = p2.RowNumber - (@threshold - 1)
    WHERE DATEDIFF(MINUTE,p1.PostDate,p2.PostDate) <= 1
    

    返回以下结果集

    UserId   PostAuthor  StartTime                EndTime
    1000    Spammer    2010-11-14 02:52:50.093  2010-11-14 02:53:22.130
    1000    Spammer    2010-11-14 02:53:06.893  2010-11-14 02:53:37.073
    1000    Spammer    2010-11-14 02:53:22.130  2010-11-14 02:54:07.430
    
        4
  •  0
  •   RC_Cleland    15 年前

    我相信萨德尔是在正确的轨道上。我对剧本有几处修改。第一个更正涉及使用“分钟”的DATADIFF单位。使用分钟将无法正确返回George示例中的四条记录。我把“分钟”改成了“秒”。我还格式化了输出,通过计算CTE中行数之间的差异来显示每分钟记录的帖子数。 虽然George没有请求它,但我添加了一个参数来控制回望表的天数,因为我不认为任何人每次都想做整个表。

    DECLARE @threshold INT; 
    SET @threshold = 3; 
    DECLARE @lookbackdays int;
    SET @lookbackdays = 2;
    
    ;WITH postCTE as 
    ( 
    SELECT  
        Userid, 
        PostAuthor, 
        PostDate, 
        RowNumber = ROW_NUMBER() OVER (ORDER BY UserId,PostDate ASC) 
    FROM 
        Post2Forum 
    WHERE 
        PostDate > GETDATE() - @lookbackdays
    ) 
    SELECT  
        p1.PostAuthor AS [PostAuthor],  
        p2.RowNumber - p1.RowNumber +1 AS [PostCount],
        p1.UserId,  
        p1.PostDate AS [DateStart],  
        p2.PostDate AS [DateEnd] 
    FROM 
        postCTE p1 
    INNER JOIN 
        postCTE p2  
        ON p1.UserId = p2.UserId  
        AND p1.Rownumber = p2.RowNumber - (@threshold ) 
    WHERE 
        DATEDIFF(second,p1.PostDate,p2.PostDate) <= 60
    

    我的测试中的查询结果是:

    PostAuthor PostCount UserId                   DateStart            DateEnd
    Spammer           4   1000 2010-11-14 02:52:50.093  2010-11-14 02:53:37.073
    
        5
  •  -1
  •   Sean    15 年前

    不完全是你想要的,但或多或少会达到目的。。。

    SELECT 
      UserId, 
      PostAuthor, 
      COUNT(*) AS [PostCount],
      YEAR(PostDate), 
      MONTH(PostDate), 
      DAY(PostDate), 
      DATEPART(hh, PostDate), 
      DATEPART(mi, PostDate)
    FROM LogTable
    GROUP BY 
      UserId, 
      PostAuthor, 
      YEAR(PostDate), 
      MONTH(PostDate), 
      DAY(PostDate), 
      DATEPART(hh, PostDate), 
      DATEPART(mi, PostDate)
    HAVING COUNT(*) >= 3
    ORDER BY 
      YEAR(PostDate), 
      MONTH(PostDate), 
      DAY(PostDate), 
      DATEPART(hh, PostDate), 
      DATEPART(mi, PostDate)