代码之家  ›  专栏  ›  技术社区  ›  mfcss

SQL:使用年/月/日分区查询某个时间段的数据

  •  0
  • mfcss  · 技术社区  · 2 年前

    我有一个由镶木地板文件组成的AWS S3数据湖,其结构如下:

    s3://bucket/device/table_x/year=2000/month=01/day=02/xyz.parquet
    

    我的目标是使用AWS Athena来查询数据,以便在Grafana仪表板中显示。我面临的挑战是,为了在任何时间段创建动态面板,同时利用我的分区,我需要找到一种方法,将我的数据限制在 WHERE 部分,但要以跨年、跨月、跨天的方式进行操作,则不必根据查询构建SQL语句。

    我现在最好的建议是下面的查询,它应该有效——但它很复杂。对于这样的陈述,是否有推荐的最佳实践?

    SELECT
        Count(a1) as AVG_a1                 
    FROM
        tbl_11111111_a
    WHERE
        (
            -- Same year, same month
            (year = 'START_YEAR' AND month = 'START_MONTH' AND day BETWEEN 'START_DAY' AND 'END_DAY')
            OR
            -- Same year, different months
            (year = 'START_YEAR' AND month = 'START_MONTH' AND day >= 'START_DAY')
            OR
            (year = 'START_YEAR' AND month > 'START_MONTH' AND month < 'END_MONTH' AND day BETWEEN '01' AND '31')
            OR
            (year = 'START_YEAR' AND month = 'END_MONTH' AND day <= 'END_DAY')
            OR
            -- Different years
            (year > 'START_YEAR' AND year < 'END_YEAR')
            OR
            (year = 'END_YEAR' AND month < 'END_MONTH' AND day BETWEEN '01' AND '31')
            OR
            (year = 'END_YEAR' AND month = 'END_MONTH' AND day <= 'END_DAY')
        )
        AND
        t BETWEEN TIMESTAMP 'START_YEAR-START_MONTH-START_DAY 00:00:00' AND TIMESTAMP 'END_YEAR-END_MONTH-END_DAY 00:00:00'
    
    1 回复  |  直到 2 年前
        1
  •  2
  •   mfcss    2 年前

    我最终使用了中描述的方法 this AWS example this blog article

    具体来说,我按照最初列出的方式设置了S3结构,只是去掉了hive表示法(这可以保留,但我认为在这种情况下,投影日期格式会变得更加混乱):

    s3://bucket/device/table_x/2000/01/02/xyz.parquet
    

    通过此操作,我将表属性设置为使用分区投影,设置如下:

     "projection.enabled" = "true",
     "projection.date_created.type" = "date",
     "projection.date_created.format" = "yyyy/MM/dd",
     "projection.date_created.range" = "2000/01/01,NOW",
     "projection.date_created.interval" = "1",
     "projection.date_created.interval.unit" = "DAYS",
     "storage.location.template" = "s3://bucket/device/table_x/${date_created}/"
    

    在我的表模式中,我将JSON更新为以下内容:

    [
      {
        "Name": "t",
        "Type": "timestamp",
        "Comment": ""
      },
      {
        "Name": "a1",
        "Type": "double",
        "Comment": ""
      },
      {
        "Name": "a2",
        "Type": "double",
        "Comment": ""
      },
      {
        "Name": "date_created",
        "Type": "string",
        "Comment": "",
        "PartitionKey": "Partition (0)"
      }
    ]
    

    在进行了这些更新之后,我现在可以使用pilcorow提出的更简单的分区结构来查询数据:

    SELECT
        Count(a1) as AVG_a1                 
    FROM
        tbl_11111111_a
    WHERE
        date_created BETWEEN '2000/01/01' AND '2000/01/02'
    

    这里的重点是,我能够保留子文件夹嵌套日期结构(我更喜欢用于非Athena数据采集目的),但仍然为我的Athena查询保留简单的SQL查询WHERE语句,同时使用高效的分区投影。

    还要注意的是,我相信pilcorow在分析我最初的年/月/日SQL查询效率低下时是正确的,因为我发现当切换到与单日5分钟间隔相关的查询的新结构时,速度提高了66%。

        2
  •  1
  •   pilcrow    2 年前

    我看到两种选择。

    首先理想地, 重新分配 使用单个“YYYYMMDD”属性分区,而不是嵌套分区。那是 a good practice for more natural querying ,例如,“…WHERE ymd_partition BETWEEN'START_ymd'AND'END_ymd'”,没有所有这些布尔扭曲。

    否则,继续目前的方法,它有一个可行的想法,但为了 效率

    我们希望根据输入范围应用三种场景中的一种(相同年/月、相同年、不同年);然而,正如所写的,我们根据所有记录评估每个场景的每个条件,而不考虑输入范围。(输入范围为 [2023-01-31, 2023-02-1] 例如,将检查 从一月份开始的所有镶木地板锉 由于倒数第二个OR条件,因此意味着我们依赖 t BETWEEN... 作为昂贵的后盾。)

    解决方案不会很好。

    WHERE
    
    -- apply only when input range has same year, same month
    -- 
    (start_yr = end_yr AND start_mo = end_mo
      AND "year" = start_yr
      AND "month" = start_mo
      AND "day" BETWEEN start_day AND end_day)
    
    OR
    -- apply only when input range same year, different months
    --
    (start_yr = end_yr AND start_mo != end_mo
      AND "year" = start_yr
      AND ( ("month" = start_mo AND "day" >= start_day)
              OR
            -- "day BETWEEN 1 AND 31" is superfluous here
            ("month" > start_mo AND "month" < end_mo)
              OR
            ("month" = end_mo AND "day" <= end_day) ) )
    
    OR
    -- apply only when input range has different years
    --
    (start_yr != end_yr
      AND
    ( ("year" = start_yr AND ( ("month" = start_mo AND "day" >= start_day)
                                  OR
                               ("month" > start_mo) ) )
         OR
      ("year" > start_yr AND "year" < end_yr)
         OR
      ("year" = end_yr AND ( ("month" < end_mo )
                                 OR
                             ("month" = end_mo AND "day" <= end_day) ) ) ) )
    

    我可能漏掉了一个括号。

    您可以添加 t BETWEEN 如果需要,逻辑会重新启动。

        3
  •  0
  •   Thom A    2 年前

    我现在最好的建议是下面的查询,它应该有效——但它很复杂。对于这样的陈述,是否有推荐的最佳实践?

    我想你也可以使用 CASE 表达式将数据筛选到由 起始年份 , START_MONTH , 开始日期 , 年末 , END_MONTH 结束日期 参数。

    SELECT
        Count(a1) as AVG_a1                 
    FROM
        tbl_11111111_a
    WHERE
        CASE
            WHEN year = START_YEAR AND month = START_MONTH AND day BETWEEN START_DAY AND END_DAY THEN 1
            WHEN year = START_YEAR AND month = START_MONTH AND day >= START_DAY THEN 1
            WHEN year = START_YEAR AND month > START_MONTH AND month < END_MONTH AND day BETWEEN '01' AND '31' THEN 1
            WHEN year = START_YEAR AND month = END_MONTH AND day <= END_DAY THEN 1
            WHEN year > START_YEAR AND year < END_YEAR THEN 1
            WHEN year = END_YEAR AND month < END_MONTH AND day BETWEEN '01' AND '31' THEN 1
            WHEN year = END_YEAR AND month = END_MONTH AND day <= END_DAY THEN 1
            ELSE 0
        END = 1
        AND
        t BETWEEN TIMESTAMP CONCAT(START_YEAR, '-', START_MONTH, '-', START_DAY, ' 00:00:00') AND TIMESTAMP CONCAT(END_YEAR, '-', END_MONTH, '-', END_DAY, ' 00:00:00')