代码之家  ›  专栏  ›  技术社区  ›  ignorance

从Glue Cloudformation模板划分Athena表

  •  3
  • ignorance  · 技术社区  · 7 年前

    使用 AWS::Glue::Table ,您可以设置雅典娜表,如 here 雅典娜先生 supports partitioning data 基于S3中的文件夹结构。我想将我的Athena表从我的Glue模板中分割出来。

    从…起 AWS Glue Table TableInput ,看来我可以使用 PartitionKeys 对我的数据进行分区,但当我尝试使用下面的模板时,Athena失败了,无法获取任何数据。

    Resources:
      ...
    
      MyGlueTable:
        Type: AWS::Glue::Table
        Properties:
          DatabaseName: !Ref MyGlueDatabase
          CatalogId: !Ref AWS::AccountId
          TableInput:
            Name: my-glue-table
            Parameters: { "classification" : "json" }
            PartitionKeys:
              - {Name: dt, Type: string}
            StorageDescriptor:
              Location: "s3://elasticmapreduce/samples/hive-ads/tables/impressions/"
              InputFormat: "org.apache.hadoop.mapred.TextInputFormat"
              OutputFormat: "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
              SerdeInfo:
                Parameters: { "separatorChar" : "," }
                SerializationLibrary: "org.apache.hive.hcatalog.data.JsonSerDe"
              StoredAsSubDirectories: false
              Columns:
                - {Name: requestBeginTime, Type: string}
                - {Name: adId, Type: string}
                - {Name: impressionId, Type: string}
                - {Name: referrer, Type: string}
                - {Name: userAgent, Type: string}
                - {Name: userCookie, Type: string}
                - {Name: ip, Type: string}
                - {Name: number, Type: string}
                - {Name: processId, Type: string}
                - {Name: browserCookie, Type: string}
                - {Name: requestEndTime, Type: string}
                - {Name: timers, Type: "struct<modellookup:string,requesttime:string>"}
                - {Name: threadId, Type: string}
                - {Name: hostname, Type: string}
                - {Name: sessionId, Type: string}
    

    如何在AWS Glue中对数据进行分区?

    1 回复  |  直到 7 年前
        1
  •  13
  •   ignorance    7 年前

    明白了!非常痛苦,因为我必须运行Glue Crawler,它正确地创建了带有分区的表,然后使用CLI提取正确的模板参数。这是模板,

    AWSTemplateFormatVersion: 2010-09-09
    Description: A partitioned Glue Table
    
    Resources:
      MyGlueDatabase:
        Type: AWS::Glue::Database
        Properties:
          DatabaseInput:
            Name: my_glue_database
            Description: "Glue beats tape"
          CatalogId: !Ref AWS::AccountId
    
      MyGlueTable:
        Type: AWS::Glue::Table
        Properties:
          DatabaseName: !Ref MyGlueDatabase
          CatalogId: !Ref AWS::AccountId
          TableInput:
            Name: my_glue_table
            TableType: EXTERNAL_TABLE
            Parameters:
              CrawlerSchemaDeserializerVersion': "1.0"
              CrawlerSchemaSerializerVersion': "1.0"
              classification': json
              compressionType': none
              typeOfData': file
            PartitionKeys:
              - {Name: dt, Type: string}
            StorageDescriptor:
              BucketColumns: []
              Columns:
              - {Name: number, Type: string}
              - {Name: referrer, Type: string}
              - {Name: processid, Type: string}
              - {Name: adid, Type: string}
              - {Name: browsercookie, Type: string}
              - {Name: usercookie, Type: string}
              - {Name: requestendtime, Type: string}
              - {Name: impressionid, Type: string}
              - {Name: useragent, Type: string}
              - {Name: timers, Type: 'struct<modelLookup:string,requestTime:string>'}
              - {Name: threadid, Type: string}
              - {Name: ip, Type: string}
              - {Name: modelid, Type: string}
              - {Name: hostname, Type: string}
              - {Name: sessionid, Type: string}
              - {Name: requestbegintime, Type: string}
              Compressed: false
              InputFormat: org.apache.hadoop.mapred.TextInputFormat
              Location: s3://elasticmapreduce/samples/hive-ads/tables/impressions/
              NumberOfBuckets: -1
              OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
              Parameters: {CrawlerSchemaDeserializerVersion: '1.0', CrawlerSchemaSerializerVersion: '1.0',
                UPDATED_BY_CRAWLER: test, averageRecordSize: '644', classification: json,
                compressionType: none, objectCount: '241', recordCount: '1000109', sizeKey: '648533598',
                typeOfData: file}
              SerdeInfo:
                Parameters: {paths: 'adId,browserCookie,hostname,impressionId,ip,modelId,number,processId,referrer,requestBeginTime,requestEndTime,sessionId,threadId,timers,userAgent,userCookie'}
                SerializationLibrary: org.openx.data.jsonserde.JsonSerDe
              SortColumns: []
              StoredAsSubDirectories: false
    

    然后Cloudformation将部署该表,您需要运行。

    MSCK REPAIR TABLE my_glue_table;
    

    这将添加所有分区,您将在输出中看到这些分区,如,

    Repair: Added partition to metastore my_glue_table:dt=2009-04-12-13-00
    Repair: Added partition to metastore my_glue_table:dt=2009-04-12-13-05
    

    然后您可以在这个分区上运行SQL,如下所示:,

    %% SELECT * FROM "my_glue_database"."my_glue_table" WHERE dt = '2009-04-14-13-00' LIMIT 10;
    1   7663    cartoonnetwork.com  1178    SxRBJCmJBCLcfTS545t6qD1M8L64SC  nsdfvfvger  3VCLfFfF75BDgHgDoowHegOpkCivMJ  1239714024000   RTM6Vtrc1O3KX2FlUghUSiAQHiix8F  Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; GTB6; .NET CLR 1.0.3705; .NET CLR 1.1.4322; Media Center PC 4.0) {modellookup=0.3538, requesttime=0.7532}    15  37.215.88.35    bxxiuxduad  ec2-50-32-48-14.amazon.com  BIBIlA7dgXc2eWekUJ6hSXa7p6dQEx  1239714024000   2009-04-14-13-00
    2   17646   coursera.org    1255    Fskm4W6JKX6vf7UMaW55KObTJCtm1E  xftjotkexc  jH6DRWtkeH3tVg6c4mcLW36UW3LvqX  1239714027000   uQqO1fNoeM8KdesiVg86o4iK7FkqLt  Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.1.4322)  {modellookup=0.2986, requesttime=0.9616}    21  37.218.101.204  bxxiuxduad  ec2-50-32-48-14.amazon.com  OjgTQWOqHJopoWf9LpJ4We1UE7uJao  1239714026000   2009-04-14-13-00