Background

If you use Amazon Athena to query data stored in Amazon S3 (for example, log files, or large datasets used for analysis), you may find yourself wanting to version-control table definitions using AWS CloudFormation. Since Athena uses the AWS Glue data catalog underneath, you can create Glue tables to populate your Athena databases.

(You should note that, with a bit of effort, you can likely adapt this information to e.g. Terraform's Glue table resource.)

Simple example

Using the CloudFront Logs example table, here's a resource definition suitable for a CloudFormation YAML template:

CloudfrontLogsTable:
  Type: "AWS::Glue::Table"
  Properties:
    CatalogId: !Ref "AWS::AccountId"
    DatabaseName: default
    TableInput:
      Name: cloudfront_logs
      TableType: EXTERNAL_TABLE
      StorageDescriptor:
        Location: "s3://athena-examples-us-east-1/cloudfront/plaintext/"
        StoredAsSubDirectories: true

        InputFormat: org.apache.hadoop.mapred.TextInputFormat
        OutputFormat: IgnoreKeyTextOutputFormat

        SerdeInfo:
          SerializationLibrary: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
          Parameters:
            field.delim: "\t"
            serialization.format: "\t"

        Columns:
          - Name: Date
            Type: date
          - Name: Time
            Type: string
          - Name: Location
            Type: string
          - Name: Bytes
            Type: int
          - Name: RequestIP
            Type: string
          - Name: Method
            Type: string
          - Name: Host
            Type: string
          - Name: Uri
            Type: string
          - Name: Status
            Type: int
          - Name: Referrer
            Type: string
          - Name: ClientInfo
            Type: string

Defining partition projection

Partition projection is supported here too. For example, on a CloudTrail logs table with partition projection:

CloudtrailLogsTable:
  Properties:
    TableInput:
      Parameters:
        projection.enabled: "true"
        projection.timestamp.type: "date"
        projection.timestamp.format: "yyyy/MM/dd"
        projection.timestamp.interval: "1"
        projection.timestamp.interval.unit: "DAYS"
        projection.awsregion.type: "enum"
        projection.awsregion.values: "us-east-1,us-east-2"
        storage.location.template: !Sub "s3://my-cloudtrail-logs/AWSLogs/${AWS::AccountId}/CloudTrail/${!awsregion}/${!timestamp}"

      PartitionKeys:
        - Name: awsregion
          Type: string

        - Name: timestamp
          Type: string