Skip to main content
All CollectionsData Warehouse SharingAmazon S3 Connector
Query S3 data using Athena with automatic schema update (Glue)
Query S3 data using Athena with automatic schema update (Glue)

Learn how to retrieve schema information automatically using aws glue and query data from s3 using Athena

Ishan Shekhar avatar
Written by Ishan Shekhar
Updated over a month ago

  • The Athena Query instance and the AWS S3 bucket should be in the same region

  • The default option for the funnel data export is compressed, using a compressed export format will help to keep the cost low when querying a large amount of data.

  • We recommend using TSV as an output file format, which works better with the default delimited format which Athena uses.
    โ€‹
    Read more about How to set up an export to AWS S3.

Get started with setting up a crawler

  1. Connect a new data source by pressing the Connect data source button, then you should be able to see an option where you can choose the location of data and how to read its metadata catalog. Select the Query data in Amazon S3 for the location, and AWS glue data catalog for metadata.

  2. AWS glue data catalog would require a crawler that will retrieve schema information automatically, select that option and press Connect to AWS Glue which will open a new window.

  3. Setup a new crawler for your data
    - Choose an appropriate name for your crawler like 'funnel-data-s3-crawler'.
    - Select Data stores as the source type
    - Choose S3 as a data store, and leave the Connection blank
    - Select Specified path in my account, and paste the path where is data is located in the include path
    - Select no, if you don't have any other data store
    - Create a new IAM role with an appropriate name for the crawler to access the data files
    - Choose a daily frequency for the crawler to run
    - Add a database where the output data table will be created, like 'funnel-data'
    - Review and submit, now select the crawler you just created and press Run Crawler to execute it for the first time (this might take a few minutes)

Start querying your data

If you go to Athena you should be able to see the database called 'funnel-data' under the Data source: AwsDataCatalog. Start by selecting your database and modifying your query

Advanced configuration to create dynamic tables in Athena

If you have multiple s3 folders with their own schema then you can configure the crawler in a dynamic way, so it can re-cursively way.

Setting up a s3 data share in Funnel

Example setup where you want to create a table in athena called "googleanalytics"

Step 1: S3 bucket name

You need to give the bucket name where

Step 2: S3 bucket path (optional), but important

Here you should give the name of the folder that you want as a table in Athena, this folder will also contain meta data information about your share such as sql schema and summary file from Funnel

Step 3: File name template

Use this to create a file name template, where each file is a for a day by selecting "Daily parts" from Time series option. A common practice is to structure your S3 paths using key-value pairs for partitions, this automatically partitions your data on date when storing which is a best practice

Setting up AWS Glue crawler to catalog the CSV files in the s3 as tables

By setting the crawler to create tables based on a specific folder level in your S3 bucket, it will automatically detect new data source folders and create corresponding tables without manual intervention.

Step 1: Use a Common Parent Path in the Crawler

  • Data Store Path: Set the crawler's data store path to the root of your S3 bucket where all data sources reside. make sure to check the box: Crawl all sub-folders

    • Example: s3://athena-aws-glue-bucket-de/

  • Exclude files matching pattern: Exclude any folder or files that you do not wish to include using a matching pattern: */*.sql and */*.json

Step 2: Configure the Crawler's Output to Group by Folder Structure

Grouping Behavior When Crawling:

  • In the crawler configuration, under "Output configuration" under advanced select:

    • Option: "Create a single schema for each S3 path."

  • Table Level:

    • Set the "Table level" to 2, which tells the crawler to create a new table for each first-level subfolder under your specified S3 path.

    • This means folders like googleanalytics, googleads, and any new data source folders will each become a separate table.

  • Delimiter:

    • Use the default delimiter / to separate folder levels.

Step 3: Ensure Folder Structure Consistency

  • Folder Hierarchy:

    • Your S3 folder structure should look like this:

  • Consistency:

    • All data source folders (googleanalytics, googleads, etc.) should be at the same level directly under the root path.

Did this answer your question?