The Athena Query instance and the AWS S3 bucket should be in the same region
The default option for the funnel data export is compressed, using a compressed export format will help to keep the cost low when querying a large amount of data.
We recommend using TSV as an output file format, which works better with the default delimited format which Athena uses.
โ
Read more about How to set up an export to AWS S3.
Get started with setting up a crawler
Connect a new data source by pressing the Connect data source button, then you should be able to see an option where you can choose the location of data and how to read its metadata catalog. Select the Query data in Amazon S3 for the location, and AWS glue data catalog for metadata.
AWS glue data catalog would require a crawler that will retrieve schema information automatically, select that option and press Connect to AWS Glue which will open a new window.
Setup a new crawler for your data
- Choose an appropriate name for your crawler like 'funnel-data-s3-crawler'.
- Select Data stores as the source type
- Choose S3 as a data store, and leave the Connection blank
- Select Specified path in my account, and paste the path where is data is located in the include path
- Select no, if you don't have any other data store
- Create a new IAM role with an appropriate name for the crawler to access the data files
- Choose a daily frequency for the crawler to run
- Add a database where the output data table will be created, like 'funnel-data'
- Review and submit, now select the crawler you just created and press Run Crawler to execute it for the first time (this might take a few minutes)
Start querying your data
If you go to Athena you should be able to see the database called 'funnel-data' under the Data source: AwsDataCatalog. Start by selecting your database and modifying your query
Advanced configuration to create dynamic tables in Athena
If you have multiple s3 folders with their own schema then you can configure the crawler in a dynamic way, so it can re-cursively way.
Setting up a s3 data share in Funnel
Example setup where you want to create a table in athena called "googleanalytics"
Step 1: S3 bucket name
You need to give the bucket name where
Step 2: S3 bucket path (optional), but important
Here you should give the name of the folder that you want as a table in Athena, this folder will also contain meta data information about your share such as sql schema and summary file from Funnel
Step 3: File name template
Use this to create a file name template, where each file is a for a day by selecting "Daily parts" from Time series option. A common practice is to structure your S3 paths using key-value pairs for partitions, this automatically partitions your data on date when storing which is a best practice
Setting up AWS Glue crawler to catalog the CSV files in the s3 as tables
By setting the crawler to create tables based on a specific folder level in your S3 bucket, it will automatically detect new data source folders and create corresponding tables without manual intervention.
Step 1: Use a Common Parent Path in the Crawler
Data Store Path: Set the crawler's data store path to the root of your S3 bucket where all data sources reside. make sure to check the box: Crawl all sub-folders
Example:
s3://athena-aws-glue-bucket-de/
Exclude files matching pattern: Exclude any folder or files that you do not wish to include using a matching pattern: */*.sql and */*.json
Step 2: Configure the Crawler's Output to Group by Folder Structure
Grouping Behavior When Crawling:
In the crawler configuration, under "Output configuration" under advanced select:
Option: "Create a single schema for each S3 path."
Table Level:
Set the "Table level" to 2, which tells the crawler to create a new table for each first-level subfolder under your specified S3 path.
This means folders like
googleanalytics
,googleads
, and any new data source folders will each become a separate table.
Delimiter:
Use the default delimiter
/
to separate folder levels.
Step 3: Ensure Folder Structure Consistency
Folder Hierarchy:
Consistency:
All data source folders (
googleanalytics
,googleads
, etc.) should be at the same level directly under the root path.