Overview
Databricks is built in your Azure, Google or AWS environment, all of which we support and can work with data in their respective warehouses. The examples below are for when you set it up in AWS and store the files in S3.
We refer to Databricks for any up to date instructions about how to setup this on their side.
Setting up the S3 data share
Make sure to select CSV
as the Field format and have data/
in the beginning of the File name template to get all the CSV data files in their own folder. This example also assumes that you don't set Header format to none
in Edit field format.
Importing data from S3 into Databricks
You can set up a default delta table in Databricks like this:
CREATE TABLE IF NOT EXISTS example_table;
You can then do a one time import of the files like this:
COPY INTO example_table
FROM 's3://funnel-s3-export-bucket/path/data'
FILEFORMAT = CSV
FORMAT_OPTIONS ('header' = 'true', 'escape' = '"')
COPY_OPTIONS ('mergeSchema' = 'true')
And preview it like this:
SELECT * FROM my_table LIMIT 20;
The data can now be found in the Databricks Data Explorer, but this was only a one time import.
Setting up a pipeline from S3 into Databricks
We refer to Databricks documentation for setting this up but if you are using an Auto Loader for S3 in SQL, it will look something like this.
CREATE STREAMING LIVE TABLE streaming_table
AS
SELECT
*
FROM
cloud_files(
's3://funnel-s3-export-bucket/path/data',
'csv',
map(
'header', 'true',
'escape', '"'
'cloudFiles.inferColumnTypes', 'true',
)
)
And then use that notebook as the source code when creating the continuous pipeline.