Customers on our Enterprise plan can export files to the Apache Parquet format. But what is this data format exactly, when is it used, and what are the benefits?
What is the Parquet filetype?
Apache Parquet is an open source file format that is one of the fastest formats to read from. It's a column-oriented file format, meaning that the data is stored per column instead of only per row. The parquet files are structured and include the schema of the columns which makes it suited for importing straight into a database/data warehouse.
When are Parquet files used?
Since file storage solutions generally are cheap, easy to work with, version controlled etc. many store their data there. Thereafter the data can be streamed into a single or multiple tools, data lakes, databases or data warehouses for processing.
You will get the most benefits from this format if you are using a tool that is built on top of a file storage or in general read data frequently from a file storage.
What are the benefits of using Parquet files?
Working with Parquet files has several benefits. The main benefits of Parquet are:
Less data scanned at ingestion leading to faster scans and lower query and memory costs because you only have to read the columns that you actually need.
Compressed file format leading to lower storage costs.
Includes column indices that allow for directly looking up which internal page the value of a column can be found in which leads to faster and cheaper querying of data.
Includes additional statistics about the data values and their types and sizes.
A perfect example of the benefits you get with Parquet, is if you are using Amazon Athena. This queries your files in Amazon S3. They themselves report massive speed improvements for switching from CSV to Parquet.
Difference between a CSV file and a Parquet file
One difference is that a CSV file is a row based file format, where the apache Parquet file format is a columnar format.
Also, if you import data in CSV format into a database you will have to use a separate schema file or manually set/change the schema in the database. In contrast, the data types are by default built into the Parquet format.
Exporting Parquet data from Funnel
Our Parquet files are compressed by default. This leads to a 60%-75% reduction in file size compared to uncompressed CSV files.