Apache Parquet

GitHub last commit GitHub

Apache Parquet is an open source data file format that was designed to improve performance when handling [[Column-oriented Database|column-oriented data]] in bulk. Apache Parquet is able to provide efficient compression and encoding schemes with enhanced performance due to its design. This makes it a common interchange format for both batch and interactive workloads, similar to other available columnar-storage file formats in Hadoop like RCFile and ORC.

Extension: .parquet

Apache Parquet Official Documentation

https://parquet.apache.org/docs/

Apache Parquet Advantages

Reduces IO operations.
Column-based format makes it more efficient in terms of storage space but also speeds up analytics queries.
Highly efficient data compression and decompression.
Support type-specific encoding.
Supports several data types and nested data structures.

Apache Parquet Disadvantages

Not human readable (binary).
More memory required to read data vs row-based format.
Can be slower to write than row-based file formats because of the metadata overhead.