Data Engineering Hub
GitHub Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode

Delta Lake

GitHub Repo stars GitHub last commit GitHub

Delta Lake is an open-source storage framework that enables building a
Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs for Scala, Java, Rust, Ruby, and Python.

Delta Lake is essentially a metadata layer on top of Parquet.

The file layout looks like:

![[Assets/delta-lake-file-format.png|500]]

Delta Lake Official Documentation

https://docs.delta.io/latest/index.html

Delta Lake Advantages (over plain [[Apache Parquet|Parquet]])

  • ACID transactions with optimistic concurrency control.
  • Efficient streaming I/O.
  • Caching.
  • Time travel.
  • Data layout optimization, e.g. Z-ordering.
  • Schema enforcement & evolution.
  • UPSERT & MERGE statements.
  • Audit logging.

Delta Lake Disadvantages

  • Same Parquet disadvantages.
  • Maintenance processes are required to maintain its performance, e.g. OPTIMIZE.
  • There is a learning curve when using advanced features, e.g. VACUUM.