What Is Delta Lake?

Delta Lake is an open-source architecture for building a Lakehouse by creating a structured layer for all types of data (including unstructured data) stored in a Data Lake. This structured layer enables some features similar to the features available in relational databases, along with other features that are beyond traditional relational databases' capabilities.

What is Delta Lake

What is Delta Table and its Format?

In spark, we can have two types of tables

  1. Managed
  2. External (Un Managed)

In Managed Table, when you drop it, the schema and the underlying file will also get deleted. But in the External(Un Managed) table, when you drop it, only the schema gets deleted, the underlying file will remain as it is.

While saving the data in an external table, we usually save it as delta format.

What is Delta format?

Delta format is nothing but a Parquet format. Delta Lake uses versioned Parquet files to store our data in the cloud storage. Apart from the versions, Delta Lake also stores a transaction log to keep track of all the commits made to the table or blob store directory to provide ACID transactions.

What is Delta Lake

PySpark

To save the data in the Delta table from our DataFrame

What is Delta Lake

The underlying file behind the created delta table with its Delta Log

In the above screenshot, I have written my records from testDf into a delta table. It got saved as Parquet format with a delta log folder, where it will maintain the transactional logs for the particular delta table.

So Delta is nothing but a Versioned Parquet file


Similar Articles