Databricks Delta Live Tables

Introduction

In this blog, we are going to explore Databricks Delta Live Tables. This feature is in public preview mode and used to build reliable and testable data processing pipelines which will perform a transformation on data without creating multiple Apache spark tasks.

Databricks Delta Live Tables

We can define the transformations on the top of data stored and Delta Live tables will take care of all the administrations, cluster management, and error handling. It helps data engineers to simplify the ETL data pipelines with pipeline development, testing features, and in-built monitoring features.

Features of Delta Live Tables
 

Simplified Data Pipelines

With the help of Databricks Delta, we can create end-to-end data pipelines to ingest, transform and process data. We don't have to worry about manually connecting the dependent data sources and it also supports both batch and streaming data sources.

Databricks Delta Live Tables

Supports in-built Automation Testing

While creating data pipelines with Databricks Delta Live Tables, it has in-built automation testing features which makes sure that quality data will be available for the downstream users to be used in Power BI and Machine Learning purpose.

Proactive monitoring

Data Brick's delta live tables provide in-built monitoring to track the executed operations and lineage. It is also possible to easily recover from the failures and speed up the operational tasks while working with the data pipelines.

After understanding the overview of Databricks Delta Live Tables and its features, let's further deep dive into the actual implementation using Azure Databricks.

Read the records from the raw data store, using Delta Live Tables to clean the data and create a new table.

Step 1 - Create an Azure Databricks Notebook

import dlt
from pyspark.sql.functions import *
from pyspark.sql.types import *

json_path = "/databricks-datasets/stream.json"
@dlt.table(
  comment="The raw clickstream dataset, ingested from /databricks-datasets."
)
def stream_raw():
  return (spark.read.json(json_path))

@dlt.table(
  comment="Clickstream data cleaned and prepared for analysis."
)
def clickstream_prepared():
  return (
    dlt.read("clickstream_raw")
      .withColumn("click_count", expr("CAST(n AS INT)"))
      .withColumnRenamed("curr_title", "page_title")
      .select("page_title", "click_count")
  )

In this code, we are creating databricks delta live tables.

Step 2 - Create Azure Databricks Pipeline

Now, go to the jobs from the Databricks workspace UI, click on the pipeline tab and create the pipeline.

Concepts of Delta Table

Pipeline

It is the main component of Databricks Delta Table which is used to linked source data set with destination data set. We can either use SQL Queries or Python code to define the pipeline for Delta Table.

Queries

We can apply data transformations in Delta Table using queries.

Expectations

Data Quality and all relevant control checks can be defined in the expectations section. We can define expectations to keep the record or drop the record. It is also possible to stop the pipeline in case certain expectations don't match.

Pipeline Settings

It is also possible to set various configurations about the pipeline to create end to end automated pipeline. Pipeline settings are mainly in the JSON format and we can configure below,

  • Libraries
  • Cloud Storage
  • Other Python Package dependency
  • Spark Cluster Configuration

Datasets

There are mainly two types of pipeline datasets: Views and Tables,

Views - Views are similar to normal views in SQL which allows us to break complex queries.

Tables - We can create either complete tables or incremental tables.

Continuous and Triggered Pipeline

Delta Live tables support two types of triggers to run the pipeline. Triggered pipeline which will update the data based on the pipeline definition once. Continuous pipeline will continuously ingest and transform data based on the pipeline logic which will make sure that downstream consumers have data available and it is fully up to date.

Note
Delta Live Table can be created in python as well as SQL language. Delta Live Tables currently only supports updating data on the top of Delta Tables. We can't write multiple queries in the pipeline to update the same table.

Conclusion

In this blog, we showed how efficiently we can create end-to-end ETL pipelines with the help of Databricks Delta Live Tables including Data Quality and Validation control configurations, automation testing, and monitoring.