What Is Photon In Databricks

Article

What is Photon?

Photon is a native vectorized engine developed in C++ to improve query performance dramatically.

All we have to do to benefit from Photon is turn it on during the cluster creation process.

How Photon works

While Photon is written in C++, it integrates directly in and with Databricks Runtime and Spark. This means that no code changes are required to use Photon.

We will see a quick “lifecycle of a query” to understand where Photon plugs.

What is Photon in Databricks

When we submit a query or command to the Spark driver, it is parsed, and the Catalyst optimizer does the analysis, planning, and optimization just as it would if there were no Photon involved.

The only difference is that with Photon, the runtime engine passes over the physical plan and determines which parts can run in Photon. Minor modifications may be made to the plan for Photon, for example, changing a sort merge join to hash join, but the overall structure of the plan, including join order, will remain the same. Since Photon does not yet support all features that Spark does, a single query can run partially in Photon and partially in Spark. This hybrid execution model is entirely transparent to the user.

The query plan is then broken up into atomic units of distributed execution called tasks run in threads on worker nodes, which operate on a specific data partition. It’s at this level that the Photon engine does its work. You can think of it as replacing Spark’s whole stage codegen with a native engine implementation. The Photon library is loaded into the JVM, and Spark and Photon communicate via JNI(Java_Native_Interface), passing data pointers to off-heap memory. Photon also integrates with Spark’s memory manager for coordinated spilling in mixed plans. Spark and Photon are configured to use off-heap memory and coordinate under memory pressure.

Advantages in Photon

Supports SQL and equivalent DataFrame operations against Delta and Parquet tables.
Accelerates queries that process a significant amount of data (100GB+) and include aggregations and joins.
Faster performance when data is accessed repeatedly from the disk cache.
More robust scan performance on tables with many columns and many small files.
Faster Delta and Parquet writing using UPDATE, DELETE, MERGE INTO, INSERT, and CREATE TABLE AS SELECT, especially for wide tables (hundreds to thousands of columns).
Replaces sort-merge joins with hash-joins.

Limitations in Photon

Structured Streaming: Photon supports stateless streaming with Delta, Parquet, and CSV. Kafka and Kinesis support are in Public Preview.
Does not support UDFs.
Does not support RDD APIs.
Not expected to improve short-running queries (<2 seconds), such as queries against small amounts of data.

Features not supported by Photon run the same way they would with Databricks Runtime; there is no performance advantage for those features.