Spark Web UI

Spark Web UI

After successfully creating/Executing our Spark code, we need to track how it is executing and where it takes more time to optimize it.

Apache Spark provides a Web UI/User Interfaces suite to monitor the status of our Spark/PySpark application, resource consumption of Spark cluster, and Spark configurations.

Spark Web UI

Spark UI is separated into the below tabs.

  • Spark Jobs
  • Stages
  • Tasks
  • Storage
  • Environment
  • Executors
  • SQL

If we are running the Spark application locally, Spark UI can be accessed using the http://localhost:4040/

1. Spark_UI_Jobs_Tab

In the Jobs Tab, we will get the below information,

Number of Spark Jobs

The number of Spark jobs equals the number of actions in the application, and each Spark job should have at least one Stage.

Number of Stages

Each Wide Transformation results in a separate Number of Stages. In our case, Spark job0 and Spark job1 have individual stages, but when it comes to Spark job3, we can see two stages due to the partition of data. Data is partitioned into two files.

Description

Description links the complete details of the associated SparkJob like Spark Job Status, DAG Visualization, Completed Stages

2. Spark_UI_Stage_Tab

A stage is a physical unit of the execution plan. It is a set of parallel tasks, i.e., one task per partition.

The Stage(s) tab displays a summary page that shows the current state of all stages of all jobs in the Spark application.

There is also a visual representation of the directed acyclic graph (DAG) of the stages, where vertices represent the RDDs or DataFrames and the edges represent an operation to be applied.

Stages in Apache spark have two categories,

  • ShuffleMapStage in Spark
  • ResultStage in Spark

ShuffleMapStage in Spark

ShuffleMapStage is considered an intermediate Spark stage in the physical execution of DAG. It produces data for another stage(s). We can consider ShuffleMapStage in Spark as an input for other following Spark stages in the DAG of stages

ResultStage in Spark

Running a function on a spark RDD Stage that executes a Spark action in a user program is a ResultStage. It is considered the final stage in spark. ResultStage implies a final stage in a job that applies a function on one or many partitions of the target RDD in Spark. It also helps with the computation of the result of an action.

3. Storage_Tab_in_Spark

The Storage tab displays the persisted RDDs and DataFrames, if any, in the application. The summary page shows the storage levels, sizes, and partitions of all RDDs, and the details page shows the sizes and using executors for all partitions in an RDD or DataFrame.

4. Environment_Tab_in_Spark

This environment page has five parts. It is a useful place to check whether your properties have been set correctly.

  • Runtime_Information: contains the runtime properties like versions of Java and Scala.
  • Spark_Properties: lists the application properties like ‘spark.app.name’ and ‘spark.driver.memory’.
  • Hadoop_Properties: displays properties relative to Hadoop and YARN. Note: Properties like ‘spark.hadoop’ are shown not in this part but in ‘Spark Properties’.
  • System_Properties: shows more details about the JVM.
  • Classpath_Entries: lists the classes loaded from different sources, which is very useful for resolving class conflicts.

5. Executors_Tab_in_Spark

The Executors tab displays summary information about the executors created for the application, including memory and disk usage and task and shuffle information. The Storage Memory column shows the memory used and reserved for caching data.

The Executors tab provides resource information like the amount of memory, disk, and cores each executor uses and performance information.

6. SQL_Tab_in_Spark

If the application executes Spark SQL queries, then the SQL tab displays information, such as the duration, Spark jobs, and physical and logical plans for the queries.


Similar Articles