Performance Optimization In Apache Spark

Performance Optimization in Apache Spark

Even though Spark is equipped with an In-Memory processing technique and we spend more on hardware, We need to optimize our spark code to achieve a better result.

We have many optimization techniques and will see some of them in the next few days.

1. Shuffle partition size

Based on our dataset size, number of cores, and memory, Spark shuffling can benefit or harm the jobs.

When we are dealing with less amount of data, we need to reduce the shuffle partitions. Otherwise, we will end up with many partitioned files with fewer records in each partition, which results in running many tasks with fewer data to process.

On the other hand, when we have too much data and have fewer partitions results in a few long running tasks, and sometimes we may also get out-of-memory errors.

2. Bucketing

Bucketing is commonly used to optimize the performance of a join query by avoiding shuffles of tables participating in the join.

df.write.bucketBy(n, “column”).saveAsTable(“bucketed_table”)

The above code will create a table (bucketed table) with rows sorted based on the column with "n" - no.of.buckets.

Bucketing can benefit when pre-shuffled bucketed tables are used more than once in the query.

3. Serialization

Serialization plays an important role in the performance of any distributed application. By default, Spark uses a Java serializer.

Spark can also use a 'Kryo' serializer for better performance.

Kryo serializer is in compact binary format and can process 10x faster than Java serializer.

To set the serializer properties,

conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

4. API selection

Spark introduced three types of API to work upon - RDD, DataFrame, DataSet

  • RDD is used for low level operation with less optimization
  • DataFrame is the best choice in most cases due to its catalyst optimizer and low garbage collection (GC) overhead.
  • The dataset is highly type-safe and uses encoders. It uses Tungsten for serialization in binary format.

5. Avoid UDF's

UDF's are used to extend the framework's functions and re-use this function on several DataFrame.

UDF's are once created and can be re-used on several DataFrame's and SQL expressions.

But UDF's are a black box to Spark. Hence it can't apply optimization, and we will lose all the optimization Spark does on Dataframe/Dataset. Whenever possible, we should use Spark SQL built-in functions as these functions are designed to provide optimization.

6. Use Serialized data formats 

Most Spark jobs run as a pipeline where one Spark job writes data into a File, and another reads the data, processes it, and writes it to another file for another Spark job to pick up.

We prefer writing an intermediate file in Serialized and optimized formats like Avro, Kryo, and Parquet when we have such use cases. Any transformations on these formats perform better than text, CSV, and JSON.