Understanding Apache Beam

Introduction

Apache Beam is a project from Apache which should be used for data processing needs, but the confusing part for beginners would be why use Apache Beam and how it is different from Apache Flink or Apache Spark when they are already used for the same requirements.

The beauty of Apache Beam is that is a unified Batch and Stream processing system. Apache Beam has a set of APIs using which we can write our code, and then the code can execute anywhere. It can run on Spark or Flink or GCP, etc.

Apache Beam

The article explains what Apache Beam is, the Internals of Beam, the API, and important details about its API. Let’s explore.

Apache Beam Internals

The Apache Beam supports 3 languages as of now: Java, Python, and Go. In the future, more languages will be supported as the community is growing quite fast. Using any of the Beam SDK the users can create the Pipelines, SDKs are any one of the 3 languages Beam is supporting as of now. After the pipeline is created, Apache Beam internally converts it into a language standard and it’s done by the Runner APIs, which will be understood next. The generic language standard discussion will be done in the next Article on Beam.

Apache Beam

The Runner executes the pipeline on the machine after the validation of whether Pipeline follows the Apache Beam protocols or not. This information is sufficient for us just to begin with.

API

The Apache Beam API has 4 important objects

  1. Pipeline
  2. PCollection
  3. PTransform
  4. Runner

The class Pipeline manages Directed Acyclic Graph (DAG), in a DAG each node represents a task like Download Input file, Normalize the Input file, and save the output file. Each Pipeline is self-contained and separated from the other Pipelines in Beam

PipelineOptions options = PipelineOptionsFactory.create();
Pipeline pipeline = Pipeline.create (options);

Another important class in Apache Beam is PCollection. PCollection handles the input data, the PCollection object is equivalent to a DataFrame, the PCollection object will be created while reading the data from the source. The PCollection can handle a bound and an unbound number of elements.

PCollection<Integer> input =
      pipeline.apply(
          List.of (1,2,3,4);
);

Step three is Transformation any transformation applied to PCollection will result in a new PCollection object.

PCollection<String> words = input.apply(
        ParDo.of(extract_even)));

Will dig into Transformation methods in a separate article very soon. The final step is Pipeline execution

pipeline.run();

The below diagram showcases the Pipeline

Apache Beam

Summary

This article explained what Apache Beam is, the internals of Beam, the API, and other details. The upcoming articles will go deep and various examples will be covered, stay tuned.