Understanding Apache Airflow

Introduction

The article briefly explains the new orchestration tool Apache Airflow, to understand Airflow it’s better to understand what workflows are. A workflow is a sequence of tasks that processes a set of data or a series of steps needed to be executed, a Wikipedia definition is “A workflow consists of an orchestrated and repeatable pattern of activity, enabled by the systematic organization of resources into processes that transform materials, provide services, or process information. It can be depicted as a sequence of operations, the work of a person or group, the work of an organization of staff, or one or more simple or complex mechanisms.”

In a nutshell, Apache Airflow is a platform for authoring, scheduling (let go of old cron jobs), monitoring data, and computing workflows. It is developed by Airbnb, and it is now under the Apache Software Foundation.

Features

Apache Airflow is open-source, and it uses standard Python to create workflows. It has a collection of ready-to-use operators that can work with MySQL, Oracle, etc. along with Azure, AWS, Google Cloud platforms. Operators are explained in detail in the latter part of the article.

It has an extremely intuitive UI for monitoring and managing workflows, we can check the status of running tasks with the flexibility of stopping and executing on demand.

DAGs

DAG stands for Directed Acyclic Graph, it’s a graph with Nodes and Edges and it should not have any loops as edges should always be directed, In a nutshell, DAG is a Data Pipeline, Node in a DAG is a task like “Download a File from S3” or “Query MySQL Database”.

In the DAG, A, B, C represents Tasks and each one of them is independent this is a correct DAG as there are no loops.

Incorrect DAG as it’s a loop, this DAG will never complete, and it will keep running forever.

Let’s understand the DAG with a realistic example. Say the organization receives a CSV file every day on S3, then aggregation and normalization are of the CSV file is required say using Pandas and upload the data to the MongoDB and after processing upload the source file to another S3 location.

Operator

As discussed, the nodes in the DAGs are tasks like Download, Aggregate, Upload, etc. These tasks are executed as Operators, our DAG has 4 nodes, and each node/task will be executed as an Operator, In Airflow there are various ready to use Operators which can be leverage

For downloading and uploading a file to S3, the existing “S3FileTransformOperator” will be used.

For executing a Python function the step2 is to execute a Pandas program, for that PythonOperator is there.

Similarly, for interaction with MongoDB or any other Database, Airflow has existing Operators.

Writing a Simple DAG

from datetime import datetime as dt
from datetime import timedelta
from airflow.utils.dates import days_ago
from airflow import DAG
from airflow.operators.python_operator import PythonOperator

default_args = {
'start_date' : days_ago(1),
'retries' : 1,
'retry_delay' : timedelta(minutes=5)
}

dag = DAG('hello_world_dag',
    description = 'Simple Hello World DAG',
    default_args = default_args,
    schedule_interval = timedelta(days = 1)
)

def print():
    return ("Hello world")

task = PythonOperator(
  task_id = 'print_hello_world', 
 python_callable = print, 
 dag = dag)

task

On the Airflow UI, we can see our DAG is imported ready for execution

The program is straightforward “DAG” section indicates our DAG creation properties, like name, start dates, retry parameters. Since we are using a simple python function to print “Hello World” a PythonOperator is used.

Summary

In the article, we have covered what is Airflow, Features, and 2 major building blocks which are DAG and Operator, this much knowledge is sufficient to start working with Airflow. In the upcoming articles, more information on Airflow will be provided.