An Overview Of Azure Databricks Cluster Creation

Sarathlal Saseendran
1y
13.1k
0
9

Article

In this article, we will cover the steps for creating an Azure Databricks workspace and configuring a Spark cluster. Later, we will create a simple data table from an existing file and query the data using SQL notebook.

Azure Databricks is the most advanced Apache Spark platform. With a high-performance processing engine that’s optimized for Azure, you’re able to improve and scale your analytics on a global scale—saving valuable time and money, while driving new insights and innovation for your organization. Microsoft Azure Databricks offers an intelligent, end-to-end solution for all your data and analytics challenges.

For those who are new to Apache Spark, it is a general framework for distributed computing that offers high performance for both batch and interactive processing. It exposes APIs for Java, Python, and Scala and consists of Sparkcore and several related projects: SparkSQL - A module for working with structured data.

Other than Databricks, Hortonworks is also providing the Spark platform in Azure, which is included in HDInsight.Cloudera is another vendor that provides the Spark platform independently.

Step 1. Create the Azure data bricks workspace

Choose to Create a Resource and select Azure Databricks in the filter box. Please choose the workspace name, resource group, and location. If you don’t have a resource group, you can create a new one. Please note that currently, Azure Databricks provides a14 14-day trial premium subscription.

Workspace

It will take some time to create the workspace and after creation, you can go to the resource and can launch the workspace. It will log in using Single Sign-on.

Azure

Now, we are going to create a new Spark Cluster. You can choose the number of worker nodes, driver type, worker type, and a number of workers. Please note that every Spark Cluster must have a driver node and we can choose as many worker nodes as per your subscription and free available cores. Here, I chose the Standard_DS3_v2 worker type and the same driver type. One node has 4 cores and I chose only 1 worker node for testing purposes.

New Cluster

It will take some time to set up the cluster. If you check the status, you can see the cluster creation status is Pending.

Cluster creation

After some time, our cluster will be ready, and the status will show as Running.

Status

If you check the resource list, you can see there are 17 new resources created for this cluster.

Resources

There is 1 virtual network, 2 virtual machines (1 each for worker and driver), 1 storage account (for meta information, this is a locked resource), 2 public IPs, 1 network security group, 4 network interfaces, and 6 disks (3 each for worker and driver).

Step 2. Create a table using an existing CSV file

We can go to the Data tab and click the Add Data button.

CSV file

It will open a screen and we can browse our existing CSV from the local system. This is a very simple CSV file containing columns (id, name, and age). I gave 5 rows inside this file.

New table

The uploaded file will be stored in the FileStore/tables folder. Databricks is following a custom DBFS (Data Bricks File System) file system developed by themselves.

Our CSV will be uploaded, and we can choose the create table with the UI button. You can choose to Create Table in the Notebook option also. Notebook is a kind of editor where you can run your SQL, Python, R and Scala codes.

After clicking the Create Table with UI button, a table schema editor screen will appear.

Here, you can change the table name, column names, and even column schemas also. Here, I chose the first row as the header and inferred schema options. Since we are creating this table from our existing CSV file I prefer these options. Please note the UI will show some sample data also.

Attributes

You can click the Create Table button.

Sample data

Our table is created successfully, and it shows the sample data too.

Step 3. Querying the data using SQL notebook

We can query our data using SQL notebook. Currently, there are 4 types of notebooks available in Azure Databricks. Python, Scala, R, and SQL. In this article, we use SQL notebook.

Please select Workspace -> Create-> Notebook -> Choose type as SQL

SQL

You can give a valid name to your notebook and click the Create button.

Notebook

We can execute all valid SQL statements in this notebook to query the data.

SQL statement

By default, data is shown in raw table format and if needed, you can view the data visually too. There are various plot options available in the notebook. The below report is shown in Bar chart format for our same data.

Plot options

Other plot options are shown below.

Bar

Hope you got a basic overview of Azure Databricks workspace creation, cluster configuration, table creation and querying the data using SQL notebook. Please note that Spark is not used for simple queries. The aim of multiple clusters is to process heavy data with high performance. We can cover more features of Azure Databricks in the coming tutorials.

Please feel free to give your valuable feedback and queries in the comments.