Querying Azure SQL Databases In Databricks Spark Cluster

Sarathlal Saseendran
1y
38.7k
0
2

Article

In this article, we will see all the steps for creating an Azure Databricks Spark Cluster and querying data from Azure SQL DB using the JDBC driver. Later we will save one table of data from SQL to a CSV file.

Step 1. Create Azure Databricks workspace

Microsoft Azure Databricks offers an intelligent, end-to-end solution for all your data and analytics challenges. Azure Databricks is a fast, easy, and collaborative Apache Spark-based analytics platform optimized for Azure. Designed in collaboration with the original founders of Apache Spark (MateiZaharia who created Apache Spark is the co-founder and Chief Technologist of Databricks), Azure Databricks combines the best of Databricks and Azure to help customers accelerate innovation with one-click setup, streamlined workflows, and an interactive workspace which enables collaboration between data scientists, data engineers, and business analysts.

I have already worked with Azure HDInsight which also contains the Spark Cluster provided by Hortonworks, but I am really impressed with the features of Databricks. It has a very powerful UI which gives users a feel-good experience. Both HDInsight & Databricks have many pros and cons that I will cover in a separate article later.

As of now, log in to the Azure Portal.

Choose "Create a new resource" select Analytics from Azure marketplace and choose Azure Databricks.

Azure marketplace

Please give a valid name to our Databricks service and then choose the resource group. If you don’t have a resource group, select the "Create New" option. You must choose a geographic location and pricing tier. Please note that currently, Databricks provides a 14-day trial period for users to verify this service. We can convert this plan to Premium later.

Databricks

After some time, your Databricks workspace will be created and we can launch the workspace now. Please note that this is the only workspace and we must create the Spark Cluster from this workspace.

Databricks workspace

Our workspace will be loaded shortly after the Single Sign-On mechanism.

Sign-On mechanism

We can click the "Clusters" button on the left menu of the dashboard and then click the "Create Cluster" button.

Clusters

We must give a valid cluster name and choose the number of worker systems we need.

Cluster name

Currently, Databricks provides various types of system configurations. I chose the default Standard_DS3_v2 and 1 worker node for testing purposes. This type of server has 4 cores and 14 GB RAM for each node. I selected the Driver type, the same as a worker. So, the driver will have the same configuration as the node system. Please note, that in a Spark cluster configuration, we have one driver system and may have multiple nodes. The driver controls all parallel operations.

Worker node

Databricks also provides an option to give some initial configurations to our cluster.

Initial configurations

I have not given any default configurations so click the "Create Cluster" button. After some time, our cluster will be ready for use.

Step 2. Create an Azure SQL Database

We are going to query the SQL database in our cluster. We can create our Azure SQL DB now.

SQL database

We create this database with the existing data source. I chose Sample (AdventureWorksLT) as the data source. We must select one database server for this database. If you don’t have a server, please create a new one.

New server

After choosing the server, you can configure the required price tier for our database.

Database

I opted for the Basic plan, so it will be minimal, however, enough for our testing purposes. After some time, our database will be created and we can use it in our Spark cluster.

Please note that Microsoft now provides a Query Editor (currently in preview mode only) for viewing and updating our data like our traditional SSMS.

Query Editor

We are going to read SalesLT.Customer table data in our Spark Cluster.

Step 3. Querying SQL data in Databricks Spark cluster

We can connect the SQL database using JDBC. We use Scala Notebook to query the database. Notebook is an editor where we can enter our Spark commands. Currently, Databricks supports Scala, Python, SQL, and Python languages in this notebook.

Please click the "Create Notebook" link in the dashboard and choose Scala as our scripting language.

Create Notebook

We can check the JDBC driver connectivity by the following command.

JDBC driver

Class.forName("com.microsoft.sqlserver.jdbc.SQLServerDriver");

After entering the above command, please choose Shift + Enter to execute the command.

By default, Spark Cluster is configured with SQL Server JDBC driver. Our above command will execute successfully.

We can give the server name and database name to the below variable and execute the command.

Spark Cluster

val jdbcHostname = "{sql server host name}"
val jdbcPort = 1433
val jdbcDatabase = "{sql db name}"
// Create the JDBC URL without passing in the user and password parameters.
val jdbcUrl = s"jdbc:sqlserver://${jdbcHostname}:${jdbcPort};database=${jdbcDatabase}"

We can give the credentials of our database server to the below connectionProperties variable and this variable will be used later in our query execution.

ConnectionProperties

import java.util.Properties
val connectionProperties = new Properties()
connectionProperties.put("user", "{user name}")
connectionProperties.put("password", "{password}")

Check the connectivity to the SQL database using the following command.

SQL database

val driverClass = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
connectionProperties.setProperty("Driver", driverClass)

Now, we are going to query the data from the SQL database. For that, we set a data frame variable customer using the below command.

Query the data

val customer = spark.read.jdbc(jdbcUrl, "SalesLT.Customer", connectionProperties)

Please note that our customer variable now holds all the data from SalesLT.Customer table.

We can list the table structure using the below command.

Table structure

customer.printSchema()

We can also list the table values using the below command.

Table values

customer.select("CustomerID", "Title", "FirstName", "LastName", "CompanyName").show()

Please note that in the above command, we supplied 5 column names explicitly and we got values for those columns only. You can use all the Spark commands to query the data.

Step 4. Saving the queried data as a CSV file

We can save the above-queried data as a CSV file easily. For that, we need to mount the storage account with DBFS (This is a custom filesystem developed by Databricks to handle the file operations). We must use the below command to mount our existing storage account in DBFS. (You must have an Azure blob storage account, otherwise please create a new one).

CSV file

dbutils.fs.mount(
  source = "wasbs://{container name}@{storage account name}.blob.core.windows.net/",
  mountPoint = "/mnt/{mount point name}",
  extraConfigs = Map("fs.azure.account.key.{storage account name}.blob.core.windows.net" -> "{storage account access key}")
)

The above command will create a DBFS mount point “mnt/Sarath”. We can use this mount point for our file system operations.

Please note that our Spark Job returned a true value. That means our command is successfully executed. If there is any error during execution, it will show the exception with details.

We can save our data frame using the below command.

Data frame

customer.coalesce(1)
       .write.format("com.databricks.spark.csv")
       .option("header", "true")
       .save("{your mount point location}.csv")

The above command will save all our data from the customer data frame to the customer.csv file. We can see the new file in Azure Storage Explorer.

Azure Storage Explorer

Please note that this CSV is saved as a multi-part file. Hence it has a folder also with the same name. This folder has many files as shown below.

Multi-part file

We can check the CSV file by getting the data from CSV using a data frame reader. There are various options available to read the data. We used the below method.

Frame reader

val customerFromCSV = spark.read.format("csv").option("header", "true").load("/mnt/sarath/customer.csv")
customerFromCSV.select("CustomerID", "Title", "FirstName", "LastName", "CompanyName").show()

In this article, we created a new Azure Databricks workspace and then configured a Spark cluster. After that, we created a new Azure SQL database and read the data from the SQL database in the Spark cluster using the JDBC driver and later, saved the data as a CSV file. We again checked the data from CSV and everything worked fine.

Hope you enjoyed this article. This is my third article on Azure Databricks. The other two are listed below.