Microsoft Azure Databricks offers an intelligent, end-to-end solution for all your data and analytics challenges. Azure Databricks is a fast, easy, and collaborative Apache Spark-based analytics platform optimized for Azure. Designed in collaboration with the original founders of Apache Spark (MateiZaharia who created Apache Spark is the co-founder and Chief Technologist of Databricks), Azure Databricks combines the best of Databricks and Azure to help customers accelerate innovation with one-click set up, streamlined workflows and an interactive workspace which enables collaboration between data scientists, data engineers, and business analysts.
I have already worked with Azure HDInsight which also contains the Spark Cluster provided by Hortonworks, but I am really impressed with the features of
Databricks. It has a very powerful UI which gives users a feel-good experience. Both HDInsight & Databricks have many pros and cons that I will cover in a separate article later.
As of now, log in to the Azure Portal.
Choose "Create a new resource" and select Analytics from Azure marketplace and choose Azure Databricks.
Please give a valid name to our Databricks service and then choose the resource group. If you don’t have a resource group, select the "Create New" option. You must choose a geographic location and pricing tier. Please note that currently Databricks provides a 14 day trial period for users to verify this service. We can convert this plan to Premium later.
After some time, your Databricks workspace will be created and we can launch the workspace now. Please note that this is the only workspace and we must create the Spark Cluster from this workspace.
Our workspace will be loaded shortly after Single Sign-On mechanism.
We can click the "Clusters" button on the left menu of the dashboard and then click the "Create Cluster" button.
We must give a valid cluster name and choose the number of worker systems we need.
Currently, Databricks provides various types of system configurations. I chose the default Standard_DS3_v2 and 1 worker node for testing purposes. This type of server has 4 cores and 14 GB RAM for each node. I selected the Driver type, same as a worker. So, the driver will have the same configuration of the node system. Please note, in a Spark cluster configuration, we have one driver system and may have multiple nodes. Driver is controlling all parallel operations.
Databricks also provides an option to give some initial configurations to our cluster.
I have not given any default configurations so click the "Create Cluster" button. After some time, our cluster will be ready for use.