Creating An HDInsight Hadoop Cluster On Linux

This article demonstrates how to create and deploy the Apache Hadoop on Azure HDInsight on the Azure portal. The purpose of this article is to provide the knowledge of the process of creating and running the Hadoop clusters controlled by the HDInsight on Linux Virtual Machines. Once our Hadoop cluster is executed, most of the operations you perform on it are independent of the ones we present on hardware clusters running Hadoop.

Prerequisites
  • An Azure Subscription.
  • Putty 
Overview of Hadoop

The most popular and influential tool for analyzing Big Data is Apache Hadoop. It is a framework that enables the distributed processing of massive volume of datasets across clusters of computers using natural programming methods. Nowadays, it is combined with other open source frameworks, like Apache Spark, Apache HBase, and Apache Storm for increasing the capacity and performance.

Azure HDInsights is the Azure implementation of Hadoop, Spark, HBase, and Storm with the help of other tools, like Pig & Apache Hive that provide a comprehensive and high-performance advanced analytics. Hadoop clusters in the HDInsight use either Windows or Linux as the working platforms and that integrates with favorite business-intelligence tools, like Excel, SQL Server Analytics etc.

Follow the steps to create the HDInsight on Azure portal.

Step 1

Sign in to the Azure portal.

Step 2

Click "+New" in the Azure portal. Then, in the search bar, search for HDInsight and select it to open the HDInsight.



Step 3

It opens the description of the HDInsight. Read the details of the HDInsight and press "Create".

 
Step 4

In the "Cluster Name" box, enter a unique DNS name for the cluster and choose the desired subscription for the HDInsight and then enter the cluster login name. And also, for the Login password too. Then, create a new Resource group for our HDInsight. After all of this, click Cluster type to open the "Cluster configuration" blade. (Note: Don't forget the password that you entered here).

 
Step 5

In the "Cluster configuration" blade, choose the Cluster type as "Hadoop", Operating System as Linux, and choose the desired version of OS for you. Then, choose the cluster tier as Standard one and click "select."

 
Step 6

In the "Storage" tab, choose the Primary storage as Azure Storage and the Selection method as My Subscriptions. Enter the unique storage account name and leave the other details as default followed by a click on "Next".



Step 7

In this blade, press "Edit." near the application.

 

Step 8

For "Applications" blade, select the "StreamSets Data Collector for Hi." Then, click the "Legal terms" for the StreamSets.

 

And press "Create" on the Application blade to save the settings.

 

Step 9

It opens the "Cluster Size" blade, by accepting the default configuration. We are creating a cluster that contains two primary nodes and four slave nodes. Press "Next."

 
Step 10

On the "Advanced Settings" blade, leave the default configurations and press "Next."


Step 11

Review your cluster settings and then click "Create" to start the creation of the cluster.



Step 12

After the successful deployment, open your HDInsight and click the SSH Shell (SSH).

 

Step 13

Copy the SSH link for the connection to our Hadoop Cluster.

 

Step 14

Download and install the Putty software which is used to open an SSH Connection to it. Run the software and paste the SSH link in the Host Name field. Then, press "Open" to establish the connection. The putty terminal opens. Enter the password and it allows to work on our cluster machine.

 

Summary

I hope you understood how to create the HDInsight with Linux on the Azure Portal.