Running Hadoop on Linux using Azure HDInsight

Prerequisite

  1. An Azure subscription: See Get Azure free trial.
  2. Putty SSH Client

For an in depth introduction to Hadoop and Hive and its application using Azure Insight, read the following Wikis

  1. Big Data Analytics using Microsoft Azure: Introduction
  2. Big Data Analytics using Microsoft Azure: Hive
  3. Analyze Twitter data with Hive in Azure HDInsight

Introduction

Azure HDInsight deploys and provisions Apache Hadoop clusters in the cloud, providing a software framework designed to manage, analyze, and report on big data.

With the September 2015 release of HDInsight, now customers configure these clusters to run using both a Windows Server Operating System as well as an Ubuntu based Linux Operating System.

HDInsight on Linux enables even broader support for Hadoop ecosystem users to run in HDInsight providing you even greater choice of preferred tools and applications for running Hadoop workloads.

Both Linux and Windows clusters in HDInsight are built on the same standard Hadoop distribution and offer the same set of rich capabilities.

Creating a Linux cluster in HDInsight

  1. To create a new Linux cluster, from the new portal, click on Data+Analytics > HDInsight.



  2. Click on create new cluster

It is at this step that you have the option to choose from Linux or Windows Operating System.

In this demo, Ubuntu shall be used.



After about 30 minutes, your cluster will be up and running.



Connecting to the cluster via an SSH Client

In this example, Putty shall be used to SSH on the Hadoop cluster.

The first step to connect to the cluster is to get the Host Name and the login credentials.

To know the Host Name, click on Secure Shell from the Azure Portal.



Here the Host Name to connect from a Windows and Linux client will be available.



The second step is to open Putty, enter the Host Name and click connect,



You will then be required to enter the credentials that were defined when creating the cluster and you are ready to go.

With the the Linux Cluster and SSH all the commands that one used to use when running Hadoop on premise will now be available which makes the transition to the cloud transparent.



Example: Running Hive Queries via Putty on a Linux Hadoop Cluster on Azure HDInsight

The following example demonstrates how IIS logs can be analyzed using Hive Queries on Hadoop.

  1. Upload the file to Azure blob storage



  2. Create internal table rawlog

    This a staging table to cleanse data to load in the cleanlog table at a later stage.



  3. Create table cleanlog, this is the table where the cleansed data will be stored and queried.



  4. View all tables



  5. Load data from the file into the staging table rawlog



  6. Move data from the staging table to the data table



  7. Generates Map Reduce and Make Analysis


Apache Ambari

The Apache Ambari project is aimed at making Hadoop management simpler by developing software for provisioning, managing, and monitoring Apache Hadoop clusters.

Ambari provides an intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs. Ambari is now included on Linux-based HDInsight clusters, and is used to monitor the cluster and make configuration changes.

To access Amabri, from your HDInsight Cluster page in the preview portal, click on Dashboard.



From this page, you can view your dashboard and view the status of your HDInsight cluster. There are also links to access the other features such as Services, Hosts, Alerts, and Admin.

For more details of the features available on Ambari have been describes on the Microsoft Azure Documentation.

References