Understanding HDInsight In Microsoft Azure

Initially Apache Hadoop software library is a framework that allows for the distributed processing of large amount of data sets across clusters of computers using simple programming model. It is designed to scale up from single servers to thousands of machines; that is scale out, each cluster offers local computation and its storage. Apache Hadoop, rather than rely on hardware to deliver high-availability of large data processing, the Apache Hadoop library itself is designed to detect and handle failures at the application layer itself, so it delivers a highly-available service on top of a cluster of computers.

The project includes these modules:

  • Hadoop Common: The common utilities that support the other Hadoop modules.

  • Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.

  • Hadoop YARN: Is Framework for job scheduling and cluster resource management.

  • HadoopMapReduce: A YARN-based system for parallel processing of large data sets.

  • HBase: A scalable, distributed database that supports structured data storage for large tables.

  • Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying.

  • Mahout: A Scalable machine learning and data mining library.

  • Pig: A high-level data-flow language and execution framework for parallel computation.

  • Spark: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.

In Microsoft Azure portal HDInsight is the name of the service available for cloud based hadoop service and it is Microsoft's managed Big Data stack in the cloud. With Azure you can provision clusters running Storm, HBase, and Hive which can process thousands of events per second, store petabytes of data, and give you a SQL-like interface to query.

Data having three 3 V’s -Volume – Variety – Velocity

The IT industry is full of data;  currently the world's population is 7.2 billion and devices are 15 billion. According to estimates in 2020 devices will be double that at 30 billion and each device will produce data. So assume we are having a large unstructured data set and you want to run a Hive query on it to extract some meaningful information. Data should be converted in to meaningful information. Azure HDInsight uses a azure storage to store data, when we create HDInsight cluster we need to specify storage account so a specific blob container is used in that and file system HDFS. Hadoop offers a distributed platform to store and manage big data. You can run query on the unstructured data using hive which enables querying and managing large amount of unstructured data using SQL like query language. You can output the data and import in MS excel or any other BI tool.

Apache Spark cluster on HDInsight

You should create a storage account in azure and then you should create Spark cluster on azure and you can run Spark SQL statements using notebooks. Jupyter notebook is also popular to write spark SQL queries, by default jupyter notebooks comes with a python 2 kernal, HDInsight Spark clusters provide two additional kernels that you can use with the Jupyter notebook. These are:

  • PySpark (for applications written in Python)
  • Spark (for applications written in Scala)

A couple of key benefits of using the PySpark kernel are: You do not need to set the contexts for Spark, SQL, and Hive. These are automatically set for you. You can use different cell magics (such as %%sql or %%hive) to directly run your SQL or Hive queries, without any preceding code snippets. The output for SQL or Hive queries is automatically visualized.

HBase is a data model that is similar to Google’s big table designed to provide quick random access to huge amounts of structured data.

Apache Storm is a scalable, fault-tolerant, distributed, real-time computation system for processing streams of data. With Storm on Azure HDInsight, you can create a cloud-based Storm cluster that performs big data analytics in real time.

To create HD Insight cluster you should create storage account first in azure portal.

create HD Insight cluster

After successful creation of storage account, click on new button data service, click on HD insight and you will find list of hadoop services like hadoop, Hbase, storm, sparx, linux and custome create options available. You can click on hadoop than enter unique custom name (*.azurehdinsight.net) than select cluster size 1 node, 2 node or 4 node clusters and HTTP username is fixed admin you need to enter password and then select storage account name.

storage account