Big Data Analytics Using Hive on Microsoft Azure

Apache Hive

Hadoop is an open-source implementation of Map Reduce that is widely used to store and process large amounts of data in a distributed fashion.

Apache Hive is a data warehousing solution that built upon Hadoop. It is powered by HiveQL that is a declarative SQL language compiled directly into Map Reduce jobs that are then executed over the underlying Hadoop architecture. Apache Hive also allows the users to customize the HiveQL language and allows them to write queries that have custom Map Reduce code.

Schema on Read

Hive uses a metadata service that projects tabular schemas over HDFS folders.

This enables the contents of folders to be queried as tables, using SQL-like query semantics that are then converted to Map-Reduce Jobs.

That is, the tables are just Metadata to represent the format in which the data will be used. It is only when the queries will be executed that the data shall be projected onto the schema and processed using map reduced.

Creating Hive Tables

Use the CREATE TABLE HiveQL statement to create tables that project into the files to be used.

It defines a schema metadata to be projected onto data in a folder when the table is queried (not when it is created).

The following two types of tables can be created:

  1. Internal tables manage the lifetime of the underlying folders. Using internal tables, if a table is dropped then all the files and their contents shall be dropped.

    a. No location specified when creating the table. When no location is specified, the contents will be stored at the default location, /hive/warehouse/table1.
    b. Table stored at a custom location during its creation. Note that the table is still internal, so the folder is deleted when the table is dropped.
  2. External tables are managed independently of the folders.

    If the table is dropped, then the table meta-data is deleted but all the data remains on the Azure Blob Store.

Loading Data into Hive Tables

  1. Use the LOAD statement to move or copy files to the appropriate folder.
    1. LOAD DATA LOCAL INPATH '/data/source' INTO TABLE MyTable;  
  2. Use the INSERT statement to insert data from one table to another.
    1. FROM  
    2. StagingTable INSERT INTO TABLE MyTable  
    3. SELECT  
    4. Col1,  
    5. Col2;  

Querying Hive Tables

Query data using the SELECT statement.

Hive translates the query into Map/Reduce jobs and applies the table schema to the underlying data files.

  1. SELECT  
  2. Col1,  
  3. SUM(Col2) AS TotalCol2  
  4. FROM  
  5. MyTable  
  6. WHERE  
  7. Col1 >= '2013-06-01'  
  8. AND Col1 <= '2013-06-30'   
  9. GROUP BY   
  10.   Col1   
  11. ORDER BY   
  12.   Col1;  
Querying Hive from the Command Line

To query a Hive using the command line, you first need to remote the server of Azure HDInsight. Follow this article to get the procedure to do the remote connection.

In the following example, 2 tables shall be created, Raw Log and Clean Log. Raw Log will be a staging table whereby data from a file will be loaded into. Clean Log shall contain data from Raw Log that has been cleansed.

The Hive Variable

The SET command is used to view the Hive Variable.

It's just an easier way to access Hive instead of typing the full path each time.

Open the Hive Environment

Hive Queries can only be executed inside the Hive Environment.
To open the Hive Environment, use the following command:


Once the Hive Environment has been accessed, now the file to be processed needs to be uploaded.

The following is the format of the Log File to be used in this example.

IIS Notepad

Create the Internal table

The main use of the internal table in this example is to act as a staging table to cleanse data to load into the CleanLogtable.
  1. CREATE TABLE rawlog(  
  2. log_date string, log_time string, c_ip string,  
  3. cs_username string, s_ip string, s_port string,  
  4. cs_method string, cs_uri_stem string,  
  5. cs_uri_query string, sc_status string,  
  6. sc_byte string, cs_byte string, time_taken int,  
  7. user_agent string, referrer string  
Since no file location is specified, it shall be stored at the default location, /hive/hive/warehouse.

Create the External table

This is the table where the cleansed data will be stored and queried.
  1. CREATE EXTERNAL TABLE cleanlog(  
  2. log_date string, log_time string, c_ip string,  
  3. cs_username string, s_ip string, s_port string,  
  4. cs_method string, cs_uri_stem string,  
  5. cs_uri_query string, sc_status string,  
  6. sc_byte string, cs_byte string, time_taken int,  
  7. user_agent string, referrer string  

View the Hive Tables

To view the Hive tables, the SHOW TABLES command can be used.

Load data into the staging table

To load the raw data from the folder into the staging table, the LOAD DATA command should be used.
  1. LOAD DATA INPATH '/SourceData/' INTO TABLE rawlog;  

After the loading completes, you may test the schema by doing a select from the table rawlog.
  1. SELECT * FROM rawlog;  

Move data from the staging table to the data table

The following command will load data from the staging table into the data table when excluding rows having # at the beginning.
  1. FROM  
  2. rawlog INSERT INTO TABLE cleanlog  
  3. SELECT  
  4. *  
  5. WHERE  
  6. SUBSTR(log_date, 1, 1) <> '#';  

Generates Map Reduce and Make Analysis

Now that the data is cleansed, it can now be queried using HiveQL and this is where Map Reduce code will be generated and applied. In the following example, the number of page hits per IP per date is calculated.
  1. SELECT  
  2. log_date as date,  
  3. c_ip,  
  4. count (*) as page_hits  
  5. FROM  
  6. cleanlog  
  7. GROUP BY  
  8. log_date,  
  9. c_ip  
  10. ORDER BY  
  11. log_date;  

Querying Hive from PowerShell

Azure PowerShell provides cmdlets that allow you to remotely run Hive queries on HDInsight. Internally, this is accomplished using REST calls to WebHCat (formerly called Templeton) running on the HDInsight cluster.

PowerShell cmdlets for Hive

The following cmdlets are used when running Hive queries in a remote HDInsight cluster:
  1. Add-AzureAccount

    Authenticates Azure PowerShell to your Azure subscription.

  2. New-AzureHDInsightHiveJobDefinition

    Creates a new job definition using the specified HiveQL statements.

  3. Start-AzureHDInsightJob

    Sends the job definition to HDInsight, starts the job and returns a job object that can be used to check the status of the job.

  4. Wait-AzureHDInsightJob

    Uses the job object to check the status of the job. It will wait until the job completes or the wait time is exceeded.

  5. Get-AzureHDInsightJobOutput

    Used to retrieve the output of the job.

  6. Invoke-Hive

    Used to run HiveQL statements. This will block the query completion, then the results will be returned.

  7. Use-AzureHDInsightCluster

    Sets the current cluster to use for the Invoke-Hive command.

Executing HiveQL Queries

In the following example, a new table webactivity shall be created containing aggregated data from the table cleanlog (created in the previous example).

The HiveQL to be used is as follows:

  1. Drop the table webactivity if it exists as in the following:
    1. DROP TABLE webactivity;  
  2. Create an external table webactivity as in the following:
    1. CREATE EXTERNAL TABLE webactivity(  
    2. log_date STRING, page_hits INT, bytes_recvd INT,  
    3. bytes_sent INT  
  3. Aggregate data from the table cleanlog and insert into the table webactivity as in the following:
    1. FROM  
    2. cleanlog INSERT INTO TABLE webactivity  
    3. SELECT  
    4. log_date,  
    5. COUNT(*),  
    6. SUM(sc_byte),  
    7. SUM(cs_byte)  
    8. GROUP BY  
    9. log_date  
    10. ORDER BY  
    11. log_date;  
    12. "  

Using AzureHDInsightHiveJobDefinition Cmdlet

The following are the PowerShell scripts to execute the queries explained above.

  1. Specify the cluster into that to do the queries.

    $clusterName = "chervinehadoop"

  2. The HiveQL Queries
    1. $hiveQL = "DROP TABLE webactivity;CREATE EXTERNAL TABLE webactivity(log_date STRING, page_hits INT, bytes_recvd INT, bytes_sent INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' STORED AS TEXTFILE LOCATION '/data/webactivity'; FROM cleanlog INSERT INTO TABLE webactivity SELECT log_date, COUNT(*), SUM(sc_byte), SUM(cs_byte) GROUP BY log_date ORDER BY log_date;"  
  3. Define the hive job for the Azure HDInsight service.

    $jobDef = New-AzureHDInsightHiveJobDefinition -Query $hiveQL

  4. Starts the Azure HDInsight job in the cluster chervinehadoop.

    $hiveJob = Start-AzureHDInsightJob –Cluster $clusterName –JobDefinition $jobDef

  5. Awaits the completion or failure of the HDInsight job and shows its progress.

    Wait-AzureHDInsightJob -Job $hiveJob -WaitTimeoutInSeconds 3600

  6. Retrieve the log output for a job as in the following:

    Get-AzureHDInsightJobOutput -Cluster $clusterName -JobId $hiveJob.JobId -StandardError

Using the Invoke-Hive Cmdlet

Once the table has been created and populated, it can then be queried. The easiest way to execute Hive Queries is to use the Invoke-Hive cmdlet.

  1. Define the cluster and the query to be used as in the following:

    $clusterName = "chervinehadoop"
    1. $hiveQL = "SELECT * FROM webactivity;"  
  2. Select the Azure HDInsight cluster to be used by the Invoke-Hive cmdlet for job submission as in the following:

    Use-AzureHDInsightCluster $clusterName

  3. Submit Hive queries to the HDInsight cluster.

    Invoke-Hive -Query $hiveQL

Querying Hive from the Azure Portal

So far, this article has focused on the technical parts, how Hive works and how to automate the process using PowerShell. Using the command line and PowerShell is most commonly intended for developers and is not user friendly.

Hopefully, Hive Queries can also be executed from the portal that makes it much easier to use. To use this feature, go to the HDInsight Homepage on Azure and click on the Query Console.

Once you are logged in, you will be routed to the Microsoft Azure HDInsight Query Console Homepage. From here, you can view some sample queries from the gallery, write your own queries using the Hive editor, view the jobs execution history and browse the files in the cluster.

In this example, the same queries as in the PowerShell example shall be written but from the Azure Portal.

To do so, click on the Hive Editor and paste in the queries.

Once you click on submit, the processing will start and you will be able to monitor the status from the Job Session Pane.

When it completes, the file will be saved in the blob at the path specified.

Conclusion and Next Steps

This article explained the basics of Hive then provided examples of using Hive from the command line, Power Shell and from the portal.

The main advantages of Hive is that it's very easy to use and quicker to develop Big Data solutions since HiveQL is very similar to SQL but translates to Map Reduce code behind the scenes despite being transparent to the user.

Moreover, the use of external tables makes it possible to process data without actually storing in HDFS and keeps the data, even if the cluster is deleted, and is also a very good benefit in terms of cost.

See Also