Word Count Method In HDInsight

In this article, you will learn about wordcount method in HDInsight.

Hadoop Configuration in Azure 

 
In this article, I will show how to create a cluster and setup Hadoop HD Insight and create a simple word count method using PowerShell commands.
 
Create a new resource group and set up a Hadoop cluster. Provide basic cluster information and login details. See below for more details. 
 
Wordcount Method In HDInsight
 
Here, I have selected Hadoop on Linux OS.
 

Resource Group

 
Now, create a new Resource Group using the cluster and storage account that I have created in the previous step. 
 
Wordcount Method In HDInsight
 
Wordcount Method In HDInsight
 
Wordcount Method In HDInsight
 

Cluster Dashboard

 
The cluster dashboard will be opened where you can see all the resources and monitoring. Here, you can see HDF Data usage, Data nodes, links, memory, network, CPU, and other resources and their utilization.
  
Wordcount Method In HDInsight
 
For unstructured data that is a text file, we will use a “BLOB” option for storage. Once you upload a file, it will be available in the Blob folder.
 

Storage Account

 
Wordcount Method In HDInsight
 
By using the Upload button, I have uploaded the files (which I have described in the last paragraph) used for the processing and analytical process. Then, I have created a folder in it so that the input file can be given and a jar file will be uploaded that will help me out for processing of word count. Since the Hadoop logic is defined in the format of Java or Python, I am going to use the jar file which is scripted in Java for the processing of data for better and accurate results.
 
When I was creating my account, I got the subscription, cluster name, and the credentials of my cluster. I am going to use that to access the cluster data from my PowerShell IDE. Without these, I can’t access the cluster data.
 
According to the requirement of the assignment, I have applied the word count on sentiment category and rating. I have uploaded the file name as Sentiment which is attached to this folder, in which, I have selected the columns. Then, I have uploaded the file to the storage account in the data folder which exists in the example folder so the processing can be done on it. There was confusion of output but it will be created automatically while giving the name in the scripting language in PowerShell. As the name is given there, the output folder will be generated in the data folder if we give its directory.
 
Wordcount Method In HDInsight
 

PowerShell Code Snippet 

  1. $subscriptionName = "Azure for Students"    
  2. $clusterName = "cluster name given"    
  3. $resourceGroupName = "muneeb"  
  4.  
  5. # Authenticate to your Azure Subscription using your email (studentid@students.latrobe.edu.au) and password  
  6. Connect-AzureRmAccount  
  7.  
  8. # Authenticate to your HDInsight clusters using your HDInsight HTTP credentials (username: wbuser)  
  9. Select-AzureRmSubscription -SubscriptionName $subscriptionName  
  10.  
  11. # Define the MapReduce job  
  12. $mrJobDefinition = New-AzureRmHDInsightMapReduceJobDefinition `  
  13.                             -JarFile "wasb:///example/jars/hadoop-mapreduce-examples.jar" `  
  14.                             -ClassName "wordcount" `  
  15.                             -Arguments "wasb:///example/data/Sentiment.csv""wasb:///example/data/output"  
  16.  
  17. # Submit the job and wait for job completion  
  18. $cred = Get-Credential -Message "Enter the HDInsight cluster HTTP user credential:"   
  19. $mrJob = Start-AzureRmHDInsightJob `  
  20.                     -ResourceGroupName $resourceGroupName `  
  21.                     -ClusterName $clusterName `  
  22.                     -HttpCredential $cred `  
  23.                     -JobDefinition $mrJobDefinition   
  24.   
  25. Wait-AzureRmHDInsightJob `  
  26.     -ResourceGroupName $resourceGroupName `  
  27.     -ClusterName $clusterName `  
  28.     -HttpCredential $cred `  
  29.     -JobId $mrJob.JobId  
After applying the script, the information has been stored in the variable. After that, the account will be connected by giving the credentials I have given to access the cluster. After that, the storage will be opened by the process in which the file is given a name as sentiment. Then, the jar file acts on it and processes the whole work so the word count can be done. After that, an output file is generated that helps to create the text file which will be downloadable and can see the word count. The scripted code will be run on Power Shell ISE. There are too many concepts in the 29 lines of scripted code. I have uploaded the file on the server on which this code is implemented. Some screen shots are shareable for the successful work. The folder has been generated for the output file that can be downloaded.
 

Word Count Successful

 
Wordcount Method In HDInsight
 
Wordcount Method In HDInsight
 
There are several steps that I have performed to make my PowerShell tool by installing the Azure CLI to get the command which I needed to perform my task. There were commands like RM which is inaccessible until I have imported the RM directories from the NuGet gallery. The Azure was unable to access because there are different user accounts in which I have to give permission from my PowerShell to perform my work remotely access the Azure account. After that the Azure account and all other features were accessible, and by providing user name and password, the whole process executed and a file was generated with word count.