Analyze Twitter Data With Hive in Azure HDInsight

Introduction

Social Media Analytics is one of the major uses of Big Data. This article shows how to use a Hive in Azure HDInsight to analyze Twitter feeds.

In this article, feeds about "Mauritius" shall be extracted and analyzed using Azure HDInsight.

create twitter application

Prerequisites

  1. Introduction to Big Data Analytics Using Microsoft Azure
  2. Big Data Analytics Using Hive on Microsoft Azure

Create a Twitter Application

For creating an application on Twitter, use the Twitter APIs, read data from Twitter and also post tweets if required.

To create your first Twitter Application:

  1. Go to https://dev.twitter.com/apps.
  2. Click on Create New App.
  3. Fill in the required details and create the app.

    create the App

Once the app is created, navigate to the Keys and Access tokens in your application, this is where the keys to read data from Twitter can be obtained.

application setting

Extract the Tweets

The tweets shall be extracted using a PowerShell script, uploaded to Azure Storage before being processed by Azure HDInsight.

The following is the procedure to extract the tweets and save them to an Azure BLOB storage.

Define the variables

In this step, all the variables used in the PowerShell Script is defined.

  1. # Enter the HDInsight cluster name  
  2. $clusterName = "chervinehadoop"   
  3.  
  4. # Enter the OAuth information for your Twitter application. These information in obtained in part 1  
  5. $oauth_consumer_key = "";  
  6. $oauth_consumer_secret = "";  
  7. $oauth_token = "";  
  8. $oauth_token_secret = "";  
  9.  
  10. # Path to save the Tweets on the Azure Blob Storage  
  11. $destBlobName = "Tweets/MRUTweets.txt"   
  12.  
  13. # This script gets the tweets containing these keywords.  
  14. $trackString = "Mauritius"   
  15. $track = [System.Uri]::EscapeDataString($trackString);  
  16. $lineMax = 24   

Connect to an Azure Account

This will open an interface to capture your login credentials.

Add-AzureAccount.

AzureAccount

If the authentication is successful, your ID and subscriptions should be displayed in PowerShell.

authentication

Create BLOB Storage

Create a BLOB Storage where the tweets shall be saved as in the following:

  1. $myCluster = Get-AzureHDInsightCluster -Name $clusterName  
  2. $storageAccountName =$myCluster.DefaultStorageAccount.StorageAccountName.Replace(".blob.core.windows.net""")  
  3. $containerName = $myCluster.DefaultStorageAccount.StorageContainerName  
  4.   
  5. $storageAccountKey = get-azurestoragekey $storageAccountName | %{$_.Primary}  
  6. $storageConnectionString ="DefaultEndpointsProtocol=https;AccountName=$storageAccountName;AccountKey=$storageAccountKey"  
  7.   
  8. $storageAccount =[Microsoft.WindowsAzure.Storage.CloudStorageAccount]::Parse($storageConnectionString)  
  9. $storageClient = $storageAccount.CreateCloudBlobClient();  
  10. $storageContainer = $storageClient.GetContainerReference($containerName)  
  11. $destBlob = $storageContainer.GetBlockBlobReference($destBlobName)  

OAuth Connection String

Build the OAuth Connection String to connect to Twitter as in the following:

  1. $oauth_nonce =[System.Convert]::ToBase64String([System.Text.Encoding]::ASCII.GetBytes([System.DateTime]::Now.Ticks.ToString()));  
  2. $ts = [System.DateTime]::UtcNow - [System.DateTime]::ParseExact("01/01/1970","dd/MM/yyyy", $null)  
  3. $oauth_timestamp = [System.Convert]::ToInt64($ts.TotalSeconds).ToString();  
  4.   
  5. $signature = "POST&";  
  6. $signature +=[System.Uri]::EscapeDataString("https://stream.twitter.com/1.1/statuses/filter.json ")+ "&";  
  7. $signature += [System.Uri]::EscapeDataString("oauth_consumer_key=" + $oauth_consumer_key +"&");  
  8. $signature += [System.Uri]::EscapeDataString("oauth_nonce=" + $oauth_nonce + "&");  
  9. $signature += [System.Uri]::EscapeDataString("oauth_signature_method=HMAC-SHA1&");  
  10. $signature += [System.Uri]::EscapeDataString("oauth_timestamp=" + $oauth_timestamp + "&");  
  11. $signature += [System.Uri]::EscapeDataString("oauth_token=" + $oauth_token + "&");  
  12. $signature += [System.Uri]::EscapeDataString("oauth_version=1.0&");  
  13. $signature += [System.Uri]::EscapeDataString("track=" + $track);  
  14.   
  15. $signature_key = [System.Uri]::EscapeDataString($oauth_consumer_secret) + "&" +[System.Uri]::EscapeDataString($oauth_token_secret);  
  16.   
  17. $hmacsha1 = new-object System.Security.Cryptography.HMACSHA1;  
  18. $hmacsha1.Key = [System.Text.Encoding]::ASCII.GetBytes($signature_key);  
  19. $oauth_signature =[System.Convert]::ToBase64String($hmacsha1.ComputeHash([System.Text.Encoding]::ASCII.GetBytes($signature)));  
  20.   
  21. $oauth_authorization = 'OAuth ';  
  22. $oauth_authorization += 'oauth_consumer_key="' +[System.Uri]::EscapeDataString($oauth_consumer_key) + '",';  
  23. $oauth_authorization += 'oauth_nonce="' + [System.Uri]::EscapeDataString($oauth_nonce) +'",';  
  24. $oauth_authorization += 'oauth_signature="' +[System.Uri]::EscapeDataString($oauth_signature) + '",';  
  25. $oauth_authorization += 'oauth_signature_method="HMAC-SHA1",'  
  26. $oauth_authorization += 'oauth_timestamp="' +[System.Uri]::EscapeDataString($oauth_timestamp) + '",'  
  27. $oauth_authorization += 'oauth_token="' + [System.Uri]::EscapeDataString($oauth_token) +'",';  
  28. $oauth_authorization += 'oauth_version="1.0"';  
  29.   
  30. $post_body = [System.Text.Encoding]::ASCII.GetBytes("track=" + $track);  

Read the tweets

Read the tweets as in the following:

  1. Write-Host "signature= " + $signature  
  2.   
  3. [System.Net.HttpWebRequest] $request =[System.Net.WebRequest]::Create("https://stream.twitter.com/1.1/statuses/filter.json ");  
  4. $request.Method = "POST";  
  5. $request.Headers.Add("Authorization", $oauth_authorization);  
  6. $request.ContentType = "application/x-www-form-urlencoded";  
  7. $body = $request.GetRequestStream();  
  8.   
  9. $body.write($post_body, 0, $post_body.length);  
  10. $body.flush();  
  11. $body.close();  
  12. $response = $request.GetResponse() ;  
  13.   
  14. $memStream = New-Object System.IO.MemoryStream  
  15. $writeStream = New-Object System.IO.StreamWriter $memStream  
  16.   
  17. $sReader = New-Object System.IO.StreamReader($response.GetResponseStream())  
  18.   
  19. $inrec = $sReader.ReadLine()  
  20. $count = 0  
  21. while (($inrec -ne $null) -and ($count -le $lineMax))  
  22. {  
  23.    if ($inrec -ne "")  
  24.    {  
  25.       Write-Host "`n`t $count tweets received." -ForegroundColor Yellow  
  26.   
  27.       $writeStream.WriteLine($inrec)  
  28.       $count ++  
  29.    }  
  30.   
  31.    $inrec=$sReader.ReadLine()  
  32. }  
Save the tweets to the BLOB storage as in the following:
  1. $writeStream.Flush()  
  2. $memStream.Seek(0, "Begin")  
  3. $destBlob.UploadFromStream($memStream)  
  4.   
  5. $sReader.close()  
At this stage, the file has been uploaded to the BLOB storage in the default container at the path Tweets/MRUTweets.txt.

MRUTweets

The file may also be downloaded to view its contents.

downloaded to view

Process the Tweets using Hive

The following procedure describes how to read the Tweets from the BLOB Storage and analyze them using Hive on HDInsight.

Tweets using Hive


  1. Open the Query Console

    From your Hadoop Cluster, click on Query Console.

    Query Console

    Fill in your credentials and go to the Hive Editor.

  2. Create Staging table RAW_TWEETS

    Load all the data from the file in the table RAW_TWEETS.
    1. DROP TABLE IF EXISTS RAW_TWEETS;  
    2.   
    3. --create the raw Tweets table on json formatted twitter data  
    4. CREATE EXTERNAL TABLE RAW_TWEETS(json_response STRING)  
    5. STORED AS TEXTFILE LOCATION'wasb://chervinehadoop@chervinestoragehadoop.blob.core.windows.net/Tweets/';  
    The location above should point to the same path where the tweets are saved on the BLOB Storage.

    The following are the contents of the table RAW_TWEETS.
    1. SELECT * FROM RAW_TWEETS;  
    job output

  3. Create table TWEETS

    This is where the processed (parsed) JSON Twitter data will be stored.
    1. DROP TABLE IF EXISTS TWEETS;  
    2. CREATE TABLE TWEETS(  
    3. id BIGINT,  
    4. created_at STRING,  
    5. created_at_date STRING,  
    6. created_at_year STRING,  
    7. created_at_month STRING,  
    8. created_at_day STRING,  
    9. created_at_time STRING,  
    10. in_reply_to_user_id_str STRING,  
    11. text STRING,  
    12. contributors STRING,  
    13. retweeted STRING,  
    14. truncated STRING,  
    15. coordinates STRING,  
    16. source STRING,  
    17. retweet_count INT,  
    18. url STRING,  
    19. hashtags array<STRING>,  
    20. user_mentions array<STRING>,  
    21. first_hashtag STRING,  
    22. first_user_mention STRING,  
    23. screen_name STRING,  
    24. name STRING,  
    25. followers_count INT,  
    26. listed_count INT,  
    27. friends_count INT,  
    28. lang STRING,  
    29. user_location STRING,  
    30. time_zone STRING,  
    31. profile_image_url STRING,  
    32. json_response STRING);  
  4. Load table TWEETS

    Parse the JSON tweets from the table RAW_TWEETS and store them into the table TWEETS.
    1. FROM RAW_TWEETS  
    2. INSERT OVERWRITE TABLE TWEETS  
    3. SELECT  
    4. CAST(get_json_object(json_response, '$.id_str'as BIGINT),  
    5. get_json_object(json_response, '$.created_at'),  
    6. CONCAT(SUBSTR (get_json_object(json_response, '$.created_at'),1,10),' ',  
    7. SUBSTR (get_json_object(json_response, '$.created_at'),27,4)),  
    8. SUBSTR (get_json_object(json_response, '$.created_at'),27,4),  
    9. CASE SUBSTR (get_json_object(json_response, '$.created_at'),5,3)  
    10. WHEN 'Jan' then '01'  
    11. WHEN 'Feb' then '02'  
    12. WHEN 'Mar' then '03'  
    13. WHEN 'Apr' then '04'  
    14. WHEN 'May' then '05'  
    15. WHEN 'Jun' then '06'  
    16. WHEN 'Jul' then '07'  
    17. WHEN 'Aug' then '08'  
    18. WHEN 'Sep' then '09'  
    19. WHEN 'Oct' then '10'  
    20. WHEN 'Nov' then '11'  
    21. WHEN 'Dec' then '12' end,  
    22. SUBSTR (get_json_object(json_response, '$.created_at'),9,2),  
    23. SUBSTR (get_json_object(json_response, '$.created_at'),12,8),  
    24. get_json_object(json_response, '$.in_reply_to_user_id_str'),  
    25. get_json_object(json_response, '$.text'),  
    26. get_json_object(json_response, '$.contributors'),  
    27. get_json_object(json_response, '$.retweeted'),  
    28. get_json_object(json_response, '$.truncated'),  
    29. get_json_object(json_response, '$.coordinates'),  
    30. get_json_object(json_response, '$.source'),  
    31. CAST (get_json_object(json_response, '$.retweet_count'as INT),  
    32. get_json_object(json_response, '$.entities.display_url'),  
    33. ARRAY(   
    34. TRIM(LOWER(get_json_object(json_response, '$.entities.hashtags[0].text'))),  
    35. TRIM(LOWER(get_json_object(json_response, '$.entities.hashtags[1].text'))),  
    36. TRIM(LOWER(get_json_object(json_response, '$.entities.hashtags[2].text'))),  
    37. TRIM(LOWER(get_json_object(json_response, '$.entities.hashtags[3].text'))),  
    38. TRIM(LOWER(get_json_object(json_response, '$.entities.hashtags[4].text')))),  
    39. ARRAY(  
    40. TRIM(LOWER(get_json_object(json_response,'$.entities.user_mentions[0].screen_name'))),  
    41. TRIM(LOWER(get_json_object(json_response,'$.entities.user_mentions[1].screen_name'))),  
    42. TRIM(LOWER(get_json_object(json_response,'$.entities.user_mentions[2].screen_name'))),  
    43. TRIM(LOWER(get_json_object(json_response,'$.entities.user_mentions[3].screen_name'))),  
    44. TRIM(LOWER(get_json_object(json_response,'$.entities.user_mentions[4].screen_name')))),  
    45. TRIM(LOWER(get_json_object(json_response, '$.entities.hashtags[0].text'))),  
    46. TRIM(LOWER(get_json_object(json_response,'$.entities.user_mentions[0].screen_name'))),  
    47. get_json_object(json_response, '$.user.screen_name'),  
    48. get_json_object(json_response, '$.user.name'),  
    49. CAST (get_json_object(json_response, '$.user.followers_count'as INT),  
    50. CAST (get_json_object(json_response, '$.user.listed_count'as INT),  
    51. CAST (get_json_object(json_response, '$.user.friends_count'as INT),  
    52. get_json_object(json_response, '$.user.lang'),  
    53. get_json_object(json_response, '$.user.location'),  
    54. get_json_object(json_response, '$.user.time_zone'),  
    55. get_json_object(json_response, '$.user.profile_image_url'),  
    56. json_response;  
    View the data into the TWEETS table:
    1. SELECT * FROM TWEETS  
    data into the TWEETS table

    The result can also be downloaded and viewed locally.

    result

  5. Create Summarized Hive table for Analysis

    1. View top users who are tweeting about Mauritius
      1. DROP TABLE IF EXISTS topusers;  
      2.   
      3. --create the topusers hive table by selecting from the HDISample_Tweets table  
      4. CREATE TABLE IF NOT EXISTS topusers(name STRING, screen_name STRING, tweet_count INT);  
      5.   
      6. INSERT OVERWRITE TABLE topusers  
      7. SELECT name, screen_name, count(1) as cc  
      8. FROM TWEETS  
      9. WHERE UPPER(text) LIKE '%MAURITIUS%'  
      10. GROUP BY name, screen_name;  
    2. View from which region people are tweeting about Mauritius
      1. DROP TABLE IF EXISTS topregion;  
      2.   
      3. --create the topusers hive table by selecting from the HDISample_Tweets table  
      4. CREATE TABLE IF NOT EXISTS topregion (region STRING, tweet_count INT);  
      5.   
      6. INSERT OVERWRITE TABLE topregion  
      7. select user_location, count(1)  
      8. from TWEETS  
      9. group by user_location  
      View the data in the topregion table.
      1. SELECT * FROM topregion  
      download file

Analyze the results in Microsoft Excel

  1. Install the Microsoft HDInsight Hive Driver

    The driver can be found at the following location: http://www.microsoft.com/en-us/download/details.aspx?id=40886

    Note: It was noticed that both the 32 and 64 bit versions of it work.

  2. Configure the Data Source

    Open the ODBC Data Source Administrator by clicking Start > Control Panel > Additional Options > Data Sources (ODBC). If you're prompted for an administrator password or confirmation, type the password or provide confirmation.

    Under User DSN, click add and select, Microsoft Hive ODBC Driver.

    User DSN

    Fill in the required information as in the following. Of course include your password.

    Fill in the required information

  3. View the result on Excel

    1. In Excel, go to the Data tab then from other sources, Microsoft Query.

      Microsoft Query

    2. Select your data source.

      Select your data source

    3. Add your required tables.

      Add your required tables
    4. View your data in Excel.

      View your data in Excel

Conclusion

This article focused on demonstrating how Twitter feeds can be analyzed using Hive and HDInsight. However, the Analysis does not end here. This can be enhanced to discover lots of information about the customers of a company.

Imagine a company extracting Twitter feeds about its products, retrieve the data into its data warehouse and link the Twitter data to its existing customer base. The possibilities of customer information here is of a really high scope, from discovering where customers have negative views on its products, hence doing more advertising to understanding which customers needs which products.

Social Media analytics is definitely crucial to the understanding of customer behaviors these days.

Moreover, having technologies like HDInsight where everything is managed by Azure and the user just focuses on the business aspects makes Social Media / Big data analytics much easier and affordable.

References

  1. Analyze Twitter data using Hive in HDInsight
  2. Using the ODBC Data Source Administrator