.Net Core With Apache Spark

  1. Introduction
  2. Install Dot Net Core
  3. Install Prerequisites
  4. Install Java SDK
    • Install 7zip
    • Install Apache Spark
  5. Install Dot Net for Apache Spark
  6. Create first application using apache spark

Introduction

 
Nowadays we are dealing with lots of data, and many IOT devices, mobile phones, home appliances, wearable devices,  etc are connected through the internet and the volume, velocity and variety of data is increasing day by day. At a certain point we need to analyze this data, represent it in a readable format,  or use it to  make some important and bold decisions in business. There are many tools and frameworks in the market to analyze the terabytes of data, and one of the most popular data analysis frameworks is Apache Spark.
 
What is Apache Spark?
 
Apache spark is an open source, general purpose, distributed data analytics engine for large datasets. It can be use in big data and Machine Learning.
 
Why do we use it?
 
Apache spark is a fast, robust and scalable data processing engine for big data. In many cases it's faster that hadoop. You can use it with Java, R, Python, SQL, and now with .net.
 
Component of Apache Spark
 
.Net Core With Apache Spark
Figure 1
 

Install Dot Net Core

  1. Open this here.
  2. Download the SDK and install it
  3. Open the command prompt and type ‘dotnet’ to verify successful dot net core installation
.Net Core With Apache Spark
Figure 2 
 

Install Prerequisites

 
A. Install java SDK
  1. Open the link here
  2. Download and Install the Java SDK
  3. Set the environment variable
  4. Open the command prompt and type ‘java- version’ to verify successful java installation 
.Net Core With Apache Spark
Figure 3
 
B. Install 7zip
  1. Open the link here.
  2. Download and Install the 7-zip 

Install Apache Spark

  1. Open the link here.
  2. Download the latest stable version of Apache Spark and extract the .tar file using 7-Zip
  3. Place the extracted file in C:\bin
  4. Set the environment variable
setx HADOOP_HOME C:\bin\spark-2.4.1-bin-hadoop2.7\
setx SPARK_HOME C:\bin\spark-2.4.1-bin-hadoop2.7\
 
.Net Core With Apache Spark
Figure 4
 
To verify the successful instillation, open the cmd prompt and run the following command,
 
%SPARK_HOME%\bin\spark-submit --version
 
.Net Core With Apache Spark
Figure 5

Install Dot Net for Apache Spark

  1. Open the link here.
  2. Download the latest stable version of .Net For Apache Spark and extract the .tar file using 7-Zip
  3. Place the extracted file in C:\bin
  4. Set the environment variable

    setx DOTNET_WORKER_DIR "C:\bin\Microsoft.Spark.Worker-0.6.0"
Also download and Install WinUtil
 
https://github.com/steveloughran/winutils/blob/master/hadoop-2.7.1/bin/winutils.exe
 
Once it gets download copy and paste it in C:\bin\spark-2.4.1-bin-hadoop2.7\bin 
 
Create your first application using Apache Spark.
 
Open the cmd prompt and type the following command to create console application
 
‘dotnet new console -o MyFirstApacheSparkApp’
 
.Net Core With Apache Spark
 
Figure 6
 
Once the application is created successfully type: 
 
“cd MyFirstApacheSparkApp“
 
and hit enter. To use the apache spark with .Net applications we need to install the Microsoft.Spark package.
 
“dotnet add package Microsoft.Spark“
 
.Net Core With Apache Spark
Figure 7
 
Once the package installssuccessfully open the project in Visual Studio code.
 
.Net Core With Apache Spark
Figure 8
 
.Net Core With Apache Spark
Figure 9
 
Add data file ‘data.txt’ to our application to process the data with the following text “Betty Botter bought some butter, but the butter, it was bitter. If she put it in her batter, it would make her batter bitter, but a bit of better butter, that would make her batter better.”
 
.Net Core With Apache Spark
Figure 10
 
Now open the program.cs and add the “Microsoft.Spark.Sql” namespace --- this contains all the necessary classes. 
  1. using Microsoft.Spark.Sql; 
Create a const variable to set data.txt file path.
  1. private const string Paths = "data.txt"
SparkSession class allows you to create the session, and you need to pass the app name to create session so it can be used further. 
  1. // Creating a Spark session here  
  2.  Builder builder = SparkSession.Builder();  
  3.  var spark = builder.AppName("spark_word_count").GetOrCreate(); 
In the next step we need to create a data frame that will process a file path to read data from the file. This frame can hold the data to process.
  1. // Creating initial DataFrame here  
  2. DataFrame dataFrame = spark.Read().Text(Paths); 
Now let’s write the code for counting the words of text.
  1. // Count words  
  2.  var words = dataFrame.Select(Functions.Split(Functions.Col("value"), " ").Alias("words"))  
  3.                 .Select(Functions.Explode(Functions.Col("words"))  
  4.                 .Alias("word"))  
  5.                 .GroupBy("word")  
  6.                 .Count()  
  7.                 .OrderBy(Functions.Col("count").Desc()); 
To show the result use the “show()” method.
  1. // results  
  2.  words.Show(); 
After completing the option we need to stop the Spark session; for this we use Stop() method.
  1. // Stop Spark session here  
  2. spark.Stop(); 
Use the below command to build the project now,
 
“dotnet build”
 
.Net Core With Apache Spark
Figure 11
 
Run the below command to show the result, this command has several parameters including some environment variables.
 
%SPARK_HOME%\bin\spark-submit --class org.apache.spark.deploy.dotnet.DotnetRunner --master local bin\Debug\netcoreapp3.1\microsoft-spark-2.4.x-0.8.0.jar dotnet bin\Debug\netcoreapp3.1\MyFirstApacheSparkApp.dll
 
.Net Core With Apache Spark
Figure 12
 
Output
 
.Net Core With Apache Spark
 Figure 13
 

Conclusion

 
Now Apache spark can be use with the .net. Spark for .net gives more flexibility to those who are more comfortable with C# and F#.
 
The aim of .Net for Spark is to provide accessibility of Spark API which can communicate with your application; i.e., one written in .Net.