Install And Run Hadoop 3 On Windows For Beginners

Amira Bedhiafi
Apr 12, 2021

66.7k
0
7
- facebook
- twitter
- linkedIn
- Reddit
- WhatsApp
- Email
- Print
- Other Artcile

The web is not the only one to generate large masses of information. Modern product tracking management, logistics or traceability for example, exploiting the generalized identification of objects and RFID-type paths also generates immeasurable amounts of precious data.

Massive analyzes then allow much finer optimizations. GPS tracking, whether for a more accurate control of the costs of the itinerant or for a new economic model of auto insurance will be detailed, cross-checked and consolidated down to the line, with these new tools.

Hadoop, in a few words,

In principle, Hadoop consists of,

HDFS for Hadoop Distributed File system, the distributed data file system, an extremely powerful distributed data management system
Map-reduce
A collection of specific tools for HRFS and Map Reduce

Structure of Hadoop: construction and basic elements

When talking about Hadoop, it is usually the entire software ecosystem that we are talking about. In addition to the kernel components (Core Hadoop), there are many extensions with original names, such as Pig, Chukwa, Oozie or ZooKeeper, which allow the framework to work with very large amounts of data. These projects, built on top of each other, are supported by the Apache Software Foundation.

The core, or Core Hadoop, is the fundamental foundation of the Hadoop ecosystem. Its components are present in version 1 of the Hadoop Common base module, the Hadoop Distributed File System (HDFS) and the MapReduce Engine. Starting with version 2.3, the YARN cluster management system (also known as MapReduce 2.0) replaces it. This excludes the MapReduce algorithm from the management system, which always works from YARN.

Hadoop Common

The Hadoop Common module therefore has a very wide range of basic functions. Part of this is Java Archive Data (JAR), which is required to start Hadoop. Libraries for data serialization as well as the Hadoop Common source code module are available for project and subproject documentation.

Hadoop Distributed File System (HDFS)

HDFS is a particularly available file system, which aims to store large amounts of data in a cluster of computers and which is responsible for data maintenance. The files are composed of blocks of data and without a classification scheme, and shared redundantly on different nodes. Therefore, HDFS is capable of processing several million data.

The length of the data blocks as well as their degree of redundancy can be configured.

The Hadoop cluster operates on the master / slave principle. The framework architecture is made up of master nodes, to which many slave nodes are subordinate. This principle is reflected in the construction of HDFS, which is based on a NameNode and various subordinate DataNodes. The NameNode manages several metadata of the file system, directory structure, and subordinate DataNodes. To minimize data loss, files are split into different blocks and stored on different nodes. The standard configuration is available three times in each run.

Each DataNode regularly sends a sign of life to the NameNode, this is called the heartbeat. If this signal does not appear, the NameNode declares the slave "dead" and uses file copies and other nodes to ensure that there are enough data blocks in the cluster that are available. Thus, the NameNode plays an essential role in the framework. To prevent it from reaching the "Single Point of Failure" stage, it is customary to make a SecondaryNameNode available to the master node. This allows the various changes made to the metadata to be stored and the central control instance to be retrievable.

For the transition from Hadoop 1 to Hadoop 2, HDFS has been extended to various backup systems: NameNode HA (High Availability) which supplements the program with a system in the event of a NameMode failure, so that replacement components are used automatically. A Snapshot copy function also allows the system to be restored to the previous status. The Federation extension also allows various NameNodes to operate within a single cluster.

MapReduce Engine

Another core component of Core Hadoop is the Google MapReduce algorithm, which is implemented in version 1 of the framework. The main duty of the MapReduce Engine is to manage the resources and guide the calculation process (Job scheduling / monitoring). Data work is based on the "map" and "reduce" phases, which allow files to be worked on directly where they are stored (data locality).

This speeds up compute time and minimizes excessive consumption of network bandwidth. As part of the MAP phase, complex calculation processes (jobs) are divided into units and shared by the JobTracker on the master node on different slave systems in the cluster. The TaskTrackers then ensure that the various part processes are processed in parallel. During the subsequent Reduce Phase, the intermediate results of the MapReduce Engine are collected and deliver an overall result.

While the master node is generally housed by the NameNode and JobTracker components, a DataNode and a TaskTracker work on each subordinate slave. The following graphic shows the basic structure of Hadoop according to version 1, shared in MapReduce Layer and HDFS Layer.

Install and Run Hadoop 3 on Windows for Beginners

With the release of version 2.3 of Hadoop, the MapReduce-Engine has been reworked. The result gave the method of management of YARN / MapReduce 2.0 clusters, of which the management of resources and the management of tasks (Job Scheduling / Monitoring) of MapReduce were coupled. Thus, the framework offers many possibilities in terms of new working models and a wide range of Hadoop applications for Big Data.

YARN / MapReduce 2.0

With the introduction of the YARN ("Yet Another Resource Negotiator") module from version 2.3, the architecture of Hadoop has been significantly changed. This is why we are talking about a move from Hadoop 1 to Hadoop 2.

While users of Hadoop 1 have only had MapReduce available as an application, the coupling to the resource and task manager of the data manipulation model has made it possible to integrate many applications for Big Data into the framework. Under Hadoop 2, MapReduce is just one of the many possible data processing possibilities of the framework. YARN adopts the role of a distributed operational system for managing resources for Hadoop Big Data applications.

The basic changes to the Hadoop architecture primarily concern the two MapReduce-Engine trackers, which no longer exist as single components in Hadoop version 2. Instead, the YARN module has three new entities: ResourceManager, NodeManager, and ApplicationMaster.

ResourceManager
The global ResourceManager is the highest authority of the Hadoop structure (Master), of which various NodeManagers are subordinate as slaves. This role is to manage the IT cluster, distribute resources to subordinate NodeManagers and orchestrate applications. The ResourceManager knows where the unique slave systems are in the cluster and what resources can be made available. An important component of the ResourceManagers is the ResourceScheduler, which determines how the available cluster resources will be shared.
NodeManager
a NodeManager acts on each node of the cluster of computers. This takes into account the position of the slaves in the Hadoop 2 infrastructure and acts as a recipient of commands from the ResourceManager. If a NodeManager is started in a node of the cluster, it conveys the information to the ResourceManager and sends a periodic sign of life (heartbeat). Each NodeManager is responsible for the resources of its own node and makes part of them available to the cluster. It is the ResourceManager's ResourceScheduler that directs how resources are used in the cluster.
ApplicationMaster
each node within the YARN system includes an ApplicationMaster, whose ResourceManager and NodeManager resources are mobilized and divided in the form of containers. On this container, the accumulation of Big Data is monitored and executed by the ApplicationMaster.

Here is a diagram showing the structure of Hadoop 2,

If a Big Data application is running on Hadoop, three actors are involved,

a Client
a ResourceManager
One or more NodeManagers

In the first step, the ResourceManager client is tasked with starting the big data application in the Hadoop cluster. This then allocates a container. In other words: the ResourceManager reserves the cluster resources for the application and contacts the NodeManager. The NodeManager in question starts the container and runs the ApplicationMaster, which is responsible for running the application and monitoring it.

Advantages of Hadoop

Range of data sources

The data collected from various sources will be in structured or unstructured form. Sources can be social media, data per click, or even email conversations. It would take a long time to convert all the collected data into a single format, Apache Hadoop saves this time because it can extract valuable data from any form of data. It also performs various functions, such as data warehousing, fraud detection, market campaign analysis, etc.

Profitable

Before, companies had to spend a considerable portion of their benefits to store large amounts of data. In some cases, they even had to delete large sets of unprocessed data to make room for new data. It was possible that valuable information could be lost in such cases. Thanks to Apache Hadoop, this problem has been completely solved. It is a cost effective solution for data storage. It helps in the long run as it stores all of the raw data generated by a business. If the business changes the direction of its processes in the future, it can easily refer to the raw data and take the necessary action. This would not have been possible in the traditional approach as raw data would have been suppressed due to increased spending.

Speed

Every organization uses a platform to get work done faster. Hadoop empowers every business to address its data storage challenges. It uses a storage system in which data is stored on a distributed file system. Since the tools used for data processing are located on the same servers as the data, processing is also performed at a faster rate. As a result, you can process terabytes of data in minutes using Apache Hadoop.

Multiple copies

Hadoop automatically duplicates the data stored there and creates multiple copies. This is done to ensure that in the event of a failure, the data is not lost. Apache Hadoop understands that the data stored by the company is important and should not be lost unless the company gives it up.

Disadvantages of Hadoop

Lack of preventive measures

During the processing of sensitive data collected by a company, it is necessary to provide the mandatory security measures. In Hadoop, security measures are disabled by default. The data analyst is aware and takes the necessary measures to secure the data.

Small data issues

There are big data platforms on the market that are not suitable for small data. Hadoop is one such platform where only large companies generating big data can reap its benefits. This is because Hadoop cannot operate efficiently in small data environments.

Risky operation

Java is one of the most popular programming languages in the world. It is also linked to various controversies as cyber criminals can easily exploit frameworks built on Java. Hadoop is one such framework entirely based in Java. Therefore, the platform is very vulnerable and can suffer damage.

Hadoop Deployment Methods

Standalone Mode

It is the default mode of configuration of Hadoop. It doesn’t use hdfs instead, it uses a local file system for both input and output. It is useful for debugging and testing.

Pseudo-Distributed Mode

Also called a single node cluster where both NameNode and DataNode resides in the same machine. All the daemons run on the same machine in this mode. It produces a fully functioning cluster on a single machine.

Fully Distributed Mode

Hadoop runs on multiple nodes wherein there are separate nodes for master and slave daemons. The data is distributed among a cluster of machines providing a production environment.

Hadoop Installation on Windows 10

Prerequisites

To install Hadoop, you should have Java version 1.8 in your system.

Check your java version through this command on command prompt

java –version

If java is not installed in your system, then visit this link.

After choosing the appropriate version for the relevant architecture of your machine, accept the license,

After downloading java version 1.8, download hadoop version 3.1 from this link

Extract it to a folder,

Setup System Environment Variables

Open control panel to edit the system environment variable,

Go to environment variable in system properties,

Create a new user variable. Put the Variable_name as HADOOP_HOME and Variable_value as the path of the bin folder where you extracted hadoop.

Likewise, create a new user variable with variable name as JAVA_HOME and variable value as the path of the bin folder in the Java directory.

Now we need to set Hadoop bin directory and Java bin directory path in system variable path.

Edit Path in system variable

Click on New and add the bin directory path of Hadoop and Java in it.

Configuration

Now we need to edit some files located in the hadoop directory of the etc folder where we installed hadoop,

core-site.xml
hadoop-env
hdfs-site.xml
map-red.xm
yarn-site.xml

Edit the file core-site.xml in the hadoop directory. Copy this xml property in the configuration in the file

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

Edit mapred-site.xml and copy this property in the cofiguration

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

Create a folder ‘data’ in the hadoop directory

Create a folder with the name ‘datanode’ and a folder ‘namenode’ in this data directory

Edit the file hdfs-site.xml and add below property in the configuration

Note

The path of namenode and datanode across value would be the path of the datanode and namenode folders you just created.

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>C:\Users\hp\Downloads\hadoop-3.1.0\hadoop-3.1.0\data\namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value> C:\Users\hp\Downloads\hadoop-3.1.0\hadoop-3.1.0\data\datanode</value>
</property>
</configuration>

Edit the file yarn-site.xml and add below property in the configuration

<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>

Edit hadoop-env.cmd and replace %JAVA_HOME% with the path of the java folder where your jdk 1.8 is installed

Hadoop needs some Windows OS specific files which are not available with default download of Hadoop.

To include those files, replace the bin folder in hadoop directory with the bin folder provided in this github link.

Download it as zip file. Extract it and copy the bin folder in it. If you want to save the old bin folder, rename it like bin_old and paste the copied bin folder in that directory.

Now it's time to chech if Hadoop is successfully installed by running this command on cmd,

hadoop version

Format the NameNode

After installing Hadoop, it is time to format the NameNode is done once when hadoop is installed (and not for running hadoop filesystem).

Run this command,

hdfs namenode –format

Result

Now change the directory in cmd to sbin folder of hadoop directory with this command :

cd C:\hadoop-3.1.0\sbin

Start namenode and datanode with this command :

start-dfs.cmd

Two more cmd windows will open for NameNode and DataNode.

Now start yarn through this command,

start-yarn.cmd

Two more windows will open, one for yarn resource manager and one for yarn node manager.

Note

Make sure all the 4 Apache Hadoop Distribution windows are up n running. If they are not running, you will see an error or a shutdown message. In that case, you need to debug the error.

To access information about resource manager current jobs, successful and failed jobs, go to this link in yout browser : http://localhost:8088/cluster

To check the details about the hdfs (namenode and datanode),

Open this link in your browser : http://localhost:9870/

NB: for Hadoop versions prior to 3.0.0 – Alpha 1, use port http://localhost:50070/

Recommended Free Ebook

Printing in C# Made Easy

Download Now!