Step By Step Installation Of Pseudo Distribution Hadoop Cluster

Why Hadoop 2.7.1 Cluster?

Recently, Apache has launched Hadoop 3.0.0 Version @ Dec 2017, Then Instead of going with the updated version why we are going with the 2.7.1 version is because of Hadoop 2.7.1 & 2.7.2 are the stable versions.
Most of the Hadoop Development is Happening only in these versions.
Further Details about the Hadoop Cluster Refer the Release Notes - http://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/releasenotes.html

Un-Tar the File

Go to the Downloaded Location of the Hadoop Packages in the Terminal. Then Execute the below Command.

tar xzvf hadoop-2.7.1.tar.gz
This command will unzip all the files which are present in the hadoop Folder.

Moving The Hadoop Folder to Our Location & Providing Ownership

sudo mv hadoop-2.7.1 /usr/local/hadoop
sudo chown -R raghav:hadoop /usr/local/hadoop - Providing permission to the User to access the contents.

Setting Java Environment Variable

echo 'export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_71' >> /usr/local/hadoop/etc/hadoop/hadoop-env.sh
We are editing the Hadoop environment script to use Java home variable used by Hadoop and modify the file using the above command. (I have already installed java in the path).

Creating Name Node, Data Node Directories

sudo mkdir -p /usr/local/hadoop_store/tmp
sudo mkdir -p /usr/local/hadoop_store/hdfs/namenode
sudo mkdir -p /usr/local/hadoop_store/hdfs/datanode
sudo mkdir -p /usr/local/hadoop_store/hdfs/secondarynamenode
sudo chown -R hduser:hadoop /usr/local/hadoop_store

We have creaed the Directories for hadoop temporary files, namenode metadata, datanode data and secondary namenode metadata.

Configurations

In order to set up a single node hadoop cluster working properly, we have to modify the 4 xml configuration files which are listed below.

mapred-site.xml.
core-site.xml
hdfs-site.xml
yarn-site.xml

Modifying mapred-site.xml

The mapred-site.xml file contains the configuration settings for MapReduce daemon on YARN

sudo vi /usr/local/hadoop/etc/hadoop/mapred-site.xml - I have moved the hadoop extracted to version that path so I am providing the absolute path of the mapred-site.xml file
Press (I) Key to edit the configurations & add the highlighted contents to the file & press: wq.
1. <configuration>
2. <property> <name>mapreduce.framework.name</name>
3. <value>yarn</value>
4. </property> </configuration>

Screenshot

Modifying core-site.xml

sudo vi /usr/local/hadoop/etc/hadoop/core-site.xml
cor-site.xml informs Hadoop daemon where NameNode runs in the cluster.
It contains the configuration settings for Hadoop Core such as I/O settings that are common to HDFS& MapReduce.
1. <property>
2. <name>hadoop.tmp.dir</name>
3. <value>/usr/local/hadoop_store/tmp</value>
4. <description>A base for other temporary directories.</description>
5. </property>
6. <property>
7. <name>fs.default.name</name>
8. <value>hdfs://localhost:54130</value>
9. <description>

The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation.

The uri's scheme determines the config property fs.SCHEME.impl) naming the FileSystem implementation class.

The uri's authority is used to determine the host, port, etc. for a filesystem.

</description>
</property>

Modifying hdfs-site.xml

The hdfs-site.xml file contains the configuration settings for HDFS daemons.

NameNode
Secondary NameNode
DataNodes.
CheckPoint - To Find the data Node is active or Dead. (During the specified time intervals DN send the information to the NameNode & NameNode will Monitor this, If there is no such transfer of signals happen from DataNode to Name Node, The NameNode will announce the DataNode as Dead for this purpose the checkpoint is being used)

In the hdfs-site.xml we need to specify default block replication (No., of Replicas for the Data Nodes).

The actual number of replications can also be specified when the file is created. The default is used if replication is not specified in create time.

In my cluster, I have set replicas as 1 only it can be modified based on the needs or requirements. Normally the replicas should be more than 1 in number. In-Order to avoid the Data Loss.

sudo vi /usr/local/hadoop/etc/hadoop/hdfs-site.xml
Modifying the hdfs-site.xml using Vi Editor, add the contents which are present only between the configuration tags <configuration> </configuration>
1. <configuration>
2. <property>
3. <name>dfs.replication</name>
4. <value>1</value>
5. <description>Default block replication. The actual number of replications can be specified when the file is created.The default is used if replication is not specified in create time. </description>
6. </property>
7. <property>
8. <name>dfs.namenode.name.dir</name>
9. <value>file:/usr/local/hadoop_store/hdfs/namenode</value>
10. </property>
11. <property>
12. <name>dfs.datanode.data.dir</name>
13. <value>file:/usr/local/hadoop_store/hdfs/datanode</value>
14. </property>
15. <property>
16. <name>dfs.namenode.checkpoint.dir</name>
17. <value>file:/usr/local/hadoop_store/hdfs/secondarynamenode</value>
18. </property>
19. <property>
20. <name>dfs.namenode.checkpoint.period</name>
21. <value>3600</value>
22. </property>
23. </configuration>

Screenshot

Modifying the yarn-site.xml file:

sudo vi /usr/local/hadoop/etc/hadoop/yarn-site.xml
The yarn-site.xml file contains configuration information that overrides the default values for YARN parameters.
1. <configuration>
2. 
3. <property>
4. <name>yarn.nodemanager.aux-services</name>
5. <value>mapreduce_shuffle</value>
6. </property>
7. <property>
8. <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
9. <value>org.apache.hadoop.mapred.ShuffleHandler</value>
10. </property>
11. </configuration>

Name Node Format

When we format namenode it formats the meta-data related to data-nodes. By doing that, all the information on the datanodes are lost and they can be reused for new data.
Normally namenode format will be done only at the first time of Hadoop cluster setup.
hadoop namenode -format "Command to execute the Name Node Format."

Running All The Process

It will be a difference based on the Node Cluster which we have configured & Installed.

Single Node Cluster

start-all.sh

To Run HDFS and YARN separately

start-yarn.sh (Resource Manager and Node manager)
start-dfs.sh (namenode, datanode and secondarynamenode)

MultiNode Cluster

hadoop-daemons.sh start secondarynamenode
hadoop-daemons.sh start namenode
hadoop-daemons.sh start datanode
yarn-daemon.sh start nodemanager
yarn-daemon.sh start resourcemanager
mr-jobhistory-daemon.sh start historyserver

How to Check All Daemons are Running?

jps - Java Virtual Machine Process Status Tool it will check all the daemons such as,

Name Node
Secondary Name Node
DataNode
ResourceManger
Node Manager is running on the Machine.

Screenshot

How to Browse through the Namenode web UI to fetch information about NameNode & DataNode?

http://localhost:50070/