Introduction To Hadoop Framework

Introduction

This blog is to teach you about Hadoop Framework.

Hadoop

  • Hadoop is an open-source framework.
  • Hadoop Framework is used to sort our application data (large amount of data).
  • Hadoop Framework is used to store any kind of data, as it has a large amount of storage space.
  • Hadoop Framework is used to handle data virtually.

Hadoop History

  • 1990 Apache Software foundation introduced Hadoop Framework technology.
  • 2004 Google Publishes GFS paper.
  • 2005 Nutch uses MapReduce.
  • 2008 Becomes Apache top-level project.
  • 2009 Yahoo uses hadoop.
  • 2013 Hadoop is used in many companies.

We need to understand Big Data before learning about the Hadoop Framework.

Big Data

  • Big Data is a collection of large amounts of data that cannot be processed by our computer technologies. Big Data technologies provide a very accurate analysis, so it's useful to know the results.
  • Big data has some challenges. It helps in capturing the data, curation, storage, searching, sharing, transfer, analysis and presentation.

Hadoop is important

  • Quickly store & process large amount of data.
  • Computing power.
  • Fault tolerance.
  • Flexibility & low cost.
  • Scalability.

Challenges in Hadoop

  • MapReduce programming is not matched for all problems.
  • Data security issues.
  • Hadoop is not easy to use.

Hadoop Data Gathering

Here, we will learn how to add our data to Hadoop.

  • Third-party vendor connectors (SAS/ACCESS or SAS Data Loader for Hadoop) are used to update our data to Hadoop
  • Apache Flume is used to load the data to Hadoop.
  • Some simple Java commands are used to transfer the data form the files into Hadoop.

Hadoop Components

  • HDFS-Distributed System.
  • MapReduce-Distributed Data Processing Model.

Hadoop Architecture

  • It operates on top of an existing file system.
  • Its files are stored as the blocks.
  • It provides reliability through replication.
  • NameNode stores the metadata and manages access.
  • No data caching due to large datasets.