Everything You Need To Know About Hadoop

Amira Bedhiafi
4y
37.5k
0
6

Article

Today, Hadoop is the primary platform for big data. Used for the storage and processing of huge volumes of data, this software framework and its various components are used by a large number of companies for their Big Data projects. By browsing this file, you will know everything about Hadoop and how it works.

What is Hadoop ?

Hadoop is an open source software framework for storing data, and launching applications on clusters of standard machines. This solution offers massive storage space for all types of data, immense processing power and the ability to support a virtually unlimited amount of tasks. Based on Java, this framework is part of the Apache project, sponsored by the Apache Software Foundation.

Everything You Need To Know About Hadoop

Thanks to the MapReduce framework, it can process huge amounts of data. Rather than having to move data to a network to perform processing, MapReduce allows processing software to be moved directly to the data.

History of Hadoop

With the advent of the World Wide Web in the late 1990s and early 2000s, search engines and indexes were created to help locate relevant information within textual content. Initially, search results were returned by humans. Of course, when the number of pages increased to tens of millions, automation became necessary. Web crawlers were then created, mainly as university research projects. Search engines like Yahoo and AltaVista have also started to appear.

Among these search engines, the open source project Nutch was created by Doug Cutting and Mike Cafarella. Their goal was to deliver web search results faster by distributing data and calculations across different computers to accomplish multiple tasks simultaneously. At the same time, the Google search engine was in development. This project was based on the same concept of storing and processing data in a distributed and automated way to deliver search results faster.

In 2006, Cutting decided to join Yahoo and took with it the Nutch project and ideas based on Google's early work in distributed data processing and storage. The Nutch project was divided into several parts. The web crawlers kept the name Nutch, while distributed computing and processing became Hadoop.

Still curious about the meaning of Hadoop ? It was named after Cutting's son's yellow elephant plush. (Funny right ?)

In 2008, Yahoo offered Hadoop as an open source project. Today, the framework and its ecosystem of technologies are managed and maintained by the not-for-profit Apache Software Foundation, a global community of software developers and contributors.

After four years of development within the Open Source community, Hadoop 1.0 was offered to the public from November 2012 as part of the Apache project, sponsored by the Apache Software Foundation. Since then, the framework has continued to be developed and updated.

The second version of Hadoop 2 has improved resource management and scheduling. It features a high availability file system option, and supports Microsoft Windows and other components to extend the versatility of the framework for data processing and analysis. Hadoop is currently offered in version 2.6.5.

Organizations can deploy Hadoop components and compatible software packages to their local data center. Most big data projects, however, rely on short-term use of substantial IT resources. This type of usage is best suited for highly scalable public cloud services, such as Amazon Web Services, Google Cloud Platform, and Microsoft Azure.

Cloud service providers typically support Hadoop components through basic services, such as AWS Elastic Compute Cloud and Simple Storage Services. There are also services tailored for Hadoop tasks, such as AWS Elastic MapReduce, Google Cloud Dataproc, and Microsoft Azure HDInsight.

Why is Hadoop important?

The benefits Hadoop brings to businesses are numerous. Thanks to this software framework, it is possible to store and process vast amounts of data quickly. Faced with the increasing volume of data and its diversification, mainly linked to social networks and the Internet of Things, this is a significant advantage.

Likewise, Hadoop's distributed computing model enables big data to be processed quickly. The greater the number of compute nodes used, the greater the processing power. Data and processed applications are protected against hardware failures. If a node goes down, tasks are redirected directly to other nodes to ensure that distributed computing does not fail. Multiple copies of all data are stored automatically.

Unlike traditional relational databases, there is no need to pre-process the data before storing it. It is possible to store as much data as you want and decide how to use it later.

This groups together unstructured data like text, images, and videos.

The open source framework is therefore free and relies on standard machines to store large amounts of data. Finally, it is possible to adapt the system to support more data by simply adding nodes. The administration required is minimal.

What are the challenges of using Hadoop?

MapReduce programming is not suitable for all problems. It is suitable for simple inquiries and problems that can be divided into independent units. However, it is not effective for iterative and interactive analytical tasks. Nodes do not communicate with each other, and iterative algorithms require multiple phases of map-shuffling and sorting to complete their tasks. Many files are created between MapReduce phases, and this programming is not suitable for advanced analytical calculations.

There is a large talent gap. It is very difficult to find programmers sufficiently proficient in Java to be productive with MapReduce. This is one of the reasons distribution providers are looking to put SQL relational technologies at the top of Hadoop. It's easier to find programmers with SQL skills than MapReduce experts. In addition, administration appears to be both artistic and scientific, and requires little knowledge of operating systems, hardware, and Hadoop kernel settings.

Another challenge relates to the security concerns of fragmented data, although new tools and technologies are emerging. The Kerberos authentication protocol represents a step forward in securing framework environments

Finally, there is no intuitive and complete tool for data management on Hadoop. The same goes for cleaning, data governance and metadata. The most missing tools are those relating to data quality and standardization.

How is Hadoop used by businesses?

Today, beyond its initial functionality to search millions of web pages for relevant results, Hadoop is used by many companies as a big data platform. Here are the main uses of Hadoop in business today:

Low-cost storage and data archiving

The modest cost of standard machines makes this data processing platform very useful for storing and combining data. Transactional data, or data coming from social networks, machines, scientific data, click streams… low-cost storage makes it possible to keep information that is not particularly useful at the moment in case it is needed. would become so later.

A sandbox for discovery and analysis

Designed to process large volumes of data of different shapes, Hadoop is capable of handling analytical algorithms. Analytical tools can help the business operate more efficiently, discover new opportunities, and gain competitive advantages. Hadoop's sandbox approach offers opportunities for medium innovation with minimal investment.

Data Lakes

Data Lakes support storing data in the original format. The goal is to provide a raw, unrefined view of data for data scientists and analysts for discovery and analysis. This helps them to ask new or complex questions without constraints. Data Lakes are not, however, a replacement for Data Warehouses. How to secure and govern Data Lakes remains a big debate in the IT industry today. It may be necessary to develop logical data structures based on data federation techniques.

Complement the Data Warehouses

Currently, Hadoop sits alongside Data Warehouse environments. Likewise, some datasets are offloaded directly from Data Warehouses to Hadoop, and some new types of data go directly to Hadoop. The end goal of every business is to have the right platform to store and process data of different schemas and formats to support different use cases that can be integrated at different levels.

Hadoop and the Internet of Things

Connected objects need to know what to communicate and when to act. The essence of the Internet of Things is the continuous streaming of data. Hadoop is often used as a data store for billions of transactions. Massive storage and processing capabilities also allow the big data platform to be used as a sandbox for discovering and establishing patterns for prescriptive instruction. These guidelines can then be continuously improved, as Hadoop is constantly updated with new data.

Recommendations engine

One of the most popular uses of Hadoop is for building recommendation systems on the web. Many companies use the framework's analytical tools to provide these types of services. Facebook uses it to suggest people you might know, LinkedIn offers you jobs that you might be interested in, and Netflix, eBay, and Hulu recommend content. These recommendation systems analyze large amounts of data in real time to quickly predict consumer preferences before they have time to leave the web page.

A recommendation system is concerned with generating a user profile explicitly, by asking for information from the user, or implicitly, by observing their behavior. This profile is then compared to benchmark characteristics, based on observations from the entire user community. The system can thus offer relevant suggestions.

How does it work ?

There are currently four core modules included in the core Apache Foundation Hadoop framework. Hadoop Common gathers the libraries and utilities used by other modules.

The Hadoop Distributed File System (HDFS) is a scalable, Java-based system for storing data on many machines without prior organization.

The YARN (Yet Another Resource Negotiator) allows you to manage the resources for the processes carried out on the platform.

Finally, MapReduce is a software framework for parallel processing. It combines two stages. The Map step allows you to retrieve entries to divide them into smaller subproblems and distribute them to the other nodes. Subsequently, the master node combines the answers to all of the subproblems to produce a result.

Apart from these four main modules, various components based on the framework have achieved a high level of reputation among Apache projects:

Ambari is a web interface to manage, configure and test Hadoop services and components.
Cassandra is a distributed database system
Flume is software that collects, aggregates and moves large amounts of data streams under HDFS.
HBase is a distributed, non-relational database that runs on top of Hadoop. Its arrays can be used as input and output for MapReduce tasks.
HCatalog is a storage and table management tool that allows users to share and access data.
Hive is a data warehouse and SQL query language presenting data in tabular form. Programming Hive is similar to programming a database.
Oozie allows to plan the tasks of the framework.
Pig is a platform for manipulating data stored in HDFS, and includes a compiler for MapReduce programs as well as a high-level language called Pig Latin. It provides a way to perform data extractions, transformations and loads, as well as basic analysis without having to write MapReduce programs.
Solr is a scalable search tool including indexing, central configuration, and recovery.
Hadoop Spark is an open-source cluster computing framework with in-memory analytics tools
Sqoop is a connection and transfer mechanism for moving data between Hadoop and relational databases.
Zookeeper is an application to coordinate distributed treatments.

Commercial distributions

Open Source software is created and maintained by a network of developers from around the world. It is available for download free of charge. Just type on its search engine "Hadoop Download" to find free distributions. Anyone can contribute to their development and use it. However, more and more commercial versions of the framework (often referred to as “distros”) are available.

Distributed by software vendors, these paid versions offer a personalized Hadoop framework. Buyers also benefit from additional features related to security, governance, SQL, management / administration consoles, as well as training, documentation and other services. Some of the most popular distributions include Cloudera, Hortonworks, MapR, IBM BigInsights, and PivotalHD.

How do I get data entered into Hadoop?

There are several ways to enter data into Hadoop. You can use connectors from third-party vendors. You can use Sqoop to import structured data from a relational database to HDFS, Hive, and HBase. Flume allows data to be uploaded from logs to Hadoop continuously. Files can also be loaded to the system using simple Java commands. HDFS can also be mounted as a file system into which files can be copied. These are just a few of the many options available.

What future for Hadoop?

According to a recent study by Zion Research, the global Hadoop market was worth $ 5 billion in 2015, and could reach a value of $ 59 billion in 2021, with an annual growth of 51% between 2016 and 2021.

The increase in the volume of structured and unstructured data within large enterprises, and the willingness of enterprises to use this data, are the main factors driving the growth of the Distributed Computing Platform market. Healthcare, finance, manufacturing, biotech and defense industries need fast and efficient solutions to monitor data. The development and updates of the framework could open up new opportunities for the market. However, security and distribution issues could limit the use of the distributed processing platform.

Software, hardware and services are the three main segments of the Hadoop market. Of these three segments, the services segment dominated the market in 2015 and generated around 40% of total revenue. Services are expected to continue to dominate through 2020. The software segment is also expected to experience significant growth with the massive adoption of Hadoop by large enterprises.

The IT industry accounted for 32% of Hadoop’s total revenue in 2015. Next is the telecommunications industry, followed by government and retail. These latter sectors are expected to generate strong growth as companies increasingly adopt Hadoop solutions.

North America is the main Hadoop market in 2015, with 50% of the overall revenue generated by the framework. This trend is expected to continue in the years to come. Asia Pacific is the region in which it is experiencing the greatest growth thanks to the emergence of the telecommunications and computer industries in China and India. Europe is also expected to initiate strong growth.

Some of the main companies in the market include Amazon Web Services, Teradata Corporation, Cisco Systems, IBM Corporation, Cloudera, Inc., Datameer, Inc., Oracle Corporation, Hortonworks, Inc., VMware, OpenX.

Hadoop 2 Vs Hadoop 3

Available since 2012, Hadoop 2 has been expanded over the years. The latest version 2.90 has been available since November 17, 2017. In parallel, release 3.0.0 was released on December 13, 2017. This new version brings a lot of new features that should be presented.

The first difference comes from the management of containers. The third provides more agility with Docker’s packet isolation. This makes it possible to build applications quickly and deploy them in minutes. Thus, the Time to Market is shorter.

The cost of using Hadoop 3 is also lower. The second version requires more storage space. While the third version requires 9 blocks of storage, the second takes up to 18 blocks in total including copies. The latest release therefore makes it possible to reduce the storage load while maintaining the same quality of data backup. And who says less space occupied, says cost reduction.

Another major difference is that Hadoop 2 only supports a single namenode. This tool for managing the file system tree, file metadata and directories. The next version can handle more than one, which exponentially increases the size of the infrastructure. More namenodes also means more safety. If one of these managers is "down", another can take over.

With its intra-node storage balancing system, there is no longer a problem of imbalance in the use of hard drives with Hadoop 3. The division of labor is also easier. Unlike the second version, it is possible to prioritize applications and users.

Finally, this latest version of the HDFS system opens up new perspectives for designers of machine learning and deep learning algorithms. Indeed, it supports more hard disk, but especially graphics cards. Analyzes are thus improved by the computing power of these processors, which is particularly useful for developing artificial intelligence applications.