SQL Engines Boost Hadoop Query Processing For Big Data Users

Ritesh Mehta
8y
9.4k
0
5
25
Blog

In the Hadoop and NoSQL world, now the spotlight is on SQL-on-Hadoop engines. These days, a lot of different engines are available, which makes it difficult for businesses to choose. With SQL-on-Hadoop technologies it is possible to access stored Big Data in Hadoop by utilizing the familiar SQL language. Users could plug in almost any analytical or reporting tool to study and analyze the data. Before SQL-on-Hadoop, accessing big data was restricted to the few. With SQL-on-Hadoop today, one could use his one’s favorite tool. For a business, this opens up big data to a much bigger audience, which could boost the return on the big data investment.

Despite all the attention that Hadoop is getting, its use as a framework to support processing of big data has been limited to programmers with certain skills. Then enter SQL, the standard programming language for relational databases. The SQL tools integration is hastening Hadoop’s performance as well as opening the door to more data analysts and developers well-versed in SQL. As the ecosystem of Hadoop expands to include more technologies, wherein most are being developed rapidly, users should ensure a tight integration between the underlying Hadoop platform as well as associated tools.

Prospective users have a lot of options to take into consideration. Most of the tools available are still relatively new and are not yet fully mature. For instance, most Hadoop query engines do not support all functionality provided in relational SQL implementations. The fast development pace on Hadoop and related technologies is to help close the gap. Nevertheless, it also means that companies should keep up with frequent new releases to benefit from the added SQL-on-Hadoop features. Additionally SQL itself could be too much for a lot of business uses, as well as some analytics and BI professionals, to directly handle, which often prompt companies to put SQL-on-Hadoop tools under the covers of a self-service BI or front-end web user interfaces.

Organizations could choose from a wide array of SQL-on-Hadoop engines. The choice is vital since not all of the technologies are built equally. Outwardly, they all look the same, but internally, they’re very different. Choosing the right technology needs a detailed study. The key is to evaluate several requirements before choosing one of the engines available. One of the requirements is the SQL dialect. The richer the SQL dialect supported, the bigger the range of apps that could benefit from it. Additionally, the richer the dialect, the more query processing could be pushed to Hadoop and the less the reporting tools and applications have to do. Executing joins on big tables efficiently and fast is not easy all the time, particularly if the SQL-on-Hadoop engine doesn’t have an idea of where data is stored. Inefficient join processing could lead to huge amounts of I/O and could cause colossal data transport between nodes. Both could result in a very poor performance.

Initially, SQL was designed to process data that is highly structured. Every record in a table has the same columns set and every column holds a single atomic value. Not all big data in Hadoop has this traditional structure and its files may have nested data, schema-less data, variable data and self-describing data. A SQL-on-Hadoop engine should be able to translate all the data forms to flat relational data and should be able to optimize queries on the forms of data too.

Hadoop supports several ‘standard’ storage data formats, like Avro, Parquet, and ORCFile. The more SQL-on-Hadoop technologies utilize these formats, the more tools, as well as other SQL-on-Hadop engines, could read the same data. This minimizes the need to replicate data drastically. So, it is important to verify whether a proprietary storage format is used. To use SQL to execute complicated analytic functions, like Gaussian discriminative analysis and market basket analysis, it is necessary that they are supported by SQL or could be developed. The functions are called UDF’s or user-defined functions. Furthermore, it’s also important for SQL-on-Hadoop engine could distribute UDF’s execution over as many disks and modes as possible.

Not all data is stored in Hadoop, and most company data is still stored in other sources, such as SQL databases. A SQL-on-Hadoop engine should support distributed joins on stored data in all types of data sources. Simply put, it should support data federation. As different organizations compare and evaluate the technologies available, the assessment of the capabilities of the engine is a great starting point. Tool approaches and selection vary and each has a specialization.