Amazon Athena - The Data Analytics Tool

Amit Dua
5y
11.3k
0
3
25
Blog

Analytics plays a key role to gain clear business insights, and if the data you want to analyze is huge, then there are a number of parameters that need to be taken care of viz: cost, the expertise of the domain, maintenance, regular upgrades, problem of concurrent users, etc.

Redshift is a fully managed data warehouse solution by Amazon. It is an efficient solution to manage data, but if more disk capacity is required, we need to increase the size of our cluster by adding more nodes. Being horizontally scalable, we can end up to handle a gigantic redshift cluster. There can be a simpler approach to this as well.

What is Amazon Athena?

Amazon Athena is an interactive query service that makes it easy to analyze data directly from S3 using standard SQL.

Serverless- that means there is no infrastructure, no instances, no administration, no updates required. We talk to Athena using an end-point or directly by AWS console.
Zero Spin-up time — that means as soon as we log in, we are able to submit queries, or analyze results, or create tables.
All the upgrades that happen behind the scenes are transparent; we get the benefit of the upgrades without any downtime.

One can query the data directly from S3; we don’t need loading of data in any external data warehouse. We can query the data directly from where it lives. It supports a variety of data in raw format from S3, which can be a text CSV, JSON, weblogs, AWS service logs. For, eg. If we have a text file that can be parsed using a regular expression, then also we can define that regular expression as a part of creating a table query.

Since the data is streamed directly from S3, so we can also take advantage of S3 durability and availability.

Cost-effective - We pay only for S3, which is quite cheap, and externally we pay per query.

Per query cost - 5 dollars per Terabyte scan for S3.

Failed queries and DDL queries are free. If we cancel a query, while running, the charge will be on the amount of data scanned. At the same time, there are many optimization techniques to reduce this cost, like partitioning the data while creating tables, so that the query on the table will scan only a part of S3.

Also, we can prefer using columnar storage of data. If data is correctly stored, we need to scan fewer pieces of data. Hence cost reduces.

Presto (SQL on anything)

Athena uses Presto as a managed service. Presto is an in-memory distributed SQL engine, which came out of Facebook. It reads data from anywhere and actually processes data from where it lives; hence it can be connected to a variety of connectors including HDFS, S3, MongoDB, MySQL, Postgres, Redshift, SQL Server. It has the power to handle hundreds of concurrent queries on a single cluster. That means we don’t have to maintain complex clusters.

Presto decouples the data from its processing; no data is stored in Presto, so it reads it from elsewhere. e.g., S3. And since S3 storage is really cheap, it makes a lot of sense to use it as the storage for our Datalake. So, data resides in S3, we create metadata over it, which is stored in AWS Glue Catalog Metastore, all processing is in-memory, and the results are directly streamed to the console (or can also be customized to an output bucket), also we can download data in CSV format from the console directly.

Internally, it automatically parallelizes queries, so when compared to Hive, which provides a brilliant use of Java MapAndReduce in Hadoop, executing seven hive commands in parallel took hours, whereas Presto was able to perform exceptionally good. This was one of the reasons Facebook created and switched to Presto, which further powers-up Amazon Athena.

Athena Supports SQL

So it supports commands like creating a table, nested queries, multiple joins. We can also partition the data based on any column, not just date and time, but can also make a combination of several columns. It uses Hive QL for DDL, and Presto while querying the data.

AWS Athena Console

We create External tables like Hive in Athena (either automatically by AWS Glue crawler or manually by DDL statement). This is the soft linking of tables. If the table is dropped, the raw data remains intact. Amazon web services (AWS) itself provides ready to use queries in Athena console, which makes it much easier for beginners to get hands-on.

Athena can be used only to read the data, DML statements like update or delete cannot be taken up.

Not just the console, Athena can also be integrated with the AWS SDK.