How To Query Parquet File Stored In S3 Using S3 Select

In this article, you will see how to query Parquet file stored in Amazon S3 using S3 Select. AWS Lambda project (.NET Core – C#) and AWS SDK for .NET are used to query file from S3 using S3 Select.

Amazon S3 Select

With Amazon S3 Select, you can use simple structured query language (SQL) statements to filter the contents of an Amazon S3 object and retrieve just the subset of data that you need. By using Amazon S3 Select to filter this data, you can reduce the amount of data that Amazon S3 transfers, which reduces the cost and latency to retrieve this data.

Amazon S3 Select works on objects stored in CSV, JSON, or Apache Parquet format. It also works with objects that are compressed with GZIP or BZIP2 (for CSV and JSON objects only), and server-side encrypted objects. You can specify the format of the results as either CSV or JSON, and you can determine how the records in the result are delimited.

Refer this blog to learn more about how Amazon S3 Select Support in the AWS SDK for .NET.

Prerequisites

  1. Download and install Visual Studio 2022.
  2. Download and install AWS Toolkit for Visual Studio 2022.
  3. An Active AWS Account. (See AWS Account Page.)
  4. User with sufficient access to create AWS resources for this article.(See IAM role in AWS IAM Roles Documentation.)
  5. User should have programmatic access keys. (See IAM user and Access in the AWS IAM documentation.)
  6. Download and install Amazon command line interface (AWS CLI). Configure AWS CLI.
  7. Create S3 bucket in AWS Console and upload sample JSON.

Sample Parquet File

How to query Parquet file stored in S3 using S3 Select

Create AWS Lambda Project (.NET Core - C#) using Visual Studio

  1. Open Visual Studio 2022.
  2. Click Create a new project.
  3. Search and select AWS Lambda Project (.NET Core – C#) project template. Click Next.
  4. Enter the project name and click Create.
  5. Select Empty Function blueprint and click Finish.

Implementation

Add AWSSDK.S3 NuGet Package.

Open Function.cs and replace the code with the following.

using Amazon;
using Amazon.Lambda.Core;
using Amazon.S3;
using Amazon.S3.Model;
// Assembly attribute to enable the Lambda function's JSON input to be converted into a .NET class.
[assembly: LambdaSerializer(typeof(Amazon.Lambda.Serialization.SystemTextJson.DefaultLambdaJsonSerializer))]
namespace S3SelectJSONDemo;
public class Function {
    private AmazonS3Client s3Client;
    public Function() {
        s3Client = new AmazonS3Client(RegionEndpoint.EUWest2);
    }
    /// <summary>
    /// A simple function that takes a string and does a ToUpper
    /// </summary>
    /// <param name="input"></param>
    /// <param name="context"></param>
    /// <returns></returns>
    public async Task FunctionHandler(ILambdaContext context) {
        var response = await s3Client.SelectObjectContentAsync(new SelectObjectContentRequest() {
            Bucket = "sqlreplicaparquet",
                Key = "MarketDataOneLakh.parquet",
                ExpressionType = ExpressionType.SQL,
                Expression = "select Instrument,Quantity from S3Object s where s.Quantity > '100'",
                InputSerialization = new InputSerialization() {
                    Parquet = new ParquetInput()
                },
                OutputSerialization = new OutputSerialization() {
                    JSON = new JSONOutput()
                }
        });
        foreach(var ev in response.Payload) {
            Console.WriteLine($ "Received {ev.GetType().Name}!");
            if (ev is RecordsEvent records) {
                Console.WriteLine("The contents of the Records Event is...");
                using(var reader = new StreamReader(records.Payload, System.Text.Encoding.UTF8)) {
                    Console.Write(reader.ReadToEnd());
                }
            }
        }
    }
}

Testing

  1. Run Mock Lambda Test Tool (Hit F5) to run Lambda locally from Visual Studio.
  2. In Mock Lambda Test Tool, click Execute Function and view the result in Log Output.

Clean up Resources

Delete all the resources to ensure that you're not charged for any services you aren't using.

Summary

Thus, in this article, you saw how to query Parquet file stored in Amazon S3 using S3 Select.


Similar Articles