Create User-Defined Functions In Hive With C#

What are UDFs in Hive?

In Hive, users can define their own functions to meet certain requirements, which is what we often call user-defined functions(UDFs).

And we can pretty much create a function in any language and plug it into our Hive query using the Hive TRANSFORM clause.

TRANSFORM helps us to add our own mappers and reducers to process the data.

In this article, we will learn how to create Hive UDF with C#.

Normal Hive Query

Before we get started, let's do some preparatory work for the later examples to demonstrate.

Create a new table named ods.t_test

create EXTERNAL table IF NOT EXISTS ods.t_test(
id int,
name string,
phone string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' 
lines terminated by '\n' 
stored as textfile ;

Insert a small amount of data.

insert into `ods`.`t_test`(id, name, phone) 
values(1, 'cat', '123'), (2, 'dog', '456');

Here's the regular SQL we wrote in hive.

select id, name, phone 
from ods.t_test
limit 10;

Create User-Defined Functions In Hive With C#

Next, we try to customize a UDF in C#, and add prefix for output results from the Hive query.

Create C# Custom UDF

What we need to do is actually very simple:

  1. Read the result of the query from the standard input
  2. Parse the result of the query
  3. Add prefix for the parsed data or other operation you want
  4. Put the new data into the standard output
public class Program
{
    public static void Main(string[] args)
    {
        // work with multi command
        var p = "raw";
        if (args.Length > 0)
        { 
            p = args[0];
        }

        string line;
        try
        {
            // receiving each record passed in from Hive via stdin 
            while ((line = Console.ReadLine()) != null)
            {
                line = line.TrimEnd('\n');

                Handle(line, p);
            }
        }
        catch (Exception ex)
        {
            //bad format or end of line so do nothing
        }
    }

    private static void Handle(string input, string type)
    {
        // different handler for cmds
        if (type.Equals("raw", StringComparison.OrdinalIgnoreCase))
        {
            RawHandle(input);
        }
        else if (type.Equals("pre", StringComparison.OrdinalIgnoreCase))
        {
            AddPrefixHandle(input);
        }
        else
        {
            RawHandle(input);
        }
    }

    public static void AddPrefixHandle(string input)
    {
        // columns delimited by \t
        var field = input.Split('\t');

        var builder = new StringBuilder(512);

        for (int i = 0; i < field.Length; i++)
        {
            builder.Append($"pre-{field[i]}\t");
        }

        Console.WriteLine(builder.ToString().TrimEnd('\t'));
    }

    public static void RawHandle(string input)
    {
        // columns delimited by \t
        var field = input.Split('\t');

        var builder = new StringBuilder(512);

        for (int i = 0; i < field.Length; i++)
        {
            builder.Append($"{field[i]}\t");
        }

        Console.WriteLine(builder.ToString().TrimEnd('\t'));        
    }
}

NOTE: We also added a RawHandle method with output as-is to verify that there was no execution of our custom method.

Next we need to package the program into an executable file, here choose to publish as a self-contained single file.

The following command demonstrates how to package a C# program into a single file.

dotnet publish \
    -c Release HiveUDFDemo\HiveUDFDemo.csproj \
    -r linux-x64 \
    -p:PublishSingleFile=true \
    -p:PublishTrimmed=true \
    -p:DebugType=None \
    -p:DebugSymbols=false \ 
    -p:EnableCompressionInSingleFile=true \
    --self-contained true

Usage of custom UDF

First of all, we should upload this executable file to Hadoop, so that Hive command line interface (CLI) can find and use it.

hadoop fs -put HiveUDFDemo /tmp

Using add file to add this executable file into the Hive’s classpath, so that it can be directly executed

add file hdfs:/tmp/HiveUDFDemo;

Below is the example of using C# UDF with TRANSFORM command.

select transform(id,name,phone) 
    using 'HiveUDFDemo raw' 
    as (nid, nname, nphone) 
    from ods.t_test limit 10;

HiveUDFDemo raw is the script using to transform the query result.

Create User-Defined Functions In Hive With C#

As we can see, the execution process here is a little more complicated than the regular execution process, the output header is not the same, and the data is still the same.

Then, we will have a try with HiveUDFDemo pre .

select transform(id,name,phone) 
    using 'HiveUDFDemo pre' 
    as (nid, nname, nphone) 
    from ods.t_test limit 10;

Create User-Defined Functions In Hive With C#

All values are prefixed with the specified prefix pre-. Our UDF has taken effect and successfully applied to Hive's query.

Here is the source code you can find on my Github page.

Summary

This article introduces how to create Hive's UDF using C# and demonstrates how to integrate in Hive SQL.

I hope this will help you!

Reference


Similar Articles