Query Apache Hive With C# And ODBC

Introduction

The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage and queried using SQL syntax.

In this article, we will use C# to query Hive step by step.

Prerequisite

Before starting the specific content, we need to have a Hive server that can be connected.

Next, I will use version 2.1.1-cdh6.2.0 as a demonstration.

Query Apache Hive With C# And ODBC

Using beeline to test the connection and make a simple query.

beeline -u "jdbc:hive2://yourip:10000" -n yourname -p yourpwd
use ods;
select id, bucket from ods_xxxx_oss where dt = '2022-01-21' limit 10;

The following screenshot is the result of performing the above operations.

Query Apache Hive With C# And ODBC

How can we query the data from Hive with C# to get the above result?

We can use Open Database Connectivity(ODBC) to complete the above query.

For more information about ODBC, take a look at the following posts.

In next section, we will write some code to do it.

Code

Create a console application and install the following packages.

<ItemGroup>
   <PackageReference Include="Dapper" Version="2.0.123" />
   <PackageReference Include="System.Data.Odbc" Version="6.0.0" />
</ItemGroup>

The most important package is System.Data.OdbcDapper is used to simplify SQL operations.

The following code demonstrates how to connect to hive and query data from hive:

public static class HiveHelper {
    // dsn
    // private static string _dsn = "DSN=hivedsn;Schema=ods;UID=xxx;PWD=xxx;";
    // dsn-less
    private static string _dsn = "Driver=Hive;Host=xxx.xxx.xxx.xxx;Port=10000;Schema=ods;UID=xxx;PWD=xxx;";
    public static List < QueryResult > QueryRawData() {
        var sql = $ " select id, bucket from ods_xxx_oss where dt = '2022-01-21' limit 10;  ";
        try {
            using
            var conn = new OdbcConnection(_dsn);
            conn.Open();
            var res = conn.Query < QueryResult > (sql, commandTimeout: 600000000);
            return res.ToList();
        } catch (Exception ex) {
            throw;
        }
    }
}
public class QueryResult {
    [Column("ods_xxx_oss.id")]
    public string ? Id {
            get;
            set;
        }
        [Column("ods_xxx_oss.bucket")]
    public string ? Bucket {
        get;
        set;
    }
}

Here are some things to note,

  1. There are two ways for connection string, DSN or DSN-less
  2. Create database connection with OdbcConnection
  3. Column name mapping should contain the table name

Add CustomPropertyTypeMap and print the query result.

SqlMapper.SetTypeMap(typeof(QueryResult), new CustomPropertyTypeMap(
    typeof(QueryResult), (type, columnName) => type.GetProperties().FirstOrDefault(prop =>
    prop.GetCustomAttributes(false).OfType<ColumnAttribute>().Any(attr => attr.Name == columnName))));
var data = HiveHelper.QueryRawData();
foreach (var item in data)
{
    Console.WriteLine(item.Id + "\t\t" + item.Bucket);
}

As you can see, the code level is almost the same as the SQL code you usually write, the difference is in deployment.

Deployment

We will use docker to deploy this sample.

What we need are as follows:

  1. unixodbc
  2. hive odbc driver

So, we will install these two components and their dependencies.

This is a simple and non optimized dockerfile for development, but it is not recommended for production.

FROM mcr.microsoft.com/dotnet/runtime:6.0-bullseye-slim AS base

# install the dependencies
RUN sed -i s@/deb.debian.org/@/mirrors.aliyun.com/@g /etc/apt/sources.list\
    && apt-get update \
    && apt-get install -y --no-install-recommends \
        sasl2-bin \
        libsasl2-modules \
        unixodbc \
        alien \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/* 

# install cloudera hive odbc
COPY ./ClouderaHiveODBC-2.6.11.1011-1.x86_64.rpm /tmp/ClouderaHiveODBC-2.6.11.1011-1.x86_64.rpm
# convert rpm to deb
RUN alien /tmp/ClouderaHiveODBC-2.6.11.1011-1.x86_64.rpm
RUN dpkg -i ./clouderahiveodbc_2.6.11.1011-2_amd64.deb && rm -rf /tmp/ClouderaHiveODBC-2.6.11.1011-1.x86_64.rpm ./clouderahiveodbc_2.6.11.1011-2_amd64.deb

# odbc configuration
COPY ./odbc.ini /etc/odbc.ini
COPY ./odbcinst.ini /etc/odbcinst.ini

FROM mcr.microsoft.com/dotnet/sdk:6.0-bullseye-slim AS build
WORKDIR /src
COPY . .
RUN dotnet restore "HiveDemo.sln"

WORKDIR /src/HiveDemo
RUN dotnet build "HiveDemo.csproj" -c Release -o /app/build

FROM build AS publish
RUN dotnet publish "HiveDemo.csproj" -c Release -o /app/publish

FROM base AS final
WORKDIR /app
COPY --from=publish /app/publish .
ENTRYPOINT ["dotnet", "HiveDemo.dll"]

There are two important ODBC configuration files, odbcinst.ini and odbc.ini.

Add hive ODBC driver information to odbcinst.ini

[Hive] 
Description = ODBC for Hive Cloudera
Driver = /opt/cloudera/hiveodbc/lib/64/libclouderahiveodbc64.so
HiveServerType=2

Hive is the driver name.

HiveServerType specify the version of hive server.

Add DSN information to odbc.ini

[hivedsn] 
Driver = Hive
Description = Cloudera Hive ODBC Driver DSN
Host = xxx.xxx.xxx.xxx 
Port = 10000 

Driver = Hive specify the driver is the above hive driver.

You also can set more configuration in odbc.ini, such as UID and password, etc.

Here is the directory structure of our example

Build and run this example.

docker build -t hivedemo:v1 .
docker run --rm hivedemo:v1

Query Apache Hive With C# And ODBC

We get the same result as the previous query through beeline.

Here is the source code you can find on my Github page.

Summary

This article introduces the combination of C# and ODBC to query data from Hive, and similarly, we can also insert, update and delete operations, as long as it complies with Hive SQL.

I hope this will help you!

Reference