How To Work With Avro Data Type In .NET Environment

The article shows an automated way of reading Avro data in .NET applications. It includes reading the Avro schema, generating C# models, and deserializing the data. The article contains also practical examples of usage: JSON and Avro benchmarks and Azure Blob Storage client.

What Avro is

Avro is a data type highly popular in the Big Data world (example) with growing popularity also in the .NET area. It has two key features,

  • Compression - data is well encoded and serialized to byte representation. Using advanced compression algorithms can only improve the size reduction. 
  • Clear model - the model of the data is delivered alongside the data itself. What is different, is the format. Model is represented in well-known and human-readable JSON format. It enables backward and forwards compatibility features, which is unique for this level of data compression.

The benefit of compression is the most valuable when working with large collections of data. For example and more information, take a look at the benchmark section at the end of this article.

Read schema from Avro file

Moving to the main topic. Our goal is to handle unknown Avro files, that we are going to process in near future. The first step is to read the schema (model) of the file.

We have multiple options. The easiest way is to manually open notepad, copy the header and extract the schema from it. But I would like to show you, how to do this in an automated way.

The library, which helps with the handling of Avro files is called AvroConvert (link). Its interface is very similar to Newtonsoft.Json library, which makes it very easy to use.

var avroBytes = File.ReadAllBytes("sample.avro");
var schema = AvroConvert.GetSchema(avroBytes);

That's it. The extracted schema looks like follows,

{
    "type": "record",
    "name": "User",
    "namespace": "GrandeBenchmark",
    "fields": [{
        "name": "Id",
        "type": "int"
    }, {
        "name": "IsActive",
        "type": "boolean"
    }, {
        "name": "Name",
        "type": "string"
    }, {
        "name": "Age",
        "type": "int"
    }, {
        "name": "Contact",
        "type": {
            "type": "record",
            "name": "Contact",
            "namespace": "GrandeBenchmark",
            "fields": [{
                "name": "Id",
                "type": "int"
            }, {
                "name": "Address",
                "type": "string"
            }, {
                "name": "HouseNumber",
                "type": "long"
            }, {
                "name": "City",
                "type": "string"
            }, {
                "name": "PostCode",
                "type": "string"
            }]
        }
    }, {
        "name": "Offerings",
        "type": {
            "type": "array",
            "items": {
                "type": "record",
                "name": "Offering",
                "namespace": "GrandeBenchmark",
                "fields": [{
                    "name": "Id",
                    "type": "int"
                }, {
                    "name": "ProductNumber",
                    "type": {
                        "type": "string",
                        "logicalType": "uuid"
                    }
                }, {
                    "name": "Price",
                    "type": "int"
                }, {
                    "name": "Currency",
                    "type": "string"
                }, {
                    "name": "Discount",
                    "type": "boolean"
                }]
            }
        }
    }]
}

Create C# model

The schema comes from the User class from the benchmark sample. What, if it's more complex and contains a significant number of properties and fields? Again, the creation of the C# model can be done in various ways. The simplest is to create classes manually. Another, and more convenient way, is to use again an automated tool. AvroConvert provides a GenerateModel feature, which is also exposed online.

The website https://avroconvertonline.azurewebsites.net/ provides a feature of Avro model creation on the fly.

How to work with Avro data type in .NET environment

Read data

When we know what is the structure of data and we have it modeled in our code, we are very close to the finish. The final step is to read the file itself. It’s simple as that:

var avroBytes = File.ReadAllBytes("sample.avro");
var result = AvroConvert.Deserialize<List<User>>(avroBytes);

The result is a list of Users ready to process.

Real-life example - Azure Blob Storage client

Our task is done, but let's take a look at the real-life example. Why do we even bother implementing Avro serialization in our services? One of the possible scenarios is that we are charged for the amount of data in our storage. This could be especially true for external hosts like Microsoft Azure and its document databases. Let's minimize the amount of data (and cost) of Blob Storage.

The example of Azure blob client reducing response time, data size, and cost of the project. The writer and reader are BlobContainerClient extensions:

 Initialize blob container

BlobContainerClient blobContainerClient = new BlobContainerClient("yourConnectionString", "containerName");
blobContainerClient.CreateIfNotExists();

 Blob writer

public static void WriteItemToBlob(this BlobContainerClient client, string blobName, object content) {
    var blob = client.GetBlobClient(blobName);
    var serializedContent = AvroConvert.Serialize(content);
    blob.DeleteIfExists(DeleteSnapshotsOption.IncludeSnapshots);
    blob.Upload(new BinaryData(serializedContent));
}

Blob reader

public static T ReadItemFromBlob < T > (this BlobContainerClient client, string blobName) {
    var blob = client.GetBlobClient(blobName);
    var content = blob.DownloadContent();
    var result = AvroConvert.Deserialize < T > (content.Value.Content.ToArray());
    return result;
}

Bonus: Avro and JSON compression benchmark

Benchmark model

public class User {
    public int Id {
        get;
        set;
    }
    public bool IsActive {
        get;
        set;
    }
    public string Name {
        get;
        set;
    }
    public int Age {
        get;
        set;
    }
    public Contact Contact {
        get;
        set;
    }
    public List < Offering > Offerings {
        get;
        set;
    }
}
public class Offering {
    public int Id {
        get;
        set;
    }
    public Guid ProductNumber {
        get;
        set;
    }
    public int Price {
        get;
        set;
    }
    public string Currency {
        get;
        set;
    }
    public bool Discount {
        get;
        set;
    }
}
public class Contact {
    public int Id {
        get;
        set;
    }
    public string Address {
        get;
        set;
    }
    public long HouseNumber {
        get;
        set;
    }
    public string City {
        get;
        set;
    }
    public string PostCode {
        get;
        set;
    }
}

Example

  data = Autofixture.Fixture.CreateMany<User>(N);

Size of the data by a number of records and serialization method [kB],

  JSON JSON Gzip JSON Brotli Avro Avro Gzip Avro Brotli
1 0.74 0.44 0.41 1.26 1.15 1.13
10 7.40 2.89 2.58 5.13 3.39 3.29
100 75.53 28.84 25.32 44.58 27.12 25.20
1000 759.94 286.56 262.59 440.76 261.87 245.29
10000 7920.17 3081.28 2654.37 4567.44 2800.21 2609.99
100000 80591.47 31417.43 27344.50 46130.41 28821.31 26625.07
1000000 807294.01 314172.52 274262.06 461301.09 289041.60 266187.34

How to work with Avro data type in .NET environment

The result looks more interesting on the chart above. It shows data from the table, in relation to JSON size.

Conclusion

  1. Avro does not bring benefits when a dataset is a single item or a small collection (<10 records).
  2. The more items are in the collection, the bigger benefit serialization brings.
  3. Using additional compression algorithms brings even more benefits. This is true for both, GZip and Brotli compressions. On the other hand, GZip and Brotli results are very similar for both JSON and Avro. So, what's the benefit? Avro built-in compression mechanism.

Result of Benchmark .NET for 1000 User records,

Method Average time FileSize
Json_Default 21.51 ms 760 kB
Json_Brotli 65.96 ms 262 kB 
Avro_Default 14.15 ms 440 kB
Avro_Brotli  45.14 ms 245 kB

The table shows serialization and deserialization time. Avro serialization is faster than JSON, in general. But applying compression algorithm highlights this result. Avro Brotli compression is 32% on average than Json Brotli for this case. Avro data can be serialized using a few compression algorithms only by specifying an argument. Supported algorithms,

  • Deflate
  • Snappy
  • GZip
  • Brotli

References

  • http://avro.apache.org/
  • https://cwiki.apache.org/confluence/display/AVRO/Index
  • https://github.com/AdrianStrugala/AvroConvert
  • https://www.c-sharpcorner.com/Blogs/avro-rest-api-as-the-evolution-of-json-based-communication-between-mic