Prefer Using Stream To Byte[]

When working with files,  often APIs are operating both byte[] and Stream so quite often people choose byte[] as it requires less ceremony or just seems intuitively more clear.

You may think this is far-fetched but I’ve decided to write about it after reviewing and refactoring some real-world production code. So you may find this simple trick neglected in your codebase as well as some other simple things I’ve mentioned in my previous articles.

Example

Let’s look at an example as simple as calculating file hash. In spite of its simplicity, some people believe that the only way to do it is to read the entire file into memory.

Experienced readers may have already foreseen a problem with such an approach. Let’s do some benchmarking on the 900MB file to see how the problem manifests and how we can circumvent it.

The baseline will be the naive solution of calculating hash from byte[] source,

public static Guid ComputeHash(byte[] data)
{
    using HashAlgorithm algorithm = MD5.Create();
    byte[] bytes = algorithm.ComputeHash(data);
    return new Guid(bytes);
}

So following the advice from the title of the article, we’ll add another method that will accept Stream convert it to a byte array and calculate the hash.

public async static Task < Guid > ComputeHash(Stream stream, CancellationToken token) {
    var contents = await ConvertToBytes(stream, token);
    return ComputeHash(contents);
}
private static async Task < byte[] > ConvertToBytes(Stream stream, CancellationToken token) {
    using
    var ms = new MemoryStream();
    await stream.CopyToAsync(ms, token);
    return ms.ToArray();
}

However, calculating hash byte[] is not the only option. There’s also an overload that accepts Stream. Let’s use it.

public static Guid ComputeStream(Stream stream)
{
    using HashAlgorithm algorithm = MD5.Create();
    byte[] bytes = algorithm.ComputeHash(stream);
    stream.Seek(0, SeekOrigin.Begin);
    return new Guid(bytes);
}

The results are quite telling,

So what happened here? While we tried to blindly follow the advice in the article, it didn’t help. The key takeaway from these figures is that using Stream allows us to process files in chunks instead of loading them into memory naively. While you may not notice this on small files but as soon as you have to deal with large files loading them into memory at once becomes quite costly.

Most of the .NET methods that work with byte[] already exhibit Stream counterpart so it shouldn’t be a problem to use it. When you provide your own API, you should consider supplying a method that operates with Stream in a robust batch-by-batch fashion.

Let’s use the following code checking two streams for equality as another example,

Here, instead of loading two potentially big files into memory, we compare them using chunks of 2KB. Once chunks are different, we exit.

Conclusion

Stream APIs allow batch-by-batch processing which allows us to reduce memory consumption on big files. While on a first glance, Stream API may seem like requiring more ceremony, it’s definitely a useful tool in one’s toolbox.