Indexing In-Memory Collections For Blazing Fast Access

Joel Champagne
4y
52.3k
0
3

Article

Introduction

The CodexMicroORM open source project on GitHub hosts a few useful .NET goodies that can help you write concise code with great performance. In this article, I’ll cover a collection type that supports multi-property indexing and as such, performs very well with equality lookups.

Motivation

It’s common (and relatively easy) to index large lists using a “side dictionary” to improve performance. Let’s consider this example where we start with a business entity that looks like this:

class Person {
public string Name {
get;
set;
}
public int ? Age {
get;
set;
}
}

… and use a test harness like this,

List < Person > people = new List < Person > ();
for (int i = 1; i <= 20000; ++i) {
people.Add(new Person() {
Name = $ "P{i}", Age = (i % 70) + 10
});
}
Stopwatch sw = new Stopwatch();
sw.Start();
int found = 0;
for (int i = 1; i <= 1000; ++i) {
found += (from p in people where p.Age == 50 select p).Count();
}
sw.Stop();
Console.WriteLine($ "{sw.Elapsed.TotalMilliseconds} microseconds per iteration");

On my machine, this takes about 627 microseconds per pass of the final loop which is counting people aged “50” out of a basic list we populated with 20,000 Person instances. The iterating is using a simple LINQ to Objects expression, which is fast – but we can do better!

Let’s make a change by adding an additional layer – a pass over the data to create an in-memory index using a Dictionary. That changes our final query considerably: finding those aged 50 becomes as simple as accessing the dictionary by key,

Stopwatch sw = new Stopwatch();
sw.Start();
Dictionary < int ? , ICollection < Person >> ageMap = new Dictionary < int ? , ICollection < Person >> ();
foreach(var p in people) {
var age = p.Age;
if (ageMap.ContainsKey(age)) {
ageMap[age].Add(p);
} else {
var list = new List < Person > ();
list.Add(p);
ageMap[age] = list;
}
}
int found = 0;
for (int i = 1; i <= 1000; ++i) {
found += ageMap[50].Count();
}
sw.Stop();
Console.WriteLine($ "{sw.Elapsed.TotalMilliseconds} microseconds per iteration");

Even including the time spent building this dictionary, this runs at 9.8 microseconds per iteration. That’s 64 times faster than our LINQ query! (It’s even more dramatic if we used “int” instead of “int?” – then it’s 368 times faster!) This is probably not a great surprise: dictionaries excel with reads. A question could be: why bother with the list at all? For my purposes here, I’m assuming you may not have control over the collection type: for example, you might be consuming it from an existing object model (such as the one you might be getting out of Entity Framework), or you might have an object that could be indexed in multiple ways and a more generalized collection makes sense.

The natural temptation might be to use indexes like this for any property where you might query as we did for “Age.” Too bad it increased our code size in the above example; there’s an easy solution for that, though.

ConcurrentIndexedList<T>

To streamline the use of indexes, the ConcurrentIndexedList<T> class comes to the rescue. First, let’s add a NuGet reference to CodexMicroORM.Core,

Now our above example can be simplified greatly,

ConcurrentIndexedList < Person > people = new ConcurrentIndexedList < Person > (nameof(Person.Age), nameof(Person.Name));
for (int i = 1; i <= 20000; ++i) {
people.Add(new Person() {
Name = $ "P{i}", Age = (i % 70) + 10
});
}
Stopwatch sw = new Stopwatch();
sw.Start();
int found = 0;
for (int i = 1; i <= 1000; ++i) {
found += people.GetAllByNameNoLock(nameof(Person.Age), 50).Count();
}
sw.Stop();
Console.WriteLine($ "{sw.Elapsed.TotalMilliseconds} microseconds per iteration");

In this case, we’ve indexed both Age and Name, as specified in the constructor. Because of that, we could do something like this,

found += people.GetAllByNameNoLock(nameof(Person.Age), 50).Union(people.GetAllByNameNoLock(nameof(Person.Name), "P15")).Count();

Notice here we’re leveraging LINQ to Objects where we’re looking for the union of results from two types of indexed queries: one by age, one by name. The performance on the age lookup isn’t quite as crazy as with a single raw dictionary, but it’s still 40 times better than the plain LINQ query.

We can even index null values, too, which is something that would have complicated our original hand-rolled index for something like Name which is a reference type. For example, this works as expected, if we supported null Names:

found += people.GetAllByNameNoLock(nameof(Person.Name), null).Count();

There’s one big caveat here: the current implementation of ConcurrentIndexedList<T> requires that your “T” in this case implement the ICEFIndexedListItem interface. I do this for a couple of reasons,

To avoid reflection. It does introduce boxing which does kill some performance, but not as much as would happen if we resorted to using System.Reflection.
Offers some value-add such as the ability to “unwrap” values from your objects. In the case of the framework itself, I want to index objects that are contained inside another object, namely a WeakReference object’s “Target” property. If I didn’t do this, I’d be indexing the WeakReference itself which isn’t useful to me.

The code change I used here to let my Person classwork includes,

class Person: ICEFIndexedListItem {
public string Name {
get;
set;
}
public int ? Age {
get;
set;
}
public object GetValue(string propName, bool unwrap) {
switch (propName) {
case nameof(Person.Name):
return Name;
case nameof(Person.Age):
return Age;
}
throw new ArgumentException("propName is not a valid property.");
}
}

If you’re unable to modify your business objects, a version of this could be constructed that does use reflection. I plan a second article to discuss my “optimized reflection helpers,” also available for free within the framework.

Added Benefits

Most operations on the collection are guarded by reader/writer locks, meaning multiple concurrent readers are allowed, but only one writer at a time can modify the collection. This resembles the behavior of the types in the System.Collections.Concurrent – which I actually used more extensively in the 0.2 version of the framework, but have migrated away for potentially large/unbounded collections based on observed memory and performance issues. In this case, the simplicity of basic dictionaries and slim reader/writer locks seems to do well. Thread-safety may not be useful in every case – but it opens the door to injecting further performance improvements in your apps through the use of the Parallel .NET framework class.

In the 0.5 release of the library, I’ve included the ability to set an initial capacity for the collection. As is true with any dictionary, it’s a good idea to set this if you know up-front the approximate maximum size expected for the collection. Setting it can avoid the resize operations which the collection would otherwise have to do when it hits a size limit (as it grows). If your collections are very large, resize operations can become very slow and memory intensive.

Conclusions

It’s true that LINQ to Objects is often “fast enough” for reasonably small data sets. ConcurrentIndexedList<T> is a general-purpose class to support cases where you either anticipate the possibility for a large data set or can’t anticipate the amount of data at all and want to “play it safe” based on known index requirements.

Other libraries exist that try to address the same problem. For example, i4o offers a way to “wrap” collections in a way that lets you use ordinary LINQ to Objects and have that in turn use indexes over the underlying data. This is a great concept, but in real-world benchmarks, I’ve seen it not always deliver in a truly generic way. (Some cases work well, some do not - and can perform worse.) There are different reasons for this, and it’s relevant to work I’m doing in CodexMicroORM, so will be likely covering deeper details in future articles.

I’ve included additional details about changes in CodexMicroORM release 0.5, found here. Feel free to contribute - pull requests are welcome.