Patterns for centralising reference data management

Robert van Kooten
13y
16.9k
0
0

Article

Introduction

I'll start by defining reference data; reference data is the data an application needs to run, excluding any data it collects from the user. For example a stock trading application would have the current stock prices as its reference data, it doesn't collect this from the user - I'd be a millionaire if I could enter the price at which I want to trade my shares - but it is data that is required for the application to function.

Reference data in its own right isn't a problem, however when you want to share the same data with multiple applications it can become a problem to maintain this data. Especially when you start to look into architectures like SOA or other high cohesion / low coupling paradigms where every service or application is a completely self contained unit that isn't dependent on any other systems to run.

The purpose of this article is to collect a number of different approaches to centrally manage this reference data and highlight their benefits / drawbacks. Not one approach is by definition better than others and this article is not aimed to make a recommendation as to which approach should be implemented, rather the purpose of this article is to highlight the merits and drawbacks of each approach so that a suitable approach can be chosen for the requirements at hand. Feedback is greatly appreciated, be it additional patterns or changes to existing patterns.

Problem statement

Often applications are built in isolation to ensure high cohesion and low coupling between applications. The drawback to this approach is that little is shared between applications, including reference data which is often common between multiple applications. You could think of data such as marital statuses, post code (zip code for US readers) lookup data or current stock prices to name but a few.

A typical scenario is shown in the image below. The arrows indicate data flow of reference data (all other data is ignored for this article to keep the images clean).

Two applications live inside their own domain and both have their own datastore. These datastores contains data that could be shared between the two applications. The drawbacks of this approach include:

Duplication of data
Duplication of management effort
No single version of the truth (e.g. if the data is different, which is right?)

The patterns in this article are meant to give some thoughts on how to remove these drawbacks.

Patterns

Replicated reference data

A central datastore contains a maintained version of the reference data. This data is replicated into application specific datastores either periodically or on a user's request. The applications have no knowledge of the central database and only speak to their local database which contains the reference data.

Benefits of this pattern

Reference data is maintained in a single location
No dependency between the applications and the central datastore

Drawbacks of this pattern

No clear single version of the truth, updates can be made in an application specific database and later wiped out by the replication
Replication causes a tight coupling in schema between source and destination
The application has no control over refreshing of the data, some applications need more frequent updating of certain reference data than others

Centralised reference data

The reference data only exists in a central datastore. Each application using the data references the datastore directly, having full knowledge of the existence and schema of the reference datastore.

Benefits of this pattern

Reference data is maintained in a single location
Single version of the truth
Applications always have the latest version of the reference data

Drawbacks of this pattern

Depending on the network topology this can damage performance of the applications
Tight coupling between central datastore and applications
Changes to central datastore schema require changes to all dependent applications
If the central datastore is down or there are network problems between the applications and the central datastore then the applications will experience downtime
Maintenance on the central datastore is limited to times when the applications can be taken down

Centralised service for reference data

The reference data only exists in a central datastore and a service is exposed through which the data can be retrieved. The applications only know of the service and do not know about the underlying storage mechanism(s).

Benefits of this pattern

Reference data is maintained in a single location
Single version of the truth
Applications always have the latest version of the reference data
Changes to central datastore (either in schema or even in datastore technology) are hidden behind the service interface which can often remain unchanged
Changes to how the data is exposed can be handled through versioning of the service, maintaining backward compatibility

Drawbacks of this pattern

Depending on the network topology this can damage performance of the applications
Downtime on the central datastore or service, or network problems between these and the applications will cause downtime on the applications
Maintenance on the central datastore and service is limited to times when the applications can be taken down

Centralised service with local caching

Similar to the pattern "Centralised service for reference data" above, however the applications cache the reference data locally so that the service needn't be called on every request and failure of the service can be caught by falling back to the locally cached data. The cache can be placed in a number of technologies, from database to in-memory cache, however true fallback to locally cached data can only be guaranteed when the local cache is a persistent mechanism.

Benefits of this pattern

Reference data is maintained in a single location
Single version of the truth (given that local caches are named in a way that makes it clear they just that)
Applications always have access to the latest version of the reference data
Changes to central datastore (either in schema or even in datastore technology) are hidden behind the service interface which can often remain unchanged
Changes to how the data is exposed can be handled through versioning of the service, maintaining backward compatibility
Failure of the network, central service or central database does not impact the application
Maintenance of either the service or central datastore can be done at any time
By throttling the number of requests to the central service (see notes below) the performance impact can be managed on a per application basis
Applications can follow their own business rules for refreshing their cache

Drawbacks of this pattern

More complex architecture

Notes on these patterns

What if the central service is unavailable

If the central service is unavailable, the applications can use their local cache to continue functioning on the old reference data until the service is available again. There are some caveats here. If the reference data is cached in an in-memory cache, it could expire or a crash of the application could crash the cache, causing it to loose its data. Using a distributed cache for this type of application is recommended and cache expiry can be prevented by using cache priorities and managing the amount of memory in the cache. To have guaranteed recoverability of the application while the service is down however, the cache may need to be a persistent (typically disk based) mechanism such as a database or a file.

Managing cache refreshing

The two things to consider when setting up the cache refreshing is how often this needs to happen and what the trigger for this should be. The question of "how often" can range from "almost all the time" through "once per day" all the way up to "only when told to by operational staff". This may seem to answer the second consideration of what the trigger should be, however something to consider is that time isn't the only trigger that can be used. For example the cache could refresh every X requests on the reference data. The benefit of such an approach is that you can limit the number of requests that can be run against out of date information, which limits the potential revenue impact. A downside of this approach however is that during peak times the refreshes will be more frequent, thus causing a bigger performance impact.

Sync vs. Async

If cache refreshes are done synchronously this means a user has to wait for the cache refresh to happen during their request, and other users may back up waiting for this user to finish the cache refresh. To do the cache refresh asynchronously however you easily get into the territory of scheduled jobs etc, very messy. My personal recommendation would be for the request that causes a cache refresh to start an async process that updates the cache, while the request continues to run against the cached data. This doesn't hold the user's request up but still causes the cache to be updated as per the business rules set for the application. A benefit of this approach is that the cache is managed within the application itself, rather than by a separate job that runs elsewhere.