Handling Data Load Types In Staging Database

One of the fundamental aspects to consider when creating Data Warehouses is the consolidation of data from different sources, whether they are systems or different areas.

This difference between origins presents challenges,

Data standardization

Different sources treat information differently - standardizing codes, representations, situations and contents is necessary to represent them in a single view

Synchronization

There is a difference in the time of availability of information since systems have different periodicity — ERP data loads daily, Human Resources are monthly, Accounting closes on the 18th, and so on. In many situations it is necessary to collect data at different times until they are aligned in the same period and can be combined with integrity;

Useful universe

The selection of useful data for the generation of the DW — We are not going to replicate all the operational content, only the data necessary for the elaboration of the analyses. For this we must select the necessary records and fields for these.

In this article, we will discover two main staging patterns,

  • Full load 
  • Incremental / Delta load

Full load VS Incremental load: How can we decide?
 

Full load

With a full load, the entire target dataset is loaded, and is then completely replaced with the latest version. No additional information, such as technical timestamps for insertion, is required.

For example, a sales department uploads all of its sales through the ETL process in the data warehouse at the end of each day. 

Let’s suppose that 30 sales were made on a Monday, so that on Monday at 8pm a table of 30 records would be uploaded. 

Then, on Tuesday, another 15 sales were made which logically need to be added. So on Tuesday at 8pm (the same time of the automated job), we have a planned full load, Monday’s 30 records, as well as Tuesday’s 15 records, are uploaded – supposing that our system is well maintained enough to handle big data flow since the following example is simplified but the principle remains the same.

Incremental load

Also known as the delta load and remains mainly on the differences between the target and source data.

 There are 2 types of sub-types of incremental loads, depending on the volume of data loaded: 

  • streaming incremental load
  • batch incremental load

Let’s start understanding the incremental load first. Following the example in the previous section, the store that made 200 sales on Tuesday will load only the additional 200 records to the sales table destination, instead of reloading all records from scratch. The advantages of using such a pattern are saving time and resources, but we should be aware that it can sometimes increase complexity.

Incremental loading is often much faster than a full load. The major drawback when following such a pattern is maintainability: with an incremental load you can’t re-run the entire data workloads if you encounter an issue since tasks need to be loaded in a specific order, so technical errors will compound the issue as other data queues up.

Batch incremental load

Also called Event-Driven load. Mostly used in traditional data environments, this pattern focuses on extracting batches of data from a source system usually based on a schedule, transforming that data, then loading it to a repository such as a data warehouse or database(relational/non-relational).

Streaming incremental load

Are used mainly when we deal with data coming from real-time applications that require streaming ETL. Following such a pattern must respond to new data in real-time as the data is generated.

Incremental Load vs Full Load
  Full Load Incremental Load
Time More time Less time
Complexity Low High
Rows sync All rows in source data New and updated records only

In the next article, we will work on how to perform an Incremental Load using SSIS with a detailed example. SSIS Incremental Load means comparing the target table against the source data based on specific best practices.