Designing and implementing a Data Warehouse: Part 2

Article

Introduction

Before reading this article, Please read my previous article Designing and Implementing a Data Warehouse: Part 1.

Business Intelligence has become a buzzword in recent years as a support to decision-making. Today we can find several database systems, which include data warehousing, online analytical processing (OLAP), and data mining technologies. Data warehousing provides efficient storage, maintenance, and retrieval of data. OLAP is a service that provides a way to create ad hoc queries against the data warehouse in order to answer important business questions. Data mining is a disciple comprising of several algorithms for discovering knowledge in a large bulk of data.

In order to build a data warehouse solution, we need to model a consistent architecture where the operational data will fit well in an integrated and enterprise-wide view as well as to take into consideration a handful of implementation strategies to provide a high-quality application. The design and implementation of a data warehouse solution sometimes is a very complex challenge in theory and practice. In this article, I will cover the main principles and techniques to design and implement a data warehouse, providing my own experience in such an overwhelming challenge.

This is the second article of a series of articles that I want to write in order to share my knowledge and experience in this subject matter.

Getting started with the data warehouse solution

A data warehouse has many functions to perform, such as data extraction, data loading, data transformation, data storage, and provision of an interface. Figure 1 illustrates a high-level architecture of the data warehouse.

Data Warehouse Solution

Figure 1

There are several technologies and components to support the functions of a data warehouse. Figure 2 shows a detailed view of the architecture of a data warehouse.

Data warehouse

Figure 2

On the left side, you can see the Data Sources, which enable extracting data from operational systems and importing additional data from external sources. Then all the data from the Data Sources is integrated into the Data Stating component in order to remove inconsistencies, detect missing values and transform the data to be stored in a suitable format for easy access and analysis, in this case, structurally consistent with the star schema dimensional model. This step is very important in order to maintain data quality. Once the data is prepared in the Data Staging component of the solution, then later the data is loaded in the data warehouse storage medium, and the Data Storage component manages the storage of the data as well as keeps track of information related to the data, known as meta-data. It's remarkable to say that the data in the Data Storage is characterized as stable and de-normalized (following the star schema), representing snapshots at specified periods, although the data may change periodically. It's also a critical decision between using a relational database or a multidimensional database for the implementation of the data warehouse. Finally, the Information Delivery component enables an interface to access the data in order to be displayed.

Regarding to the Data Storage component, we have three physical mechanisms to store data in a dimensional database: ROLAP, MOLAP, and HOLAP. ROLAP stands for Relational OLAP; in this case, the dimensional model is mapped to a relational database engine such as Oracle and SQL Server. MOLAP stands for Multidimensional OLAP; in this case, the dimensional model is mapped to cubes, for example, in SQL Server Analysis Server. And finally, we have HOLAP, which stands for Hybrid OLAP a combination of ROLAP and MOLAP.

Building the data warehouse solution

This section is intended to take everything we have discussed so far and begin the process of putting it into practice by describing and demonstrating the full process of analyzing, designing, and building a dimensional model and the implementation of the underlying data warehouse. A data warehouse is typically developed in one subject area at a time, and in this section, we're going to develop a department-specific data warehouse or data mart for sales analysis. The result mode will be a star schema format to enable multidimensional analysis.

Designing the dimensional data model

Let's suppose that our client is a manufacturing company whose transactional database is AdventureWorks, shipped with Microsoft SQL Server 2005. This company is finding it difficult to give the sales manager meaningful information to answer several business questions to keep up with a changing and challenging environment, and the report execution is adversely affecting the performance of the transaction processing systems during peak reporting periods. The main business questions to answer are concerning sales volumes and shipped units by-products by customers by region and by time periods.

The first step is the development of the staging database that receives data from the underlying data sources (the data is almost stored in separate transactional systems), transforms the data, and loads the data through an integration process in order to build the multidimensional database.

Once the requirements are gathered, all that we need to do is to establish what the facts are (the transactional data) and what the dimensions are (reference data or master data).

The next diagram (in Figure 3) shows a simple star schema containing a central fact table and several dimension tables surrounding it. You can see that every dimension has a surrogate key generated within the data warehouse as the primary key of the dimension table as well as a business key that represents the different primary keys (for example, a product may have a product code in a data source different semantically and by its data type respect to the other data sources) of the underlying tables to be consolidated into the dimension. All the surrogate keys are declared as integer fields with the IDENTITY property turned on. In the case of the fact table, the primary key is usually the combination of all the dimension keys. However, depending on the semantics of the problem and the level of summarization, it does not matter whether we have multiple records. You can declare another attribute on the fact table as the primary key.

Dimensional Data Model

Figure 3

The dimension tables such as ProductDim, CustomerDim, RegionDim, and TimePeriodDim enable describing the SalesFact fact table. One interested area in the model is Customers, and the Sales Manager is only interested in the CustomerType attribute because this manufacturer does not sell directly to consumers but only to retailers. The simple SQL statement to load the data to this dimension from a relational schema is shown in Listing 1.

select CustomerID as CustomerBusinessID,CustomerType
from Sales.Customer;

Listing 1

In order for the Sales Manager to get more insight into the customers, the RegionDim table was introduced. This table records the fact that most companies divide up geographical areas into "sales territories", and these territories might not translate directly to a political association such as a state or province. The data for the RegionDim table may be extracted directly from a table in the relational schema using the SQL statement shown in Listing 2.

Select TerritoryID as RegionBusinessID, [Name]+'-'+CountryRegionCode as RegionName
from Sales.SalesTerritory;

Listing 2

The ProductDim table represents the dimension related to the area of products. Products can be grouped by into subcategories and categories, and each product can contain descriptive attributes useful for the final reports. This dimension is characterized by the product name, color, and subcategory name. The data for this dimension may be extracted directly from the Product and ProductSubcategory tables in the relational schema, as shown in Listing 3.

Select p.ProductID as ProductBusinessID, p.[Name] as ProductName, p.Color, s.[Name] as SubcategoryName
from Production.Product p inner join Production.ProductSubcategory s
on p.ProductSubcategoryID=s.ProductSubcategoryID;

Listing 3

And finally, we have to deal with the Time dimension represented by the TimePeriodDim table. It's remarkable to say that most star schema designs will have a time dimension because we often need to analyze the information from different time periods. For example, the Sales Manager wants to see the sales volumes in the current month or compare the value of the current period with the same period last year, and so on. This dimension includes information about the year, day, quarter, month, and week. In this way, we can access and summarize data in any of these periods, not a single period such as a year. In order to define the time dimension, we need to understand the level of detail required by the fact table. Sometimes we think that we don't need a date timetable because modern database systems include a lot of functions to process date and time data types. But if we include a date timetable containing a record for each day, we can semantically extend the logic associated with the date time data type by flagging days as holidays or other special days such as Christmas or the 4th of July. You can also accommodate the calendar to see fiscal years.

As you can see, the time dimension is a special dimension, and most of the time, we need to populate the dimension with calendar dates and date parts such as year, month, quarter, and week. In order to generate this dimension, you can come up with SQL business logic (see Listing 4).

declare @dtStartDate datetime;
declare @dtEndDate datetime;

set @dtStartDate = '1/1/2008';
set @dtEndDate = '1/1/2010';

while(@dtStartDate <= @dtEndDate)
begin
  select @dtStartDate, DATEPART(YYYY,@dtStartDate) as Calendar_Year,
         DATEPART(M,@dtStartDate) as Calendar_Month,
         DATEPART(Q,@dtStartDate) as Calendar_Quarter,
         DATEPART(WK,@dtStartDate) as Calendar_Week;
  set @dtStartDate = @dtStartDate + 1;

end;

Listing 4

The fact table SalesFact contains all the keys to the dimensions, such as ProductDim_ID, CustomerDim_ID, RegionDim_ID, and TimePeriodDim_ID, as well as the columns to hold the measures to be analyzed, such as SalesVolumes and ShippedUnits. The data in this table is the result of summarizing data from the SalesOrderHeader and SalesOrderDetail tables in the relational schema through the following SQL statement (see Listing 5).

select sd.ProductID as ProductDim_ID, sh.CustomerID as CustomerDim_ID, sh.TerritoryID as RegionDim_ID, OrderDate as TimeDim, TotalDue as SalesVolumes, sum(OrderQty) as ShippedUnits

from Sales.SalesOrderHeader sh inner join Sales.SalesOrderDetail sd
on sh.SalesOrderID=sd.SalesOrderID

group by sd.ProductID, sh.CustomerID, sh.TerritoryID, OrderDate, TotalDue;

Listing 5

In this case, the measures SalesVolumes and ShippedUnits are fully additive, meaning that the values of these attributes may be summed up by simple addition. When you cannot add up a certain measure attribute from the underlying instances and come up with the total, this kind of measure is semi-additive. It's remarkable to say that cubes are great at handling additive measures.

Sometimes you need to add columns to fact tables which are neither dimension keys nor numeric measures, to help the users drill down to a level of detail that includes some business entities and a full detail of each. This is commonly known as a degenerate dimension because it's basically a dimension with only one column, which is the business key.

Conclusion

In this second article, I've explained how to get started with a data warehouse solution.