Using Materialized Views Effectively

Joel Champagne
6y
28.6k
0
3

Article

Materialized views have served me well in day-to-day life – both as an application developer and business intelligence developer. In this article, I’ll explain some of the “why’s,” along with a specific example that takes advantage of the SQL MERGE statement to incrementally reconcile changes to a persistent data store, optimized for reads.

Motivations and Mechanics

The link above offers some good detail on why and how we might use materialized views, but I can offer some practical examples, too.

We had a system with an event table; we could determine a lot of detail from events, but querying against them was complex, and ultimately the product of those queries was heavily used. Simply putting those queries into views was not ideal for performance (it was tried!). Instead, we used triggers on the event table to calculate and store the resulting query values. The rule was the only way they were updated was through event changes, and this was an efficient update, based on the row-by-row feed (leveraging indexes). In keeping with the principle of this being a materialized view, we could have lost this persisted table and subsequently fully recover it based on the original view which served as the source for the drip-feed. These persisted values were used extensively by application and report alike, but this was not a “free” process: OLTP transactions bore the cost of maintaining the table, but the benefits far outweighed the cost.

In a BI setting, we had a report that became extremely slow. The reporting tool (out of our control) had constructed some ugly SQL, and this was largely due to the natural complexity and multiple layers of views. The results were correct – but a correct report isn’t useful if it never completes! It was possible to identify a set of three attributes, where if they were included on an additional table, they’d hopefully act as a core filter when included in the report (based on what I expected the report tool and SQL Server would probably do). That set of attributes could be persisted as a read-only materialized view, and the result was exactly as desired: the report became fast with no other changes.

What about alternatives such as indexed views? This is a very real option – but only when the limitations of indexed views don’t become a problem. In both the situations described above, the need for OUTER JOIN’s and subqueries automatically ruled out indexed views. For materialized views, there are really no limitations on the complexity of source queries, and given the target is a physical table, we can use all available tools such as partitioning, COLUMNSTORE indexes, etc.

A consideration when using materialized views is latency. In my first example, using a trigger meant the materialized view was always in-sync with source table updates. In the second example, the refresh of the materialized view could occur as a final step of the process that was used to populate the source tables (effectively ETL). In many other cases, it’s not practical to keep your materialized view in-sync with your source in real-time. If that’s the case, you’ll need to make decisions about how current your view needs to be: is one minute good enough? An hour? A day? At that point, SQL Agent or your scheduler of choice can be used.

I’ve already alluded to a few ways to update a materialized view – one of which is triggers – but when you’re not using triggers, what are your options?

TRUNCATE / INSERT. This tends to be straightforward, but it has limitations including the fact you’ll have some period when your table is empty before it’s refreshed. For cases where you have no good time window options to filter an incremental update or most of your data can change, this approach lets you effectively rebuild your entire table from scratch.
Using sp_rename (or ALTER SCHEMA). This approach is like TRUNCATE/INSERT, but you’re filling a work table, swapping out your “real” materialized view and swapping in the work table to replace it (the swap itself in a transaction). The advantage is the period when the table appears empty can be eliminated. The problem tends to be with blocking: a schema lock is required to do the swap operation, and although the swap itself can be fast, it can be blocked by readers and writers. (With a heavily used table, this can become a problem.)
Using a MERGE statement. This tends to be a good choice, especially if only a small subset of your rows will change during each refresh.
Using SSIS. This approach is particularly suited to BI solutions, most often when combining data from disparate sources. Ultimately, though, with something like materialized view population, the final step can still involve a MERGE statement, called through a SQL Task.

A More Detailed Example

Let’s look at a contrived example that contains a single core event table and a code table:

CREATE SCHEMA [Source] AUTHORIZATION dbo
GO
CREATE SCHEMA [Dest] AUTHORIZATION dbo
GO
CREATE TABLE [Source].[EventType] (
EventTypeID tinyint NOT NULL IDENTITY PRIMARY KEY,
EventTypeCode varchar(20) NOT NULL,
EventTypeDesc varchar(100) NOT NULL);
INSERT [Source].[EventType] (EventTypeCode, EventTypeDesc) VALUES ('ARRIVE', 'Widget Arrival');
INSERT [Source].[EventType] (EventTypeCode, EventTypeDesc) VALUES ('CAN_ARRIVE', 'Cancel Widget Arrival');
INSERT [Source].[EventType] (EventTypeCode, EventTypeDesc) VALUES ('LEAVE', 'Widget Depart');
INSERT [Source].[EventType] (EventTypeCode, EventTypeDesc) VALUES ('CAN_LEAVE', 'Cancel Widget Depart');
CREATE TABLE [Source].[Event] (
WidgetID int NOT NULL,
EventTypeID tinyint NOT NULL REFERENCES [Source].[EventType] (EventTypeID),
TripID int NOT NULL,
EventDate datetime NOT NULL,
PRIMARY KEY (WidgetID, EventTypeID, TripID, EventDate));
CREATE INDEX IDX_Event_Date ON [Source].[Event] (EventDate, EventTypeID) INCLUDE (WidgetID);
CREATE INDEX IDX_Event_Widget ON [Source].[Event] (WidgetID, TripID, EventDate) INCLUDE (EventTypeID);
CREATE INDEX IDX_Event_Trip ON [Source].[Event] (TripID, WidgetID);

The nature of this scenario is we will record the dates of arrival and departure for widgets (key: WidgetID), which can come and go multiple times based on a “trip” (key: TripID). If we incorrectly recorded an arrival or departure, we’ll use a cancellation event that ties back to the widget and trip in question. This fits in with the idea of a log table which only allows insertion. (Truthfully, we could also have done this via “canceled attributes” as well, but let’s assume this is our standard.)

Let’s say our query of interest is this, placing it in a view for convenience,

CREATE VIEW [Dest].[uv_WidgetLatestState]
AS
SELECT
lw.WidgetID
, la.LastTripID
, lw.LastEventDate
, la.ArrivalDate
, (SELECT MAX(de.EventDate)
FROM [Source].[Event] de
WHERE de.EventTypeID = 3
AND de.WidgetID = lw.WidgetID
AND de.TripID = la.LastTripID
AND NOT EXISTS
(SELECT 0
FROM [Source].[Event] dc
WHERE lw.WidgetID = dc.WidgetID
AND la.LastTripID = dc.TripID
AND dc.EventTypeID = 4
AND dc.EventDate > de.EventDate)) AS DepartureDate
FROM
(SELECT
e.WidgetID
, MAX(e.EventDate) AS LastEventDate
FROM
[Source].[Event] e
GROUP BY
e.WidgetID) lw
LEFT OUTER JOIN
(SELECT
ae.WidgetID
, ae.TripID AS LastTripID
, ae.EventDate AS ArrivalDate
FROM
[Source].[Event] ae
WHERE
ae.EventTypeID = 1
AND ae.EventDate =
(SELECT MAX(la.EventDate)
FROM [Source].[Event] la
WHERE la.EventTypeID = 1
AND la.WidgetID = ae.WidgetID
AND NOT EXISTS
(SELECT 0
FROM [Source].[Event] ac
WHERE la.WidgetID = ac.WidgetID
AND la.TripID = ac.TripID
AND ac.EventTypeID = 2
AND ac.EventDate > la.EventDate))) AS la ON lw.WidgetID = la.WidgetID

This tells us the most recent arrival and departure date for a widget, along with the trip identifier for the last trip, based on event dates. (This supports the possibility that an arrival could get cancelled, although our standard will be to only store cases where we have a non-NULL ArrivalDate.) It’s obvious this type of query can be useful to manage the current business process for our hypothetical widgets - our events remain as detail to document history.

Ignoring the fact that we might be okay with performance achieved only by indexing on the source tables, suppose it’s important we persist the results of this query. Our materialized view table could be defined as,

CREATE TABLE [Dest].[WidgetLatestState] (
WidgetID int NOT NULL PRIMARY KEY,
LastTripID int NOT NULL,
LastEventDate datetime NOT NULL,
ArrivalDate datetime NOT NULL,
DepartureDate datetime NULL);
CREATE INDEX IDX_WidgetLatestState_LastTrip ON [Dest].[WidgetLatestState] (LastTripID);
CREATE INDEX IDX_WidgetLatestState_Arrival ON [Dest].[WidgetLatestState] (ArrivalDate);
CREATE INDEX IDX_WidgetLatestState_Departure ON [Dest].[WidgetLatestState] (DepartureDate);

In this case, our key for the query would be the WidgetID – all other values are derived details about a given widget. We’ve defined this as the primary key, and we’ve added a couple of additional non-clustered indexes on the target.

Let’s also assume we’re fine with having the target data updated on a schedule. We can do an effective “full refresh” using a single MERGE statement,

MERGE [Dest].[WidgetLatestState] AS a
USING (
SELECT
v.[WidgetID]
, v.[LastTripID]
, v.[LastEventDate]
, v.[ArrivalDate]
, v.[DepartureDate]
FROM
[Dest].[uv_WidgetLatestState] v
) AS T
ON
(
a.[WidgetID] = t.[WidgetID]
)
WHEN MATCHED AND t.ArrivalDate IS NOT NULL THEN
UPDATE
SET LastTripID = t.LastTripID
, LastEventDate = t.LastEventDate
, ArrivalDate = t.ArrivalDate
, DepartureDate = t.DepartureDate
WHEN NOT MATCHED BY TARGET AND t.ArrivalDate IS NOT NULL THEN
INSERT (
WidgetID
, LastTripID
, LastEventDate
, ArrivalDate
, DepartureDate
) VALUES (
t.[WidgetID]
, t.[LastTripID]
, t.[LastEventDate]
, t.[ArrivalDate]
, t.[DepartureDate]
)
WHEN MATCHED AND t.ArrivalDate IS NULL THEN
DELETE;

My preference in this situation is to deal in three primary objects: the target table (materialized view), a view that expresses the desired query, and a procedure that houses our MERGE (plus adds some logging). With the hard logic contained in the view, the MERGE is largely just boilerplate code and makes it easy to templatize.

I’ve created a test harness available on GitHub that demonstrates the performance of doing a TRUNCATE/INSERT using 2,150,000 source events. (Note: if you clone this repository to try it out, search for the “TODO” and change the file path of the data file to match your local directory for where you extract the Event20180903-1336.dat file.) The result of 4.15 seconds (3 trials, averaged) becomes a baseline on top of which we can add tweaks. The most obvious is to see what the performance is by doing a MERGE with no data changes, post-initial-population. This is 3.40 seconds, and checking @@ROWCOUNT after the operation, we see that it’s updated 196,291 records – which is basically everything in the materialized view. In this case, there was a mild performance gain over TRUNCATE/INSERT and we didn't spend any time with no data in the target table.

The first real tweak is to filter our first WHEN MATCHED clause to check for material changes on non-key columns, as accomplished by,

WHEN MATCHED
AND t.ArrivalDate IS NOT NULL
AND ((a.[LastTripID] <> CONVERT(int, t.[LastTripID]))
OR (a.[LastEventDate] <> CONVERT(datetime, t.[LastEventDate]))
OR (a.[ArrivalDate] <> CONVERT(datetime, t.[ArrivalDate]))
OR (a.[DepartureDate] <> CONVERT(datetime, t.[DepartureDate]) OR (a.[DepartureDate] IS NULL AND t.[DepartureDate] IS NOT NULL) OR (a.[DepartureDate] IS NOT NULL AND t.[DepartureDate] IS NULL))) THEN

This costs something, namely the need to read and compare data, but saves on the need to write back rows where nothing changed. In the test harness, this shows as a net benefit with a result of 1.14 seconds, and @@ROWCOUNT shows zero rows affected.

Another tweak is to use a control date to limit the scope of change checks to a certain timeframe. That timeframe could be fixed (e.g. the last month) or be tied to when the process was the last run. The latter obviously leads to the minimum necessary range but does require some extra infrastructure, including maintaining the control date in a persisted form. For storage of this date, I prefer to use a name/value pair table that can support multiple such needs within a single (BI) database. Doing it this way also requires that we’ve got a way to identify “change dates” in the source data, something that I discuss in another recent C# corner article. (In my example here, the LastEventDate is assumed to be monotonically increasing.)

Using a control date is something I demonstrate in the GitHub script, with an important change being to filter the query of the source view itself,

USING (
SELECT
v.[WidgetID]
, v.[LastTripID]
, v.[LastEventDate]
, v.[ArrivalDate]
, v.[DepartureDate]
FROM
[Dest].[uv_WidgetLatestState] v
WHERE
v.LastEventDate > @lastprocessed

The remainder of the query already has the necessary bits to insert, update or delete depending on what the final state of the data is, as returned by the view.

The performance of this final version is 0.46 seconds - 89% better than our baseline! (That includes all the overhead of dealing with loading and storing the control date.) Doing a similar test with some added new events shows similar performance.

Our non-clustered indexes are there to support efficient lookup against the calculated fields (e.g. LastEventDate, DepartureDate). I also tried running a query against the source view as opposed to the materialized view as another benchmark,

SELECT @temp = COUNT(*)
FROM [Dest].[uv_WidgetLatestState] w
WHERE w.DepartureDate IS NULL
AND w.ArrivalDate IS NOT NULL;

Unsurprisingly the materialized view performance is 94% better than using the source tables through the “non-materialized” view.

Conclusion

Materialized views can be a life-saver for both applications and BI solutions – but they should follow the design principle of being a read-only, fully-computed product of other data.

I showed how a simple MERGE statement supported the incremental population of a materialized view, with good performance. My example was highly simplistic, and you should evaluate your workload based on what’s required and what’s acceptable, considering performance, complexity, data integrity, etc. Obviously doing this for every query in your system is impractical, so pick-and-choose wisely.

I have another blog article that expands on some of what I’ve discussed here and offered some ideas on how to templatize the building of materialized views, significantly cutting down on the coding effort.