Merging Data Using A Queuing Pattern

Joel Champagne
7y
9.4k
0
4

Article

In my recent article Using Materialized Views Effectively, I offered a few ways to populate data stores that can be critical performance-improvers for your apps and BI environments that use SQL Server. I showed how we could incrementally improve the loading process for these specialized tables - starting from a simple TRUNCATE / INSERT, to a MERGE statement, to a MERGE filtered based on “since the last load.” There’s another method that I’ve been using in production situations, with good success. In the remainder of this article, I’ll walk through both the implementation of this queued merge and its advantages, using a realistic test harness that I’ve made available on GitHub as one way to demonstrate and compare.

The Concept

In my previous article, I worked with imaginary “widgets” which are tracked in our SQL Server database using event records – widget arrival, departure, and cancellation events. I noted that perhaps our widget management application cares about every arrival and departure, but it cares most about the current state of each widget: when did it last arrive? When did it last depart? The basic schema and query to do this is shown here.

CREATE TABLE [Source].[EventType] (
EventTypeID tinyint NOT NULL IDENTITY PRIMARY KEY,
EventTypeCode varchar(20) NOT NULL,
EventTypeDesc varchar(100) NOT NULL)
GO
INSERT [Source].[EventType] (EventTypeCode, EventTypeDesc) VALUES ('ARRIVE', 'Widget Arrival');
INSERT [Source].[EventType] (EventTypeCode, EventTypeDesc) VALUES ('CAN_ARRIVE', 'Cancel Widget Arrival');
INSERT [Source].[EventType] (EventTypeCode, EventTypeDesc) VALUES ('LEAVE', 'Widget Depart');
INSERT [Source].[EventType] (EventTypeCode, EventTypeDesc) VALUES ('CAN_LEAVE', 'Cancel Widget Depart');
GO
CREATE TABLE [Source].[Event] (
WidgetID int NOT NULL,
EventTypeID tinyint NOT NULL REFERENCES [Source].[EventType] (EventTypeID),
TripID int NOT NULL,
EventDate datetime NOT NULL,
PRIMARY KEY (WidgetID, EventTypeID, EventDate, TripID))
GO
SELECT
lw.WidgetID
, la.LastTripID
, lw.LastEventDate
, la.ArrivalDate
, (SELECT MAX(de.EventDate)
FROM [Source].[Event] de
WHERE de.EventTypeID = 3
AND de.WidgetID = lw.WidgetID
AND de.TripID = la.LastTripID
AND NOT EXISTS
(SELECT 0
FROM [Source].[Event] dc
WHERE lw.WidgetID = dc.WidgetID
AND la.LastTripID = dc.TripID
AND dc.EventTypeID = 4
AND dc.EventDate > de.EventDate)) AS DepartureDate
FROM
(SELECT
e.WidgetID
, MAX(e.EventDate) AS LastEventDate
FROM
[Source].[Event] e
GROUP BY
e.WidgetID) lw
LEFT OUTER JOIN
(SELECT
ae.WidgetID
, ae.TripID AS LastTripID
, ae.EventDate AS ArrivalDate
FROM
[Source].[Event] ae
WHERE
ae.EventTypeID = 1
AND ae.EventDate =
(SELECT MAX(la.EventDate)
FROM [Source].[Event] la
WHERE la.EventTypeID = 1
AND la.WidgetID = ae.WidgetID
AND NOT EXISTS
(SELECT 0
FROM [Source].[Event] ac
WHERE la.WidgetID = ac.WidgetID
AND la.TripID = ac.TripID
AND ac.EventTypeID = 2
AND ac.EventDate > la.EventDate))) AS la ON lw.WidgetID = la.WidgetID
GO

The complexity looks high, and the case was made to persist the query results into a table, on a scheduled basis. (We’re using subqueries and outer joins so indexed views are out-of-bounds here.) The test harness I offered on GitHub can be used to show the performance of trying to merge updates to this table using different approaches. The fastest option was to use a MERGE statement that filters to look at only widgets where there could be a possible change, based on the fact we’ve got a “last updated date” available to tell when any change has occurred that might affect results.

Let’s consider a similar situation: You’ve got a query that’s even more complex – perhaps joining 15+ tables - and given the nature of these tables, the determination of the “latest change” is harder. You might have to do a CASE statement or use a user-defined function (UDF) to encapsulate the need to find the greatest modification date among many tables. When comparing a “last run date” with this calculation to determine the maximum date among many, there’s a good chance you’ll suffer nasty performance degradation. How can we avoid this?

If our source query is efficient when run for a single “primary entity” (a widget in my example), then we can take advantage of that fact by creating a work queue that holds a simple list of entities (by key) and then “work the queue” – either one-by-one or using parallelism. We could, for example, create a queue table that looks like this.

CREATE TABLE dbo.WidgetLatestStateQueue
(WidgetID int NOT NULL PRIMARY KEY)

If we knew that our source query might be affected by an update to a related table, say “WidgetJob”, we might include the population of this queue based on a check of changes within the WidgetJob table that happened since our “last run date”,

INSERT dbo.WidgetLatestStateQueue (WidgetID)
SELECT DISTINCT j.WidgetID
FROM dbo.WidgetJob j
WHERE j.LastUpdatedDate >= @since
AND NOT EXISTS (SELECT 0 FROM dbo.WidgetLatestStateQueue q WHERE q.WidgetID = j.WidgetID);

This pattern can be repeated to handle multiple tables. This approach obviously relies on having a “last updated date” as a standardized audit field used throughout your app (for more on that, see my article Auditing Data Changes in Microsoft SQL Server). Having a non-clustered index on that field is often important for the performance of this queue population step.

Now that we have a singular list that’s based on the primary entity key, we can perform a MERGE that’s targeted directly by key. Review your execution plan to see if this offers a good benefit – if your plan isn’t improved by using a key value, this approach is likely not going to bring joy. (In fact, the GitHub harness illustrates that performance gains aren’t a “sure thing”!)

Implementation Details

If we wanted to encapsulate both the queue population and queue processing into a stored procedure, we could end up scheduling the stored procedure using something like SQL Agent, and we’d be done - if we’re happy with the row-by-row processing of the queue. What might this type of stored procedure look like? I prefer to parameterize it with an optional start date, usually.

CREATE PROCEDURE dbo.up_WidgetLatestStateQueue_ProcessAll
@start datetime2 = NULL
AS
BEGIN
EXEC dbo.up_WidgetLatestStateQueue_Populate @start;
EXEC dbo.up_WidgetLatestStateQueue_Process;
END
GO
CREATE PROCEDURE dbo.up_WidgetLatestStateQueue_Populate
@start datetime2 = NULL
AS
BEGIN
IF @start IS NULL
SET @start = DATEADD(dd, -1, SYSDATETIME());
INSERT dbo.WidgetLatestStateQueue (WidgetID)
SELECT DISTINCT e.WidgetID
FROM
[Source].[Event] e
WHERE
e.EventDate >= @start
AND NOT EXISTS (SELECT 0 FROM dbo.WidgetLatestStateQueue q WHERE e.WidgetID = q.WidgetID);
END
GO

This affords a way to load historical data, such as we might want to do when we first introduce the materialized view and it’s empty. When the value is NULL, I might instead calculate the start date based on a trailing 24 hours if I know the job is running say every few hours. This might lead to some unnecessary reprocessing of records that would not cause a material change, but it also gives some leeway if the job fails for some reason: if it’s running again within a day, there’s no need to go back and explicitly reload from a start date. If the process is sufficiently fast, you’re trading some extra CPU and I/O for peace-of-mind, and you can adjust this type of “dial” to suit your preferences.

The procedure logic turns into two main steps: queue population and queue processing. Queue population can involve multiple INSERTS – all of which should check that the keys being inserted do not already exist in the queue. As a clustered primary key, our lookups in the queue will be efficient.

The queue processing step can be as simple as this,

CREATE PROCEDURE dbo.up_WidgetLatestStateQueue_Process
AS
BEGIN
DECLARE @WidgetID int;
SELECT TOP 1 @WidgetID = WidgetID FROM dbo.WidgetLatestStateQueue;
WHILE @WidgetID IS NOT NULL
BEGIN
EXEC dbo.up_WidgetLatestStateQueue_ByID @WidgetID;
DELETE dbo.WidgetLatestStateQueue WHERE WidgetID = @WidgetID;
SET @WidgetID = NULL;
SELECT TOP 1 @WidgetID = WidgetID FROM dbo.WidgetLatestStateQueue;
END
END
GO
CREATE PROCEDURE dbo.up_WidgetLatestStateQueue_ByID
@WidgetID int
AS
BEGIN
MERGE [Dest].[WidgetLatestState] AS a
USING (
SELECT
v.[WidgetID]
, v.[LastTripID]
, v.[LastEventDate]
, v.[ArrivalDate]
, v.[DepartureDate]
FROM
[Dest].[uv_WidgetLatestState] v
WHERE
v.WidgetID = @WidgetID
) AS T
ON
(
a.[WidgetID] = t.[WidgetID]
)
WHEN MATCHED
AND t.ArrivalDate IS NOT NULL
AND ((a.[LastTripID] <> CONVERT(int, t.[LastTripID]))
OR (a.[LastEventDate] <> CONVERT(datetime, t.[LastEventDate]))
OR (a.[ArrivalDate] <> CONVERT(datetime, t.[ArrivalDate]))
OR (a.[DepartureDate] <> CONVERT(datetime, t.[DepartureDate]) OR (a.[DepartureDate] IS NULL AND t.[DepartureDate] IS NOT NULL) OR (a.[DepartureDate] IS NOT NULL AND t.[DepartureDate] IS NULL))) THEN
UPDATE
SET LastTripID = t.LastTripID
, LastEventDate = t.LastEventDate
, ArrivalDate = t.ArrivalDate
, DepartureDate = t.DepartureDate
WHEN NOT MATCHED BY TARGET AND t.ArrivalDate IS NOT NULL THEN
INSERT (
WidgetID
, LastTripID
, LastEventDate
, ArrivalDate
, DepartureDate
) VALUES (
t.[WidgetID]
, t.[LastTripID]
, t.[LastEventDate]
, t.[ArrivalDate]
, t.[DepartureDate]
)
WHEN MATCHED AND t.ArrivalDate IS NULL THEN
DELETE;
END
GO

Notice we’re pulling an entity key from the queue into a local variable which becomes our main filter against our source query for our MERGE, and then once the MERGE has succeeded, we can DELETE the value from the queue safely and continue until there’s nothing left to do. This implementation simply loops until the queue table is empty.

Nifty Benefits

One of the cool benefits of this approach is you’ve effectively instrumented your merge process. You could run this,

SELECT COUNT(*) FROM dbo.WidgetLatestStateQueue

… and you’d know how many records are left to process. (It’s geeky fun to see the number spike, then bleed down to zero.) I like to additionally log the number of expected rows to process prior to the processing loop, so you could even get a “percent complete” if you wanted to.

Another benefit can be reduced contention. If you’re doing a massive, big-bang merge/load, you might end up doing some heavy locking whereas, with this approach, your locks are very finite. In the real world in a couple of cases, it was the contention that drove me to this solution more than performance – and with the change to row-by-row, problems were reduced.

A third benefit is that we can selectively control how much data to refresh - with no code changes. Say, for example, we make a database change that adds a new table to our “big join” – say WidgetJobAttribute (a child of WidgetJob) – we might add an INSERT for our queue based on the changes in this table (like for other tables involved), but also, perform a one-time load/update of data based on this INSERT (i.e. overall job attributes that relate to all widgets).

INSERT dbo.WidgetLatestStateQueue (WidgetID)
SELECT DISTINCT j.WidgetID
FROM dbo.WidgetJob j
JOIN dbo.WidgetJobAttribute a
ON j.WidgetJobID = a.WidgetJobID
WHERE NOT EXISTS (SELECT 0 FROM dbo.WidgetLatestStateQueue q WHERE q.WidgetID = j.WidgetID);

Despite the benefits, a side-effect of this approach is we’ll see data appearing and/or changing bit by bit. In other words, data consistency is sacrificed here, as is acknowledged in this paper: on materialized views. One way I’ve sometimes dealt with this concern is to build my queue at a higher level of granularity than my materialized view. For example, say we have a materialized view to support our WidgetJob entity (one row per job). Instead of queuing changes by WidgetJobID (the primary key of WidgetJob), we might queue based on WidgetID instead. When merging data in this case, we’d see all WidgetJob updates for a given widget, using a single MERGE invocation. There are other design tricks to mitigate concerns, but your requirements will drive what tricks can be used and whether queuing is even feasible or not. (Having a BISM as another layer on top of your database offers another nice way to mitigate this problem.)

Comparisons and the “Real-World”

To get some concrete performance figures, I’ve taken a real-world example and tested it in three different ways, under similar load. This is not my widgets' example – I wanted a bigger, complex data set (while solving a real production requirement at the same time!).

The first approach is using a multi-step MERGE for what I’d consider a “big-bang” update. I did this by populating key values into a table that serves as a candidate list of items to process, but instead of processing items individually, I join the “to process” table to the source query and MERGE (source query to target materialized view) using all ~280,000 rows in one step. The main difference compared to the fastest approach presented in my last article is this includes the tuning step of determining the list of candidate items to insert/update ahead of time to avoid a complex query against our “last updated date” over many tables. This works – but takes 108 minutes, which is our baseline for this exercise. Note, of these 108 minutes, five seconds are what’s expended in populating the candidate list – so that aspect is clearly not expensive.

Next, I tried switching to a “queue” and record-by-record processing, largely following what’s described in the previous sections. The performance here for the same 280,000 “keys” is 55.5 minutes - so clearly, better for this workload and server load. Of note: expect greater variability in performance with this approach, in practice, based on server load. Also, if your “big bang” is usually a small number of rows or relatively simple, the benefits of row-by-row diminishes significantly. I have at times been pushed to use row-by-row knowing that I needed to one-time load millions of rows of historical data, where “big bang” was proven to be nearly impossible, but row-by-row worked like a champ.

Finally, I took out the SQL WHILE loop to process queue entries and replaced it with a SSIS package step that looks like the one highlighted here.

This is a script task – so what’s the core script for this?

using(var conn = new SqlConnection(Dts.Variables["User::BI_ADONET_Connection"].Value.ToString())) {
conn.Open();
DataView keys = null;
while (keys == null || keys.Count > 0) {
using(var da = new SqlDataAdapter("SELECT TOP 8 JobID FROM dbo.JobSummaryQueue", conn)) {
DataTable dt = new DataTable();
da.Fill(dt);
keys = dt.DefaultView;
}
if (keys.Count > 0) {
Parallel.ForEach(keys.Cast < DataRowView > (), (kdrv) => {
using(var connInner = new SqlConnection(Dts.Variables["User::BI_ADONET_Connection"].Value.ToString())) {
connInner.Open();
using(var cmd = new SqlCommand("dbo.up_JobSummary_ByJobID", connInner)) {
cmd.CommandType = CommandType.StoredProcedure;
cmd.Parameters.AddWithValue("JobID", kdrv["JobID"]);
cmd.ExecuteNonQuery();
}
}
});
}
}
}

After passing in my connection string as a variable (settable via package configurations, for example), you can see how I’ve taken advantage of the .NET Parallel class to launch up to 8 concurrent threads (based on the “TOP 8”), each of which receives its own entity ID (jobs in this case) to process. The job ID is passed to a stored procedure which is doing our MERGE, filtered by JobID, and then deletes the input JobID from the queue.

There’s more overhead here since unlike a SQL WHILE loop that can execute all within SQL Server, this approach requires out-of-process communication using multiple connections. However, the timing does not lie: 12.3 minutes to process all 280,000 queue entries (89% improvement over baseline)!

The ideal level of parallelism is, once again, a big “it depends” – mostly on how much your environment can handle reasonably. Eight in my example is fine, but that’s because I have the capacity to support eight additional concurrent users – which is effectively what these requests look like to the system.

Should we care that much about speed in a largely-batch BI environment? I vote “yes”: I’ve seen this solve multiple types of problems by quietly shifting away from “real-time” data that was problematic to query, over to “near-real-time” BI data that’s much easier to query and everyone has remained happy about the timeliness of data since it can be refreshed on the scale of minutes instead of hours (or days).

The creation of this SSIS package is something that I discuss in more detail in another article, where we can leverage quite a bit of metadata and conventions-based naming to generate the required .dtsx file with no SSIS coding per se.