SQL Performance (3), Optimized DATA MODIFYING Query

George
3y
46.9k
0
4

Article

Introduction

In this series of articles, we will discuss SQL server performance. In this article, I use SQL Performance because the articles could include SQL statement (Query) optimization and also include SQL Server performance issues, and probably include stored procedure performance issues.

SQL Performance (0), Summary
SQL Performance (1), Optimized SELECT Query (A)
SQL Performance (2), Optimized SELECT Query (B)
SQL Performance (3), Optimized Data Modifying Query --- this article

We will discuss INSERT, UPDATE, and DELETE together as Data Modifying Queries because they have similar characteristics. We will mainly discuss the following three features,

Index effect on Modifying Query
Updating Method (Loop, Bulk, or Batch) effect to modify Query
(De)Normalization effect on Modifying Query

Index effect to Modifying Query

Insert

The number of indexes on a table is the most dominant factor for insert performance. The more indexes a table has, the slower the execution becomes. The insert statement is the only operation that cannot directly benefit from indexing because it has no where clause.

Adding a new row to a table involves several steps. First, the database must find a place to store the row. For a regular heap table — which has no particular row order — the database can take any table block that has enough free space. This is a very simple and quick process, mostly executed in the main memory. All the database has to do afterward is add the new entry to the respective data block.

If there are indexes on the table, the database must make sure the new entry is also found via these indexes. For this reason, it has to add a new entry to each and every index on that table. The number of indexes is, therefore, a multiplier for the cost of an insert statement.

Insert Performance by Number of Indexes

Delete

Unlike the insert statement, the delete statement has a where clause that can use all the methods, of the where clause in a select clause, to benefit directly from indexes. In fact, the delete statement works like a select that is followed by an extra step to delete the identified rows.

The actual deletion of a row is a similar process to inserting a new one - especially the removal of the references from the indexes and the activities to keep the index trees in balance.

The performance chart shown below is therefore very similar to the one shown for the insert.

Delete Performance by Number of Indexes

Update

An update statement must relocate the changed index entries to maintain the index order. For that, the database must remove the old entry and add the new one at the new location. The response time is basically the same as for the respective delete and insert statements together.

The update performance, just like insert and delete, also depends on the number of indexes on the table. The only difference is that update statements do not necessarily affect all columns because they often modify only a few selected columns. Consequently, an update statement does not necessarily affect all indexes on the table but only those that contain updated columns.

Update Performance by Indexes and Column Count

Update Queries for Large Data Volume

Updating very large tables can be a time taking task and sometimes it might take hours to finish.

Here are a few tips for SQL Server Optimizing the updates on large data volumes.

Removing the index on the column to be updated.
Executing the update in smaller batches.
Disabling Delete triggers.
Replacing the Update statement with a Bulk-Insert operation.

With that being said, we will only discuss the first points above in this article to optimize an update query.

The code below creates a dummy table with 200,000 rows and required indexes.

CREATE TABLE tblverylargetable   
  (   
     sno  INT IDENTITY,   
     col1 CHAR(800),   
     col2 CHAR(800),   
     col3 CHAR(800)   
  )   
GO   
DECLARE @i INT=0   
WHILE( @i < 200000 )   
  BEGIN  
      INSERT INTO tblverylargetable   
      VALUES     ('Dummy',   
                  Replicate('Dummy', 160),   
                  Replicate('Dummy', 160))   
      SET @i=@i + 1   
  END  
GO  
CREATE INDEX ix_col1   
  ON tblverylargetable(col1)   
GO   
CREATE INDEX ix_col2_col3   
  ON tblverylargetable(col2)   
  INCLUDE(col3)

Consider the following update query which is to be optimized. It’s a very straightforward query to update a single column.

UPDATE tblverylargetable   
SET    col1 = 'D'  
WHERE  col1 = 'Dummy'

The query takes 23 seconds to execute.

Let’s look at the execution plan of the query shown below. In addition to the clustered index update, the index ix_col1 is also updated. The index update and Sort operation together take 55% of the execution cost (Index Update 20%, Sort 34%, Index Seek 1%).

Removing the index on the column to be updated

The same query takes 4 seconds when there isn’t any index on col1. Thus, an update query runs much faster if the column to be updated is not an index key column. The index can always be created once the update completes.

Updating Method effect to Modifying Query

In SQL Server we use the UPDATE statement for modifying data. Updating data can be done in various ways such as row by row, in one big batch, or in several smaller batches.

In this part, we will look at the differences to perform UPDATEs using these methods.

Setup SQL Server Test environment

First, we'll set up a SQL Server database table in order to run UPDATE operations,

CREATE Table CustTransaction  
(  
   id INT PRIMARY KEY IDENTITY(1,1),  
   col1 NVARCHAR(500),  
   Col2 NVARCHAR(500),  
   Col3 NVARCHAR(500),  
   Col4 NVARCHAR(500),  
   COL5 NVARCHAR(500),  
   COL6 NVARCHAR(500)  
)   
GO  
  
-- INSERT 5.58''  
INSERT INTO CustTransaction  
SELECT   
'C1' + replicate ('0',200),  
'C2' + replicate ('0',200),  
'C3' + replicate ('0',200),  
'C4' + replicate ('0',200),  
'C5' + replicate ('0',200),  
'C6' + replicate ('0',200)  
GO 1000000

This will create a table and insert 1 million rows.

Using this database, we will discuss updating a big SQL Server table and compare the following options:

Data Update by row by row
Data Update in bulk
Data Update in small batches

We need to be aware of these different methods. Which one will be faster? Is there some performance impact due to infrastructure, isolation level, concurrent processes that are running, etc?

UPDATE Row By Row

I am going to update the table CustTransaction using a row-by-row method with a loop. I have prepared the sample query and executed it as follows.

-- UPDATE Row By Row 1'27''  
Declare @LoopStartPosition INT = 1,  
        @LoopEndPosition INT  
  
SELECT @LoopEndPosition = MAX(ID)   
FROM dbo.CustTransaction  
  
WHILE @LoopEndPosition >= @LoopStartPosition  
BEGIN  
    UPDATE dbo.CustTransaction  
    SET col1 = Col2,  
        Col2 = col1,  
        Col3 = Col4,  
        Col4 = Col3,  
        COL5 = COL6,  
        COL6 = COL5  
    WHERE id = @LoopStartPosition  
  
    SET @LoopStartPosition = @LoopStartPosition + 1  
END

The CustTransaction table has 1 million rows and this took 1 minute and 27 seconds to update all rows. Updating row by row is very slow, but less resource-intensive and the consistency is at a statement level. If you want all updates to be consistent for the table, you would need to use a transaction, because if one update fails, the rest will succeed. Also, in a transaction when updating row by row with a large volume, it might be possible that deadlocks occur because row iteration in the transaction takes more time.

UPDATE in Bulk

I am going to update the CustTransaction table with one command.

-- UPDATE in Bulk: 16''  
UPDATE dbo.CustTransaction  
SET col1 = Col2,  
    Col2 = col1,  
    Col3 = Col4,  
    Col4 = Col3,  
    COL5 = COL6,  
    COL6 = COL5

This took around 16 seconds. It’s a faster update than a row-by-row operation, but this is best used when updating limited rows. A bulk update is an expensive operation in terms of query cost because it takes more resources for a single update operation. It also takes time for the update to be logged in the transaction log. Also, long-running updates can cause blocking issues for other processes.

UPDATE in Batches

Another approach is to update the data in smaller batches. So instead of doing row-by-row updates or one big update, I have divided the data into batches of 10K rows and performed the update operation as shown below,

-- UPDATE in Batches: 10''  
Declare @batchSize INT = 10000,  
        @startPosition INT = 1,  
        @endPosition INT,  
        @incrementalPosition INT = 10000  
  
SELECT @endPosition = MAX(ID)   
FROM dbo.CustTransaction  
  
WHILE @endPosition >= @startPosition  
BEGIN  
    UPDATE dbo.CustTransaction  
    SET col1 = Col2,  
        Col2 = col1,  
        Col3 = Col4,  
        Col4 = Col3,  
        COL5 = COL6,  
        COL6 = COL5  
    WHERE id >= @startPosition  
    AND   id <= @incrementalPosition  
  
    SET @startPosition = @startPosition + @batchSize  
    SET @incrementalPosition = @startPosition + @batchSize - 1  
END

This was completed in 10 seconds, which is much faster than the other two approaches. In addition, the transaction size is minimized, and also the data logged and resources used are minimized.

You may update data in batches for large volumes of data. You can set your own batch size after analyzing your data and environment to see what makes the most sense.

Another option for batching updates is to use the top statement for configuring the batch size.

Things to Consider

For optimizing update operations you should try to minimize the transaction size.
Always make sure you use a WHERE clause unless you want to update the entire table.
Do large updates during low peak usage times to minimize blocking of other processes.
If triggers are used on a table, it is helpful to disable the triggers prior to doing large updates if possible.
Make sure you have the correct indexes in place to optimize the update.
If you have indexes that are not being used for the table, removing them could help optimize the update.
Understand the Isolation Level because this could directly or indirectly impact query performance.
Make sure you understand the recovery model for the database. Different recovery models can help reduce the use of the transaction log when updates are done in batches, but understand the differences before making changes.

(De)Normalization effect on Modifying Query

Normalization and denormalization are the methods used in databases. The terms are differentiable where Normalization is a technique of minimizing the insertion, deletion, and update anomalies by eliminating the redundant data. On the other hand, Denormalization is the inverse process of normalization where redundancy is added to the data to improve the performance of the specific application.

OLTP (OnLine Transactional Processing) and OLAP (OnLine Analytical Processing) both are online processing systems. OLTP is a transactional processing while OLAP is an analytical processing system. OLTP is a system that manages transaction-oriented applications on the internet, for example, ATMs. OLAP is an online system that reports multidimensional analytical queries like financial reporting, forecasting, etc. The basic difference between OLTP and OLAP is that OLTP is an online database modifying system, whereas, OLAP is an online database query answering system.

We will indicate here that denormalization is a technique or process of combining the data or introducing redundancy in order to make queries faster for improving database performance. We might or will discuss the details in other articles specific to the topic of Database Normalization and Denormalization.

Summary

This article mainly discussed the following three features to improve performance for modifying queries,

Index effect on Modifying Query
- The effect is huge, trying to remove the index for large data modification
Updating Method (Loop, Bulk, or Batch) effect to modify Query
- The loop is very slow, avoid using it
- Batch might perform better than Bulk, depending on data size and ...
(De)Normalization effect on Modifying Query
- Denormalization introduces data redundancy to make queries faster for improving database performance