Building High Performance Back End (SQL Server)

Gaurav Jain
6y
44.8k
0
12

Article

I am sure the below screen looks familiar to you because most of us have built this screen in a project, wherein, we need to create a search feature for searching something from our project and bring the data from the back-end. Generally, from the front-end, we collect the parameters and pass this data to a stored procedure.

Remember, in this type of screen, if a user has provided ID (in below case, Service provider ID), it will bring only a single record from the back-end, while in all the other boxes, the user can provide First name or Last name; or both.

For example - If a user has entered only Tom as First Name, the Stored Procedure will bring all the service providers whose first name is Tom.

Building High Performance Back End (SQL Server)

We often use the below types of query for this.

WHERE ([C].[CustomerID] = @SProviderID OR @SProviderID IS NULL)
AND ([C].[lastname] LIKE @LastName OR @LastName IS NULL)
AND ([C].[firstname] LIKE @FirstName OR @FirstName IS NULL)

NOTE
This is a BAD design of stored procedure. This type of stored procedure will neither optimize well nor it will make the use of execution plan caching.

Whenever this type of stored procedure will get called for the first time, it builds the execution plan and caches it. It will try to use the same plan for another set of parameters and it will not fit for all the provided data; resulting in us getting horrible performance out of that stored procedure.

For example, if this SP gets called for service provider ID first, the SQL Server will return one row based on the provided data. It will save the execution plan in cache. Next time when you call the same SP with only First name, the SQL Server plan will estimate the result row size as 1 while it may return thousands of rows, which will kill the performance.

OPTION(RECOMPILE) is the quickest solution to fix this

To fix this type of stored procedure, OPTION(RECOMPILE) works wonderfully but how to use it is equally important, especially because you can’t use it blindly with all statements. If you use it blindly, the SQL Server will invest time every time to create an ideal plan first and you might end up using too much CPU. Definitely, we can reuse at least a few stable plans, if not all.

A Hybrid solution

In the below example, I have used OPTION(RECOMPILE) dynamically inside the stored procedure based on the incoming data.

If the SProviderID has been provided, do not recompile it. If anything else is provided, recompile it based on the length of data. By doing it that way, we are asking the SQL Server to create a new plan only when it’s needed.

CREATE PROC [dbo].[GetServiceProviderData]
(
@SProviderID BIGINT = NULL
, @LastName VARCHAR (30) = NULL
, @FirstName VARCHAR (30) = NULL
)
AS
IF (@SProviderID IS NULL
AND @LastName IS NULL
AND @FirstName IS NULL
BEGIN
RAISERROR
RETURN;
END;
DECLARE @ExecStr NVARCHAR (4000),
@Recompile BIT = 1;
SELECT @ExecStr =
N'SELECT * FROM [dbo].[Customers] AS [C] WHERE 1=1';
IF @SProviderID IS NOT NULL
SELECT @ExecStr = @ExecStr
+ N' AND [C].[CustomerID] = @SProviderID';
IF @LastName IS NOT NULL
SELECT @ExecStr = @ExecStr
+ N' AND [C].[LastName] LIKE @LName';
IF @FirstName IS NOT NULL
SELECT @ExecStr = @ExecStr
+ N' AND [C].[Firstname] LIKE @FName';
IF (@SProviderID IS NOT NULL)
SET @Recompile = 0
IF (PATINDEX('%[%_?]%', @LastName) = 4)
OR (PATINDEX('%[%_?]%', @FirstName) = 4)
SET @Recompile = 0
IF @Recompile = 1
BEGIN
SELECT @ExecStr = @ExecStr + N' OPTION(RECOMPILE)';
END;
EXEC [sp_executesql] @ExecStr
@CustID = @SProviderID
, @LName = @LastName
, @FName = @FirstName

Choose correct Index

We always think that indexes improve performance, which is correct most of the time but before choosing the index, we need to understand what type of index is needed in our case and on which column index is needed. Choosing the incorrect type of index or choosing the invalid columns for the index can negatively impact the performance.

Internal architecture of data storage

Before diving into indexes, let’s first look at how SQL Server stores data. SQL Server always stores the data in the form of pages (8KB = 8*1024 bytes).

There are different types of pages; some store data records, some store index records and others store metadata of various sorts. All of them have one thing in common, which is their structure. A page is always exactly 8 KB (8192 bytes) in size and contains two major sections, the header and the body. The header has a fixed size of 96 bytes and has the same contents and format, regardless of the page type. It contains information such as how much space is free in the body, how many records are stored in the body, the object to which the page belongs and, in an index, the pages that precede and succeed it.

Image:1 Typical structure of a Data page.

Heap and Indexes

A table that does not have a clustered index is referred to as a HEAP and a table that has a clustered index is referred to as a clustered table. A table can either have one heap or one clustered index.

In Heaps, Data is not stored in any order, Data pages are not linked, so sequential access needs to refer to the index allocation map (IAM) pages.

In clustered indexed table Data is stored in order based on the clustered index key, Data can be retrieved quickly based on the clustered index key, if the query uses the indexed columns.

Structure of data storage on a Heap table

Building High Performance Back End (SQL Server)
Structure of data storage on an Indexed table (B-tree)

As from the above clarification it is clear that having a clustered index on a large table will definitely help in query performance, if that indexed column is being used in where clause.

What all this means is that you should consider using a heap only when you’re working with very light tables or your table operations are limited to inserts and your queries are fairly basic (and you’re still using non-clustered indexes). Otherwise, stick with a well-designed clustered index.

Generally, in big size applications, we can’t only have heap tables and we need to have indexed tables, next questions come is what is the ideal candidate (column) for a indexed key.

What is Clustered key?

Clustered index is like the white pages of a phone book. The phone book stores every person’s name in alphabetical order, making it easy to look up certain individuals. Additionally, if we look someone up, we immediately have their address and phone number right their next to their name. This is exactly the very good feature of clustered indexes is, if you ever need to retrieve many or all columns from your table, a clustered index will usually be efficient because once it finds the indexed value you are searching on, it doesn’t need to go anywhere else to get the remaining data from that row. Whole data is stored in the data page.

Who is the right candidate for a Clustered/Indexed key

Some people think they can ONLY put their clustered index on their PK. That’s not true! Most of it can be much more beneficial to put your clustered index on something that isn’t your PK, like a different column that is getting more use than our PK. For example, you might have an identity column set as your PK, but every query against your table is filtering and sorting on a different column. Why store your table in PK order if you are always going to be filtering and returning data on a separate column? You should put that clustered index on that column not on PK, which you are most commonly going to use on filtering, ordering. Though remember insert, update can become a bit slow as SQL server need to find the appropriate place to keep that record in the data page.

In SQL server to get the maximum advantage of a clustered index, choose a column with a high degree of uniqueness, that can be utilized in range queries, that is often accessed sequentially, that is monotonic, incremental, and unique.

Remember, GUID also fulfills all the characters mentioned in the above paragraph but is it a good idea to have GUID as a clustered key. NO, why not?

Choosing GUID as clustering key is a horrible design - GUIDs may seem to be a natural choice for your primary key but is using GUID as clustering key, is a good idea? By default, the primary key on a SQL Server table is also used as the clustering key - but that doesn't need to be that way, you can change it. You can make GUID as PK but not as a clustering key. You can create a clustering (ordering) key on a separate INT IDENTITY (1,1) column.

https://docs.microsoft.com/en-us/sql/relational-databases/indexes/create-clustered-indexes?view=sql-server-2017

GUID is unnecessarily wide (it’s 4 times wider than an int-based identity), estimate how much this costs on a bigger table and one with a few indexes and top of it fragmentation costs. Disk space is cheap but it will impact performance horribly.

Having an auto-increment Int or BIGINT column(It should be narrow, ever increasing, unique) will be a good choice for PK and columns; a high degree of uniqueness and being used in a range of queries should be the choice for clustering key (CK).

What is non-clustered Index and how does it work

Unlike a clustered index, the leaf nodes of a non-clustered index contain only the values from the indexed columns and row locators that point to the actual data rows, rather than containing the data rows themselves. This means that the query engine must take an additional hop in order to locate the actual data. Remember, if a table is a clustered table then row locator points to clustered index and from there it will go to actual data to fetch the data and if the table is a heap, it will point to the actual data row.

Unlike clustered index, you can contain up to 16 index keys in NC index and you are not restricted with this number in the included columns. In SQL Server, you can include up-to 1023 columns per each non-clustered index. But please remember having multiple columns as a key column is not recommended.

Don’t cover the index, the transactional table

Before designing any index on the table, please see what are the activities involved on that table. Is it only being used for reporting (select statement) or also being used as a transactional table (insert/update/delete) and how much transactional activity is being performed on that table. When you add a record to a transactional table, an INSERT operation occurs not only on the data page, but also in all associated indexes and same is true about the update and delete. if your table is either a heap or a clustered table and there are a lot of INSERTS, UPDATES and DELETES the data pages can become fragmented. This results in wasted space as well as additional data pages to read to satisfy the queries. If you have too many indexes, it can slow down the transactional activity on your table because your query will do very heavy I/O activity.

Use Fill Factor wisely on the transactional table

The fill-factor option is provided for fine-tuning index data storage and performance. When an index is created or rebuilt, the fill-factor determines the percentage of space on each leaf-level page to be filled/free with data. It reserves the free space for any transactional activity.

For example in the below, data pages will be 80% full and 20% empty.

CREATE INDEX IX_Employee_Name ON HumanResources.Employee
(Employee_Name) WITH (FILLFACTOR = 80);

A correctly chosen fill-factor value can reduce page splits by providing enough space for index expansion as data is added to the table or large size data has got updated on the existing page. Therefore, it reduces page split.

A few other performance/space killers

Avoid using MAX, if you know the maximum data width already

Sometime without realizing it, we tend to create the column as VARCHAR(MAX) even if we know that user is not going to give more than 100 characters in that field. Max is a safety net, but if you don't need it, don't put it up.

VARCHAR(MAX) or NVARCHAR(MAX) is considered as a 'large value type'. Large value types are usually stored 'out of row'. (only if the size exceeds 8000). It means that the data row will have a pointer to another location where the 'large value' is stored. (Copied from MSDN).

Avoid using NVARCHAR/NCHAR for all columns, if not needed

If your database is going to store multilingual data you should use the nvarchar datatype instead varchar. Also, nvarchar takes twice as much space as varchar data.

Constraint (e.g. Foreign Key) to boost performance

Most of us as a developer try to avoid putting the constraint in the backend table and believe that constraints are only to maintain data integrity. If the constraints are used wisely they contribute to performance.

Let’s try to see with a simple example.

I have created two simple tables and I am fetching data ONLY from OrderHeader1 table, but if you see the execution plan, OrderHeader1 table is being scanned and Customer1 table is being sought. So are both tables being read?

select OH.OrderDate,Oh.OrderNo
from OrderHeader1 OH
inner join Customer1 C
on oh.CustomerId=C.CustomerId

Building High Performance Back End (SQL Server)

Now, I simply put a Foreign Key constraint on OrderHeader1 table as below and run the same query again.

ALTER TABLE [dbo].[OrderHeader1] WITH NOCHECK ADD CONSTRAINT [fkCust] FOREIGN KEY([CustomerId])
REFERENCES [dbo].[Customer1] ([CustomerId])

Below is the execution plan. After applying FK contraint, only one table is being scanned. We can very well imagine the impact of it on performace when we have two heavily loaded tables.