Understanding Columnstore Indexes In SQL Server - Part One

In this article, you will learn about Columnstore indexes in SQL Server.

Recently, I reviewed filtered indexes. This time, let’s look at the Columnstore indexes. These indexes are very useful for data warehouse workloads and large tables. They can improve the query performance by a factor of 10 in some cases, so knowing and understanding how they work is important if you work in an environment with larger scaled data.
 
Now, I admit when these first were introduced in SQL Server 2012, I found them very intimidating (additionally, you couldn’t update them directly). For me, anytime you say Columnstore, my mind tends to set off alarms saying, "wait, stay away! This is too complicated". So, in this post, I am going to try and simplify the feature for you.
 
To do that, first, you need to understand some terminology and the difference between a Columnstore index and a Rowstore index (the normal kind we all use). Let’s start with the terminology.
 
Columnstore is simply the way the data is stored in the index. Instead of our normal Rowstore or b-tree indexes where the data is logically and physically organized and stored as a table with rows and column, the data in Columnstore indexes is physically stored in columns and logically organized in rows and columns. It is this difference in architecture that gives the Columnstore index a very high level of compression along with reducing your storage footprint and providing massive improvements in the read performance.
 
Now, the way the index works is by slicing the data into compressible segments. It takes a group of rows - a minimum of 102,400 rows with a max of 1 million rows, called a rowgroup - and takes that group of rows and changes them into Column segments. It’s these segments that are the basic unit of storage for a Columnstore index, as shown below. This to me is a little tricky to understand without a picture.
 
Understanding Columnstore Indexes In SQL Server
 
Imagine this is a table with 2.1 million rows and 6 columns. Which means we now have two rowgroups of 1 million rows each and a reminder of 100,00 rows, which is called a deltagroup. Since each rowgroup holds a minimum of 102,400 rows the delta rowgroup is used to store all index records remaining until it creates another rowgroup. You can have multiple delta rowgroups awaiting being moved to the Columnstore. Multiple delta groups are stored in the delta store and it is actually a B-tree index used in addition to the Columnstore. Ideally, your index will have rowgroups as close to 1 million rows as possible to reduce the overhead of scanning operations.
 
Now, to complicate things just one step further: There is a process that runs to move delta rowgroups from the delta store to the Columnstore index called a tuple-mover process. This process checks for closed groups, meaning a group that has a maximum of 1 million records and is ready to be compressed and added to the index. As illustrated in the picture, the Columnstore index now has 2 rowgroups that it will then divide into column segments for every column in a table. This creates 6 pillars of 1 million rows per rowgroup for a total of 12 column segments. Makes sense? It is these column segments that are compressed individually for storage on disk. The engine takes these pillars and uses them for very highly paralleled scans of the data. You can also force the tuple-mover process by doing a reorg on your Columnstore index.
 
To facilitate faster data access, only the Min and Max values for the row group are stored on the page header. In addition, query processing, as it relates to column store uses Batch mode allowing the engine to process multiple rows at one time. This also makes the engine able to process rows extremely fast in some cases, giving two to four times the performance of a single query process. For example, if you are doing an aggregation these happen very quickly as only the row being aggregated is read into memory and using the row groups the engine can batch process the groups of 1 million rows. In SQL Server 2019, batch mode is going to be introduced to some row store indexes and execution plans.
 
Now hopefully you have a basic understanding of what a Columnstore index is. In my next post, we will look at how to create one, load data, and what limitations using Columnstore indexes have. We will also get to see the index in action compared to a Rowstore index. Stay tuned.