Statistics And Cost Estimation Internals - Part One

Introduction

Incorrect cardinality and Cost Estimation may lead query optimizer to choose inefficient plans which can have a negative impact on the database performance. In this article, we are going to look deep inside Statistics - how statistics are obtained by the query, and manually calculate the same with examples and scenarios and how optimizer thinks. I am going to focus on Density in this article.

Queries that I am demonstrating in this article are prepared against the “AdventureWorks2012” database and I am using “Sales.SalesOrderDetail” table.

Cardinality Estimate

Estimated number of records that will be returned by filtering, joining predicates, and using group by clause.

Note - Auto Update and Auto Create statistics give you better performance.

Below are the outputs from sys.stats table.

object_idNamestats_idauto_createduser_createdno_recomputehas_filterfilter_definitionis_temporary
1509580416PK_Person_BusinessEntityID10000NULL0
1509580416IX_Person_LastName_FirstName_MiddleName20000NULL0
1509580416AK_Person_rowguid30000NULL0
1509580416_WA_Sys_00000002_59FA5E8041000NULL0

Creating and updating statistics

Statistics are created/udated in various ways automatically by the Query Optimizer. 

  • If Statistics is not created on particular column, then it will be automatically created by the Optimizer.

If you see in the above table “_WA_Sys_00000002_59FA5E80”, this statistics is created automatically by Optimizer.

_WA_Sys_00000002_59FA5E80 
  • WA means Washington (SQL Server developers sit here)
  • 00000002 - Column ID
  • 59FA5E80 - is the hexadecimal equivalent of the object_id

If Plan already exists in the Plan Cache and Statistics used by the Optimizer is outdated, then Optimizer discards the plan from plan cache, updates the statistics, and creates new plan.

  • When we create an Index, it will automatically create statistics against it.
  • Explicitly create statistics on particular column.

Different ways statistics get updated,

  1. Synchronous Update Statistics - Optimizer has to wait until the update statistics to complete before optimizing and executing query.
  2. Asynchronous Update Statistics - Optimizer need not wait to update statistics it will use existing statistics.

How to view Statistics?

Syntax of Statistics

  1. dbcc SHOW_STATISTICS({  
  2.     'table_name' | 'view_name'  
  3. }, target)[WITH {  
  4.     [NO_INFOMSGS] < option > [, n]  
  5. }] < option > :: = STAT_HEADER | DENSITY_VECTOR | HISTOGRAM  

Example of Statistics

Query

DBCC SHOW_STATISTICS('Sales.SalesOrderDetail','IX_SalesOrderDetail_ProductID')

Output

output

You can get all these 3 parts separately by executing below queries along with “with” syntax as below,

  1. DBCC SHOW_STATISTICS('Sales.SalesOrderDetail','IX_SalesOrderDetail_ProductID'with STAT_HEADER  
  2. DBCC SHOW_STATISTICS('Sales.SalesOrderDetail','IX_SalesOrderDetail_ProductID'with HISTOGRAM  
  3. DBCC SHOW_STATISTICS('Sales.SalesOrderDetail','IX_SalesOrderDetail_ProductID'with DENSITY_VECTOR  

When you look into show_statistics result you will get below as output,

  1. Stats Header
  2. Density 
  3. Histogram

Density

Density is used when query having Group by Clause or Equality Predicate with UNKNOWN values.

Scenario IDAll density Average LengthColumns
#10.0037593994ProductID
#28.2429E-068ProductID, SalesOrderID
#38.2429E-0612ProductID, SalesOrderID, SalesOrderDetailID

Density is calculated based on the 1/Number of Estimated Rows (i.e. count).

CountDensity Based on Manual CalculationReciprocal
2660.003759398265.99996
1213178.24287E-06121316.53
1213178.24287E-06121316.53

Below are the different scenarios and how statistics is getting calculated to qualify the number of rows required by the particular query.

Scenario 1 Group By

Query

SELECT ProductID FROM Sales.SalesOrderDetail GROUP BY ProductID

How statistics is getting used to retrieve Number of rows got selected for above query?

Formula used

1/Density value of column ProductID

I.e. 1/0.003759398 = 265.99996

 Number of rows
Group By ProductID265.9999644
Group By ProductID, SalesOrderID121316.9979

Scenario 2 Equality Operation

In Equality to get rows from statistics it will use the below formula,

Formula Used

Density Value of column * Count of rows

I.e. 8.24287E-06 * 121317 = 456.0790085

Equality OperationNumber of rows
ProductID = 970456.0790085

Scenario 3 Inequality Operation

Inequality Operation will use below formula,

Formula Used

30% * Count of rows

I.e. 30% * 121317 = 36395.1

Inequality OperationNumber of rows
ProductID <36395.1

Summary

In this article, I explained how to look into statistics, what are the different outputs that statistics have, how Query Optimizer uses statistics for Group By, Equality, and Inequality operation. Also, I explained in depth what are the formulas being used by the Optimizer internally to identify number of rows that qualify for particular queries, based on the different selectivity.