Statistics And Cost Estimation Internals - Part One

Vikrant More
7y
7.6k
0
5

Article

Introduction

Incorrect cardinality and Cost Estimation may lead query optimizer to choose inefficient plans which can have a negative impact on the database performance. In this article, we are going to look deep inside Statistics - how statistics are obtained by the query, and manually calculate the same with examples and scenarios and how optimizer thinks. I am going to focus on Density in this article.

Queries that I am demonstrating in this article are prepared against the “AdventureWorks2012” database and I am using “Sales.SalesOrderDetail” table.

Cardinality Estimate

Estimated number of records that will be returned by filtering, joining predicates, and using group by clause.

Note - Auto Update and Auto Create statistics give you better performance.

Below are the outputs from sys.stats table.

object_id	Name	stats_id	auto_created	user_created	no_recompute	has_filter	filter_definition	is_temporary
1509580416	PK_Person_BusinessEntityID	1	0	0	0	0	NULL	0
1509580416	IX_Person_LastName_FirstName_MiddleName	2	0	0	0	0	NULL	0
1509580416	AK_Person_rowguid	3	0	0	0	0	NULL	0
1509580416	_WA_Sys_00000002_59FA5E80	4	1	0	0	0	NULL	0

Creating and updating statistics

Statistics are created/udated in various ways automatically by the Query Optimizer.

If Statistics is not created on particular column, then it will be automatically created by the Optimizer.

If you see in the above table “_WA_Sys_00000002_59FA5E80”, this statistics is created automatically by Optimizer.

_WA_Sys_00000002_59FA5E80

WA means Washington (SQL Server developers sit here)
00000002 - Column ID
59FA5E80 - is the hexadecimal equivalent of the object_id

If Plan already exists in the Plan Cache and Statistics used by the Optimizer is outdated, then Optimizer discards the plan from plan cache, updates the statistics, and creates new plan.

When we create an Index, it will automatically create statistics against it.
Explicitly create statistics on particular column.

Different ways statistics get updated,

Synchronous Update Statistics - Optimizer has to wait until the update statistics to complete before optimizing and executing query.
Asynchronous Update Statistics - Optimizer need not wait to update statistics it will use existing statistics.

How to view Statistics?

Syntax of Statistics

dbcc SHOW_STATISTICS({
'table_name' | 'view_name'
}, target)[WITH {
[NO_INFOMSGS] < option > [, n]
}] < option > :: = STAT_HEADER | DENSITY_VECTOR | HISTOGRAM

Example of Statistics

Query

DBCC SHOW_STATISTICS('Sales.SalesOrderDetail','IX_SalesOrderDetail_ProductID')

Output

You can get all these 3 parts separately by executing below queries along with “with” syntax as below,

DBCC SHOW_STATISTICS('Sales.SalesOrderDetail','IX_SalesOrderDetail_ProductID') with STAT_HEADER
DBCC SHOW_STATISTICS('Sales.SalesOrderDetail','IX_SalesOrderDetail_ProductID') with HISTOGRAM
DBCC SHOW_STATISTICS('Sales.SalesOrderDetail','IX_SalesOrderDetail_ProductID') with DENSITY_VECTOR

When you look into show_statistics result you will get below as output,

Stats Header
Density
Histogram

Density

Density is used when query having Group by Clause or Equality Predicate with UNKNOWN values.

Scenario ID	All density	Average Length	Columns
#1	0.003759399	4	ProductID
#2	8.2429E-06	8	ProductID, SalesOrderID
#3	8.2429E-06	12	ProductID, SalesOrderID, SalesOrderDetailID

Density is calculated based on the 1/Number of Estimated Rows (i.e. count).

Count	Density Based on Manual Calculation	Reciprocal
266	0.003759398	265.99996
121317	8.24287E-06	121316.53
121317	8.24287E-06	121316.53

Below are the different scenarios and how statistics is getting calculated to qualify the number of rows required by the particular query.

Scenario 1 Group By

Query

SELECT ProductID FROM Sales.SalesOrderDetail GROUP BY ProductID

How statistics is getting used to retrieve Number of rows got selected for above query?

Formula used

1/Density value of column ProductID

I.e. 1/0.003759398 = 265.99996

	Number of rows
Group By ProductID	265.9999644
Group By ProductID, SalesOrderID	121316.9979

Scenario 2 Equality Operation

In Equality to get rows from statistics it will use the below formula,

Formula Used

Density Value of column * Count of rows

I.e. 8.24287E-06 * 121317 = 456.0790085

Equality Operation	Number of rows
ProductID = 970	456.0790085

Scenario 3 Inequality Operation

Inequality Operation will use below formula,

Formula Used

30% * Count of rows

I.e. 30% * 121317 = 36395.1

Inequality Operation	Number of rows
ProductID <	36395.1

Summary

In this article, I explained how to look into statistics, what are the different outputs that statistics have, how Query Optimizer uses statistics for Group By, Equality, and Inequality operation. Also, I explained in depth what are the formulas being used by the Optimizer internally to identify number of rows that qualify for particular queries, based on the different selectivity.