Statistics and how the optimizer uses them Interview Questions

Overview

Statistics are metadata objects that describe the distribution of values in one or more table or indexed-view columns. SQL Server's query optimizer uses statistics to estimate how many rows a query operator will process. Those cardinality estimates drive plan choices such as index seek versus scan, join order, join type, memory grant, aggregation strategy, and parallelism.

Good statistics do not guarantee a perfect plan, but poor or stale statistics are a common reason for bad plans. If the optimizer thinks a predicate will return 10 rows when it actually returns 10 million, it can choose a plan that looks cheap at compile time but performs badly at runtime.

This topic matters because many performance problems are not caused by missing indexes alone. They come from wrong row-count estimates, stale statistics after data changes, skewed data distributions, parameter sensitivity, correlated predicates, ascending keys, or predicates that make estimation hard.

For interviews, strong candidates can explain histograms, density, cardinality estimation, automatic statistics creation and updates, when manual updates help, and how to compare estimated rows with actual rows in execution plans.

Core Concepts

What Statistics Are

Statistics contain information about data distribution.

SQL Server can create statistics:

Automatically on columns used in query predicates.
Automatically for indexes.
Manually with CREATE STATISTICS.
As filtered statistics for a subset of rows.

Example:

Code

CREATE STATISTICS ST_Orders_Status_OrderDate
ON dbo.Orders (Status, OrderDate);

Statistics are not indexes. They do not store row locators or speed data access directly. They help the optimizer choose a plan.

Cardinality Estimation

Cardinality estimation predicts how many rows will flow through an operator.

Example:

Code

SELECT OrderId, CustomerId, TotalAmount
FROM dbo.Orders
WHERE Status = N'Cancelled';

If statistics show that only 0.5 percent of orders are cancelled, the optimizer may choose an index seek. If most orders are cancelled, it may choose a scan.

Cardinality estimates influence:

Access method.
Join order.
Join algorithm.
Memory grants.
Parallelism.
Sort and hash strategy.
Spool decisions.

Histogram

A histogram describes value distribution for the first key column of a statistics object. SQL Server histograms have up to 200 steps.

Example:

Code

DBCC SHOW_STATISTICS ('dbo.Orders', 'IX_Orders_OrderDate');

The histogram can help estimate predicates such as:

Code

WHERE OrderDate >= @StartDate
  AND OrderDate < @EndDate;

Important detail: in multi-column statistics, the histogram is only on the first column. Additional columns contribute density information, but not full histograms.

Density

Density describes uniqueness or correlation information for statistics columns. It helps estimate equality predicates and joins, especially for multi-column statistics.

Example:

Code

CREATE STATISTICS ST_Orders_Tenant_Status
ON dbo.Orders (TenantId, Status);

This can help when queries filter by both tenant and status:

Code

WHERE TenantId = @TenantId
  AND Status = N'Open';

However, statistics still may not fully capture complex correlation between columns. If one tenant has very different status distribution than another tenant, estimates may still be wrong.

Auto Create Statistics

When automatic statistics creation is enabled, SQL Server can create statistics on columns used in predicates to help optimize queries.

Example:

Code

SELECT *
FROM dbo.Orders
WHERE ExternalReference = @ExternalReference;

If no statistics exist on ExternalReference, SQL Server may create them automatically during compilation, depending on database settings and query conditions.

Auto-created statistics can help, but they are not a substitute for well-designed indexes. A statistic can tell the optimizer how many rows may qualify, but it does not provide a physical access path.

Auto Update Statistics

Statistics become stale when data changes through inserts, updates, deletes, or merges. With automatic statistics updates enabled, SQL Server determines when statistics are out of date and updates them when needed.

Example scenario:

A table had 1 million rows last month.
A bulk import added 5 million new rows.
Statistics still describe the old distribution.
Queries against new values get poor estimates.

Manual update:

Code

UPDATE STATISTICS dbo.Orders IX_Orders_OrderDate;

Or table-wide:

Code

UPDATE STATISTICS dbo.Orders;

Do not update statistics constantly without reason. Updating statistics can cause recompilation and consumes resources.

Synchronous Vs Asynchronous Statistics Updates

With synchronous statistics updates, a query may wait for stale statistics to update before compilation finishes. This can produce a better plan but increase compile latency.

With asynchronous statistics updates, a query can compile using existing stale statistics while SQL Server updates statistics in the background. Later queries can benefit from the updated statistics.

The choice is workload-dependent:

Synchronous favors plan quality for the current query.
Asynchronous favors avoiding compile-time waits.

Interview answer: know the trade-off and test in the workload rather than choosing by habit.

Statistics And Indexes

Creating an index also creates statistics on the index key. Rebuilding an index recreates the index and updates the associated index statistics as part of that process.

However:

Reorganizing an index does not update statistics the same way.
Auto-created column statistics are separate from index statistics.
Updating statistics does not rebuild indexes.
Rebuilding indexes is not a complete substitute for targeted statistics strategy.

Index and statistics maintenance overlap, but they are not identical.

Filtered Statistics

Filtered statistics describe a subset of rows.

Example:

Code

CREATE STATISTICS ST_Orders_Open_OrderDate
ON dbo.Orders (OrderDate)
WHERE Status = N'Open';

This can help the optimizer estimate queries that target the filtered subset:

Code

WHERE Status = N'Open'
  AND OrderDate >= @StartDate
  AND OrderDate < @EndDate;

Filtered statistics are useful when a subset has different distribution from the full table, such as active rows, unprocessed messages, or non-null optional columns.

Ascending Key Problem

Ascending keys such as identity columns and increasing timestamps can cause estimate problems when new values appear beyond the histogram's known range.

Example:

Code

SELECT *
FROM dbo.Orders
WHERE OrderDate >= DATEADD(hour, -1, SYSUTCDATETIME());

If many new rows arrived after the last statistics update, the histogram may not represent the newest values well. The optimizer can underestimate recent rows.

Possible mitigations:

Let auto update statistics run.
Manually update statistics after large loads.
Use appropriate maintenance jobs.
Consider filtered statistics for recent active subsets.
Monitor actual versus estimated rows.

Parameter Sensitivity

One query shape can need different plans for different parameter values.

Example:

Code

CREATE PROCEDURE dbo.GetOrdersByStatus
    @Status NVARCHAR(30)
AS
BEGIN
    SELECT *
    FROM dbo.Orders
    WHERE Status = @Status;
END;

If @Status = 'Cancelled' returns very few rows but @Status = 'Completed' returns millions, one cached plan may not be good for both.

Statistics help estimate distribution, but plan reuse can still create parameter-sensitive performance. This is related to parameter sniffing and parameter-sensitive plan behavior.

Estimated Rows Vs Actual Rows

Execution plans show estimated rows and actual rows.

Example plan symptom:

Code

Estimated rows: 1
Actual rows: 500000

This mismatch is a strong clue that the optimizer made decisions based on bad or incomplete cardinality estimates.

Possible causes:

Stale statistics.
Missing statistics.
Skewed data.
Correlated predicates.
Table variables or temporary structures with poor estimates.
Non-sargable predicates.
Parameter sensitivity.
Implicit conversions.

Statistics Inspection

Useful commands and tools:

Code

DBCC SHOW_STATISTICS ('dbo.Orders', 'IX_Orders_OrderDate');

Code

SELECT
    s.name,
    sp.last_updated,
    sp.rows,
    sp.modification_counter
FROM sys.stats AS s
CROSS APPLY sys.dm_db_stats_properties(s.object_id, s.stats_id) AS sp
WHERE s.object_id = OBJECT_ID(N'dbo.Orders');

Execution plan XML can also show statistics used during optimization through optimizer statistics usage information.

Common Mistakes

Common mistakes include:

Assuming indexes and statistics are the same thing.
Disabling auto update statistics without a strong maintenance replacement.
Updating statistics too frequently and causing unnecessary recompiles.
Rebuilding indexes and assuming all statistics are handled.
Ignoring stale statistics after bulk loads.
Ignoring ascending timestamp or identity patterns.
Looking only at operator cost instead of estimated versus actual rows.
Creating multi-column statistics with the wrong leading column.
Assuming statistics perfectly understand correlated predicates.
Blaming the optimizer before checking data distribution.

Best Practices

Best practices:

Keep automatic statistics creation and updates enabled unless there is a proven reason not to.
Check estimated versus actual rows in actual execution plans.
Update statistics after large data changes when auto thresholds are not enough.
Use filtered statistics for important skewed subsets.
Create multi-column statistics when correlated predicates matter.
Avoid non-sargable predicates that make estimation and index use harder.
Monitor statistics age and modification counts for critical tables.
Understand parameter sensitivity before forcing hints.
Use Query Store and actual plans to focus on real workload regressions.
Treat statistics maintenance as workload-specific, not a universal nightly ritual.

Statistics and how the optimizer uses them

Overview

Core Concepts

What Statistics Are

Cardinality Estimation

Histogram

Density

Auto Create Statistics

Auto Update Statistics

Synchronous Vs Asynchronous Statistics Updates

Statistics And Indexes

Filtered Statistics

Ascending Key Problem

Parameter Sensitivity

Estimated Rows Vs Actual Rows

Statistics Inspection

Common Mistakes

Best Practices

Interview Practice

Beginner Interview Practice

Intermediate Interview Practice

Advanced Interview Practice