DEV_NET_CORE
GET_STARTED
SQLCore querying and data retrieval

GROUP BY and aggregate functions

Overview

GROUP BY and aggregate functions are used to summarize rows in SQL. Instead of returning every detail row, a grouped query returns one row per group, such as total sales by customer, order count by status, average response time by day, or maximum invoice amount by region.

Aggregate functions calculate a single value from a set of rows. Common examples include COUNT, SUM, AVG, MIN, and MAX. GROUP BY defines which rows belong together before those aggregate calculations are returned.

This topic matters because many real applications need reporting, dashboards, billing summaries, analytics, validation queries, and operational metrics. It is also a common interview area because candidates often confuse WHERE and HAVING, select non-grouped columns incorrectly, misunderstand COUNT(*) versus COUNT(column), or forget how NULL affects aggregates.

The practical goal is to write grouped queries that are correct, readable, and efficient, while understanding how filtering, null handling, joins, indexing, and result shape affect the answer.

Core Concepts

Basic Aggregate Query

An aggregate function summarizes multiple rows into one value.

Code
SELECT COUNT(*) AS OrderCount
FROM dbo.Orders;

This returns one row with the total number of rows in Orders.

Other common aggregates:

Code
SELECT
    COUNT(*) AS OrderCount,
    SUM(TotalAmount) AS TotalRevenue,
    AVG(TotalAmount) AS AverageOrderValue,
    MIN(TotalAmount) AS SmallestOrder,
    MAX(TotalAmount) AS LargestOrder
FROM dbo.Orders;

Without GROUP BY, aggregate functions summarize the entire filtered result set.

GROUP BY Basics

GROUP BY divides rows into groups and returns one row per group.

Code
SELECT
    Status,
    COUNT(*) AS OrderCount,
    SUM(TotalAmount) AS TotalRevenue
FROM dbo.Orders
GROUP BY Status;

If the table contains Draft, Placed, Cancelled, and Shipped statuses, this query returns one row per status.

The grouping column defines the granularity of the result. If you add more grouping columns, the result becomes more detailed.

Code
SELECT
    CustomerId,
    Status,
    COUNT(*) AS OrderCount
FROM dbo.Orders
GROUP BY CustomerId, Status;

This returns one row per customer and status combination.

SELECT List Rule

In a grouped query, every expression in the SELECT list must be either:

  • Part of the GROUP BY.
  • An aggregate expression.
  • A constant or expression derived only from grouped columns.

Bad:

Code
SELECT
    CustomerId,
    OrderDate,
    COUNT(*) AS OrderCount
FROM dbo.Orders
GROUP BY CustomerId;

OrderDate is not grouped or aggregated. SQL Server cannot know which order date to return for each customer.

Better:

Code
SELECT
    CustomerId,
    MIN(OrderDate) AS FirstOrderDate,
    MAX(OrderDate) AS LastOrderDate,
    COUNT(*) AS OrderCount
FROM dbo.Orders
GROUP BY CustomerId;

Now every selected value has a clear meaning for the customer group.

WHERE vs HAVING

WHERE filters rows before grouping. HAVING filters groups after aggregation.

Code
SELECT
    CustomerId,
    COUNT(*) AS CompletedOrderCount
FROM dbo.Orders
WHERE Status = N'Completed'
GROUP BY CustomerId
HAVING COUNT(*) >= 5;

Execution conceptually works like this:

  • FROM finds the source rows.
  • WHERE removes non-completed orders.
  • GROUP BY groups remaining rows by customer.
  • COUNT(*) counts rows per customer.
  • HAVING keeps only groups with at least five completed orders.
  • SELECT returns the final grouped result.
  • ORDER BY sorts the final result if specified.

Use WHERE for row-level filters and HAVING for aggregate filters.

Bad:

Code
SELECT CustomerId, COUNT(*) AS OrderCount
FROM dbo.Orders
GROUP BY CustomerId
HAVING Status = N'Completed';

Status is neither grouped nor aggregated. The row-level status filter belongs in WHERE.

COUNT Star vs COUNT Column

COUNT(*) counts rows.

Code
SELECT COUNT(*) AS RowCount
FROM dbo.Customers;

COUNT(column) counts non-null values in that column.

Code
SELECT
    COUNT(*) AS CustomerRows,
    COUNT(MiddleName) AS CustomersWithMiddleName
FROM dbo.Customers;

If MiddleName is nullable, COUNT(MiddleName) can be smaller than COUNT(*).

This is a common interview trap. Use COUNT(*) when you mean rows. Use COUNT(column) when you intentionally mean rows where that column has a value.

SUM and AVG

SUM adds numeric values. AVG calculates the average of non-null values.

Code
SELECT
    CustomerId,
    SUM(TotalAmount) AS TotalSpent,
    AVG(TotalAmount) AS AverageOrderValue
FROM dbo.Orders
WHERE Status = N'Completed'
GROUP BY CustomerId;

Be careful with nullable values:

Code
SELECT
    AVG(DiscountAmount) AS AverageDiscountAmongDiscountedRows,
    AVG(COALESCE(DiscountAmount, 0)) AS AverageDiscountAcrossAllRows
FROM dbo.OrderLines;

These two averages answer different questions. The first ignores rows where DiscountAmount is NULL. The second treats missing discount as zero.

MIN and MAX

MIN and MAX return the smallest and largest values in a group.

Code
SELECT
    CustomerId,
    MIN(OrderDate) AS FirstOrderDate,
    MAX(OrderDate) AS LastOrderDate
FROM dbo.Orders
GROUP BY CustomerId;

They work with numbers, dates, and comparable strings. In most business queries, they are used for:

  • First or last date.
  • Lowest or highest value.
  • Earliest or latest event.
  • Lexicographically smallest or largest code.

Do not assume MAX(OrderId) means latest order unless the key ordering truly matches creation order and the business accepts that assumption. Prefer an actual timestamp when recency matters.

DISTINCT Aggregates

COUNT(DISTINCT column) counts unique non-null values.

Code
SELECT
    COUNT(*) AS OrderRows,
    COUNT(CustomerId) AS CustomerReferences,
    COUNT(DISTINCT CustomerId) AS DistinctCustomers
FROM dbo.Orders;

Use distinct aggregates when the question is about unique values, not rows.

Common examples:

  • Count distinct customers who ordered.
  • Count distinct products sold.
  • Count distinct login days per user.

Be careful with joins. A join can multiply rows and make COUNT(*) larger than expected. Sometimes COUNT(DISTINCT ...) is correct; other times the query should aggregate before joining.

Conditional Aggregation

Conditional aggregation calculates multiple related counts or totals in one grouped query.

Code
SELECT
    CustomerId,
    COUNT(*) AS TotalOrders,
    SUM(CASE WHEN Status = N'Completed' THEN 1 ELSE 0 END) AS CompletedOrders,
    SUM(CASE WHEN Status = N'Cancelled' THEN 1 ELSE 0 END) AS CancelledOrders,
    SUM(CASE WHEN Status = N'Completed' THEN TotalAmount ELSE 0 END) AS CompletedRevenue
FROM dbo.Orders
GROUP BY CustomerId;

This pattern is common for dashboards and reports.

For counts, you may also see:

Code
COUNT(CASE WHEN Status = N'Completed' THEN 1 END) AS CompletedOrders

This works because COUNT(expression) counts non-null expressions. The SUM(CASE...) version is often clearer when the result is numeric and explicit.

Grouping By Expressions

You can group by expressions, not only raw columns.

Code
SELECT
    CAST(OrderDate AS date) AS OrderDay,
    COUNT(*) AS OrderCount
FROM dbo.Orders
GROUP BY CAST(OrderDate AS date);

This returns one row per calendar day.

Be careful with performance. Applying functions to a column can make filtering less efficient if used in WHERE.

Less efficient filter:

Code
WHERE CAST(OrderDate AS date) = '2026-06-19'

Often better:

Code
WHERE OrderDate >= '20260619'
  AND OrderDate <  '20260620'

Then group by the date expression if needed for display.

Grouping NULL Values

When a grouping column contains NULL, SQL Server groups all NULL values together.

Code
SELECT
    Region,
    COUNT(*) AS CustomerCount
FROM dbo.Customers
GROUP BY Region;

Rows with Region IS NULL appear as one group where Region is NULL.

If you want a display label, use a separate expression:

Code
SELECT
    COALESCE(Region, N'Unknown') AS RegionName,
    COUNT(*) AS CustomerCount
FROM dbo.Customers
GROUP BY COALESCE(Region, N'Unknown');

Be cautious: replacing NULL with a label can merge true unknowns with a real value if that label also exists in the data.

Aggregates and NULL Values

Most aggregate functions ignore NULL values. COUNT(*) is the major exception because it counts rows.

Code
SELECT
    COUNT(*) AS RowsInGroup,
    COUNT(ShippedAtUtc) AS ShippedRows,
    MIN(ShippedAtUtc) AS FirstShippedAt,
    MAX(ShippedAtUtc) AS LastShippedAt
FROM dbo.Orders;

If ShippedAtUtc is nullable, COUNT(ShippedAtUtc) counts only shipped rows.

This can be useful, but it must be intentional.

HAVING Without GROUP BY

HAVING is usually used with GROUP BY, but SQL can also apply it to a single implicit group.

Code
SELECT COUNT(*) AS OrderCount
FROM dbo.Orders
HAVING COUNT(*) > 0;

This returns the aggregate row only if the aggregate condition is true. It is less common in application code but useful to understand for interviews.

ROLLUP, CUBE, and GROUPING SETS

SQL Server supports extended grouping options for subtotal and reporting queries.

ROLLUP produces hierarchical totals:

Code
SELECT
    Region,
    SalesPersonId,
    SUM(TotalAmount) AS Revenue
FROM dbo.Orders
GROUP BY ROLLUP (Region, SalesPersonId);

GROUPING SETS lets you specify exact grouping combinations:

Code
SELECT
    Region,
    Status,
    SUM(TotalAmount) AS Revenue
FROM dbo.Orders
GROUP BY GROUPING SETS
(
    (Region, Status),
    (Region),
    ()
);

These are useful for reports that need detail rows, subtotals, and grand totals in one result. They can be overkill for simple API queries.

GROUP BY Does Not Sort

GROUP BY defines groups. It does not guarantee output order.

Bad assumption:

Code
SELECT Status, COUNT(*) AS OrderCount
FROM dbo.Orders
GROUP BY Status;

If order matters, add ORDER BY:

Code
SELECT Status, COUNT(*) AS OrderCount
FROM dbo.Orders
GROUP BY Status
ORDER BY OrderCount DESC, Status;

This matters for deterministic reports and tests.

Join Multiplication

Aggregates after joins can be wrong when joins multiply rows.

Example:

Code
SELECT
    c.CustomerId,
    COUNT(*) AS OrderCount
FROM dbo.Customers AS c
JOIN dbo.Orders AS o
    ON o.CustomerId = c.CustomerId
JOIN dbo.OrderLines AS ol
    ON ol.OrderId = o.OrderId
GROUP BY c.CustomerId;

This counts order lines, not orders. If an order has three lines, it contributes three rows.

Better:

Code
SELECT
    c.CustomerId,
    COUNT(DISTINCT o.OrderId) AS OrderCount
FROM dbo.Customers AS c
JOIN dbo.Orders AS o
    ON o.CustomerId = c.CustomerId
JOIN dbo.OrderLines AS ol
    ON ol.OrderId = o.OrderId
GROUP BY c.CustomerId;

Or aggregate orders before joining to order lines if that better matches the report.

Performance Considerations

Grouped queries often benefit from indexes that match filters and grouping keys.

Example:

Code
CREATE INDEX IX_Orders_Status_CustomerId
ON dbo.Orders(Status, CustomerId)
INCLUDE (TotalAmount);

This can help a query like:

Code
SELECT
    CustomerId,
    SUM(TotalAmount) AS CompletedRevenue
FROM dbo.Orders
WHERE Status = N'Completed'
GROUP BY CustomerId;

Performance depends on data distribution, statistics, indexes, query shape, and result size. Avoid assuming all aggregation problems are solved by indexes. Sometimes the correct answer is a summary table, indexed view, filtered index, partitioning, or a separate reporting model.

Common Mistakes

Common mistakes include:

  • Selecting columns that are not grouped or aggregated.
  • Using HAVING for row filters that belong in WHERE.
  • Forgetting that COUNT(column) ignores NULL.
  • Assuming GROUP BY sorts results.
  • Counting multiplied join rows accidentally.
  • Using COUNT(DISTINCT ...) as a bandage instead of fixing the join shape.
  • Treating NULL and zero as the same in averages.
  • Grouping at the wrong granularity.
  • Using MAX(Id) as a proxy for latest date without proving it.
  • Returning subtotals from ROLLUP without labeling them clearly.

Best Practices

Best practices include:

  • Decide the granularity before writing the query.
  • Filter rows in WHERE before grouping.
  • Filter aggregate results in HAVING.
  • Use COUNT(*) for row count.
  • Use COUNT(column) only when non-null count is intended.
  • Use clear aggregate aliases.
  • Check joins for row multiplication.
  • Add ORDER BY when result order matters.
  • Test edge cases with no rows, null values, and duplicate joins.
  • Use indexes or summary structures based on measured workload.

Interview Practice

PreviousSELECT, WHERE, ORDER BY, TOP/OFFSET-FETCHNext UpNULL handling and common filtering mistakes