Overview
GROUP BY and aggregate functions are used to summarize rows in SQL. Instead of returning every detail row, a grouped query returns one row per group, such as total sales by customer, order count by status, average response time by day, or maximum invoice amount by region.
Aggregate functions calculate a single value from a set of rows. Common examples include COUNT, SUM, AVG, MIN, and MAX. GROUP BY defines which rows belong together before those aggregate calculations are returned.
This topic matters because many real applications need reporting, dashboards, billing summaries, analytics, validation queries, and operational metrics. It is also a common interview area because candidates often confuse WHERE and HAVING, select non-grouped columns incorrectly, misunderstand COUNT(*) versus COUNT(column), or forget how NULL affects aggregates.
The practical goal is to write grouped queries that are correct, readable, and efficient, while understanding how filtering, null handling, joins, indexing, and result shape affect the answer.
Core Concepts
Basic Aggregate Query
An aggregate function summarizes multiple rows into one value.
SELECT COUNT(*) AS OrderCount
FROM dbo.Orders;
This returns one row with the total number of rows in Orders.
Other common aggregates:
SELECT
COUNT(*) AS OrderCount,
SUM(TotalAmount) AS TotalRevenue,
AVG(TotalAmount) AS AverageOrderValue,
MIN(TotalAmount) AS SmallestOrder,
MAX(TotalAmount) AS LargestOrder
FROM dbo.Orders;
Without GROUP BY, aggregate functions summarize the entire filtered result set.
GROUP BY Basics
GROUP BY divides rows into groups and returns one row per group.
SELECT
Status,
COUNT(*) AS OrderCount,
SUM(TotalAmount) AS TotalRevenue
FROM dbo.Orders
GROUP BY Status;
If the table contains Draft, Placed, Cancelled, and Shipped statuses, this query returns one row per status.
The grouping column defines the granularity of the result. If you add more grouping columns, the result becomes more detailed.
SELECT
CustomerId,
Status,
COUNT(*) AS OrderCount
FROM dbo.Orders
GROUP BY CustomerId, Status;
This returns one row per customer and status combination.
SELECT List Rule
In a grouped query, every expression in the SELECT list must be either:
- Part of the
GROUP BY. - An aggregate expression.
- A constant or expression derived only from grouped columns.
Bad:
SELECT
CustomerId,
OrderDate,
COUNT(*) AS OrderCount
FROM dbo.Orders
GROUP BY CustomerId;
OrderDate is not grouped or aggregated. SQL Server cannot know which order date to return for each customer.
Better:
SELECT
CustomerId,
MIN(OrderDate) AS FirstOrderDate,
MAX(OrderDate) AS LastOrderDate,
COUNT(*) AS OrderCount
FROM dbo.Orders
GROUP BY CustomerId;
Now every selected value has a clear meaning for the customer group.
WHERE vs HAVING
WHERE filters rows before grouping. HAVING filters groups after aggregation.
SELECT
CustomerId,
COUNT(*) AS CompletedOrderCount
FROM dbo.Orders
WHERE Status = N'Completed'
GROUP BY CustomerId
HAVING COUNT(*) >= 5;
Execution conceptually works like this:
FROMfinds the source rows.WHEREremoves non-completed orders.GROUP BYgroups remaining rows by customer.COUNT(*)counts rows per customer.HAVINGkeeps only groups with at least five completed orders.SELECTreturns the final grouped result.ORDER BYsorts the final result if specified.
Use WHERE for row-level filters and HAVING for aggregate filters.
Bad:
SELECT CustomerId, COUNT(*) AS OrderCount
FROM dbo.Orders
GROUP BY CustomerId
HAVING Status = N'Completed';
Status is neither grouped nor aggregated. The row-level status filter belongs in WHERE.
COUNT Star vs COUNT Column
COUNT(*) counts rows.
SELECT COUNT(*) AS RowCount
FROM dbo.Customers;
COUNT(column) counts non-null values in that column.
SELECT
COUNT(*) AS CustomerRows,
COUNT(MiddleName) AS CustomersWithMiddleName
FROM dbo.Customers;
If MiddleName is nullable, COUNT(MiddleName) can be smaller than COUNT(*).
This is a common interview trap. Use COUNT(*) when you mean rows. Use COUNT(column) when you intentionally mean rows where that column has a value.
SUM and AVG
SUM adds numeric values. AVG calculates the average of non-null values.
SELECT
CustomerId,
SUM(TotalAmount) AS TotalSpent,
AVG(TotalAmount) AS AverageOrderValue
FROM dbo.Orders
WHERE Status = N'Completed'
GROUP BY CustomerId;
Be careful with nullable values:
SELECT
AVG(DiscountAmount) AS AverageDiscountAmongDiscountedRows,
AVG(COALESCE(DiscountAmount, 0)) AS AverageDiscountAcrossAllRows
FROM dbo.OrderLines;
These two averages answer different questions. The first ignores rows where DiscountAmount is NULL. The second treats missing discount as zero.
MIN and MAX
MIN and MAX return the smallest and largest values in a group.
SELECT
CustomerId,
MIN(OrderDate) AS FirstOrderDate,
MAX(OrderDate) AS LastOrderDate
FROM dbo.Orders
GROUP BY CustomerId;
They work with numbers, dates, and comparable strings. In most business queries, they are used for:
- First or last date.
- Lowest or highest value.
- Earliest or latest event.
- Lexicographically smallest or largest code.
Do not assume MAX(OrderId) means latest order unless the key ordering truly matches creation order and the business accepts that assumption. Prefer an actual timestamp when recency matters.
DISTINCT Aggregates
COUNT(DISTINCT column) counts unique non-null values.
SELECT
COUNT(*) AS OrderRows,
COUNT(CustomerId) AS CustomerReferences,
COUNT(DISTINCT CustomerId) AS DistinctCustomers
FROM dbo.Orders;
Use distinct aggregates when the question is about unique values, not rows.
Common examples:
- Count distinct customers who ordered.
- Count distinct products sold.
- Count distinct login days per user.
Be careful with joins. A join can multiply rows and make COUNT(*) larger than expected. Sometimes COUNT(DISTINCT ...) is correct; other times the query should aggregate before joining.
Conditional Aggregation
Conditional aggregation calculates multiple related counts or totals in one grouped query.
SELECT
CustomerId,
COUNT(*) AS TotalOrders,
SUM(CASE WHEN Status = N'Completed' THEN 1 ELSE 0 END) AS CompletedOrders,
SUM(CASE WHEN Status = N'Cancelled' THEN 1 ELSE 0 END) AS CancelledOrders,
SUM(CASE WHEN Status = N'Completed' THEN TotalAmount ELSE 0 END) AS CompletedRevenue
FROM dbo.Orders
GROUP BY CustomerId;
This pattern is common for dashboards and reports.
For counts, you may also see:
COUNT(CASE WHEN Status = N'Completed' THEN 1 END) AS CompletedOrders
This works because COUNT(expression) counts non-null expressions. The SUM(CASE...) version is often clearer when the result is numeric and explicit.
Grouping By Expressions
You can group by expressions, not only raw columns.
SELECT
CAST(OrderDate AS date) AS OrderDay,
COUNT(*) AS OrderCount
FROM dbo.Orders
GROUP BY CAST(OrderDate AS date);
This returns one row per calendar day.
Be careful with performance. Applying functions to a column can make filtering less efficient if used in WHERE.
Less efficient filter:
WHERE CAST(OrderDate AS date) = '2026-06-19'
Often better:
WHERE OrderDate >= '20260619'
AND OrderDate < '20260620'
Then group by the date expression if needed for display.
Grouping NULL Values
When a grouping column contains NULL, SQL Server groups all NULL values together.
SELECT
Region,
COUNT(*) AS CustomerCount
FROM dbo.Customers
GROUP BY Region;
Rows with Region IS NULL appear as one group where Region is NULL.
If you want a display label, use a separate expression:
SELECT
COALESCE(Region, N'Unknown') AS RegionName,
COUNT(*) AS CustomerCount
FROM dbo.Customers
GROUP BY COALESCE(Region, N'Unknown');
Be cautious: replacing NULL with a label can merge true unknowns with a real value if that label also exists in the data.
Aggregates and NULL Values
Most aggregate functions ignore NULL values. COUNT(*) is the major exception because it counts rows.
SELECT
COUNT(*) AS RowsInGroup,
COUNT(ShippedAtUtc) AS ShippedRows,
MIN(ShippedAtUtc) AS FirstShippedAt,
MAX(ShippedAtUtc) AS LastShippedAt
FROM dbo.Orders;
If ShippedAtUtc is nullable, COUNT(ShippedAtUtc) counts only shipped rows.
This can be useful, but it must be intentional.
HAVING Without GROUP BY
HAVING is usually used with GROUP BY, but SQL can also apply it to a single implicit group.
SELECT COUNT(*) AS OrderCount
FROM dbo.Orders
HAVING COUNT(*) > 0;
This returns the aggregate row only if the aggregate condition is true. It is less common in application code but useful to understand for interviews.
ROLLUP, CUBE, and GROUPING SETS
SQL Server supports extended grouping options for subtotal and reporting queries.
ROLLUP produces hierarchical totals:
SELECT
Region,
SalesPersonId,
SUM(TotalAmount) AS Revenue
FROM dbo.Orders
GROUP BY ROLLUP (Region, SalesPersonId);
GROUPING SETS lets you specify exact grouping combinations:
SELECT
Region,
Status,
SUM(TotalAmount) AS Revenue
FROM dbo.Orders
GROUP BY GROUPING SETS
(
(Region, Status),
(Region),
()
);
These are useful for reports that need detail rows, subtotals, and grand totals in one result. They can be overkill for simple API queries.
GROUP BY Does Not Sort
GROUP BY defines groups. It does not guarantee output order.
Bad assumption:
SELECT Status, COUNT(*) AS OrderCount
FROM dbo.Orders
GROUP BY Status;
If order matters, add ORDER BY:
SELECT Status, COUNT(*) AS OrderCount
FROM dbo.Orders
GROUP BY Status
ORDER BY OrderCount DESC, Status;
This matters for deterministic reports and tests.
Join Multiplication
Aggregates after joins can be wrong when joins multiply rows.
Example:
SELECT
c.CustomerId,
COUNT(*) AS OrderCount
FROM dbo.Customers AS c
JOIN dbo.Orders AS o
ON o.CustomerId = c.CustomerId
JOIN dbo.OrderLines AS ol
ON ol.OrderId = o.OrderId
GROUP BY c.CustomerId;
This counts order lines, not orders. If an order has three lines, it contributes three rows.
Better:
SELECT
c.CustomerId,
COUNT(DISTINCT o.OrderId) AS OrderCount
FROM dbo.Customers AS c
JOIN dbo.Orders AS o
ON o.CustomerId = c.CustomerId
JOIN dbo.OrderLines AS ol
ON ol.OrderId = o.OrderId
GROUP BY c.CustomerId;
Or aggregate orders before joining to order lines if that better matches the report.
Performance Considerations
Grouped queries often benefit from indexes that match filters and grouping keys.
Example:
CREATE INDEX IX_Orders_Status_CustomerId
ON dbo.Orders(Status, CustomerId)
INCLUDE (TotalAmount);
This can help a query like:
SELECT
CustomerId,
SUM(TotalAmount) AS CompletedRevenue
FROM dbo.Orders
WHERE Status = N'Completed'
GROUP BY CustomerId;
Performance depends on data distribution, statistics, indexes, query shape, and result size. Avoid assuming all aggregation problems are solved by indexes. Sometimes the correct answer is a summary table, indexed view, filtered index, partitioning, or a separate reporting model.
Common Mistakes
Common mistakes include:
- Selecting columns that are not grouped or aggregated.
- Using
HAVINGfor row filters that belong inWHERE. - Forgetting that
COUNT(column)ignoresNULL. - Assuming
GROUP BYsorts results. - Counting multiplied join rows accidentally.
- Using
COUNT(DISTINCT ...)as a bandage instead of fixing the join shape. - Treating
NULLand zero as the same in averages. - Grouping at the wrong granularity.
- Using
MAX(Id)as a proxy for latest date without proving it. - Returning subtotals from
ROLLUPwithout labeling them clearly.
Best Practices
Best practices include:
- Decide the granularity before writing the query.
- Filter rows in
WHEREbefore grouping. - Filter aggregate results in
HAVING. - Use
COUNT(*)for row count. - Use
COUNT(column)only when non-null count is intended. - Use clear aggregate aliases.
- Check joins for row multiplication.
- Add
ORDER BYwhen result order matters. - Test edge cases with no rows, null values, and duplicate joins.
- Use indexes or summary structures based on measured workload.