Normalization and when denormalization is justified Interview Questions

Overview

Normalization is the process of organizing relational tables so each fact is stored in the right place, relationships are explicit, and redundancy is reduced. A normalized schema protects data integrity by avoiding update anomalies, insert anomalies, delete anomalies, duplicated facts, and unclear dependencies.

Denormalization is the deliberate decision to store redundant or precomputed data after understanding the normalized model. It is usually done to improve read performance, simplify reporting queries, preserve historical snapshots, or support read-heavy application patterns. It is not the same as never normalizing. A denormalized design should have a clear reason, an owner, and a consistency strategy.

This topic matters because relational modeling questions are common in SQL interviews. Interviewers want to know whether a candidate can design tables that match business facts, explain normal forms in practical language, avoid duplication problems, and also recognize when strict normalization is not the best production choice.

The practical goal is balance: normalize the source-of-truth model enough to protect correctness, then denormalize intentionally where measurements, workload shape, reporting needs, or historical requirements justify the extra complexity.

Core Concepts

What Normalization Solves

Normalization reduces the risk that one real-world fact is stored in multiple places.

Bad design:

Code

CREATE TABLE Orders
(
    OrderId INT NOT NULL PRIMARY KEY,
    CustomerId INT NOT NULL,
    CustomerName NVARCHAR(200) NOT NULL,
    CustomerEmail NVARCHAR(320) NOT NULL,
    ProductId INT NOT NULL,
    ProductName NVARCHAR(200) NOT NULL,
    ProductPrice DECIMAL(19, 4) NOT NULL,
    Quantity INT NOT NULL
);

This table mixes order facts, customer facts, product facts, and line-item facts. If a customer email changes, many order rows may need updates. If a product name changes, old orders may be accidentally rewritten. If an order has multiple products, order-level data is duplicated.

Better normalized shape:

Code

CREATE TABLE Customers
(
    CustomerId INT NOT NULL PRIMARY KEY,
    Email NVARCHAR(320) NOT NULL UNIQUE,
    DisplayName NVARCHAR(200) NOT NULL
);

CREATE TABLE Products
(
    ProductId INT NOT NULL PRIMARY KEY,
    Name NVARCHAR(200) NOT NULL,
    CurrentPrice DECIMAL(19, 4) NOT NULL
);

CREATE TABLE Orders
(
    OrderId INT NOT NULL PRIMARY KEY,
    CustomerId INT NOT NULL,
    OrderedAtUtc DATETIME2 NOT NULL,
    CONSTRAINT FK_Orders_Customers
        FOREIGN KEY (CustomerId) REFERENCES Customers(CustomerId)
);

CREATE TABLE OrderLines
(
    OrderId INT NOT NULL,
    LineNumber INT NOT NULL,
    ProductId INT NOT NULL,
    Quantity INT NOT NULL,
    UnitPrice DECIMAL(19, 4) NOT NULL,
    CONSTRAINT PK_OrderLines PRIMARY KEY (OrderId, LineNumber),
    CONSTRAINT FK_OrderLines_Orders
        FOREIGN KEY (OrderId) REFERENCES Orders(OrderId),
    CONSTRAINT FK_OrderLines_Products
        FOREIGN KEY (ProductId) REFERENCES Products(ProductId)
);

Each table owns a clear kind of fact.

Data Anomalies

Normalization prevents common anomalies:

Update anomaly: the same fact must be updated in multiple rows.
Insert anomaly: a fact cannot be stored until unrelated data exists.
Delete anomaly: deleting one row accidentally removes the only copy of another fact.
Inconsistent dependency: a column depends on the wrong key or only part of a key.

Example update anomaly:

Code

-- ProductName is copied into many order rows.
UPDATE Orders
SET ProductName = N'Wireless Keyboard'
WHERE ProductId = 42;

If one row is missed, reports now disagree about the product name. Normalization stores current product name once in Products. Historical order display names can still be stored as a deliberate snapshot, but that is a different requirement.

First Normal Form

First normal form means rows and columns represent scalar values, and repeating groups are removed.

Bad:

Code

CREATE TABLE Customers
(
    CustomerId INT NOT NULL PRIMARY KEY,
    Name NVARCHAR(200) NOT NULL,
    Phone1 NVARCHAR(30) NULL,
    Phone2 NVARCHAR(30) NULL,
    Phone3 NVARCHAR(30) NULL
);

This design breaks down when a customer has four phone numbers, and it makes querying awkward.

Better:

Code

CREATE TABLE CustomerPhones
(
    CustomerId INT NOT NULL,
    PhoneNumber NVARCHAR(30) NOT NULL,
    PhoneType NVARCHAR(20) NOT NULL,
    CONSTRAINT PK_CustomerPhones PRIMARY KEY (CustomerId, PhoneNumber),
    CONSTRAINT FK_CustomerPhones_Customers
        FOREIGN KEY (CustomerId) REFERENCES Customers(CustomerId)
);

The one-to-many relationship is modeled explicitly.

Second Normal Form

Second normal form matters when a table has a composite key. Every non-key column should depend on the whole key, not only part of it.

Bad:

Code

CREATE TABLE Enrollment
(
    StudentId INT NOT NULL,
    CourseId INT NOT NULL,
    CourseName NVARCHAR(200) NOT NULL,
    EnrolledAtUtc DATETIME2 NOT NULL,
    CONSTRAINT PK_Enrollment PRIMARY KEY (StudentId, CourseId)
);

CourseName depends on CourseId, not on the whole (StudentId, CourseId) key. If many students enroll in the same course, the course name is duplicated.

Better:

Code

CREATE TABLE Courses
(
    CourseId INT NOT NULL PRIMARY KEY,
    CourseName NVARCHAR(200) NOT NULL
);

CREATE TABLE Enrollment
(
    StudentId INT NOT NULL,
    CourseId INT NOT NULL,
    EnrolledAtUtc DATETIME2 NOT NULL,
    CONSTRAINT PK_Enrollment PRIMARY KEY (StudentId, CourseId),
    CONSTRAINT FK_Enrollment_Courses
        FOREIGN KEY (CourseId) REFERENCES Courses(CourseId)
);

Third Normal Form

Third normal form means non-key columns should depend on the key, the whole key, and nothing but the key. Avoid transitive dependencies where one non-key column determines another non-key column.

Bad:

Code

CREATE TABLE Employees
(
    EmployeeId INT NOT NULL PRIMARY KEY,
    FullName NVARCHAR(200) NOT NULL,
    DepartmentId INT NOT NULL,
    DepartmentName NVARCHAR(200) NOT NULL
);

DepartmentName depends on DepartmentId, not directly on EmployeeId.

Better:

Code

CREATE TABLE Departments
(
    DepartmentId INT NOT NULL PRIMARY KEY,
    DepartmentName NVARCHAR(200) NOT NULL UNIQUE
);

CREATE TABLE Employees
(
    EmployeeId INT NOT NULL PRIMARY KEY,
    FullName NVARCHAR(200) NOT NULL,
    DepartmentId INT NOT NULL,
    CONSTRAINT FK_Employees_Departments
        FOREIGN KEY (DepartmentId) REFERENCES Departments(DepartmentId)
);

Third normal form is often the practical target for transactional schemas. Higher normal forms exist, but many interviews focus on 1NF, 2NF, 3NF, keys, dependencies, and anomalies.

Functional Dependencies

A functional dependency means one value determines another value.

Examples:

CustomerId determines CustomerEmail.
ProductId determines current product name.
PostalCode may determine city in some countries, but not reliably in every data model.
(OrderId, LineNumber) determines line quantity and unit price.

Normalization is mostly about placing columns where their dependencies belong. If a column depends on a different entity's key, it probably belongs in that entity's table.

Candidate Keys and Natural Facts

A candidate key is a column or set of columns that can uniquely identify a row. A table can have multiple candidate keys. One becomes the primary key; others are usually enforced with unique constraints.

Code

CREATE TABLE Users
(
    UserId BIGINT NOT NULL PRIMARY KEY,
    Email NVARCHAR(320) NOT NULL,
    ExternalIdentityProvider NVARCHAR(50) NOT NULL,
    ExternalSubject NVARCHAR(200) NOT NULL,

    CONSTRAINT UQ_Users_Email UNIQUE (Email),
    CONSTRAINT UQ_Users_ExternalIdentity
        UNIQUE (ExternalIdentityProvider, ExternalSubject)
);

Normalization is not just splitting tables. It also means enforcing the keys that make the model true.

Many-to-Many Relationships

A many-to-many relationship usually needs a junction table.

Code

CREATE TABLE Students
(
    StudentId INT NOT NULL PRIMARY KEY,
    FullName NVARCHAR(200) NOT NULL
);

CREATE TABLE Courses
(
    CourseId INT NOT NULL PRIMARY KEY,
    CourseName NVARCHAR(200) NOT NULL
);

CREATE TABLE StudentCourses
(
    StudentId INT NOT NULL,
    CourseId INT NOT NULL,
    EnrolledAtUtc DATETIME2 NOT NULL,
    Grade NVARCHAR(5) NULL,
    CONSTRAINT PK_StudentCourses PRIMARY KEY (StudentId, CourseId),
    CONSTRAINT FK_StudentCourses_Students
        FOREIGN KEY (StudentId) REFERENCES Students(StudentId),
    CONSTRAINT FK_StudentCourses_Courses
        FOREIGN KEY (CourseId) REFERENCES Courses(CourseId)
);

Do not store course IDs as comma-separated text in Students. That breaks referential integrity, indexing, filtering, and joins.

Denormalization

Denormalization is deliberately storing redundant, copied, prejoined, or precomputed data.

Common forms:

Copying display values into a transaction table.
Storing summary counts or totals.
Creating reporting tables.
Creating read models for application screens.
Using indexed views or materialized summaries.
Flattening dimensions for analytics.
Keeping historical snapshots of mutable reference data.

Example historical snapshot:

Code

CREATE TABLE OrderLines
(
    OrderId INT NOT NULL,
    LineNumber INT NOT NULL,
    ProductId INT NOT NULL,
    ProductNameAtPurchase NVARCHAR(200) NOT NULL,
    UnitPriceAtPurchase DECIMAL(19, 4) NOT NULL,
    Quantity INT NOT NULL,
    CONSTRAINT PK_OrderLines PRIMARY KEY (OrderId, LineNumber),
    CONSTRAINT FK_OrderLines_Products
        FOREIGN KEY (ProductId) REFERENCES Products(ProductId)
);

This is not careless duplication. It preserves what the customer actually bought at that time, even if the product is renamed or repriced later.

When Denormalization Is Justified

Denormalization can be justified when:

A measured read bottleneck remains after indexing and query tuning.
Reports repeatedly aggregate the same large data.
A screen needs a read model that avoids complex joins on every request.
Historical records must preserve past names, prices, tax rates, or addresses.
Data warehouse or analytics workloads favor star schemas or flattened dimensions.
A service boundary requires copying data owned by another service.
The system has many reads and relatively few writes.

Example summary table:

Code

CREATE TABLE ProductSalesDaily
(
    ProductId INT NOT NULL,
    SalesDate DATE NOT NULL,
    OrderCount INT NOT NULL,
    QuantitySold INT NOT NULL,
    Revenue DECIMAL(19, 4) NOT NULL,
    LastUpdatedAtUtc DATETIME2 NOT NULL,
    CONSTRAINT PK_ProductSalesDaily PRIMARY KEY (ProductId, SalesDate)
);

This table can speed up dashboards, but the team must decide how it is updated, rebuilt, corrected, and monitored.

Denormalization Alternatives

Before copying data into new tables, consider:

Proper primary and foreign keys.
Indexes on join and filter columns.
Covering indexes.
Filtered indexes.
Computed columns.
Indexed views where SQL Server requirements are acceptable.
Query rewrites.
Pagination and result limits.
Caching outside the database.
Archiving old data.

Denormalization should not be the first response to a slow query. Often the real problem is a missing index, non-sargable predicate, bad statistics, or a query that retrieves too much data.

Indexed Views as Controlled Redundancy

In SQL Server, an indexed view can persist a computed result set and let the optimizer use it for some queries. This is a controlled form of redundancy maintained by the database engine.

Simplified example:

Code

CREATE VIEW dbo.ProductRevenueByDay
WITH SCHEMABINDING
AS
SELECT
    ProductId,
    CAST(OrderDateUtc AS DATE) AS SalesDate,
    COUNT_BIG(*) AS RowCount,
    SUM(Quantity) AS QuantitySold
FROM dbo.OrderLines
GROUP BY ProductId, CAST(OrderDateUtc AS DATE);
GO

CREATE UNIQUE CLUSTERED INDEX IX_ProductRevenueByDay
ON dbo.ProductRevenueByDay(ProductId, SalesDate);

Indexed views have restrictions and write overhead. They are not a casual replacement for good schema design, but they can be useful when a repeated aggregate is expensive and the database can maintain the result correctly.

Consistency Cost

Every denormalized value introduces a consistency question:

What is the source of truth?
How is the copy updated?
Is it updated synchronously in the same transaction?
Can it be eventually consistent?
How are failures retried?
How can the copy be rebuilt?
How will stale or inconsistent values be detected?

If the team cannot answer these questions, denormalization is probably premature.

OLTP vs Reporting Models

Transactional OLTP schemas usually value consistency, normalized relationships, and safe writes. Reporting and analytical models often value read speed, simpler query shapes, and precomputed metrics.

It is common to have both:

A normalized transactional model for orders, customers, products, and payments.
A denormalized reporting model for daily sales, product performance, and customer cohorts.

The important design choice is not "normalize everything forever." It is "choose the right model for the workload and protect the source of truth."

Common Mistakes

Common mistakes include:

Treating normalization as splitting tables randomly.
Storing lists in comma-separated columns.
Using repeated columns such as Phone1, Phone2, Phone3.
Duplicating names and statuses everywhere without a reason.
Denormalizing before measuring performance.
Forgetting to enforce candidate keys.
Ignoring update, insert, and delete anomalies.
Creating summary tables without rebuild or correction logic.
Using index keys as a substitute for a real relational model.
Confusing historical snapshots with accidental duplication.

Best Practices

Best practices include:

Start with clear entities, relationships, keys, and dependencies.
Normalize transactional source-of-truth tables to reduce anomalies.
Use constraints to enforce important rules.
Add indexes before changing the logical model for performance.
Denormalize only for a named reason.
Keep the source of truth explicit.
Make denormalized data rebuildable where possible.
Document consistency expectations.
Test write paths that maintain redundant data.
Separate OLTP and reporting models when their workloads differ.

Normalization and when denormalization is justified

Overview

Core Concepts

What Normalization Solves

Data Anomalies

First Normal Form

Second Normal Form

Third Normal Form

Functional Dependencies

Candidate Keys and Natural Facts

Many-to-Many Relationships

Denormalization

When Denormalization Is Justified

Denormalization Alternatives

Indexed Views as Controlled Redundancy

Consistency Cost

OLTP vs Reporting Models

Common Mistakes

Best Practices

Interview Practice

Beginner Interview Practice

Intermediate Interview Practice

Advanced Interview Practice