High availability and disaster recovery basics Interview Questions

Overview

High availability and disaster recovery are related but different disciplines. High availability keeps a service running through expected component failures, such as a server or instance failure. Disaster recovery restores service after a larger failure, such as storage loss, data center outage, regional outage, severe corruption, ransomware, or operational mistakes.

This topic matters because production SQL systems need more than fast queries. They need continuity plans. The right design depends on RPO, RTO, cost, edition, licensing, operational skill, network latency, data loss tolerance, read-scale needs, and application failover behavior. Always On availability groups, failover cluster instances, log shipping, backups, replication, and cloud platform features all solve different parts of the problem.

For interviews, HA/DR basics test whether you can match business objectives to database architecture. Strong candidates can explain availability groups, failover modes, synchronous vs asynchronous replication, failover cluster instances, log shipping, backups, RPO/RTO, testing, and why replicas do not eliminate backup or restore planning.

Core Concepts

High Availability Vs Disaster Recovery

High availability focuses on keeping service available during common failures.

Examples:

SQL Server instance failure.
Operating system patching.
Server hardware failure.
Planned maintenance.
Local failover.

Disaster recovery focuses on recovering from major incidents.

Examples:

Data center loss.
Regional outage.
Storage corruption.
Ransomware.
Accidental destructive deployment.
Loss of primary environment.

HA is about continuity. DR is about recovery after a larger break.

RPO And RTO

RPO and RTO drive architecture.

Objective	HA/DR Meaning
RPO	How much committed data can be lost
RTO	How quickly service must be restored

If the business requires near-zero data loss and quick failover, synchronous replicas and automatic failover may be needed. If the business can tolerate delayed recovery and some data loss, log shipping or backup restore may be enough.

Always On Availability Groups

An availability group is a SQL Server HA/DR feature that replicates a set of user databases from a primary replica to one or more secondary replicas. The databases in an availability group fail over together.

Key concepts:

Primary replica accepts read-write workload.
Secondary replicas receive transaction log records.
Secondary replicas can be failover targets.
Some secondary replicas can support read-only workloads.
A listener can route client connections.
Backups may be offloaded to suitable secondary replicas.

Availability groups protect at the database group level, not the entire SQL Server instance.

Synchronous Vs Asynchronous Commit

Availability groups support synchronous and asynchronous commit modes.

Mode	Behavior	Trade-Off
Synchronous commit	Primary waits for secondary to harden log before commit completes	Lower data loss risk, higher transaction latency
Asynchronous commit	Primary commits without waiting for secondary acknowledgement	Lower latency, possible data loss if primary fails

Synchronous commit is common for local HA when latency is low. Asynchronous commit is common for distant DR replicas where latency would slow the primary workload too much.

Failover Types

Common failover types include:

Automatic failover.
Planned manual failover.
Forced failover with possible data loss.

Automatic failover requires the right configuration and a synchronized failover partner. Forced failover is a disaster action when the primary is unavailable and some data loss may be accepted.

Applications must also be designed to reconnect and retry safely after failover.

Availability Group Listener

An availability group listener is a stable network name that clients use to connect to the current primary replica, and sometimes to read-only replicas through routing configuration.

Without a listener or equivalent connection abstraction, applications may need connection string changes during failover. That increases RTO and operational risk.

Failover Cluster Instances

A Failover Cluster Instance provides instance-level high availability by running SQL Server as a clustered resource across nodes. It protects the SQL Server instance name and instance-level objects, but usually depends on shared storage.

Key distinction:

FCI protects an instance.
Availability group protects databases.

An FCI does not provide readable secondaries. Availability groups can provide readable secondaries and database-level replication.

Log Shipping

Log shipping backs up transaction logs on a primary database, copies them to one or more secondary servers, and restores them there. It is simpler than availability groups and can be useful for disaster recovery with a relaxed RPO/RTO.

Trade-offs:

Data can lag behind based on backup/copy/restore frequency.
Failover is usually manual.
Secondary database may be in restoring or standby mode.
Useful for low-cost DR and reporting scenarios when latency is acceptable.

Backups In HA/DR

Backups remain mandatory even with HA/DR replicas. Replication can copy bad changes, corruption scenarios, accidental drops, or malicious updates. Backups provide historical recovery points and independent protection.

Availability groups, FCIs, and log shipping improve availability and recovery options. Backups protect recoverability across time.

Readable Secondaries

Readable secondary replicas can offload read-only workloads such as reporting or some backups. This can reduce load on the primary, but it introduces design questions:

Is read latency acceptable?
Are queries safe against slightly stale data?
Is read-only routing configured?
Are statistics and plans appropriate on secondaries?
Will reporting workload affect redo or failover readiness?

Readable secondaries are not free capacity without operational trade-offs.

Quorum And Split-Brain

Clustered HA systems require quorum or consensus to decide which node can own resources. Quorum helps avoid split-brain, where two nodes both think they are primary.

Interview-level understanding: automatic failover depends not only on SQL Server health but also on cluster configuration, voting, witness design, and network behavior.

Testing HA/DR

HA/DR plans must be tested.

Testing should include:

Planned failover.
Unplanned failover simulation.
Application reconnect behavior.
Data loss measurement.
Restore from backups.
DR environment activation.
Runbook timing.
Monitoring and alerting.

Untested HA/DR architecture is a diagram, not a proven plan.

Common Mistakes

Common mistakes include:

Confusing HA with backup.
Assuming replicas prevent user error.
Not testing application reconnect behavior.
Ignoring RPO/RTO when choosing architecture.
Using synchronous replication across high-latency links without measuring write impact.
Forgetting SQL Agent jobs, logins, linked servers, and instance-level dependencies.
Not monitoring replica lag.
Not documenting forced failover data-loss decisions.
Failing to practice DR restore.

Best Practices

Best practices include:

Define RPO and RTO first.
Pick HA/DR technology based on business needs.
Keep backups independent and tested.
Use synchronous commit only where latency is acceptable.
Use asynchronous DR replicas when distance makes synchronous commit too expensive.
Use listeners or stable connection abstractions.
Test failover and application retries.
Monitor synchronization health and lag.
Document manual and forced failover procedures.
Include security, jobs, maintenance, and dependent services in the plan.

High availability and disaster recovery basics