Overview
Distributed systems fail in partial and time-dependent ways. A dependency can be briefly unavailable, remain unhealthy for minutes, slow down until callers exhaust their own resources, or receive a traffic burst beyond its safe processing rate.
Four complementary resilience patterns address different problems:
- Retry: repeat an operation when a failure is likely to be transient.
- Circuit breaker: temporarily stop calls that are likely to fail.
- Bulkhead: isolate resource pools so one failing workload cannot consume everything.
- Queue-based load leveling: buffer bursts and process work at a controlled rate.
These patterns are not interchangeable. Aggressive retries can amplify overload. A circuit breaker does not limit concurrency by itself. A bulkhead rejects excess work but does not durably preserve it. A queue preserves work but adds latency, eventual processing, duplicate delivery, and operational backlog.
Good resilience design starts with:
- A defined end-to-end deadline.
- Failure classification.
- Idempotency.
- Bounded concurrency and queues.
- Downstream capacity awareness.
- Graceful degradation.
- Telemetry for attempts, breaker state, saturation, and backlog age.
This topic matters in interviews because candidates must explain how patterns compose without creating retry storms, hidden latency, cascading failure, or unbounded queues.
Core Concepts
Transient, Persistent, and Permanent Failures
Classify failures before choosing a policy.
Transient
- Brief network interruption.
- Temporary throttling.
- A short leader election.
- Momentary dependency overload.
Persistent
- Dependency outage.
- Broken route or certificate.
- Exhausted capacity that will not recover quickly.
Permanent for the request
- Invalid input.
- Authentication or authorization failure.
- Missing resource.
- Business-rule rejection.
Retry transient failures. Fail fast or degrade for persistent faults. Do not retry permanent failures without changing the request or system state.
Retry Pattern
A retry repeats a failed operation after a deliberate delay:
attempt
-> transient failure
-> wait with backoff and jitter
-> retry
-> success or final failure
Common strategies:
- Immediate retry for rare transport glitches.
- Fixed delay.
- Linear backoff.
- Exponential backoff.
- Server-directed delay through
Retry-After.
Jitter spreads retries from many clients so they do not synchronize into another traffic spike.
Retry Safety and Idempotency
A timeout does not prove that the remote operation failed:
server commits order
response is lost
client retries
Safe retry mechanisms include:
- Naturally idempotent operations.
- Client-generated operation IDs.
- HTTP idempotency keys.
- Conditional updates.
- Database uniqueness constraints.
- Provider-supported deduplication.
- Status lookup after an unknown outcome.
Blindly retrying POST, payment, email, or inventory operations can duplicate effects.
Retry Budgets and Deadlines
Retries must fit within one total time budget:
total deadline = connection + attempts + delays + response processing
An interactive request might allow only one short retry. A background job may tolerate more attempts over minutes.
Pass cancellation and remaining deadline downstream. Do not start another attempt when too little time remains for it to complete usefully.
Retry Amplification
Nested retries multiply:
gateway retries 3 times
service retries 3 times
database client retries 3 times
maximum attempts = 27
This can overwhelm a dependency during recovery.
Coordinate retry ownership:
- Prefer one layer with business context.
- Understand built-in SDK retries.
- Disable or reduce overlapping policies.
- Measure attempts, not only logical requests.
Circuit Breaker
A circuit breaker tracks recent outcomes and moves through:
Closed
- Calls are allowed.
- Failures are sampled.
Open
- Calls fail immediately.
- The dependency receives recovery time.
Half-open
- A limited number of probes are allowed.
- Success closes the breaker.
- Failure reopens it.
The breaker protects callers from waiting on likely failures and reduces load on the unhealthy dependency.
Breaker Configuration
Useful settings include:
- Failure ratio or count.
- Minimum throughput.
- Sampling duration.
- Break duration.
- Handled status codes and exceptions.
- Number of half-open probes.
A breaker should not open from one failure when traffic is too low to establish a meaningful failure rate. It should be scoped to the actual failure domain:
- Dependency.
- Endpoint.
- Region.
- Shard.
- Tenant or credential where quotas differ.
One global breaker can unnecessarily block healthy shards.
Retry and Circuit Breaker Composition
The usual intent is:
total timeout
-> retry policy
-> circuit breaker
-> attempt timeout
-> dependency
Exact pipeline ordering affects which calls count as breaker failures and how total time is bounded.
Rules:
- Stop retrying when the circuit is open.
- Count direct dependency failures, not every wrapper exception blindly.
- Respect
429or503recovery hints. - Keep attempts within the caller's deadline.
- Expose degraded behavior when the breaker opens.
Timeouts
Every remote operation needs:
- Connection timeout.
- Per-attempt timeout.
- Total operation timeout.
Timeouts should reflect measured latency and business deadlines. Too long permits resource buildup; too short creates false failures and retries.
Cancellation should propagate into HTTP calls, database operations, queue work, and downstream services where supported.
.NET HTTP Resilience
Modern .NET applications can use Microsoft.Extensions.Http.Resilience with IHttpClientFactory:
builder.Services
.AddHttpClient<CatalogClient>(client =>
{
client.BaseAddress = new Uri("https://catalog.internal");
})
.AddStandardResilienceHandler(options =>
{
options.TotalRequestTimeout.Timeout = TimeSpan.FromSeconds(8);
options.AttemptTimeout.Timeout = TimeSpan.FromSeconds(2);
options.Retry.MaxRetryAttempts = 2;
options.Retry.UseJitter = true;
});
For unsafe HTTP methods, disable automatic retries unless the operation has an idempotency design:
builder.Services
.AddHttpClient<PaymentClient>()
.AddStandardResilienceHandler(options =>
{
options.Retry.DisableForUnsafeHttpMethods();
});
Do not stack multiple uncoordinated resilience handlers. Configure policies per dependency and operation class.
Bulkhead Pattern
A bulkhead isolates capacity:
critical API pool -> 50 concurrent calls
reporting pool -> 10 concurrent calls
background export -> separate workers and queue
If reporting saturates its pool, critical API capacity remains available.
Bulkheads can isolate:
- Thread or task concurrency.
- Connection pools.
- Worker pools.
- Queues.
- Compute instances.
- Database pools.
- Tenants or workload classes.
The goal is controlled blast radius.
Semaphore and Queue Bulkheads
A concurrency limiter can permit a fixed number of operations:
private readonly SemaphoreSlim gate = new(initialCount: 20);
public async Task<T> ExecuteAsync<T>(
Func<CancellationToken, Task<T>> operation,
CancellationToken cancellationToken)
{
if (!await gate.WaitAsync(TimeSpan.Zero, cancellationToken))
{
throw new BulkheadRejectedException();
}
try
{
return await operation(cancellationToken);
}
finally
{
gate.Release();
}
}
A short bounded waiting queue can absorb minor variation. An unbounded queue hides overload and consumes memory while latency grows.
Bulkhead Partitioning
Partition by business priority and failure domain:
- Interactive versus background.
- Premium versus batch workloads.
- Read versus write.
- Dependency A versus dependency B.
- Large tenant versus shared pool.
Over-partitioning wastes capacity. Under-partitioning permits noisy neighbors. Measure utilization and rejected work to tune partitions.
Queue-Based Load Leveling
A durable queue separates arrival rate from processing rate:
bursty producers
-> durable queue
-> bounded consumers
-> constrained dependency
Benefits:
- Producers can complete intake quickly.
- Work survives temporary consumer outage.
- Consumers process at a safe rate.
- Capacity can scale independently.
Costs:
- Added latency.
- Duplicate and out-of-order delivery.
- Eventual completion.
- Poison messages.
- Backlog storage and retention.
- More difficult cancellation and user feedback.
Queue Capacity Is Not Infinite
If average production exceeds consumption:
arrival rate > service rate
-> backlog grows
-> completion latency grows
-> retention or capacity is exhausted
Monitor oldest-message age, not only message count. Apply:
- Admission limits.
- Producer throttling.
- Priority queues.
- Load shedding.
- Safe consumer scaling.
- Dead-letter handling.
A queue delays overload; it does not create downstream capacity.
Consumer Scaling and Downstream Protection
Scaling consumers based only on queue depth can overload the database.
Bound:
- Consumer instance count.
- Per-instance concurrency.
- Database connections.
- External request rate.
- Batch size.
Scale within the safe capacity of the slowest dependency. Use backpressure when downstream saturation appears.
Poison Messages
Retry malformed or permanently failing messages only a bounded number of times, then dead-letter them.
Operations need:
- Alerting.
- Failure reason.
- Safe inspection.
- Repair and replay.
- Ownership and retention.
A poison message should not block unrelated work unless strict ordering requires it.
Graceful Degradation
When a dependency is unavailable:
- Return cached or stale data with clear semantics.
- Disable an optional feature.
- Queue nonurgent work.
- Return
503 Service Unavailablewith retry guidance. - Preserve critical flows.
Fallback data must be safe. Returning an empty list or default authorization decision can be more harmful than failing explicitly.
Pattern Comparison
They often compose, but every additional mechanism needs a defined deadline and telemetry.
Observability
Measure:
- Logical calls and physical attempts.
- Retry count and final outcome.
- Circuit state and transition count.
- Bulkhead utilization, queue depth, and rejection rate.
- Broker queue depth and oldest-message age.
- Dependency latency and throttling.
- End-to-end business completion.
Alert on sustained user impact and exhaustion, not on every successful retry.
Testing
Test:
- Transient failure followed by recovery.
- Persistent outage.
- Slow responses near timeout.
429withRetry-After.- Duplicate unsafe requests.
- Circuit half-open under concurrency.
- Bulkhead saturation.
- Queue backlog and poison messages.
- Consumer autoscaling against a constrained database.
- Process restart and message redelivery.
Use fault injection and load tests. Unit tests of policy configuration alone do not reveal cascading behavior.
Common Mistakes
Common failures include:
- Retrying nontransient errors.
- Retrying non-idempotent operations without a key.
- Layering SDK, HTTP, service, and workflow retries.
- Omitting total deadlines.
- Using a global circuit breaker for unrelated resources.
- Treating an open breaker as dependency recovery.
- Adding an unbounded bulkhead queue.
- Scaling consumers beyond downstream capacity.
- Assuming a queue guarantees exactly-once processing.
- Returning unsafe fallback data.
- Alerting on every retry rather than final impact.
Best-Practice Design Process
- Classify failures and business deadlines.
- Make operations idempotent before retrying.
- Coordinate retries across layers with backoff and jitter.
- Add per-attempt and total timeouts.
- Scope circuit breakers to failure domains.
- Isolate critical and noncritical capacity with bulkheads.
- Use durable queues only when delayed processing is acceptable.
- Bound queue, concurrency, and consumer scale.
- Design fallback and rejection behavior.
- Measure attempts, saturation, backlog age, and user impact.
- Test cascading failure and recovery under load.