Overview
Horizontal scaling increases capacity by adding service instances rather than only making one instance larger. It works best when any healthy instance can process any request and when shared bottlenecks do not prevent throughput from increasing.
A stateless service does not keep durable request or user state only in one process. Durable state belongs in databases, object storage, distributed caches, brokers, or workflow stores. Instances can then start, stop, fail, and receive traffic without losing business state.
Horizontal scaling is not unlimited. More web instances can overwhelm:
- A database.
- A shared cache.
- An external API.
- A connection pool.
- A broker partition.
- A lock or serialized resource.
Backpressure prevents fast producers from overwhelming slower consumers. It uses bounded queues, concurrency limits, throttling, flow control, admission control, or load shedding so work does not grow without limit.
This topic matters in interviews because candidates must explain capacity bottlenecks, statelessness, autoscaling delays, graceful scale-in, consistency, and overload behavior rather than assuming that adding instances automatically increases throughput.
Core Concepts
Scale Up and Scale Out
Vertical scaling
- Add CPU, memory, or faster storage to one node.
- Simple but limited by machine size.
- Can require restart or migration.
- Keeps a larger failure domain.
Horizontal scaling
- Add more instances.
- Improves elasticity and potential availability.
- Requires traffic distribution and shared-state design.
- Introduces coordination and distributed-system concerns.
Most systems combine both.
Scalability and Elasticity
Scalability is the ability to increase capacity as resources increase.
Elasticity is the ability to add and remove capacity in response to demand.
Ideal:
2x instances -> close to 2x useful throughput
Real systems lose efficiency because of:
- Shared bottlenecks.
- Coordination.
- Lock contention.
- Uneven partitions.
- Network overhead.
- Cache warmup.
- Serial work.
Measure useful business throughput, not only CPU.
Stateless Service Meaning
Stateless does not mean the application has no state. It means an instance does not uniquely own durable state required by later requests.
Instance-local state can safely include:
- Immutable configuration.
- Reconstructable caches.
- Connection pools.
- Temporary request data.
- Metrics buffers.
State that should be externalized:
- User sessions when continuity is required.
- Shopping carts.
- Workflow progress.
- Idempotency records.
- Job ownership.
- Uploaded files.
- Distributed locks and leases.
Loss of one instance should not lose committed business state.
Session Affinity
Sticky sessions route a user repeatedly to one instance.
Problems:
- Uneven load.
- Difficult scale-in.
- Lost session on failure.
- Reduced failover.
- Hot users cannot spread across instances.
- Deployment complexity.
Prefer:
- Secure self-contained authentication tickets where appropriate.
- Distributed session storage.
- Durable workflow stores.
- Client-visible resource IDs.
Affinity can be pragmatic for legacy systems, but it is a constraint and should not be mistaken for horizontal scalability.
Shared Cryptographic Keys
Instances must share keys needed to validate data created by other instances, such as ASP.NET Core Data Protection keys for authentication cookies.
If every instance uses machine-local keys:
instance A issues cookie
request reaches instance B
instance B cannot decrypt cookie
Persist and protect shared keys appropriately, rotate them, and separate environments.
Load Balancing
A load balancer distributes requests using:
- Round robin.
- Least connections.
- Weighted routing.
- Latency or health.
- Consistent hashing.
Health probes should remove instances that cannot safely receive traffic. Traffic distribution must consider long-lived connections, uneven request cost, and zone or region capacity.
Bottleneck Analysis
Adding application instances does not help if the database is saturated.
Measure:
- CPU and memory.
- Thread or event-loop saturation.
- Connection pools.
- Database CPU, locks, I/O, and query latency.
- Cache throughput.
- External quotas.
- Queue partitions.
- Network bandwidth.
- Serialized critical sections.
Use load tests and production telemetry to identify the limiting resource.
Amdahl's Law Intuition
If part of a workload is serialized, it limits total speedup.
95% parallel work
5% serialized work
Even unlimited workers cannot eliminate the serialized portion.
Reduce coordination, partition data, remove locks, or redesign the invariant rather than adding instances indefinitely.
Data Partitioning
Partition data and work by:
- Tenant.
- Customer.
- Aggregate ID.
- Geography.
- Time.
- Hash of a stable key.
Good partition keys:
- Distribute load.
- Preserve required locality.
- Avoid one hot partition.
- Support routing and rebalancing.
Skewed tenants or popular keys can create hotspots even when average capacity appears healthy.
Autoscaling Signals
Possible signals:
- CPU or memory.
- Request concurrency.
- Requests per second.
- Latency.
- Queue depth.
- Oldest-message age.
- Custom business workload.
- Dependency saturation.
CPU alone can be misleading:
- I/O-bound services may saturate connections with low CPU.
- A downstream dependency may be overloaded.
- High CPU may be efficient work rather than distress.
Use signals tied to workload and service objectives.
Autoscaling Delay
Scale-out is not immediate:
detect demand
-> make decision
-> provision instance
-> start process
-> load configuration
-> warm caches and connections
-> pass readiness
-> receive traffic
Keep headroom for sudden bursts and use queues, admission control, or pre-scaling for predictable events.
Scale-In and Graceful Shutdown
Before terminating an instance:
- Stop sending new work.
- Mark it unready.
- Drain in-flight requests.
- Stop receiving new queue messages.
- Complete or safely abandon leased work.
- Flush critical telemetry.
- Respect a bounded shutdown timeout.
Jobs need leases, visibility timeouts, or checkpoints so another worker can resume.
Readiness, Liveness, and Startup
Startup
- Has initialization completed?
Readiness
- Can this instance safely receive traffic now?
Liveness
- Is the process making progress, or should it restart?
Do not make liveness fail merely because a shared dependency is temporarily down. Restarting every instance can amplify the incident. Readiness and degraded operation depend on whether the service can handle useful traffic.
Backpressure
Backpressure communicates or enforces that producers must slow down because downstream capacity is constrained.
Mechanisms:
- Bounded channel.
- Semaphore or concurrency limiter.
- Broker prefetch limits.
- TCP flow control.
- Reactive demand signals.
- HTTP
429 Too Many Requests. - Queue capacity.
- Consumer pause.
- Admission control.
Without backpressure, work accumulates in memory, threads, sockets, queues, or databases until latency and failure cascade.
Bounded Channels in .NET
var channel = Channel.CreateBounded<WorkItem>(
new BoundedChannelOptions(capacity: 1_000)
{
FullMode = BoundedChannelFullMode.Wait,
SingleReader = false,
SingleWriter = false
});
Full-mode choices include:
- Wait and slow producers.
- Reject new work.
- Drop newest.
- Drop oldest.
Choose from business semantics. Dropping a payment request is different from dropping an outdated telemetry sample.
Load Shedding
Load shedding rejects lower-value work to preserve critical functions.
Examples:
- Reject optional recommendations.
- Reduce expensive response detail.
- Disable exports.
- Return cached data.
- Reject anonymous traffic before authenticated traffic.
- Sample low-value telemetry.
Use explicit priority and fairness rules. Avoid silently dropping committed business work.
Throttling
Throttling limits work by:
- User.
- Tenant.
- API key.
- IP or network.
- Endpoint.
- Workload cost.
- Global capacity.
Return:
HTTP/1.1 429 Too Many Requests
Retry-After: 10
Clients should back off with jitter. Server limits must prevent one noisy tenant from consuming all capacity.
Backpressure Versus Buffering
Buffering absorbs short bursts. Backpressure prevents unbounded accumulation.
bounded buffer has space -> accept
buffer near capacity -> slow or reject
buffer full -> explicit overload behavior
An unbounded queue is not backpressure.
Queue Backlog and Staleness
For time-sensitive work, old messages may no longer be useful.
Define:
- Business deadline.
- Maximum queue age.
- Cancellation.
- Expiration.
- Priority.
- What to do with stale work.
Processing obsolete work wastes capacity and delays current work.
Fan-Out and Downstream Multiplication
One request can generate many downstream calls:
1 request
-> 20 parallel service calls
-> each performs 5 database queries
Horizontal scaling at the edge multiplies downstream pressure.
Bound fan-out, batch requests, cache appropriately, and include downstream cost in admission decisions.
Stateless Background Workers
Workers can scale horizontally when work ownership is external:
- Queue lease.
- Broker lock.
- Partition assignment.
- Database job claim.
Workers need:
- Idempotent processing.
- Checkpoints.
- Visibility timeout renewal.
- Graceful shutdown.
- Poison-message handling.
- Per-key ordering where required.
Distributed Coordination
Avoid global locks when possible. Prefer:
- Partition ownership.
- Optimistic concurrency.
- Idempotency.
- Commutative updates.
- Leases with expiry.
- Single-writer per key.
Distributed locks add failure and timeout semantics. A lock holder can pause or lose connectivity, so fencing tokens may be required to stop stale owners from writing.
Deployment and Warmup
New instances can be slower because:
- JIT compilation.
- Empty caches.
- New connections.
- Model loading.
- DNS and certificate work.
Use readiness gates, controlled ramp-up, minimum replicas, prewarming, and connection reuse. Do not route full production traffic before initialization completes.
State and Multi-Region Scaling
Stateless compute can run in multiple regions, but state consistency remains difficult.
Decide:
- Active-active or active-passive.
- Data ownership.
- Replication delay.
- Conflict resolution.
- Session routing.
- Regional failover.
- Recovery point and recovery time objectives.
Compute scaling cannot make a single-region database globally available.
Observability
Measure:
- Throughput per instance and globally.
- Scaling efficiency.
- Instance count and startup time.
- CPU, memory, thread pool, connections.
- Request concurrency and latency.
- Rejection and throttling rates.
- Queue depth and oldest age.
- Partition skew.
- Database saturation.
- Graceful shutdown failures.
Correlate scaling decisions with user outcomes and downstream load.
Testing
Test:
- Sudden burst and sustained overload.
- Scale-out lag.
- Cache-cold new instances.
- Uneven tenant load.
- Dependency bottleneck.
- Queue full behavior.
- Instance termination with in-flight work.
- Zone failure.
- Scale-in during long jobs.
- Recovery after load drops.
Capacity tests should find the knee where latency rises sharply before catastrophic failure.
Common Mistakes
Common failures include:
- Calling a service stateless while keeping sessions in memory.
- Depending on sticky sessions.
- Adding web servers while the database is saturated.
- Scaling consumers only from queue count.
- Using unbounded in-memory queues.
- Ignoring startup and scale-out delay.
- Failing liveness on every dependency outage.
- Dropping committed work without semantics.
- Allowing one tenant to consume global capacity.
- Terminating workers without drain or leases.
- Using distributed locks without expiry and fencing.
- Measuring instance count instead of useful throughput.
Best-Practice Design Process
- Define workload, SLOs, and capacity units.
- Externalize durable and session state.
- Ensure any instance can handle any request.
- Identify shared bottlenecks and partition where needed.
- Select autoscaling signals tied to workload.
- Maintain headroom for scaling delay.
- Bound concurrency, fan-out, and queues.
- Define throttling, fairness, and load shedding.
- Implement readiness and graceful drain.
- Test bursts, hotspots, dependency limits, and scale-in.
- Measure scaling efficiency and user impact.