DEV_NET_CORE
GET_STARTED
Design & ArchitectureRequirements decomposition and system trade-offs

Capacity Planning and Identifying Likely Bottlenecks

Overview

Capacity planning is the process of estimating the resources a system needs to meet expected demand while satisfying performance, reliability, and cost targets. It helps answer questions such as:

  • How many users can the system support?
  • How many requests per second should the API handle?
  • How much CPU, memory, storage, network bandwidth, and database capacity are needed?
  • What happens during a traffic spike?
  • Which component is likely to become the bottleneck first?
  • When should the system scale up, scale out, cache, partition, queue, or redesign?
  • How much capacity is enough without wasting money?

Identifying bottlenecks means finding the component or resource that limits overall system throughput, latency, availability, or growth. A bottleneck can be a database, CPU, memory, network bandwidth, disk I/O, connection pool, thread pool, lock, external API, queue consumer, cache, storage account, file system, rate limit, or even a manual operational process.

Capacity planning matters because a system can be functionally correct but still fail when real traffic arrives. A login API may work perfectly for 10 users but fail when 10,000 users sign in at 9:00 AM. A checkout service may pass all unit tests but slow down because the database cannot handle concurrent writes. A reporting worker may work in development but fall behind when thousands of report jobs are queued.

Capacity planning is used in:

  • System design interviews.
  • Cloud architecture.
  • Web API design.
  • Database design.
  • Microservices and distributed systems.
  • E-commerce platforms.
  • SaaS applications.
  • Background job processing.
  • Data pipelines.
  • File upload/download systems.
  • Real-time messaging systems.
  • Incident prevention.
  • Cost optimization.
  • Launch readiness.
  • Seasonal traffic planning.
  • Disaster recovery planning.

This topic is important for interviews because capacity planning shows whether a candidate can think beyond features. Interviewers expect candidates to reason about scale, traffic patterns, latency, throughput, storage growth, bottlenecks, trade-offs, and observability. A strong candidate should be able to make rough estimates, identify likely bottlenecks, propose mitigation strategies, and explain how to validate assumptions with testing and monitoring.

A strong interview answer does not need perfect math. It needs a structured approach:

  1. Clarify requirements and traffic assumptions.
  2. Estimate demand.
  3. Translate demand into resource needs.
  4. Identify likely bottlenecks.
  5. Propose scaling and optimization strategies.
  6. Validate with load tests and production monitoring.
  7. Revisit the plan as traffic and architecture change.

Capacity planning is not a one-time spreadsheet. It is an ongoing process. Assumptions change, usage patterns change, features change, user behavior changes, dependencies change, and cloud limits change. Good systems are designed so capacity can be measured, adjusted, and improved continuously.

Core Concepts

What Capacity Planning Means

Capacity planning means predicting how much capacity a workload needs to meet its performance and reliability goals.

Capacity can include:

  • Compute capacity.
  • CPU cores.
  • Memory.
  • Thread pool capacity.
  • Database CPU.
  • Database connections.
  • Database IOPS.
  • Storage capacity.
  • Storage throughput.
  • Network bandwidth.
  • Cache memory.
  • Queue throughput.
  • Background worker concurrency.
  • External API rate limits.
  • Message broker partitions.
  • CDN capacity.
  • Human operational capacity.

A simple definition:

Code
Capacity planning = expected demand + performance target + resource model + safety margin + validation.

Example:

Code
Requirement:
The order API must handle 2,000 requests per second at p95 latency under 500 ms.

Capacity planning questions:
- How much CPU is needed per request?
- How many app instances are needed?
- How many database writes per second are required?
- How many database connections are needed?
- What is the peak-to-average traffic ratio?
- What happens if one instance fails?
- How much spare capacity is required?
- Which resource saturates first?

The goal is not only to avoid underprovisioning. It is also to avoid overprovisioning. Too little capacity causes latency, errors, outages, and failed launches. Too much capacity increases cloud cost and operational waste.

What a Bottleneck Is

A bottleneck is the limiting component that restricts system performance or scalability.

Example:

Code
The API servers can handle 10,000 requests per second.
The database can handle 2,000 writes per second.
The system performs one database write per request.

Likely bottleneck:
The database write capacity.

A bottleneck can appear in different forms:

Bottleneck TypeExample
CPUApplication instances reach 95% CPU
MemoryGarbage collection increases latency
Disk I/ODatabase writes wait on storage
NetworkLarge file downloads saturate bandwidth
Database locksConcurrent updates block each other
Connection poolRequests wait for database connections
Thread poolBlocking calls consume worker threads
External APIPayment provider allows only 100 RPS
Queue consumersMessages arrive faster than workers process them
CacheCache memory limit causes evictions
Rate limitCloud service throttles requests
SerializationLarge JSON payloads consume CPU
Hot partitionOne tenant or key receives most traffic
Manual processHuman approval cannot keep up with volume

A bottleneck is not always bad. Every system has a limiting factor. The problem is when the bottleneck prevents the system from meeting requirements or causes unacceptable cost, latency, or failure risk.

Capacity Planning vs Performance Optimization

Capacity planning and performance optimization are related but different.

ConceptFocusExample
Capacity planningHow much resource is needed for expected demandNeed 6 API instances for launch traffic
Performance optimizationMake the system use resources more efficientlyReduce database query from 500 ms to 50 ms
Scalability planningHow the system grows as demand increasesAdd instances, shard database, partition queue
Bottleneck analysisFind what limits performanceDatabase CPU reaches 90% before app CPU
Cost optimizationMeet demand at acceptable costUse autoscaling and reserved capacity

Capacity planning asks:

Code
How much do we need?

Performance optimization asks:

Code
How can we do the same work with less time or fewer resources?

Both are needed. If a query is inefficient, simply adding more database capacity may be expensive and temporary. If a system is well optimized but traffic grows 20x, it still needs capacity planning.

Capacity Planning Inputs

Good capacity planning starts with inputs.

Important inputs include:

  • Number of users.
  • Active users.
  • Concurrent users.
  • Requests per second.
  • Reads per second.
  • Writes per second.
  • Peak traffic.
  • Average traffic.
  • Traffic growth rate.
  • Data size.
  • Data growth rate.
  • Request payload size.
  • Response payload size.
  • File size.
  • Job arrival rate.
  • Job processing time.
  • Latency targets.
  • Throughput targets.
  • Availability targets.
  • Consistency requirements.
  • Retention requirements.
  • Backup requirements.
  • External dependency limits.
  • Cloud service quotas.
  • Deployment topology.
  • Regional requirements.
  • Cost constraints.

Example input table:

Code
Registered users: 1,000,000
Daily active users: 100,000
Peak concurrent users: 10,000
Average API requests per active user per day: 50
Peak-to-average multiplier: 5x
Read/write ratio: 90/10
Average response size: 20 KB
Average request size: 2 KB
Target API p95 latency: 300 ms
Availability target: 99.9%

These inputs are rarely perfect. In interviews, make reasonable assumptions and state them clearly.

Demand Forecasting

Demand forecasting estimates future workload.

Common demand signals:

  • Historical traffic.
  • Product launch estimates.
  • Marketing campaign plans.
  • Seasonal patterns.
  • Business growth projections.
  • Customer onboarding plans.
  • Similar product benchmarks.
  • Market research.
  • Pilot data.
  • Load test results.
  • Sales forecasts.
  • User behavior analytics.

Existing systems can use historical data. New systems often require assumptions, prototypes, market estimates, and stakeholder input.

Example:

Code
Current traffic:
500 RPS average
2,500 RPS peak

Expected campaign:
3x normal peak

Planned capacity target:
2,500 * 3 = 7,500 RPS

Add safety margin:
7,500 * 1.3 = 9,750 RPS

The safety margin protects against forecasting error, uneven traffic, node failures, and unexpected usage patterns.

Peak vs Average Load

Average load can be misleading. Systems usually fail during peak load.

Example:

Code
Daily requests: 86,400,000
Seconds per day: 86,400

Average RPS:
86,400,000 / 86,400 = 1,000 RPS

But traffic may not be evenly distributed. If 40% of daily traffic occurs in a 2-hour window:

Code
Peak-window requests:
86,400,000 * 0.40 = 34,560,000

Peak-window seconds:
2 * 60 * 60 = 7,200

Peak-window average:
34,560,000 / 7,200 = 4,800 RPS

If there are minute-level bursts, actual peak may be even higher.

In interviews, always ask about:

  • Daily traffic.
  • Peak hour traffic.
  • Peak minute traffic.
  • Burst behavior.
  • Seasonal events.
  • Launch traffic.
  • Regional traffic patterns.

Capacity should be planned for expected peak plus safety margin, not only daily average.

Concurrency vs Throughput

Concurrency and throughput are related but not the same.

Throughput is work completed per unit of time.

Code
Requests per second
Messages per second
Transactions per second
Files per hour
Jobs per minute

Concurrency is how many operations are in progress at the same time.

Code
Concurrent users
Concurrent HTTP requests
Concurrent database queries
Concurrent file uploads
Concurrent background jobs

A useful relationship is:

Code
Concurrency ≈ Throughput × Average response time

Example:

Code
Throughput: 1,000 requests/second
Average response time: 200 ms = 0.2 seconds

Estimated concurrent in-flight requests:
1,000 * 0.2 = 200 concurrent requests

If response time increases to 2 seconds:

Code
1,000 * 2 = 2,000 concurrent requests

This shows why latency affects capacity. Slow dependencies increase concurrency, which increases memory, connection usage, thread pressure, and queue length.

Little's Law

Little's Law is a useful mental model for capacity planning.

Code
L = λ × W

Where:

Code
L = average number of items in the system
λ = arrival rate
W = average time an item spends in the system

Example for a queue:

Code
Arrival rate: 100 jobs/second
Average processing time: 2 seconds

Average jobs in processing:
100 * 2 = 200 jobs

If processing time grows to 10 seconds:

Code
100 * 10 = 1,000 jobs

This means slow processing increases the number of in-flight jobs, which may require more memory, more workers, more queue capacity, or a redesign.

In interviews, Little's Law helps explain why improving latency can reduce required capacity.

Basic Capacity Estimation Workflow

A practical capacity estimation workflow:

Code
1. Define target workload.
2. Estimate request rate and data volume.
3. Break requests into operations.
4. Estimate resource cost per operation.
5. Multiply by peak demand.
6. Add safety margin.
7. Identify the first limiting resource.
8. Propose mitigation.
9. Validate with load testing.
10. Monitor in production and adjust.

Example:

Code
Target:
5,000 peak RPS for product search.

Per request:
1 cache lookup
If cache miss: 1 search index query
Average response size: 30 KB
Cache hit rate: 90%

Estimated cache RPS:
5,000

Estimated search index RPS:
5,000 * 10% = 500

Estimated outbound bandwidth:
5,000 * 30 KB = 150,000 KB/s ≈ 150 MB/s

Likely bottlenecks:

Code
Cache throughput
Search index query latency
Network egress
API CPU for JSON serialization

This kind of rough math is often enough to guide system design discussions.

Back-of-the-Envelope Estimation

Back-of-the-envelope estimation is a quick approximation used in system design.

Example: estimating storage for messages

Code
Users: 1,000,000
Daily active users: 200,000
Messages per active user per day: 20
Average message size: 1 KB
Metadata overhead: 1 KB

Daily messages:
200,000 * 20 = 4,000,000 messages/day

Average stored size per message:
1 KB + 1 KB = 2 KB

Daily storage:
4,000,000 * 2 KB = 8,000,000 KB ≈ 8 GB/day

Yearly storage:
8 GB * 365 ≈ 2.9 TB/year

Add replication, indexes, backups, and overhead:

Code
2.9 TB raw * 3 replication * 1.5 index/metadata overhead ≈ 13 TB/year

The exact number may be wrong, but the estimate reveals important design implications.

Estimating API Capacity

Example: API instance capacity

Code
Load test result:
One API instance handles 500 RPS at p95 latency under 300 ms.

Target peak:
4,000 RPS

Base instance count:
4,000 / 500 = 8 instances

Add 30% safety margin:
8 * 1.3 = 10.4

Rounded:
11 instances

N+1 failure tolerance:
12 instances

This estimate assumes the bottleneck is API compute. If the database saturates at 2,000 RPS, adding more API instances will not solve the problem.

Always compare instance capacity with downstream capacity.

Estimating Database Capacity

Database capacity often becomes the bottleneck because many API instances share one database.

Inputs:

Code
Peak API RPS: 3,000
Read/write ratio: 80/20
Queries per read request: 2
Queries per write request: 4

Estimate:

Code
Read requests:
3,000 * 80% = 2,400 RPS

Write requests:
3,000 * 20% = 600 RPS

Read queries:
2,400 * 2 = 4,800 queries/sec

Write queries:
600 * 4 = 2,400 queries/sec

Total database operations:
7,200 operations/sec

Likely database bottlenecks:

  • CPU.
  • IOPS.
  • Lock contention.
  • Slow queries.
  • Missing indexes.
  • Connection pool exhaustion.
  • Transaction log throughput.
  • Hot rows.
  • Hot partitions.
  • Replication lag.
  • Storage size.
  • Backup/restore time.

Mitigations:

  • Add indexes.
  • Optimize queries.
  • Reduce round trips.
  • Use caching.
  • Use read replicas.
  • Use CQRS/read models.
  • Batch writes.
  • Partition data.
  • Shard by tenant or key.
  • Move long-running work to queues.
  • Denormalize carefully.
  • Use a more suitable database type for the access pattern.

Estimating Storage Capacity

Storage capacity planning includes more than raw data size.

Consider:

  • Raw data.
  • Indexes.
  • Metadata.
  • Replication.
  • Backups.
  • Logs.
  • Audit trails.
  • Soft deletes.
  • Versioning.
  • Temporary files.
  • Retention period.
  • Growth rate.
  • Compression.
  • Encryption overhead.
  • Data lifecycle policies.

Example:

Code
Uploads per day: 50,000
Average file size: 5 MB
Metadata per file: 2 KB
Retention: 365 days

Daily file storage:
50,000 * 5 MB = 250,000 MB ≈ 250 GB/day

Yearly raw file storage:
250 GB * 365 = 91,250 GB ≈ 91 TB/year

With replication:

Code
91 TB * 3 copies = 273 TB physical storage equivalent

Design implications:

  • Use object storage.
  • Use lifecycle tiers.
  • Add retention rules.
  • Avoid storing large files in relational database rows.
  • Plan backup and restore time.
  • Monitor storage growth and cost.

Estimating Network Bandwidth

Network can become a bottleneck for large responses, file downloads, video, images, or chat/media systems.

Example:

Code
Peak downloads: 2,000 downloads/second
Average file size: 2 MB

Bandwidth:
2,000 * 2 MB = 4,000 MB/s ≈ 4 GB/s

Design implications:

  • Use CDN.
  • Use object storage direct download.
  • Use compression.
  • Use pagination.
  • Use response caching.
  • Avoid returning huge JSON payloads.
  • Use streaming.
  • Use regional deployment.
  • Consider egress cost.

For APIs:

Code
Peak RPS: 5,000
Average response size: 50 KB

Outbound bandwidth:
5,000 * 50 KB = 250,000 KB/s ≈ 250 MB/s

Large responses can saturate network and increase serialization CPU.

Estimating Background Job Capacity

For background jobs, compare arrival rate with processing rate.

Example:

Code
Jobs arrive: 10,000 jobs/hour
Average processing time: 2 seconds/job
One worker processes:
3,600 seconds/hour / 2 = 1,800 jobs/hour

Workers needed:
10,000 / 1,800 = 5.56

Round up:
6 workers

Add safety margin:
8 workers

If each job calls an external API with a limit of 100 requests/second, the external API might become the bottleneck before worker CPU.

Important questions:

  • How many jobs arrive per second?
  • How long does each job take?
  • Can jobs run concurrently?
  • Are jobs CPU-bound or I/O-bound?
  • Is processing idempotent?
  • What is the retry policy?
  • What is acceptable queue delay?
  • What happens to poison messages?
  • How large can the queue become?
  • What is the worker scale-out strategy?

Queue Capacity and Backlog

A queue absorbs spikes, but it does not remove work. If arrival rate exceeds processing rate for long enough, backlog grows.

Formula:

Code
Backlog growth rate = arrival rate - processing rate

Example:

Code
Arrival rate: 500 jobs/minute
Processing rate: 300 jobs/minute

Backlog growth:
200 jobs/minute

After one hour:

Code
200 * 60 = 12,000 queued jobs

If each job must complete within 10 minutes, the system is under-capacity.

Mitigations:

  • Increase worker count.
  • Improve job processing time.
  • Batch jobs.
  • Reduce unnecessary work.
  • Split heavy jobs.
  • Use priority queues.
  • Apply backpressure.
  • Use autoscaling based on queue length or queue age.
  • Add dead-letter handling.
  • Separate slow job types from fast job types.

Queue length alone can be misleading. Queue age is often more important.

Code
Queue length: 10,000 messages
Oldest message age: 5 seconds
Probably healthy for high-throughput system.

Queue length: 100 messages
Oldest message age: 2 hours
Probably unhealthy for low-throughput urgent workflow.

Common Bottleneck Locations in Web Applications

Common bottlenecks in web applications include:

Client and Frontend

  • Large JavaScript bundles.
  • Slow rendering.
  • Too many API calls.
  • Large images.
  • No caching.
  • Layout shifts.
  • Blocking scripts.
  • Slow third-party scripts.

API Layer

  • CPU saturation.
  • Synchronous blocking calls.
  • Thread pool starvation.
  • Large JSON serialization.
  • Inefficient middleware.
  • Too many dependency calls per request.
  • Poor connection reuse.
  • Inefficient authorization checks.
  • Excessive logging.

Database

  • Missing indexes.
  • N+1 queries.
  • Table scans.
  • Lock contention.
  • Slow joins.
  • Large transactions.
  • Connection pool exhaustion.
  • Hot rows.
  • Hot partitions.
  • Transaction log bottleneck.

Cache

  • Low hit rate.
  • Hot keys.
  • Evictions.
  • Cache stampede.
  • Cache server memory pressure.
  • Network latency to cache.

External Dependencies

  • API rate limits.
  • Slow payment provider.
  • Identity provider latency.
  • Third-party outages.
  • DNS issues.
  • TLS handshake overhead.
  • Retry storms.

Infrastructure

  • Load balancer limits.
  • Network bandwidth.
  • Disk I/O.
  • Container CPU throttling.
  • Memory limits.
  • Autoscaling delay.
  • Cloud service quotas.
  • Regional capacity.

Bottleneck Symptoms

Symptoms of bottlenecks include:

SymptomPossible Cause
High p95/p99 latencySlow dependency, queueing, CPU saturation
High CPUExpensive computation, serialization, inefficient code
High memoryLeaks, large objects, cache growth, buffering
High GC timeAllocation-heavy code, large object heap pressure
Database CPU highSlow queries, missing indexes, too many queries
Database lock waitsLong transactions, hot rows, contention
Connection timeoutsConnection pool exhaustion, network issues
Queue backlog increasingWorker under-capacity or downstream bottleneck
Error rate rising under loadResource exhaustion, timeouts, throttling
429 responsesRate limit exceeded
503 responsesOverloaded service or dependency unavailable
Autoscaling does not helpBottleneck is downstream or shared resource
One partition hotPoor partition key or skewed traffic
Long deployment warmupCold starts, cache warmup, JIT, migrations

A good capacity plan defines which signals will be monitored.

CPU Bottlenecks

CPU bottlenecks occur when processing demand exceeds available CPU.

Common causes:

  • Expensive algorithms.
  • Serialization and deserialization.
  • Encryption/compression.
  • Image/video processing.
  • Complex validation.
  • Regular expressions.
  • Excessive logging.
  • Busy waiting.
  • CPU-bound work inside web request threads.
  • Inefficient JSON transformations.
  • High garbage collection overhead.

Mitigations:

  • Optimize hot code paths.
  • Cache computed results.
  • Reduce payload size.
  • Move CPU-heavy work to background workers.
  • Scale out API instances.
  • Use better algorithms.
  • Use streaming.
  • Use compiled expressions carefully.
  • Profile before optimizing.
  • Use specialized services for media processing.

Interview signal:

Code
If CPU scales linearly with request rate and no downstream dependency is saturated, horizontal scaling may help.

But if CPU is high because each request performs unnecessary work, optimization may be cheaper than scaling.

Memory Bottlenecks

Memory bottlenecks occur when the application stores too much data or allocates too frequently.

Common causes:

  • Loading entire files into memory.
  • Loading huge result sets.
  • In-memory caches without limits.
  • Memory leaks through static references.
  • Large object heap pressure.
  • Too many concurrent requests.
  • Excessive buffering.
  • Large JSON payloads.
  • Unbounded queues.
  • Poor object lifecycle management.

Mitigations:

  • Stream large files.
  • Add pagination.
  • Limit request and response size.
  • Bound in-memory queues.
  • Set cache size limits.
  • Avoid static mutable collections.
  • Use memory profiling.
  • Reduce allocations.
  • Use pooling only when justified.
  • Scale out or scale up memory.
  • Move large storage to external systems.

Example bad pattern:

Code
var bytes = await File.ReadAllBytesAsync(path);
return File(bytes, "application/octet-stream");

Better for large files:

Code
var stream = File.OpenRead(path);
return File(stream, "application/octet-stream");

Database Bottlenecks

Databases are common bottlenecks because they are shared stateful components.

Common causes:

  • Missing indexes.
  • N+1 queries.
  • Too many round trips.
  • Large result sets.
  • Unbounded queries.
  • Full table scans.
  • Lock contention.
  • Hot rows.
  • Long transactions.
  • Inefficient schema design.
  • Over-normalization for read-heavy workloads.
  • Under-normalization causing write anomalies.
  • Connection pool exhaustion.
  • Poor partition key.
  • Transaction log saturation.

Mitigations:

  • Add appropriate indexes.
  • Use query plans.
  • Avoid N+1 queries.
  • Use projections instead of loading full entities.
  • Add pagination.
  • Use read replicas.
  • Cache read-heavy data.
  • Split read and write models.
  • Use background processing.
  • Batch writes.
  • Use optimistic concurrency.
  • Partition or shard.
  • Reduce transaction scope.
  • Tune connection pool settings carefully.
  • Use the right database for the access pattern.

Example N+1 pattern:

Code
var orders = await context.Orders.ToListAsync();

foreach (var order in orders)
{
    var items = await context.OrderItems
        .Where(item => item.OrderId == order.Id)
        .ToListAsync();
}

Better:

Code
var orders = await context.Orders
    .Include(order => order.Items)
    .ToListAsync();

Or better for API responses:

Code
var orders = await context.Orders
    .Select(order => new OrderDto
    {
        Id = order.Id,
        Total = order.Total,
        ItemCount = order.Items.Count
    })
    .ToListAsync();

Connection Pool Bottlenecks

Connection pools limit how many simultaneous connections can be used for a dependency.

Common pools:

  • Database connection pool.
  • HTTP connection pool.
  • Redis connection pool.
  • Message broker connections.
  • Thread pool.
  • Browser automation pool.
  • File handle pool.

Symptoms:

  • Requests wait for connections.
  • Timeout errors.
  • High latency under load.
  • Database CPU not high but app requests still slow.
  • Increasing app instances makes the database overloaded with connections.

Example:

Code
API instances: 20
Max DB connections per instance: 100

Potential database connections:
20 * 100 = 2,000

If the database can safely handle only 500 connections, scaling API instances without controlling connection use can make the problem worse.

Mitigations:

  • Reduce unnecessary database calls.
  • Use short-lived connections correctly.
  • Avoid long-running transactions.
  • Tune pool size with evidence.
  • Add backpressure.
  • Use read replicas.
  • Use queue-based processing.
  • Limit app instance count if database cannot support more connections.
  • Use multiplexing where supported.

Network Bottlenecks

Network bottlenecks appear when data transfer or network round trips dominate performance.

Causes:

  • Large payloads.
  • Too many chatty service calls.
  • Cross-region calls.
  • No compression.
  • Inefficient protocols.
  • Repeated TLS handshakes.
  • No connection reuse.
  • Large file downloads through API servers.
  • Slow DNS or external dependencies.

Mitigations:

  • Place services closer together.
  • Use CDN for static assets and downloads.
  • Compress responses.
  • Reduce payload size.
  • Use pagination.
  • Use binary formats when justified.
  • Use connection reuse.
  • Avoid unnecessary cross-region calls.
  • Batch requests.
  • Use async messaging.
  • Let clients upload/download directly to object storage when appropriate.

External Dependency Bottlenecks

External APIs often limit system capacity.

Examples:

Code
Payment provider: 200 requests/sec
Email provider: 10,000 emails/hour
Identity provider: token endpoint limit
Shipping API: p95 latency 2 seconds
Credit bureau: strict rate limit

If your system needs more capacity than the dependency allows, the external service becomes the bottleneck.

Mitigations:

  • Cache where valid.
  • Queue and process asynchronously.
  • Batch requests.
  • Use provider rate limits explicitly.
  • Add backoff and circuit breakers.
  • Use idempotency keys.
  • Add fallback providers.
  • Degrade gracefully.
  • Negotiate higher limits.
  • Split traffic across regions/providers if contract allows.
  • Do not retry blindly during outages.

Capacity planning must include external dependency limits, not only internal resources.

Hot Partitions and Skew

A hot partition happens when traffic is unevenly concentrated on one partition, shard, tenant, key, or database row.

Examples:

  • One tenant generates 70% of traffic.
  • One viral post receives most reads.
  • One product has a flash sale.
  • All writes use today's date as partition key.
  • Sequential IDs concentrate writes.
  • All users update one global counter.
  • One queue partition receives most messages.

Symptoms:

  • Overall system capacity looks available, but one partition throttles.
  • Some tenants/users experience poor performance.
  • Scaling out does not help evenly.
  • One shard has high CPU while others are idle.

Mitigations:

  • Choose better partition keys.
  • Add key salting.
  • Use fan-out/fan-in carefully.
  • Cache hot items.
  • Split hot tenants.
  • Avoid global counters.
  • Use per-partition metrics.
  • Use load-aware routing.
  • Precompute popular content.
  • Use CDN for hot static content.

Autoscaling and Capacity

Autoscaling adjusts capacity based on metrics such as CPU, memory, request count, queue length, or custom signals.

Autoscaling helps with variable demand, but it is not instant.

Important considerations:

  • Scale-out delay.
  • Cold start time.
  • Warmup behavior.
  • Minimum instance count.
  • Maximum instance count.
  • Cooldown periods.
  • Metric delay.
  • Dependency capacity.
  • Cost.
  • Load balancing.
  • Stateful components.
  • Database connections created by new instances.

Example issue:

Code
Traffic spike lasts 3 minutes.
Autoscaling takes 5 minutes to add ready instances.
Result: autoscaling reacts too late.

Mitigations:

  • Keep minimum warm capacity.
  • Use scheduled scaling for predictable spikes.
  • Use faster scale-out triggers.
  • Use queue-based load leveling.
  • Pre-warm before launch or campaign.
  • Optimize startup time.
  • Use readiness probes.
  • Scale dependencies too.
  • Test autoscaling behavior under load.

Headroom and Safety Margin

Headroom is spare capacity above expected demand.

Example:

Code
Expected peak: 5,000 RPS
Planned capacity: 7,000 RPS
Headroom: 2,000 RPS = 40%

Headroom protects against:

  • Forecasting error.
  • Traffic spikes.
  • Instance failure.
  • Slow dependency.
  • Deployment warmup.
  • Noisy neighbors.
  • Cloud service throttling.
  • Batch jobs overlapping with traffic.
  • Retry storms.
  • Seasonal load.

Too little headroom causes risk. Too much headroom increases cost.

A common interview approach is to state a safety margin, such as 20% to 50%, then explain that the actual margin depends on business criticality, cost, autoscaling speed, traffic volatility, and failure tolerance.

N+1 and Redundant Work as Bottlenecks

Many bottlenecks are not caused by raw traffic, but by inefficient work per request.

Example:

Code
One API request should need 1 database query.
Implementation performs 101 database queries.

At 100 RPS:

Code
Expected database queries: 100/sec
Actual database queries: 10,100/sec

This turns a moderate traffic system into a database bottleneck.

Other redundant work:

  • Calling the same external API multiple times per request.
  • Recalculating expensive data that could be cached.
  • Returning unused fields.
  • Serializing large object graphs.
  • Re-checking permissions repeatedly.
  • Loading full entities when projection is enough.
  • Repeating configuration or secret lookups.

Capacity planning should estimate work per request, not only request count.

Latency Percentiles

Average latency is not enough. Use percentiles.

Common percentiles:

Code
p50: median user experience
p95: slowest 5% of requests
p99: slowest 1% of requests
p99.9: rare but severe tail latency

Example:

Code
Average latency: 100 ms
p95 latency: 900 ms
p99 latency: 4 seconds

The average looks good, but many users experience slow requests.

Capacity planning should define targets such as:

Code
Search p95 <= 300 ms under 2,000 RPS.
Checkout p95 <= 1 second under campaign traffic.
Payment confirmation p99 <= 3 seconds.

Tail latency often reveals bottlenecks caused by locks, garbage collection, slow queries, cold starts, retries, or overloaded dependencies.

Load Testing

Load testing validates whether the system can handle expected traffic.

Types of tests:

Test TypePurpose
Smoke testVerify system works with minimal load
Load testValidate expected normal and peak load
Stress testPush beyond expected capacity to find breaking point
Spike testSudden traffic increase
Soak testLong-running test for leaks and degradation
Scalability testMeasure behavior as instances/resources increase
Failover testValidate capacity during instance/zone/dependency failure

A good load test should define:

  • Target workload.
  • User behavior model.
  • Read/write mix.
  • Request distribution.
  • Data volume.
  • Ramp-up pattern.
  • Test duration.
  • Success criteria.
  • Environment similarity.
  • Monitoring metrics.
  • Failure thresholds.

Bad load test:

Code
Hit one endpoint repeatedly with unrealistic data.

Better load test:

Code
Simulate realistic user journeys with proper traffic mix, data volume, authentication, cache behavior, and peak traffic pattern.

Capacity Testing Metrics

Monitor metrics at every layer.

Application metrics:

  • Request rate.
  • Error rate.
  • Latency percentiles.
  • CPU.
  • Memory.
  • GC time.
  • Thread pool usage.
  • Active requests.
  • Dependency latency.
  • Retry count.
  • Timeout count.

Database metrics:

  • CPU.
  • IOPS.
  • Query duration.
  • Lock waits.
  • Deadlocks.
  • Connection count.
  • Buffer/cache hit rate.
  • Log write waits.
  • Slow queries.
  • Replication lag.
  • Index usage.

Queue metrics:

  • Arrival rate.
  • Processing rate.
  • Queue length.
  • Oldest message age.
  • Dead-letter count.
  • Retry count.
  • Consumer lag.

Cache metrics:

  • Hit rate.
  • Miss rate.
  • Eviction count.
  • Memory usage.
  • CPU.
  • Latency.
  • Hot keys.

Infrastructure metrics:

  • Load balancer status.
  • Network throughput.
  • Disk throughput.
  • Container restarts.
  • Autoscaling events.
  • Throttling.
  • Cloud service quotas.

Capacity planning without metrics is guessing.

Identifying the Bottleneck During Load Testing

A structured bottleneck investigation:

Code
1. Observe user-facing symptom:
   latency, errors, throughput plateau, queue backlog.

2. Check application resource metrics:
   CPU, memory, thread pool, GC, active requests.

3. Check dependency metrics:
   database, cache, queue, external API.

4. Look for saturation:
   high utilization, wait time, throttling, connection limits.

5. Compare throughput:
   Does adding app instances increase throughput?

6. Isolate:
   Test endpoints separately, disable optional dependencies, profile hot paths.

7. Validate:
   Fix or scale suspected bottleneck and retest.

Important rule:

Code
A saturated component is not always the root cause.

Example:

Code
Database CPU is high.
Root cause may be N+1 queries from application code.

Fixing the database size may help temporarily, but fixing the query pattern may solve the real cause.

Throughput Plateau

A throughput plateau occurs when increasing load no longer increases completed work.

Example:

Code
500 RPS load -> 500 RPS handled
1,000 RPS load -> 1,000 RPS handled
1,500 RPS load -> 1,200 RPS handled
2,000 RPS load -> 1,200 RPS handled with high latency

The system maximum is around 1,200 RPS in that environment. Additional requests queue, time out, or fail.

Next step: find what saturates at 1,200 RPS.

Possible bottlenecks:

  • API CPU.
  • Database CPU.
  • Database connections.
  • Locks.
  • External dependency rate limit.
  • Network.
  • Thread pool.
  • Cache.

Scaling Up vs Scaling Out

Scaling up means using a larger resource.

Code
Bigger VM
More CPU
More memory
Higher database tier
More IOPS

Scaling out means adding more resources.

Code
More API instances
More workers
More queue consumers
More database replicas
More partitions
More shards

Comparison:

StrategyBenefitsRisks
Scale upSimple, fewer moving partsHas upper limit, can be costly, still single point of failure
Scale outBetter availability and elasticityRequires stateless design, load balancing, coordination, data partitioning

Examples:

  • Stateless API servers usually scale out well.
  • A relational database often scales up first, then uses replicas, partitioning, or sharding.
  • Background workers usually scale out if jobs are independent.
  • A single hot row does not scale out easily without design changes.

Vertical Scaling Limits

Vertical scaling is limited by available instance sizes, cost, and single-resource failure risk.

Example:

Code
Database is scaled to largest available SKU.
CPU still reaches 95% at peak.

At this point, options include:

  • Query optimization.
  • Index tuning.
  • Caching.
  • Read replicas.
  • Data partitioning.
  • Sharding.
  • CQRS.
  • Archiving old data.
  • Moving analytical workload away from OLTP database.
  • Redesigning write path.
  • Using a different storage engine.

A mature answer should not assume "just use a bigger database" forever.

Horizontal Scaling Requirements

Horizontal scaling requires design support.

For API servers:

  • Stateless application instances.
  • Shared external session storage if sessions are needed.
  • Load balancing.
  • Health checks.
  • Configuration consistency.
  • Distributed cache if needed.
  • No local-only file storage for shared data.

For workers:

  • Idempotent processing.
  • Work partitioning.
  • Queue visibility timeout.
  • Duplicate handling.
  • Concurrency limits.
  • Dead-letter handling.
  • Distributed locks only when necessary.

For data:

  • Partition key design.
  • Shard routing.
  • Cross-shard query strategy.
  • Replication.
  • Data consistency model.
  • Rebalancing plan.

Scaling out a stateful system is harder than scaling out stateless compute.

Caching and Capacity

Caching can reduce load on expensive resources.

Use caching for:

  • Frequently read data.
  • Expensive computations.
  • Slow external API results.
  • Static or rarely changing data.
  • Search suggestions.
  • Product catalog.
  • Permissions or metadata with short TTL.
  • Configuration.

Caching benefits:

  • Lower latency.
  • Reduced database load.
  • Reduced external API calls.
  • Better peak handling.
  • Lower cost.

Caching risks:

  • Stale data.
  • Cache invalidation complexity.
  • Cache stampede.
  • Hot keys.
  • Memory pressure.
  • Inconsistent user experience.
  • Security issues if user-specific data is cached incorrectly.

Cache capacity planning includes:

  • Cache hit rate.
  • Memory size.
  • Eviction rate.
  • TTL.
  • Key count.
  • Value size.
  • Hot key distribution.
  • Cache server CPU/network.
  • Fallback load if cache fails.

Important question:

Code
If the cache is down, can the database handle the fallback traffic?

If not, the cache has become a critical dependency and needs reliability planning.

Backpressure

Backpressure means slowing producers when consumers cannot keep up.

Without backpressure:

Code
API accepts unlimited work.
Queue grows without limit.
Memory grows.
Database grows.
Workers fall behind.
Eventually system fails.

With backpressure:

Code
When queue is full or dependency is saturated:
- reject requests with 429 or 503
- slow producers
- shed low-priority work
- degrade optional features
- apply rate limits

Backpressure protects the system from collapse.

Examples:

  • Limit concurrent background jobs.
  • Use bounded channels.
  • Return 429 when rate limit is exceeded.
  • Use queue max length.
  • Disable expensive optional work during overload.
  • Use circuit breakers for failing dependencies.
  • Apply per-tenant quotas.

Backpressure is a capacity planning tool because it defines what happens when demand exceeds capacity.

Load Shedding and Graceful Degradation

Load shedding means dropping or rejecting work to preserve critical functionality.

Graceful degradation means reducing functionality instead of failing completely.

Examples:

Code
Disable recommendations if recommendation service is slow.
Return cached product data if search is degraded.
Reject report generation requests during overload.
Serve static fallback page if personalization fails.
Prioritize checkout over analytics.

This is important because not all traffic has the same business value.

Capacity planning should identify:

  • Critical flows.
  • Optional flows.
  • Low-priority background work.
  • Work that can be delayed.
  • Work that can be rejected.
  • Work that must never be lost.

Cost and Capacity Trade-Offs

Capacity planning is not only about performance. It is also about cost.

Overprovisioning:

Code
Pros:
- More headroom
- Lower outage risk during spikes
- Simpler planning

Cons:
- Higher cloud cost
- Waste during low usage
- May hide inefficiencies

Underprovisioning:

Code
Pros:
- Lower immediate cost

Cons:
- Higher latency
- Errors
- Outages
- Failed launches
- Poor user experience
- Emergency scaling

Autoscaling:

Code
Pros:
- Matches capacity to demand
- Reduces idle cost
- Handles variable traffic

Cons:
- Has scaling delay
- Needs good metrics
- Can cause dependency pressure
- May be hard for stateful systems

A strong architecture answer balances cost, user experience, reliability, and operational complexity.

Capacity Planning for Different System Types

Read-Heavy Systems

Examples:

  • Product catalog.
  • News feed.
  • Documentation site.
  • Public profile pages.
  • Search.

Likely bottlenecks:

  • Database reads.
  • Search index.
  • Cache.
  • Network bandwidth.
  • CDN.
  • Serialization.

Common strategies:

  • CDN.
  • Caching.
  • Read replicas.
  • Search indexes.
  • Denormalized read models.
  • Pagination.
  • Precomputation.
  • Compression.

Write-Heavy Systems

Examples:

  • Analytics ingestion.
  • Chat messages.
  • IoT telemetry.
  • Payment events.
  • Logging pipeline.

Likely bottlenecks:

  • Database writes.
  • Transaction log.
  • Partition hot spots.
  • Queue throughput.
  • Disk I/O.
  • Consumer lag.

Common strategies:

  • Queue ingestion.
  • Batching.
  • Partitioning.
  • Append-only storage.
  • Event streaming.
  • Sharding.
  • Asynchronous processing.
  • Backpressure.

Mixed Workloads

Examples:

  • E-commerce checkout.
  • Banking application.
  • SaaS dashboard.
  • Course management system.

Likely bottlenecks:

  • Database contention.
  • Mixed read/write pressure.
  • External dependencies.
  • Authorization checks.
  • Cache invalidation.

Common strategies:

  • Separate read and write models.
  • Cache read-heavy data.
  • Keep transactions short.
  • Use queues for side effects.
  • Use idempotency.
  • Prioritize critical paths.

Background Job Systems

Examples:

  • Report generation.
  • Email sending.
  • File processing.
  • Data import/export.

Likely bottlenecks:

  • Worker count.
  • Job processing time.
  • External APIs.
  • Queue backlog.
  • Storage throughput.
  • CPU for transformations.

Common strategies:

  • Queue-based architecture.
  • Worker autoscaling.
  • Batch processing.
  • Idempotency.
  • Dead-letter queues.
  • Priority queues.
  • Backpressure.
  • Job status tracking.

Capacity Planning for Databases

Database planning should include:

  • Read/write ratio.
  • Query complexity.
  • Index strategy.
  • Transaction volume.
  • Data growth.
  • Retention.
  • Archive strategy.
  • Backup and restore time.
  • Connection limits.
  • Replication.
  • Sharding or partitioning.
  • Migration impact.
  • Maintenance windows.
  • Analytical workloads.

Common interview answer:

Code
I would avoid putting all read, write, reporting, and analytics load on the same OLTP database. I would separate heavy analytical queries into a read replica, data warehouse, or reporting pipeline if needed.

This shows awareness that different workloads stress databases differently.

Capacity Planning for File Upload Systems

Inputs:

  • Number of uploads per day.
  • Peak uploads per second.
  • Average file size.
  • Maximum file size.
  • Upload duration.
  • Download frequency.
  • Retention period.
  • Virus scanning time.
  • Metadata size.
  • Encryption requirements.
  • Regional storage.
  • Bandwidth and egress.

Likely bottlenecks:

  • API server memory if files are proxied.
  • Network bandwidth.
  • Object storage request limits.
  • Virus scanning worker capacity.
  • Metadata database.
  • Download bandwidth.
  • Storage growth.
  • CDN.

Common design:

Code
Client uploads directly to object storage using signed URL.
API stores metadata.
Object storage event triggers scanner.
Scanner updates file status.
Download uses signed URL after scan passes.

This avoids API servers becoming the file transfer bottleneck.

Capacity Planning for Real-Time Systems

Examples:

  • Chat.
  • Live notifications.
  • Trading dashboards.
  • Multiplayer games.
  • Collaboration tools.

Important metrics:

  • Concurrent connections.
  • Messages per second.
  • Fan-out count.
  • Connection memory.
  • Presence updates.
  • Delivery latency.
  • Regional latency.
  • Reconnect storms.
  • Message persistence.
  • Backpressure behavior.

Likely bottlenecks:

  • Connection count per server.
  • Message broker throughput.
  • Fan-out amplification.
  • Hot rooms/channels.
  • Network bandwidth.
  • Presence state store.
  • Database writes.

Example:

Code
1 message sent to a group with 10,000 members creates 10,000 delivery operations.

Fan-out can become the real bottleneck, not message creation.

Capacity Planning for Multi-Tenant Systems

Multi-tenant systems introduce uneven load.

Questions:

  • How many tenants?
  • What is average tenant size?
  • What is largest tenant size?
  • Can one tenant affect others?
  • Are there tenant quotas?
  • Is data shared or isolated?
  • Are noisy tenants throttled?
  • Can large tenants be moved to dedicated resources?
  • Are metrics available per tenant?

Likely bottlenecks:

  • Hot tenant.
  • Shared database.
  • Shared cache.
  • Shared queue.
  • Per-tenant reporting jobs.
  • Noisy neighbor effects.
  • Large tenant migrations.

Mitigations:

  • Per-tenant rate limits.
  • Tenant-level metrics.
  • Tenant partitioning.
  • Dedicated resources for large tenants.
  • Fair scheduling.
  • Quotas.
  • Bulkhead isolation.

Capacity Planning and Reliability

Capacity and reliability are connected.

A system running at 95% CPU during normal traffic has little room for:

  • Failover.
  • Traffic spikes.
  • Retry storms.
  • Deployment warmup.
  • Node failure.
  • Background jobs.
  • Garbage collection pauses.
  • Slow dependencies.

Reliability planning often requires spare capacity.

Example:

Code
System has 4 instances.
Each instance normally runs at 70% CPU.
If one instance fails, remaining 3 must handle the same load.

New average per remaining instance:
4 * 70% / 3 = 93.3%

This may be unsafe.

For N+1 capacity:

Code
The system should handle peak load even if one instance is unavailable.

This requirement directly increases required capacity.

Capacity Planning and Consistency

Consistency requirements affect capacity.

Strong consistency may require:

  • Synchronous writes.
  • Distributed transactions.
  • Locks.
  • Single leader.
  • Quorum writes.
  • Serial processing.
  • Lower concurrency.

Eventual consistency may allow:

  • Queues.
  • Async processing.
  • Read replicas.
  • Caches.
  • Denormalized views.
  • Higher availability.
  • Higher throughput.

Example:

Code
Inventory reservation must be strongly consistent.
Product recommendation updates can be eventually consistent.

Design implication:

Code
Use transactional inventory reservation for checkout.
Use async event processing for recommendations.

This avoids forcing the whole system to use the strictest consistency model.

Capacity Planning and Security

Security features can affect capacity.

Examples:

  • Password hashing consumes CPU.
  • Encryption/decryption consumes CPU.
  • TLS consumes CPU and network overhead.
  • Audit logging increases write volume.
  • Authorization checks add database/cache calls.
  • Rate limiting requires counters.
  • Malware scanning requires worker capacity.
  • Security monitoring increases log volume.

Capacity planning should include security workloads.

Example:

Code
If every login requires a strong password hash and peak login traffic is 2,000 attempts/second, CPU capacity for authentication must be planned carefully.

Security should not be removed to improve capacity. Instead, plan capacity for required security behavior.

Capacity Planning and Observability

Observability itself consumes capacity.

Logging every request with large payloads can increase:

  • CPU.
  • Disk.
  • Network.
  • Storage cost.
  • Log ingestion cost.
  • Query cost.
  • Privacy risk.

Good observability planning includes:

  • Log levels.
  • Sampling.
  • Structured logs.
  • Metrics.
  • Traces.
  • Retention periods.
  • Redaction.
  • Cost controls.
  • Alert thresholds.

Example:

Code
At 10,000 RPS, logging 5 KB per request creates:
10,000 * 5 KB = 50 MB/sec
50 MB/sec * 86,400 sec/day ≈ 4.3 TB/day

This can become a storage and cost bottleneck.

Identifying Bottlenecks Before Building

For a new system, use likely bottleneck analysis.

Steps:

Code
1. Identify main user flows.
2. Estimate traffic per flow.
3. Break each flow into component calls.
4. Identify shared resources.
5. Check limits of each resource.
6. Find resources where demand is close to limits.
7. Plan mitigation.

Example: checkout

Code
Flow:
Validate cart
Reserve inventory
Create payment
Create order
Send confirmation email
Update analytics

Likely bottlenecks:

Code
Inventory reservation database
Payment provider latency/rate limit
Order database writes
Email provider throughput
Analytics should not block checkout

Design decision:

Code
Keep inventory/payment/order synchronous.
Move email and analytics to background queue.
Use idempotency key for payment.
Monitor provider latency and error rate.

This identifies likely bottlenecks before writing code.

Bottleneck Mitigation Strategies

Common strategies:

BottleneckPossible Mitigation
API CPUScale out, optimize code, cache, reduce serialization
API memoryStream, paginate, limit payloads, bound queues
Database readsIndexes, caching, read replicas, projections
Database writesBatch, partition, queue, optimize transactions
Database locksShort transactions, optimistic concurrency, redesign hot rows
Search latencySearch index, caching, precompute
External APIQueue, rate limit, fallback, cache, negotiate limits
Queue backlogAdd workers, optimize jobs, autoscale, priority queues
Network bandwidthCDN, compression, direct object storage access
Hot partitionBetter partition key, salting, split hot tenant
Cache stampedeLocking, request coalescing, stale-while-revalidate
Thread pool starvationAvoid sync-over-async, use async I/O
Connection pool exhaustionReduce long operations, tune pool, limit concurrency

A strong interview answer matches the mitigation to the bottleneck instead of applying generic solutions.

Common Capacity Planning Mistakes

Common mistakes include:

  • Planning for average load instead of peak load.
  • Ignoring burst traffic.
  • Ignoring external API limits.
  • Assuming autoscaling is instant.
  • Scaling API servers while the database is the bottleneck.
  • Not accounting for database connections per instance.
  • Ignoring background jobs and batch workloads.
  • Ignoring retries and retry storms.
  • Ignoring data growth and retention.
  • Ignoring backup and restore time.
  • Ignoring log volume and observability cost.
  • Using unrealistic load tests.
  • Testing only one endpoint instead of real user journeys.
  • Not monitoring p95 and p99 latency.
  • Ignoring queue age.
  • Ignoring hot partitions.
  • Assuming cache hit rate will always be high.
  • Not planning for cache failure.
  • Overprovisioning without fixing inefficient code.
  • Underprovisioning to save cost.
  • Not validating assumptions before launch.
  • Not updating capacity plans after product changes.

Best Practices

Start with measurable performance and reliability targets.

Estimate demand using historical data, business forecasts, pilots, and assumptions.

Plan for peak traffic, not only average traffic.

Include safety margin and failure scenarios.

Break user flows into component-level operations.

Identify shared resources and external dependency limits.

Estimate database, storage, network, cache, and queue needs separately.

Use back-of-the-envelope math to reveal likely risks.

Validate assumptions with realistic load testing.

Monitor application, database, queue, cache, and infrastructure metrics.

Use p95 and p99 latency, not only averages.

Track queue age, not only queue length.

Use autoscaling, but understand its delay and dependency impact.

Add backpressure and load shedding for overload protection.

Cache carefully and plan for cache failure.

Optimize inefficient code before blindly scaling expensive resources.

Use queues for burst smoothing and long-running work.

Make background jobs idempotent.

Avoid unbounded queues and unlimited concurrency.

Review capacity before launches, campaigns, migrations, seasonal peaks, and major feature releases.

Treat capacity planning as continuous, not one-time.

Interview Practice

PreviousAssumptions, constraints, risks, and failure modesNext UpFunctional Requirements vs Nonfunctional Requirements