Overview
Throughput, latency, concurrency, availability, consistency, and cost targets are measurable quality requirements used to describe how a system should behave under real-world production conditions. They turn vague goals such as "make it fast", "support many users", "never go down", or "keep it affordable" into concrete engineering targets that influence architecture, infrastructure, database design, testing, monitoring, and operational decisions.
In system design and architecture interviews, these targets are important because they show whether a candidate can move beyond feature requirements and reason about real production behavior. A system is not only judged by whether it works functionally. It must also handle expected traffic, respond within acceptable time, survive failures, preserve the right level of data correctness, and stay within cost limits.
These targets are commonly used when designing APIs, microservices, background job systems, distributed databases, payment flows, reporting systems, file transfer platforms, messaging pipelines, and cloud-hosted applications. They influence decisions such as whether to use caching, queues, partitioning, read replicas, autoscaling, circuit breakers, retries, multi-region deployment, eventual consistency, or stronger transactional guarantees.
A strong interview answer usually starts by clarifying the business flow, user expectations, traffic shape, data criticality, failure tolerance, and budget constraints. Then the answer converts those requirements into measurable targets, explains trade-offs, and proposes how the system will be tested and monitored.
Core Concepts
Quality Targets and Requirement Decomposition
Requirement decomposition means breaking a high-level business goal into measurable technical targets.
A vague requirement:
The system must handle high traffic and be reliable.
A decomposed requirement:
Feature: Submit order
Traffic:
- Average: 500 requests per second
- Peak: 2,000 requests per second during campaign events
- Burst: 5,000 requests per second for up to 5 minutes
Latency:
- p95 response time below 300 ms
- p99 response time below 1 second
Availability:
- 99.95% monthly availability for order submission
- Graceful degradation for recommendation and analytics features
Consistency:
- Payment and inventory reservation must be strongly consistent
- Order history can be eventually consistent within 30 seconds
Cost:
- Monthly infrastructure budget below $20,000
- Unit cost below $0.002 per order request
This style of decomposition helps architects choose appropriate technologies and helps teams validate whether the system is meeting expectations.
Important terms:
- Functional requirement: What the system does, such as "create an order" or "upload a file".
- Non-functional requirement: How well the system performs, such as speed, reliability, security, scalability, or cost efficiency.
- SLI: Service Level Indicator, a measured metric such as request success rate or p95 latency.
- SLO: Service Level Objective, the internal target for an SLI.
- SLA: Service Level Agreement, an external promise that may include contractual consequences.
- Error budget: The allowed amount of unreliability over a period, often derived from the SLO.
Throughput
Throughput measures how much work a system completes in a period of time.
Common throughput units:
- Requests per second
- Transactions per second
- Messages per second
- Jobs per minute
- Files processed per hour
- Database writes per second
- Events ingested per second
Throughput is not the same as the number of users. A system may have one million registered users but only a few thousand active users at the same time. Interviewers often expect candidates to distinguish between total users, daily active users, peak concurrent users, and actual request rate.
Example:
Assumptions:
- 1,000,000 daily active users
- Each user makes 20 API calls per day
- Traffic is concentrated into 8 active hours
- Peak traffic is 5 times the average
Average requests per second:
1,000,000 * 20 / (8 * 60 * 60) = about 694 RPS
Peak requests per second:
694 * 5 = about 3,470 RPS
Throughput matters because it drives capacity planning. It influences the number of application instances, database capacity, queue throughput, cache size, partitioning strategy, network bandwidth, and autoscaling rules.
Common throughput bottlenecks include:
- Database locks or slow queries
- Thread pool starvation
- Synchronous I/O
- Insufficient connection pool size
- Hot partitions
- Slow downstream services
- Excessive serialization or large payloads
- CPU-heavy transformations
- Unbounded retries during failures
Best practices:
- Separate average, peak, and burst throughput.
- Measure throughput per critical workflow, not only per application.
- Design for backpressure when demand exceeds capacity.
- Use queues for asynchronous workloads that do not need immediate completion.
- Avoid scaling only the web tier if the database or downstream dependency is the real bottleneck.
- Track throughput together with latency and error rate.
Latency
Latency measures how long one operation takes from the user's or caller's perspective. In APIs, latency is often measured as request-response time. In event systems, it may mean end-to-end processing delay from event creation to completion.
Important latency metrics:
- Average latency: The mean response time. Useful, but can hide slow outliers.
- Median latency / p50: Half of requests are faster than this value.
- p95 latency: 95% of requests are faster than this value.
- p99 latency: 99% of requests are faster than this value.
- Tail latency: High-percentile latency such as p95, p99, or p99.9.
- End-to-end latency: Total time across client, network, API, dependencies, database, and response serialization.
Example latency target:
For the product search API:
- p50 below 80 ms
- p95 below 250 ms
- p99 below 800 ms
- timeout at 2 seconds
Latency matters because users experience delay directly. A system can have high throughput but still feel slow if each request waits too long. High latency can also reduce throughput because resources remain occupied longer.
Common latency causes:
- Slow database queries
- Cold starts
- Large payloads
- Chatty service-to-service calls
- Blocking calls in async code
- Lock contention
- Cache misses
- Retry storms
- Cross-region network calls
- Garbage collection pressure
- Inefficient serialization
Best practices:
- Define percentile-based latency targets, not only averages.
- Track latency per endpoint or business flow.
- Use timeouts, cancellation, and circuit breakers.
- Avoid unnecessary sequential calls when independent calls can run concurrently.
- Use caching carefully for read-heavy workloads.
- Reduce payload size and avoid over-fetching.
- Treat p95 and p99 latency as first-class production metrics.
Throughput vs Latency
Throughput and latency are related but different.
- Throughput asks: "How many operations can the system complete per second?"
- Latency asks: "How long does one operation take?"
A system can have:
- High throughput and low latency: ideal but requires efficient design.
- High throughput and high latency: system processes many requests, but users wait.
- Low throughput and low latency: system is fast for a small load but does not scale.
- Low throughput and high latency: system is both slow and capacity-limited.
Trade-off examples:
- Batching can improve throughput but increase latency for individual requests.
- Caching can reduce latency and improve throughput but may introduce stale data.
- Strong consistency can improve correctness but may increase latency.
- Adding replicas can improve read throughput but may complicate consistency.
- Increasing concurrency can improve throughput until the system becomes saturated, after which latency and error rates increase.
Interview habit:
Do not say "the system should be fast."
Say "the checkout API should keep p95 latency below 300 ms at 2,000 RPS with an error rate below 0.1%."
Concurrency
Concurrency describes how many operations are in progress at the same time. It is related to but not identical to throughput.
Common types of concurrency:
- Concurrent users
- Concurrent requests
- Concurrent database connections
- Concurrent background jobs
- Concurrent message consumers
- Concurrent file uploads
- Concurrent transactions
- Concurrent threads or tasks
A useful approximation is:
Concurrency ≈ Throughput × Latency
Example:
If an API handles 1,000 requests per second
and average request latency is 200 ms:
Concurrency ≈ 1,000 × 0.2 = 200 active requests
This is useful for estimating connection pools, thread usage, memory pressure, queue consumers, and instance count.
Concurrency mistakes:
- Confusing registered users with concurrent users.
- Confusing concurrent users with concurrent requests.
- Allowing unlimited parallel tasks.
- Increasing concurrency without checking database capacity.
- Forgetting connection pool limits.
- Using locks around slow I/O.
- Creating too many threads for I/O-bound workloads.
- Running CPU-bound work on the request path without throttling.
Best practices:
- Limit concurrency at the correct boundary.
- Use
SemaphoreSlim, bounded channels, queue consumers, or rate limiters when needed. - Use async I/O for I/O-bound work.
- Use worker pools for background processing.
- Monitor saturation metrics such as CPU, memory, thread pool queue length, queue depth, connection pool usage, and database waits.
- Apply backpressure instead of letting the system collapse under unlimited load.
Example C# concurrency limit:
public sealed class ReportProcessor
{
private readonly SemaphoreSlim _semaphore = new(initialCount: 10);
public async Task ProcessAsync(IEnumerable<ReportJob> jobs, CancellationToken cancellationToken)
{
var tasks = jobs.Select(async job =>
{
await _semaphore.WaitAsync(cancellationToken);
try
{
await ProcessOneReportAsync(job, cancellationToken);
}
finally
{
_semaphore.Release();
}
});
await Task.WhenAll(tasks);
}
private static Task ProcessOneReportAsync(ReportJob job, CancellationToken cancellationToken)
{
return Task.Delay(TimeSpan.FromMilliseconds(200), cancellationToken);
}
}
public sealed record ReportJob(Guid Id);
The important point is not only that the code runs tasks concurrently, but that concurrency is intentionally bounded.
Availability
Availability measures whether a system is usable when users need it. It is usually expressed as a percentage over a time period.
Common availability examples:
99.0% monthly availability = about 7.2 hours downtime per month
99.9% monthly availability = about 43.8 minutes downtime per month
99.95% monthly availability = about 21.9 minutes downtime per month
99.99% monthly availability = about 4.4 minutes downtime per month
Availability is not only about server uptime. A service may be "up" but unusable if:
- Error rates are high.
- Latency is extreme.
- Database writes fail.
- Authentication is broken.
- A critical dependency is unavailable.
- The UI loads but checkout cannot complete.
Related concepts:
- Reliability: The ability to perform correctly over time.
- Resiliency: The ability to recover from failures.
- Fault tolerance: The ability to continue operating despite component failures.
- RTO: Recovery Time Objective, how quickly the system must recover.
- RPO: Recovery Point Objective, how much data loss is acceptable.
- Graceful degradation: Keeping critical features available while non-critical features are disabled or reduced.
Availability design patterns:
- Health checks
- Load balancing
- Multiple application instances
- Database replication
- Zone redundancy
- Multi-region deployment
- Circuit breakers
- Retries with exponential backoff and jitter
- Timeouts
- Bulkheads
- Queues for temporary buffering
- Read-only fallback mode
- Cache fallback for non-critical reads
- Disaster recovery plans
Trade-offs:
- Higher availability usually increases cost and operational complexity.
- Multi-region systems improve resilience but complicate data consistency.
- Aggressive retries can improve availability during transient failures but can also overload dependencies.
- Graceful degradation requires product decisions about which features are critical.
Consistency
Consistency describes how correct and up to date data appears across reads, writes, replicas, caches, and distributed services.
Common consistency models:
- Strong consistency: A read returns the latest committed write.
- Eventual consistency: Replicas or read models become consistent after some delay.
- Read-your-writes consistency: A user sees their own updates immediately.
- Monotonic reads: A user does not see data move backward in time.
- Bounded staleness: Reads may be stale, but only within a known time or version limit.
- Session consistency: Consistency is preserved within a user session.
- Transactional consistency: A group of changes succeeds or fails as a unit.
Consistency matters because not all data has the same correctness requirements.
Examples:
Strong consistency usually needed:
- Payment capture
- Bank balance update
- Inventory reservation
- Password change
- Authorization policy update
Eventual consistency often acceptable:
- Analytics dashboard
- Search index
- Email notification status
- Recommendation list
- Activity feed
- Reporting read model
Common architecture examples:
- A write database is strongly consistent, while read replicas may lag.
- A cache improves read performance but can return stale data.
- A message-driven workflow improves resilience but introduces eventual consistency.
- A search index may lag behind the source database.
- A CQRS read model may be temporarily behind the write model.
Best practices:
- Define consistency requirements per business operation.
- Avoid applying strong consistency everywhere by default.
- Use transactions for local database invariants.
- Use idempotency keys for retry-safe commands.
- Use outbox patterns for reliable event publishing.
- Use version numbers, ETags, or concurrency tokens for conflict detection.
- Communicate eventual consistency clearly in the UI when needed.
- Design compensation workflows for distributed operations that cannot use one transaction.
Availability vs Consistency
In distributed systems, availability and consistency often compete during network partitions, replication lag, or dependency failures.
Example trade-off:
Scenario: Product inventory service is unavailable.
Option A:
Reject checkout to avoid selling unavailable inventory.
- Better consistency
- Lower availability
Option B:
Accept orders and reconcile inventory later.
- Better availability
- Weaker consistency
- Requires compensation if inventory is insufficient
The right answer depends on business rules. A banking system may reject operations rather than show stale balances. A social feed may accept eventual consistency to stay available and responsive.
Interviewers often test whether candidates can reason about this instead of blindly choosing "strong consistency" or "eventual consistency" everywhere.
Cost Targets
Cost targets define acceptable spending for building, running, scaling, and operating the system.
Cost can be expressed as:
- Monthly infrastructure budget
- Cost per request
- Cost per transaction
- Cost per customer
- Cost per GB stored
- Cost per GB transferred
- Cost per report generated
- Cost per message processed
- Engineering and operational cost
Example cost target:
The file processing platform must process 10 million files per month
with total cloud cost below $8,000 per month
and average processing cost below $0.0008 per file.
Cost matters because many architecture choices improve performance or availability by spending more money. Good architecture balances user experience, reliability, correctness, and budget.
Cost drivers:
- Always-on compute
- Over-provisioned instances
- Expensive database tiers
- Cross-region replication
- Data transfer between regions
- Large log volume
- Inefficient queries
- Unbounded queue processing
- High cache memory size
- Excessive retries
- Excessive storage retention
- Large payloads and frequent polling
Cost optimization techniques:
- Autoscaling based on realistic metrics
- Right-sizing compute and database resources
- Reserved capacity or savings plans for predictable workloads
- Serverless or consumption-based services for bursty workloads
- Caching to reduce expensive reads
- Data lifecycle policies
- Compression and efficient payload formats
- Tiered storage
- Queue-based load leveling
- Removing unused resources
- Monitoring unit cost over time
Important trade-off:
The cheapest system is not always the best system.
The most available system is not always worth the cost.
A good design justifies cost based on business value and risk.
Common System Trade-Offs
System design is often about choosing the best trade-off, not finding a perfect solution.
Examples:
- Adding a cache can improve latency, throughput, and cost, but introduces invalidation and stale data risks.
- Adding a queue can improve availability and absorb bursts, but increases end-to-end latency.
- Using read replicas can improve read throughput, but introduces replication lag.
- Using strong distributed transactions can improve consistency, but reduces scalability and increases latency.
- Deploying multi-region improves availability, but increases cost and consistency complexity.
- Limiting concurrency protects dependencies, but may reduce throughput under load.
Architecture Decisions Driven by Targets
Different targets lead to different architecture choices.
For high throughput:
- Horizontal scaling
- Stateless application instances
- Partitioning and sharding
- Queue-based processing
- Efficient database indexes
- Batch processing
- Caching
- Avoiding unnecessary synchronous dependencies
For low latency:
- Fewer network hops
- Local or regional data placement
- Optimized queries
- Read models
- Caching
- Smaller payloads
- Parallel independent calls
- Avoiding cold starts for critical paths
For high concurrency:
- Async I/O
- Bounded concurrency
- Connection pool tuning
- Backpressure
- Bulkheads
- Avoiding shared locks
- Efficient memory usage
- Separating CPU-bound and I/O-bound workloads
For high availability:
- Redundancy
- Health checks
- Failover
- Multi-zone or multi-region deployment
- Graceful degradation
- Retry and circuit breaker policies
- Disaster recovery testing
- Operational runbooks
For stronger consistency:
- Database transactions
- Unique constraints
- Concurrency tokens
- Idempotency keys
- Single-writer patterns
- Sagas with compensation
- Outbox/inbox patterns
- Careful cache invalidation
For cost control:
- Autoscaling
- Right-sizing
- Serverless for bursty workloads
- Storage lifecycle policies
- Reducing log noise
- Efficient queries
- Avoiding over-replication
- Tracking unit cost
Measuring and Validating Targets
Targets are only useful if they can be measured.
Important validation techniques:
- Load testing
- Stress testing
- Spike testing
- Soak testing
- Chaos testing
- Synthetic monitoring
- Real-user monitoring
- Distributed tracing
- Metrics dashboards
- Alerting based on SLOs
- Cost monitoring
- Capacity reviews
Example measurement plan:
Checkout API targets:
- Throughput: 2,000 RPS for 30 minutes
- Latency: p95 below 300 ms and p99 below 1 second
- Error rate: below 0.1%
- Availability: 99.95% monthly SLO
- Consistency: no double payment and no negative inventory
- Cost: below $0.002 per successful checkout
Validation:
- Run load test before release
- Monitor p95/p99 latency per endpoint
- Track payment failure and duplicate charge metrics
- Alert when error budget burn rate is too high
- Review cost per checkout weekly
Best practices:
- Measure business flows, not only infrastructure metrics.
- Use percentiles instead of averages for latency.
- Track saturation before failure happens.
- Validate failure behavior, not only happy-path load.
- Define alerts based on user impact.
- Keep targets visible in architecture documents and runbooks.
Common Mistakes
Common mistakes in interviews and real projects include:
- Saying "fast" instead of specifying latency targets.
- Saying "highly available" without an uptime target or failure model.
- Designing for total registered users instead of peak request rate.
- Ignoring p95 and p99 latency.
- Treating all data as requiring strong consistency.
- Using caching without an invalidation strategy.
- Adding retries without timeouts or circuit breakers.
- Allowing unlimited concurrency.
- Scaling the application tier while ignoring the database bottleneck.
- Ignoring cost until after the architecture is already too expensive.
- Over-engineering multi-region architecture for a low-risk internal tool.
- Under-engineering reliability for a revenue-critical workflow.
- Failing to define RTO and RPO.
- Not testing production-like traffic.
- Treating SLOs as purely operational instead of architectural requirements.
Best Practices for Interviews
A strong interview answer should:
- Clarify the business-critical flows.
- Estimate traffic using assumptions.
- Separate average, peak, and burst load.
- Define latency using p95 and p99, not only average.
- Define availability targets per critical workflow.
- Identify which data needs strong consistency and which can be eventual.
- Discuss cost as a design constraint.
- Explain trade-offs clearly.
- Propose architecture choices that match the targets.
- Explain how the targets will be tested and monitored.
A practical answer pattern:
For this system, I would first identify the critical user journeys.
Then I would estimate average and peak throughput, expected concurrency, and latency targets.
For consistency, I would classify each operation as strong or eventual based on business risk.
For availability, I would define an SLO and decide what can degrade during failure.
For cost, I would define a monthly budget or unit-cost target.
After that, I would choose architecture patterns such as caching, queues, partitioning,
replication, autoscaling, and failover based on those targets.
Finally, I would validate the design using load tests, failure tests, observability, and cost monitoring.