Throughput, latency, concurrency, availability, consistency, and cost targets Interview Questions

Overview

Throughput, latency, concurrency, availability, consistency, and cost targets are measurable quality requirements used to describe how a system should behave under real-world production conditions. They turn vague goals such as "make it fast", "support many users", "never go down", or "keep it affordable" into concrete engineering targets that influence architecture, infrastructure, database design, testing, monitoring, and operational decisions.

In system design and architecture interviews, these targets are important because they show whether a candidate can move beyond feature requirements and reason about real production behavior. A system is not only judged by whether it works functionally. It must also handle expected traffic, respond within acceptable time, survive failures, preserve the right level of data correctness, and stay within cost limits.

These targets are commonly used when designing APIs, microservices, background job systems, distributed databases, payment flows, reporting systems, file transfer platforms, messaging pipelines, and cloud-hosted applications. They influence decisions such as whether to use caching, queues, partitioning, read replicas, autoscaling, circuit breakers, retries, multi-region deployment, eventual consistency, or stronger transactional guarantees.

A strong interview answer usually starts by clarifying the business flow, user expectations, traffic shape, data criticality, failure tolerance, and budget constraints. Then the answer converts those requirements into measurable targets, explains trade-offs, and proposes how the system will be tested and monitored.

Core Concepts

Quality Targets and Requirement Decomposition

Requirement decomposition means breaking a high-level business goal into measurable technical targets.

A vague requirement:

Code

The system must handle high traffic and be reliable.

A decomposed requirement:

Code

Feature: Submit order

Traffic:
- Average: 500 requests per second
- Peak: 2,000 requests per second during campaign events
- Burst: 5,000 requests per second for up to 5 minutes

Latency:
- p95 response time below 300 ms
- p99 response time below 1 second

Availability:
- 99.95% monthly availability for order submission
- Graceful degradation for recommendation and analytics features

Consistency:
- Payment and inventory reservation must be strongly consistent
- Order history can be eventually consistent within 30 seconds

Cost:
- Monthly infrastructure budget below $20,000
- Unit cost below $0.002 per order request

This style of decomposition helps architects choose appropriate technologies and helps teams validate whether the system is meeting expectations.

Important terms:

Functional requirement: What the system does, such as "create an order" or "upload a file".
Non-functional requirement: How well the system performs, such as speed, reliability, security, scalability, or cost efficiency.
SLI: Service Level Indicator, a measured metric such as request success rate or p95 latency.
SLO: Service Level Objective, the internal target for an SLI.
SLA: Service Level Agreement, an external promise that may include contractual consequences.
Error budget: The allowed amount of unreliability over a period, often derived from the SLO.

Throughput

Throughput measures how much work a system completes in a period of time.

Common throughput units:

Requests per second
Transactions per second
Messages per second
Jobs per minute
Files processed per hour
Database writes per second
Events ingested per second

Throughput is not the same as the number of users. A system may have one million registered users but only a few thousand active users at the same time. Interviewers often expect candidates to distinguish between total users, daily active users, peak concurrent users, and actual request rate.

Example:

Code

Assumptions:
- 1,000,000 daily active users
- Each user makes 20 API calls per day
- Traffic is concentrated into 8 active hours
- Peak traffic is 5 times the average

Average requests per second:
1,000,000 * 20 / (8 * 60 * 60) = about 694 RPS

Peak requests per second:
694 * 5 = about 3,470 RPS

Throughput matters because it drives capacity planning. It influences the number of application instances, database capacity, queue throughput, cache size, partitioning strategy, network bandwidth, and autoscaling rules.

Common throughput bottlenecks include:

Database locks or slow queries
Thread pool starvation
Synchronous I/O
Insufficient connection pool size
Hot partitions
Slow downstream services
Excessive serialization or large payloads
CPU-heavy transformations
Unbounded retries during failures

Best practices:

Separate average, peak, and burst throughput.
Measure throughput per critical workflow, not only per application.
Design for backpressure when demand exceeds capacity.
Use queues for asynchronous workloads that do not need immediate completion.
Avoid scaling only the web tier if the database or downstream dependency is the real bottleneck.
Track throughput together with latency and error rate.

Latency

Latency measures how long one operation takes from the user's or caller's perspective. In APIs, latency is often measured as request-response time. In event systems, it may mean end-to-end processing delay from event creation to completion.

Important latency metrics:

Average latency: The mean response time. Useful, but can hide slow outliers.
Median latency / p50: Half of requests are faster than this value.
p95 latency: 95% of requests are faster than this value.
p99 latency: 99% of requests are faster than this value.
Tail latency: High-percentile latency such as p95, p99, or p99.9.
End-to-end latency: Total time across client, network, API, dependencies, database, and response serialization.

Example latency target:

Code

For the product search API:
- p50 below 80 ms
- p95 below 250 ms
- p99 below 800 ms
- timeout at 2 seconds

Latency matters because users experience delay directly. A system can have high throughput but still feel slow if each request waits too long. High latency can also reduce throughput because resources remain occupied longer.

Common latency causes:

Slow database queries
Cold starts
Large payloads
Chatty service-to-service calls
Blocking calls in async code
Lock contention
Cache misses
Retry storms
Cross-region network calls
Garbage collection pressure
Inefficient serialization

Best practices:

Define percentile-based latency targets, not only averages.
Track latency per endpoint or business flow.
Use timeouts, cancellation, and circuit breakers.
Avoid unnecessary sequential calls when independent calls can run concurrently.
Use caching carefully for read-heavy workloads.
Reduce payload size and avoid over-fetching.
Treat p95 and p99 latency as first-class production metrics.

Throughput vs Latency

Throughput and latency are related but different.

Throughput asks: "How many operations can the system complete per second?"
Latency asks: "How long does one operation take?"

A system can have:

High throughput and low latency: ideal but requires efficient design.
High throughput and high latency: system processes many requests, but users wait.
Low throughput and low latency: system is fast for a small load but does not scale.
Low throughput and high latency: system is both slow and capacity-limited.

Trade-off examples:

Batching can improve throughput but increase latency for individual requests.
Caching can reduce latency and improve throughput but may introduce stale data.
Strong consistency can improve correctness but may increase latency.
Adding replicas can improve read throughput but may complicate consistency.
Increasing concurrency can improve throughput until the system becomes saturated, after which latency and error rates increase.

Interview habit:

Code

Do not say "the system should be fast."
Say "the checkout API should keep p95 latency below 300 ms at 2,000 RPS with an error rate below 0.1%."

Concurrency

Concurrency describes how many operations are in progress at the same time. It is related to but not identical to throughput.

Common types of concurrency:

Concurrent users
Concurrent requests
Concurrent database connections
Concurrent background jobs
Concurrent message consumers
Concurrent file uploads
Concurrent transactions
Concurrent threads or tasks

A useful approximation is:

Code

Concurrency ≈ Throughput × Latency

Example:

Code

If an API handles 1,000 requests per second
and average request latency is 200 ms:

Concurrency ≈ 1,000 × 0.2 = 200 active requests

This is useful for estimating connection pools, thread usage, memory pressure, queue consumers, and instance count.

Concurrency mistakes:

Confusing registered users with concurrent users.
Confusing concurrent users with concurrent requests.
Allowing unlimited parallel tasks.
Increasing concurrency without checking database capacity.
Forgetting connection pool limits.
Using locks around slow I/O.
Creating too many threads for I/O-bound workloads.
Running CPU-bound work on the request path without throttling.

Best practices:

Limit concurrency at the correct boundary.
Use SemaphoreSlim, bounded channels, queue consumers, or rate limiters when needed.
Use async I/O for I/O-bound work.
Use worker pools for background processing.
Monitor saturation metrics such as CPU, memory, thread pool queue length, queue depth, connection pool usage, and database waits.
Apply backpressure instead of letting the system collapse under unlimited load.

Example C# concurrency limit:

Code

public sealed class ReportProcessor
{
    private readonly SemaphoreSlim _semaphore = new(initialCount: 10);

    public async Task ProcessAsync(IEnumerable<ReportJob> jobs, CancellationToken cancellationToken)
    {
        var tasks = jobs.Select(async job =>
        {
            await _semaphore.WaitAsync(cancellationToken);

            try
            {
                await ProcessOneReportAsync(job, cancellationToken);
            }
            finally
            {
                _semaphore.Release();
            }
        });

        await Task.WhenAll(tasks);
    }

    private static Task ProcessOneReportAsync(ReportJob job, CancellationToken cancellationToken)
    {
        return Task.Delay(TimeSpan.FromMilliseconds(200), cancellationToken);
    }
}

public sealed record ReportJob(Guid Id);

The important point is not only that the code runs tasks concurrently, but that concurrency is intentionally bounded.

Availability

Availability measures whether a system is usable when users need it. It is usually expressed as a percentage over a time period.

Common availability examples:

Code

99.0% monthly availability   = about 7.2 hours downtime per month
99.9% monthly availability   = about 43.8 minutes downtime per month
99.95% monthly availability  = about 21.9 minutes downtime per month
99.99% monthly availability  = about 4.4 minutes downtime per month

Availability is not only about server uptime. A service may be "up" but unusable if:

Error rates are high.
Latency is extreme.
Database writes fail.
Authentication is broken.
A critical dependency is unavailable.
The UI loads but checkout cannot complete.

Related concepts:

Reliability: The ability to perform correctly over time.
Resiliency: The ability to recover from failures.
Fault tolerance: The ability to continue operating despite component failures.
RTO: Recovery Time Objective, how quickly the system must recover.
RPO: Recovery Point Objective, how much data loss is acceptable.
Graceful degradation: Keeping critical features available while non-critical features are disabled or reduced.

Availability design patterns:

Health checks
Load balancing
Multiple application instances
Database replication
Zone redundancy
Multi-region deployment
Circuit breakers
Retries with exponential backoff and jitter
Timeouts
Bulkheads
Queues for temporary buffering
Read-only fallback mode
Cache fallback for non-critical reads
Disaster recovery plans

Trade-offs:

Higher availability usually increases cost and operational complexity.
Multi-region systems improve resilience but complicate data consistency.
Aggressive retries can improve availability during transient failures but can also overload dependencies.
Graceful degradation requires product decisions about which features are critical.

Consistency

Consistency describes how correct and up to date data appears across reads, writes, replicas, caches, and distributed services.

Common consistency models:

Strong consistency: A read returns the latest committed write.
Eventual consistency: Replicas or read models become consistent after some delay.
Read-your-writes consistency: A user sees their own updates immediately.
Monotonic reads: A user does not see data move backward in time.
Bounded staleness: Reads may be stale, but only within a known time or version limit.
Session consistency: Consistency is preserved within a user session.
Transactional consistency: A group of changes succeeds or fails as a unit.

Consistency matters because not all data has the same correctness requirements.

Examples:

Code

Strong consistency usually needed:
- Payment capture
- Bank balance update
- Inventory reservation
- Password change
- Authorization policy update

Eventual consistency often acceptable:
- Analytics dashboard
- Search index
- Email notification status
- Recommendation list
- Activity feed
- Reporting read model

Common architecture examples:

A write database is strongly consistent, while read replicas may lag.
A cache improves read performance but can return stale data.
A message-driven workflow improves resilience but introduces eventual consistency.
A search index may lag behind the source database.
A CQRS read model may be temporarily behind the write model.

Best practices:

Define consistency requirements per business operation.
Avoid applying strong consistency everywhere by default.
Use transactions for local database invariants.
Use idempotency keys for retry-safe commands.
Use outbox patterns for reliable event publishing.
Use version numbers, ETags, or concurrency tokens for conflict detection.
Communicate eventual consistency clearly in the UI when needed.
Design compensation workflows for distributed operations that cannot use one transaction.

Availability vs Consistency

In distributed systems, availability and consistency often compete during network partitions, replication lag, or dependency failures.

Example trade-off:

Code

Scenario: Product inventory service is unavailable.

Option A:
Reject checkout to avoid selling unavailable inventory.
- Better consistency
- Lower availability

Option B:
Accept orders and reconcile inventory later.
- Better availability
- Weaker consistency
- Requires compensation if inventory is insufficient

The right answer depends on business rules. A banking system may reject operations rather than show stale balances. A social feed may accept eventual consistency to stay available and responsive.

Interviewers often test whether candidates can reason about this instead of blindly choosing "strong consistency" or "eventual consistency" everywhere.

Cost Targets

Cost targets define acceptable spending for building, running, scaling, and operating the system.

Cost can be expressed as:

Monthly infrastructure budget
Cost per request
Cost per transaction
Cost per customer
Cost per GB stored
Cost per GB transferred
Cost per report generated
Cost per message processed
Engineering and operational cost

Example cost target:

Code

The file processing platform must process 10 million files per month
with total cloud cost below $8,000 per month
and average processing cost below $0.0008 per file.

Cost matters because many architecture choices improve performance or availability by spending more money. Good architecture balances user experience, reliability, correctness, and budget.

Cost drivers:

Always-on compute
Over-provisioned instances
Expensive database tiers
Cross-region replication
Data transfer between regions
Large log volume
Inefficient queries
Unbounded queue processing
High cache memory size
Excessive retries
Excessive storage retention
Large payloads and frequent polling

Cost optimization techniques:

Autoscaling based on realistic metrics
Right-sizing compute and database resources
Reserved capacity or savings plans for predictable workloads
Serverless or consumption-based services for bursty workloads
Caching to reduce expensive reads
Data lifecycle policies
Compression and efficient payload formats
Tiered storage
Queue-based load leveling
Removing unused resources
Monitoring unit cost over time

Important trade-off:

Code

The cheapest system is not always the best system.
The most available system is not always worth the cost.
A good design justifies cost based on business value and risk.

Common System Trade-Offs

System design is often about choosing the best trade-off, not finding a perfect solution.

Target	Improving It Usually Helps	But Can Hurt
Throughput	More completed work per second	Latency, cost, consistency, downstream stability
Latency	Better user experience	Cost, complexity, cache consistency
Concurrency	More simultaneous work	Memory, CPU, database connections, lock contention
Availability	Fewer user-visible outages	Cost, complexity, consistency
Consistency	More correct and predictable data	Latency, availability, scalability
Cost efficiency	Lower spend and better unit economics	Availability, performance headroom, operational simplicity

Examples:

Adding a cache can improve latency, throughput, and cost, but introduces invalidation and stale data risks.
Adding a queue can improve availability and absorb bursts, but increases end-to-end latency.
Using read replicas can improve read throughput, but introduces replication lag.
Using strong distributed transactions can improve consistency, but reduces scalability and increases latency.
Deploying multi-region improves availability, but increases cost and consistency complexity.
Limiting concurrency protects dependencies, but may reduce throughput under load.

Architecture Decisions Driven by Targets

Different targets lead to different architecture choices.

For high throughput:

Horizontal scaling
Stateless application instances
Partitioning and sharding
Queue-based processing
Efficient database indexes
Batch processing
Caching
Avoiding unnecessary synchronous dependencies

For low latency:

Fewer network hops
Local or regional data placement
Optimized queries
Read models
Caching
Smaller payloads
Parallel independent calls
Avoiding cold starts for critical paths

For high concurrency:

Async I/O
Bounded concurrency
Connection pool tuning
Backpressure
Bulkheads
Avoiding shared locks
Efficient memory usage
Separating CPU-bound and I/O-bound workloads

For high availability:

Redundancy
Health checks
Failover
Multi-zone or multi-region deployment
Graceful degradation
Retry and circuit breaker policies
Disaster recovery testing
Operational runbooks

For stronger consistency:

Database transactions
Unique constraints
Concurrency tokens
Idempotency keys
Single-writer patterns
Sagas with compensation
Outbox/inbox patterns
Careful cache invalidation

For cost control:

Autoscaling
Right-sizing
Serverless for bursty workloads
Storage lifecycle policies
Reducing log noise
Efficient queries
Avoiding over-replication
Tracking unit cost

Measuring and Validating Targets

Targets are only useful if they can be measured.

Important validation techniques:

Load testing
Stress testing
Spike testing
Soak testing
Chaos testing
Synthetic monitoring
Real-user monitoring
Distributed tracing
Metrics dashboards
Alerting based on SLOs
Cost monitoring
Capacity reviews

Example measurement plan:

Code

Checkout API targets:
- Throughput: 2,000 RPS for 30 minutes
- Latency: p95 below 300 ms and p99 below 1 second
- Error rate: below 0.1%
- Availability: 99.95% monthly SLO
- Consistency: no double payment and no negative inventory
- Cost: below $0.002 per successful checkout

Validation:
- Run load test before release
- Monitor p95/p99 latency per endpoint
- Track payment failure and duplicate charge metrics
- Alert when error budget burn rate is too high
- Review cost per checkout weekly

Best practices:

Measure business flows, not only infrastructure metrics.
Use percentiles instead of averages for latency.
Track saturation before failure happens.
Validate failure behavior, not only happy-path load.
Define alerts based on user impact.
Keep targets visible in architecture documents and runbooks.

Common Mistakes

Common mistakes in interviews and real projects include:

Saying "fast" instead of specifying latency targets.
Saying "highly available" without an uptime target or failure model.
Designing for total registered users instead of peak request rate.
Ignoring p95 and p99 latency.
Treating all data as requiring strong consistency.
Using caching without an invalidation strategy.
Adding retries without timeouts or circuit breakers.
Allowing unlimited concurrency.
Scaling the application tier while ignoring the database bottleneck.
Ignoring cost until after the architecture is already too expensive.
Over-engineering multi-region architecture for a low-risk internal tool.
Under-engineering reliability for a revenue-critical workflow.
Failing to define RTO and RPO.
Not testing production-like traffic.
Treating SLOs as purely operational instead of architectural requirements.

Best Practices for Interviews

A strong interview answer should:

Clarify the business-critical flows.
Estimate traffic using assumptions.
Separate average, peak, and burst load.
Define latency using p95 and p99, not only average.
Define availability targets per critical workflow.
Identify which data needs strong consistency and which can be eventual.
Discuss cost as a design constraint.
Explain trade-offs clearly.
Propose architecture choices that match the targets.
Explain how the targets will be tested and monitored.

A practical answer pattern:

Code

For this system, I would first identify the critical user journeys.
Then I would estimate average and peak throughput, expected concurrency, and latency targets.
For consistency, I would classify each operation as strong or eventual based on business risk.
For availability, I would define an SLO and decide what can degrade during failure.
For cost, I would define a monthly budget or unit-cost target.
After that, I would choose architecture patterns such as caching, queues, partitioning,
replication, autoscaling, and failover based on those targets.
Finally, I would validate the design using load tests, failure tests, observability, and cost monitoring.

Throughput, latency, concurrency, availability, consistency, and cost targets

Overview

Core Concepts

Quality Targets and Requirement Decomposition

Throughput

Latency

Throughput vs Latency

Concurrency

Availability

Consistency

Availability vs Consistency

Cost Targets

Common System Trade-Offs

Architecture Decisions Driven by Targets

Measuring and Validating Targets

Common Mistakes

Best Practices for Interviews

Interview Practice

Beginner Interview Practice

Intermediate Interview Practice

Advanced Interview Practice