DEV_NET_CORE
GET_STARTED
Design & ArchitectureScalability, resilience, caching, and observability design

Correlation IDs, traces, health checks, and alertable telemetry

Overview

Observability is the ability to understand a system's internal behavior from the signals it emits. In distributed systems, one user operation can cross HTTP services, message brokers, databases, caches, background workers, and third-party APIs.

Core signals include:

  • Traces: the path and timing of one distributed operation.
  • Metrics: aggregated numerical behavior over time.
  • Logs: discrete contextual records.
  • Health checks: current ability of an instance or subsystem to serve its intended role.

A correlation ID groups related activity under a business or request identifier. Distributed tracing adds structured parent-child relationships through trace and span IDs, commonly propagated using W3C Trace Context.

Telemetry is useful only when it supports:

  • Detection.
  • Diagnosis.
  • Capacity planning.
  • Security investigation.
  • SLO measurement.
  • Tested operational response.

Alertable telemetry must represent sustained user or business impact and have a clear owner and action. Paging on every exception, retry, or health-check transition creates noise and hides real incidents.

This topic matters in interviews because candidates must design end-to-end context propagation, distinguish health probes, control telemetry cost and cardinality, and define actionable alerts rather than merely saying to add logging.

Core Concepts

Observability Versus Monitoring

Monitoring checks known conditions:

Code
Is error rate above 2%?
Is queue age above 5 minutes?

Observability provides enough context to investigate unexpected behavior:

Code
Which dependency and tenant caused latency only for checkout requests?

Monitoring is built on observable signals. Dashboards without semantic context do not create observability.

Correlation ID

A correlation ID identifies related work across components:

Code
X-Correlation-ID: order-checkout-7ef2...

Useful scopes:

  • One inbound request.
  • One business workflow.
  • One saga.
  • One batch job.
  • One message chain.

Do not overload one identifier with every purpose. A business operation ID can remain stable for days, while a trace ID normally represents one execution path.

Correlation ID Safety

Incoming IDs are untrusted input.

Validate:

  • Length.
  • Character set.
  • Header count.
  • Format.

Generate a new value if invalid. Do not place secrets, email addresses, or other personal data in identifiers. Propagate only to trusted destinations and prevent attacker-controlled values from causing log injection or excessive cardinality.

Distributed Trace

A trace represents one operation as a graph of spans:

Code
HTTP POST /orders               trace
  -> validate request           span
  -> SQL insert                 span
  -> publish message            span
  -> inventory consumer         linked span
  -> inventory database update span

Each span records:

  • Trace ID.
  • Span ID.
  • Parent or links.
  • Start time and duration.
  • Operation name.
  • Status.
  • Attributes.
  • Events.
  • Resource identity.

Use stable low-cardinality span names such as:

Code
GET /orders/{id}

not:

Code
GET /orders/839104

W3C Trace Context

W3C Trace Context standardizes:

Code
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
tracestate: vendor=value

traceparent contains:

  • Version.
  • Trace ID.
  • Parent span ID.
  • Trace flags.

tracestate carries optional vendor-specific context.

Use standard propagation rather than inventing incompatible headers. A correlation ID can still exist for business lookup.

Trace Context Across Messaging

For asynchronous messages:

  • Inject trace context into message properties.
  • Extract it at the consumer.
  • Create a consumer or processing span.
  • Use links when one batch combines multiple messages.
  • Preserve business message, correlation, and causation IDs.

Do not make a days-long business workflow one continuously open span. Use separate traces connected by business IDs or span links.

OpenTelemetry

OpenTelemetry provides vendor-neutral APIs, SDKs, semantic conventions, instrumentation, and export protocols for:

  • Traces.
  • Metrics.
  • Logs.

Conceptual .NET setup:

Code
builder.Services.AddOpenTelemetry()
    .ConfigureResource(resource =>
        resource.AddService("ordering-api"))
    .WithTracing(tracing => tracing
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddEntityFrameworkCoreInstrumentation()
        .AddSource("Ordering"))
    .WithMetrics(metrics => metrics
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddMeter("Ordering"));

Export through an OpenTelemetry Collector or directly to a compatible backend according to operational needs.

Custom Spans

Instrument business-relevant work not covered by automatic instrumentation:

Code
private static readonly ActivitySource ActivitySource =
    new("Ordering");

using var activity = ActivitySource.StartActivity("order.place");
activity?.SetTag("order.channel", command.Channel);
activity?.SetTag("tenant.tier", tenant.Tier);

await handler.Handle(command, cancellationToken);

Avoid recording:

  • Full request bodies.
  • Secrets or tokens.
  • Personal information.
  • Unbounded object IDs as metric dimensions.

Trace attributes can have higher cardinality than metric labels, but still affect cost and privacy.

Metrics

Common metric types:

  • Counter.
  • Up-down counter.
  • Histogram.
  • Gauge or observable measurement.

Useful service metrics:

  • Request rate.
  • Error rate.
  • Latency histogram.
  • Active requests.
  • Queue depth and age.
  • Retry attempts.
  • Circuit state.
  • Cache hit ratio.
  • Database pool utilization.
  • Business completions and failures.

Prefer histograms and percentiles for latency rather than averages alone.

Cardinality

Metric cardinality is the number of unique label combinations.

Dangerous labels:

  • User ID.
  • Request ID.
  • Order ID.
  • Raw URL.
  • Exception message.

Safe bounded labels:

  • Route template.
  • HTTP method.
  • Status class.
  • Region.
  • Dependency name.
  • Known operation type.

High cardinality increases memory, storage, query cost, and alert instability. Put per-request identifiers in traces or logs, not metric dimensions.

Logs

Structured logs:

Code
logger.LogInformation(
    "Order {OrderId} accepted for tenant {TenantId}",
    order.Id,
    tenant.Id);

Benefits:

  • Searchable named properties.
  • Correlation with trace context.
  • Consistent redaction.
  • Better aggregation.

Log levels should reflect action:

  • Debug for development detail.
  • Information for normal state transitions.
  • Warning for recoverable abnormal conditions.
  • Error for failed operations requiring investigation.
  • Critical for severe service impact.

Do not log every successful retry as an error.

Logs, Metrics, and Traces Together

Use:

  • Metrics to detect and quantify.
  • Traces to locate latency and failure paths.
  • Logs for detailed events and state transitions.

Exemplars can connect a metric sample to representative traces. Include trace and span IDs in structured logs so an operator can pivot between signals.

Sampling

Recording every trace can be too expensive.

Strategies:

  • Head sampling at trace start.
  • Tail sampling after outcome is known.
  • Probability sampling.
  • Always sample errors or high latency.
  • Different rates by operation or environment.

Sampling must preserve enough evidence for rare failures. Metrics should continue to represent all traffic even when traces are sampled.

Health Checks

A health check answers a narrow operational question.

ASP.NET Core:

Code
builder.Services.AddHealthChecks()
    .AddCheck<DatabaseReadinessCheck>(
        "database",
        tags: ["ready"]);

app.MapHealthChecks("/health/live", new HealthCheckOptions
{
    Predicate = _ => false
});

app.MapHealthChecks("/health/ready", new HealthCheckOptions
{
    Predicate = check => check.Tags.Contains("ready")
});

Keep public responses minimal. Detailed dependency information can expose architecture and should be restricted.

Liveness, Readiness, and Startup

Liveness

  • Is the process alive and making progress?
  • Failure may trigger restart.
  • Should avoid broad dependency checks.

Readiness

  • Can this instance safely receive traffic?
  • Failure removes it from load balancing.

Startup

  • Has long initialization completed?
  • Prevents premature liveness failure during startup.

If every instance fails liveness because a shared database is down, the orchestrator can restart the entire fleet and make recovery worse.

Dependency Health

Checking a dependency can itself create load.

Use:

  • Cheap bounded probes.
  • Timeouts.
  • Separate readiness semantics.
  • Cached or scheduled checks when appropriate.
  • Degraded status for optional dependencies.

Do not use deep business transactions as high-frequency liveness probes.

Health Endpoint Security

Protect detailed endpoints:

  • Network restriction.
  • Authentication.
  • Separate management port.
  • Minimal response body.
  • No secrets or internal exception text.
  • No response caching.

A simple liveness endpoint often needs only an HTTP status.

Service-Level Indicators and Objectives

An SLI is a measured service behavior:

  • Successful request ratio.
  • Latency under threshold.
  • Fresh processing within deadline.

An SLO is a target:

Code
99.9% of checkout requests succeed within 500 ms over 30 days.

An error budget is the allowed unreliability implied by the SLO. Alerts should connect to meaningful consumption of that budget.

Alert Design

An actionable alert has:

  • Clear symptom.
  • User or business impact.
  • Threshold and evaluation window.
  • Owner.
  • Severity.
  • Runbook.
  • Useful context and dashboard links.
  • Deduplication and routing.

Prefer symptom-based alerts:

Code
checkout success rate below SLO
queue oldest age threatens deadline

over cause-only alerts:

Code
CPU above 80%
one exception occurred

Cause metrics remain valuable for diagnosis.

Multi-Window Alerting

A brief spike and a slow burn require different detection.

Use:

  • Short window for severe rapid impact.
  • Longer window for sustained degradation.
  • Error-budget burn rates where available.

Avoid static thresholds without traffic context. A 5% error rate at 2 requests per minute differs from the same rate at 20,000 requests per second.

Alert Fatigue

Reduce noise by:

  • Paging only for urgent actionable impact.
  • Sending lower-severity notifications to tickets or dashboards.
  • Grouping related alerts.
  • Suppressing dependent symptom storms.
  • Requiring sustained conditions.
  • Reviewing alerts after incidents.
  • Deleting alerts with no owner or action.

If operators routinely ignore an alert, the system is not safer because it exists.

Golden Signals and RED/USE

For request-driven services, RED:

  • Rate.
  • Errors.
  • Duration.

For resources, USE:

  • Utilization.
  • Saturation.
  • Errors.

Additional business signals are essential:

  • Orders completed.
  • Payments unresolved.
  • Messages past deadline.
  • Projection lag.

Infrastructure health can be green while the business workflow is broken.

Queue and Async Telemetry

Measure:

  • Arrival rate.
  • Completion rate.
  • Queue depth.
  • Oldest-message age.
  • Processing duration.
  • Attempts.
  • Dead-letter count.
  • End-to-end business latency.

Propagate message and business IDs. Acknowledgement success alone does not prove the intended side effect occurred.

Deployment Telemetry

Tag telemetry with bounded deployment context:

  • Service version.
  • Environment.
  • Region.
  • Deployment ring.

Compare errors and latency before and after deployment. Avoid per-instance dashboards as the only view; aggregate and preserve drill-down.

Telemetry Pipeline Reliability

The telemetry system has limits too.

Design:

  • Bounded application buffers.
  • Nonblocking export.
  • Batch export.
  • Sampling.
  • Backpressure or dropping policy.
  • Collector redundancy.
  • Cost and retention.

Application availability should not normally depend on synchronous telemetry export.

Privacy and Security

Telemetry can contain sensitive data.

Apply:

  • Data classification.
  • Redaction before export.
  • Access control.
  • Encryption.
  • Retention limits.
  • Audit.
  • Tenant separation.

Never log passwords, tokens, secrets, raw authorization headers, or unrestricted request bodies.

Testing Observability

Test:

  • Trace propagation across HTTP and messaging.
  • Missing or malformed incoming context.
  • Log and trace correlation.
  • Sampling behavior.
  • Health endpoint status.
  • Dependency timeout in readiness.
  • Alert rule with synthetic failures.
  • Runbook and notification routing.
  • Telemetry backend outage.
  • Redaction.

An untested alert is a hypothesis, not an operational control.

Common Mistakes

Common failures include:

  • Treating correlation ID and trace ID as identical for every workflow.
  • Creating custom trace headers instead of standards.
  • Losing context at message boundaries.
  • High-cardinality metric labels.
  • Logging secrets or request bodies.
  • Using averages for latency.
  • Making liveness depend on every shared service.
  • Exposing detailed health information publicly.
  • Alerting on every exception or retry.
  • Paging on causes without user impact.
  • Sampling away all rare failures.
  • Blocking application requests on telemetry export.
  • Having dashboards without ownership or runbooks.

Best-Practice Design Process

  1. Define critical user and business journeys.
  2. Define SLIs and SLOs.
  3. Instrument standard HTTP, database, broker, and runtime signals.
  4. Propagate W3C trace context across boundaries.
  5. Preserve business correlation and causation IDs.
  6. Add bounded custom spans, metrics, and structured logs.
  7. Separate liveness, readiness, and startup semantics.
  8. Control cardinality, sampling, privacy, cost, and retention.
  9. Alert on sustained actionable impact.
  10. Link alerts to dashboards, traces, logs, owners, and runbooks.
  11. Test telemetry and incident response under failure.

Interview Practice

PreviousCache-aside, read caching, and invalidation trade-offsNext UpHorizontal scaling, stateless services, and backpressure