DEV_NET_CORE
GET_STARTED
AzureMonitoring, tracing, and incident response on Azure

Metrics vs logs vs traces

Overview

Metrics, logs, and traces are the three core telemetry signals used to understand production systems. Azure Monitor brings these signals together across Azure resources, applications, infrastructure, and hybrid environments.

Each signal answers a different kind of question:

  • Metrics answer "What is happening over time?"
  • Logs answer "What happened in detail?"
  • Distributed traces answer "Where did this request go and where did time or failure occur?"

A strong monitoring design does not choose one signal and ignore the others. Metrics are excellent for dashboards and fast alerts. Logs are excellent for investigation and detailed evidence. Traces are excellent for following a request across services and dependencies.

For interviews, candidates should be able to explain signal differences, storage and query trade-offs, cardinality, sampling, correlation IDs, KQL, metric dimensions, alert design, and how to use telemetry during an incident.

Core Concepts

Observability Signals

Observability is the ability to understand a system's internal behavior from the telemetry it emits.

The main signals are:

SignalShapeBest for
MetricsNumeric time seriesTrends, alerting, dashboards, SLOs
LogsTimestamped recordsInvestigation, audit, detailed diagnostics
TracesConnected spansEnd-to-end request flow and latency analysis

These signals complement each other. A metric tells you error rate increased. A log tells you which errors occurred. A trace tells you which dependency caused the failed request.

Metrics

Metrics are numeric values collected at regular intervals. In Azure Monitor, native platform metrics are stored in a time-series database and are optimized for fast charting and alerting.

Examples:

  • CPU percentage.
  • Request count.
  • Failed request rate.
  • Queue length.
  • Database DTU or CPU usage.
  • Service Bus active message count.
  • App response time.

Metrics are usually aggregated over a time window using functions such as average, minimum, maximum, count, percentile, or total.

Metric Dimensions

Dimensions are name/value pairs that provide context for a metric.

Example:

Code
Metric: Request duration
Dimensions:
  route = /api/orders/{id}
  statusCode = 200
  region = eastus

Dimensions let you split and filter metrics, but every additional dimension increases cardinality. High-cardinality dimensions such as user ID, request ID, or full URL can make metrics expensive or impractical. Put high-cardinality details in logs or traces instead.

Platform Metrics and Custom Metrics

Platform metrics are emitted by Azure resources without extra configuration. They are useful for infrastructure health and service-level alerting.

Custom metrics are emitted by applications or agents. They are useful for business and application health signals such as:

  • Orders submitted per minute.
  • Payment failures.
  • Cart checkout latency.
  • Background job backlog.
  • Cache hit rate.

Good custom metrics are stable, low-cardinality, and tied to user impact or system health.

Logs

Logs are timestamped records with structured fields and message text. Azure Monitor Logs stores log data in Log Analytics workspaces, where it can be queried with Kusto Query Language.

Examples:

  • Application log events.
  • Exceptions.
  • Dependency failures.
  • Audit records.
  • Azure Activity Log entries.
  • Diagnostic logs from Azure resources.
  • Container logs.

Logs are richer than metrics, but they are usually more expensive to query and alert on. They are best for investigation, not every-second heartbeat alerting.

Structured Logging

Structured logs store important values as fields rather than hiding everything in a string.

Less useful:

Code
Failed to process order 100187 for tenant 42

More useful:

Code
{
  "message": "Failed to process order",
  "orderId": "100187",
  "tenantId": "42",
  "operation": "ProcessOrder",
  "exceptionType": "TimeoutException"
}

Structured logs make KQL filtering, grouping, and dashboarding far easier.

Distributed Traces

Distributed traces represent one operation as a tree or graph of spans. A span is a timed unit of work, such as an HTTP request, database call, queue handler, or dependency call.

Traces answer:

  • Which services handled this request?
  • Which dependency was slow?
  • Where did the error first occur?
  • Did retries increase latency?
  • Did a queue message originate from a specific API request?

In Azure, Application Insights and OpenTelemetry are commonly used to collect request, dependency, exception, and span data.

Trace Context and Correlation

Correlation connects telemetry from the same operation. Without correlation, an incident becomes a scavenger hunt.

Common correlation identifiers include:

  • Trace ID.
  • Span ID.
  • Parent span ID.
  • Operation ID.
  • Correlation ID.
  • Request ID.
  • Message ID.

For asynchronous systems, propagate correlation IDs through messages, events, and background work. A queue consumer should log the incoming message ID and the original request correlation ID.

Metrics Versus Logs

Metrics are compact and optimized for aggregation. Logs are detailed and optimized for investigation.

Use metrics when:

  • You need fast dashboards.
  • You need low-latency alerting.
  • You need trends or SLO calculations.
  • The value is numeric and regularly sampled.

Use logs when:

  • You need details.
  • You need text or structured event fields.
  • You need audit evidence.
  • You need ad hoc investigation.

Do not log every request only to compute basic request rate if a metric already gives the answer.

Logs Versus Traces

Logs are independent records. Traces connect related work into an end-to-end path.

Use logs when:

  • You need business details.
  • You need exception context.
  • You need audit records.
  • You need custom diagnostic messages.

Use traces when:

  • A request crosses services.
  • Latency must be broken down by dependency.
  • You need causal relationships.
  • You need a transaction timeline.

Good systems link logs to trace IDs so investigators can move between both views.

Sampling

Sampling reduces telemetry volume by keeping only a subset of events or traces. It helps control cost and overhead, but it can hide rare failures if applied carelessly.

Sampling strategy should preserve:

  • Errors.
  • Slow requests.
  • Security-relevant events.
  • Business-critical operations.
  • Representative successful traffic.

Do not blindly sample away the exact data needed to explain production incidents.

Cardinality

Cardinality is the number of distinct values for a field or dimension. High cardinality is dangerous for metrics and dashboards.

High-cardinality examples:

  • User ID.
  • Email address.
  • Full URL with query string.
  • Request ID.
  • Order ID.

These are useful in logs and traces, but usually poor metric dimensions. For metrics, prefer bounded values such as route template, region, status code family, dependency type, or operation name.

Latency, Ingestion Delay, and Freshness

Telemetry is not always available instantly. Platform metrics, logs, traces, and exported diagnostic data can have different ingestion delays.

Interview-worthy design point:

  • Use metric alerts for fast operational detection.
  • Use logs and traces for deeper diagnosis.
  • Avoid alert rules that assume every log arrives immediately.
  • Make dashboards show the time range and query freshness clearly.

SLI, SLO, and Error Budget

A service-level indicator is a measurement such as availability, latency, or error rate. A service-level objective is the target for that measurement.

Examples:

Code
SLI: Percentage of successful checkout requests
SLO: 99.9% success over 30 days

SLI: p95 API latency
SLO: p95 under 500 ms for 99% of 5-minute windows

Metrics usually power SLIs. Logs and traces explain why SLOs are missed.

Azure Monitor Data Stores

Azure Monitor uses different stores for different telemetry:

  • Azure Monitor Metrics stores numeric time-series data.
  • Log Analytics workspaces store logs and trace data queried with KQL.
  • Azure Monitor workspaces store Prometheus and OpenTelemetry metrics queried with PromQL.

The names are similar, but the resource types and query languages differ. A good candidate will not blur them together.

Practical Incident Workflow

A typical investigation flow:

  1. Metric alert fires for error rate or latency.
  2. Dashboard confirms scope and impact.
  3. Logs identify the dominant exception, route, tenant, or deployment version.
  4. Traces show the failing dependency or slow span.
  5. Deployment, configuration, and dependency telemetry identify the likely cause.
  6. Incident notes record actions and evidence.
  7. Post-incident work adds missing telemetry or better alerts.

Metrics detect. Logs explain. Traces connect.

Common Mistakes

  • Logging unstructured strings that cannot be queried reliably.
  • Using high-cardinality metric dimensions.
  • Alerting on noisy logs without grouping or suppression.
  • Forgetting correlation IDs across queues and events.
  • Sampling away all successful requests and losing baseline behavior.
  • Treating traces as a substitute for business audit logs.
  • Treating logs as a substitute for fast metric alerts.
  • Creating dashboards with dozens of charts and no user-impact signal.
  • Monitoring infrastructure but not business outcomes.
  • Storing secrets, tokens, or personal data in telemetry.

Best Practices

  • Define user-impact SLIs first.
  • Use metrics for fast detection and trends.
  • Use logs for detailed diagnosis and audit.
  • Use traces for cross-service request flow.
  • Emit structured logs.
  • Propagate trace context and correlation IDs.
  • Keep metric dimensions bounded.
  • Use sampling deliberately.
  • Redact sensitive data before ingestion.
  • Build dashboards around symptoms, saturation, errors, and latency.
  • Review telemetry gaps after incidents.

Interview Practice

PreviousLog Analytics queries, dashboards, and availability testsNext UpBicep fundamentals, modules, and reusable environment definitions