DEV_NET_CORE
GET_STARTED
AzureMonitoring, tracing, and incident response on Azure

Alert rules, action groups, and incident response workflows

Overview

Azure Monitor alerts notify teams when telemetry indicates a problem. Alert rules define the condition. Action groups define who is notified and which automated actions run. Incident response workflows define what people do after the alert fires.

Good alerting is not about maximizing alert count. Good alerting detects user-impacting issues early, routes them to the right owner, provides useful context, avoids duplicate noise, and leads to a practiced response.

For interviews, candidates should explain metric alerts, log alerts, activity log alerts, service health alerts, action groups, alert processing rules, severity, dynamic thresholds, suppression, runbooks, incident lifecycle, post-incident review, and alert quality.

Core Concepts

Alert Rule

An alert rule defines:

  • Target resource or scope.
  • Signal type.
  • Condition.
  • Evaluation frequency.
  • Lookback window.
  • Severity.
  • Description.
  • Action group.
  • Optional dimensions or splitting.

An alert rule should answer one operational question. If one rule tries to detect everything, it usually becomes noisy and hard to route.

Signal Types

Common Azure Monitor alert signals include:

  • Metric alerts.
  • Log search alerts.
  • Activity log alerts.
  • Resource health alerts.
  • Service health alerts.
  • Prometheus alerts.
  • Smart detection or anomaly-based alerts in supported scenarios.

Choose the alert type based on the signal. Metric alerts are usually best for fast numeric conditions. Log alerts are better for complex queries or event patterns.

Metric Alerts

Metric alerts evaluate time-series metrics. They are often used for:

  • CPU or memory saturation.
  • Request failure rate.
  • Response time.
  • Queue length.
  • Service Bus dead-letter count.
  • Database DTU or CPU.
  • Availability metric.

Metric alerts are usually fast, efficient, and well-suited for operational symptoms.

Static and Dynamic Thresholds

A static threshold uses a fixed value:

Code
Alert when p95 latency > 750 ms for 10 minutes

A dynamic threshold learns normal behavior and detects deviations. It can reduce manual tuning for metrics with predictable patterns, but it still needs review.

Use static thresholds when the acceptable limit is known. Use dynamic thresholds when normal behavior varies by time or workload and anomaly detection is useful.

Dimension Splitting

Metric dimensions can split alerts by values such as region, route, status code, or instance.

Dimension splitting is powerful but can create alert storms. Use it when routing or diagnosis benefits from separate alert instances.

Example:

Code
Alert per region when availability drops below target.
Do not alert per user ID or request ID.

Log Search Alerts

Log alerts use KQL to evaluate conditions in Log Analytics.

Example:

Code
requests
| where timestamp > ago(10m)
| summarize
    total = count(),
    failed = countif(success == false)
| extend failureRate = 100.0 * failed / total
| where total > 100 and failureRate > 5

Use log alerts when the condition requires joins, custom grouping, ratios, or event detail that metrics do not provide.

Activity Log Alerts

Activity log alerts detect management-plane events such as:

  • Resource deletion.
  • Role assignment changes.
  • Policy assignment changes.
  • Service health notifications.
  • Autoscale operations.

These alerts are useful for security, governance, and platform operations. They do not replace application health alerts.

Service Health and Resource Health

Service Health alerts notify teams about Azure service incidents, planned maintenance, and health advisories that may affect subscriptions or regions.

Resource Health alerts focus on the health of specific Azure resources.

Use both where appropriate:

  • Service Health explains platform-wide Azure conditions.
  • Resource Health identifies a resource-level availability issue.
  • Application alerts still detect whether users are affected.

Severity

Severity should indicate urgency and expected response, not emotional intensity.

Example severity model:

SeverityMeaning
Sev0Widespread critical outage or data-loss risk
Sev1Major user impact requiring immediate response
Sev2Partial impact or degraded service
Sev3Non-urgent issue needing follow-up
Sev4Informational or ticket-only

If every alert is critical, no alert is critical.

Action Groups

An action group defines notifications and automated actions for alerts.

Notification examples:

  • Email.
  • SMS.
  • Voice.
  • Azure mobile app push.

Automation examples:

  • Webhook.
  • Secure webhook.
  • Azure Function.
  • Logic App.
  • Automation runbook.
  • Event Hub.
  • ITSM connector.

Action groups are reusable across alert rules. They should map to service ownership and escalation paths.

Common Alert Schema

The common alert schema standardizes alert payloads across alert types. It is useful when routing alerts into webhooks, Logic Apps, ITSM systems, or incident-management platforms.

Use the common schema when:

  • A shared automation endpoint handles multiple alert types.
  • You want consistent fields for severity, resource, condition, and context.
  • You need maintainable downstream parsing.

Testing Action Groups

Action groups should be tested before production incidents.

Test:

  • Email and SMS recipients.
  • Webhook authentication.
  • Logic App or Function behavior.
  • Incident-management integration.
  • Common alert schema parsing.
  • Escalation paths.

An untested action group is basically a polite wish.

Alert Processing Rules

Alert processing rules modify alert behavior after an alert fires. They can suppress notifications or apply action groups based on scope, conditions, and schedule.

Common uses:

  • Suppress alerts during planned maintenance.
  • Route a set of alerts to a temporary team.
  • Add an action group across many rules.
  • Reduce noise during known platform events.

Suppression should be visible, time-bound, and documented. Silent permanent suppression is dangerous.

Maintenance Windows

During planned maintenance, use alert processing rules or explicit alert disablement with a defined end time. Do not rely on people remembering to re-enable alerts.

Good maintenance handling includes:

  • Change record.
  • Affected resources.
  • Suppression start and end.
  • Expected symptoms.
  • Owner.
  • Rollback plan.

Incident Response Workflow

A basic incident workflow:

  1. Alert fires.
  2. On-call acknowledges.
  3. Triage confirms impact and severity.
  4. Incident lead coordinates response.
  5. Engineers mitigate user impact.
  6. Communications keep stakeholders informed.
  7. Root cause investigation continues after mitigation.
  8. Post-incident review creates follow-up work.
  9. Alerts and runbooks are improved.

The alert starts the workflow. It does not replace the workflow.

Runbooks

A runbook explains what to do when an alert fires.

A useful runbook includes:

  • What the alert means.
  • Likely causes.
  • First queries or dashboards to check.
  • Mitigation steps.
  • Escalation contacts.
  • Rollback or failover instructions.
  • Customer communication guidance.
  • Links to related alerts and known issues.

Put the runbook link in the alert description or downstream incident ticket.

Alert Quality

Track alert quality over time.

Useful measures:

  • True positive rate.
  • Time to acknowledge.
  • Time to mitigate.
  • Alerts per incident.
  • Alerts closed as noise.
  • Duplicate alerts.
  • Alerts without runbooks.
  • Incidents found by users before monitoring.

Bad alerts train teams to ignore the system. Good alerts build trust.

Symptom Versus Cause Alerts

Prefer symptom alerts for paging:

  • Checkout error rate.
  • API p95 latency.
  • Availability test failure.
  • Queue age exceeding SLO.

Cause alerts are still useful, but often as supporting context:

  • CPU high.
  • Thread pool starvation.
  • Database DTU high.
  • Dependency throttling.

Page on user impact. Use cause alerts to diagnose.

Multi-Stage Alerting

Some systems use different alert routes for different urgency levels.

Example:

  • Warning dashboard when error rate is elevated.
  • Ticket when degradation lasts 15 minutes.
  • Page when SLO burn rate is high or availability is affected.

This reduces noise while still preserving visibility.

Automation Actions

Automation can help, but it must be safe.

Good automation:

  • Adds context to incidents.
  • Restarts a known safe component.
  • Scales out within limits.
  • Opens a ticket.
  • Captures diagnostics.
  • Starts a runbook that requires approval for risky steps.

Bad automation can make incidents worse by repeatedly restarting healthy services or hiding symptoms before evidence is captured.

Security and Access

Alerting systems can trigger powerful actions. Secure them.

Consider:

  • Who can edit alert rules.
  • Who can edit action groups.
  • Whether webhook endpoints are authenticated.
  • Whether automation identities have least privilege.
  • Whether alert payloads contain sensitive data.
  • Whether incident channels expose customer data.

Monitoring is part of the control plane.

Common Mistakes

  • Paging on every exception.
  • Alerting on infrastructure metrics without user impact.
  • No action group testing.
  • No runbook.
  • Everyone receives every alert.
  • Severity has no meaning.
  • Suppression rules without expiration.
  • Alert thresholds copied from another system.
  • No ownership for dashboards or alerts.
  • Treating alert creation as the end of incident preparedness.

Best Practices

  • Page on user-impacting symptoms.
  • Use metric alerts for fast numeric signals.
  • Use log alerts for complex KQL conditions.
  • Keep alert descriptions actionable.
  • Attach the right action group and runbook.
  • Test notifications and automation.
  • Use alert processing rules for planned maintenance.
  • Review noisy alerts regularly.
  • Automate alert rules and action groups as infrastructure as code.
  • Include alerts in post-incident improvement work.

Interview Practice

PreviousService Bus queues, topics, subscriptions, and dead-letter queuesNext UpApplication Insights and OpenTelemetry-aligned observability