Overview
Azure Monitor alerts notify teams when telemetry indicates a problem. Alert rules define the condition. Action groups define who is notified and which automated actions run. Incident response workflows define what people do after the alert fires.
Good alerting is not about maximizing alert count. Good alerting detects user-impacting issues early, routes them to the right owner, provides useful context, avoids duplicate noise, and leads to a practiced response.
For interviews, candidates should explain metric alerts, log alerts, activity log alerts, service health alerts, action groups, alert processing rules, severity, dynamic thresholds, suppression, runbooks, incident lifecycle, post-incident review, and alert quality.
Core Concepts
Alert Rule
An alert rule defines:
- Target resource or scope.
- Signal type.
- Condition.
- Evaluation frequency.
- Lookback window.
- Severity.
- Description.
- Action group.
- Optional dimensions or splitting.
An alert rule should answer one operational question. If one rule tries to detect everything, it usually becomes noisy and hard to route.
Signal Types
Common Azure Monitor alert signals include:
- Metric alerts.
- Log search alerts.
- Activity log alerts.
- Resource health alerts.
- Service health alerts.
- Prometheus alerts.
- Smart detection or anomaly-based alerts in supported scenarios.
Choose the alert type based on the signal. Metric alerts are usually best for fast numeric conditions. Log alerts are better for complex queries or event patterns.
Metric Alerts
Metric alerts evaluate time-series metrics. They are often used for:
- CPU or memory saturation.
- Request failure rate.
- Response time.
- Queue length.
- Service Bus dead-letter count.
- Database DTU or CPU.
- Availability metric.
Metric alerts are usually fast, efficient, and well-suited for operational symptoms.
Static and Dynamic Thresholds
A static threshold uses a fixed value:
Alert when p95 latency > 750 ms for 10 minutes
A dynamic threshold learns normal behavior and detects deviations. It can reduce manual tuning for metrics with predictable patterns, but it still needs review.
Use static thresholds when the acceptable limit is known. Use dynamic thresholds when normal behavior varies by time or workload and anomaly detection is useful.
Dimension Splitting
Metric dimensions can split alerts by values such as region, route, status code, or instance.
Dimension splitting is powerful but can create alert storms. Use it when routing or diagnosis benefits from separate alert instances.
Example:
Alert per region when availability drops below target.
Do not alert per user ID or request ID.
Log Search Alerts
Log alerts use KQL to evaluate conditions in Log Analytics.
Example:
requests
| where timestamp > ago(10m)
| summarize
total = count(),
failed = countif(success == false)
| extend failureRate = 100.0 * failed / total
| where total > 100 and failureRate > 5
Use log alerts when the condition requires joins, custom grouping, ratios, or event detail that metrics do not provide.
Activity Log Alerts
Activity log alerts detect management-plane events such as:
- Resource deletion.
- Role assignment changes.
- Policy assignment changes.
- Service health notifications.
- Autoscale operations.
These alerts are useful for security, governance, and platform operations. They do not replace application health alerts.
Service Health and Resource Health
Service Health alerts notify teams about Azure service incidents, planned maintenance, and health advisories that may affect subscriptions or regions.
Resource Health alerts focus on the health of specific Azure resources.
Use both where appropriate:
- Service Health explains platform-wide Azure conditions.
- Resource Health identifies a resource-level availability issue.
- Application alerts still detect whether users are affected.
Severity
Severity should indicate urgency and expected response, not emotional intensity.
Example severity model:
If every alert is critical, no alert is critical.
Action Groups
An action group defines notifications and automated actions for alerts.
Notification examples:
- Email.
- SMS.
- Voice.
- Azure mobile app push.
Automation examples:
- Webhook.
- Secure webhook.
- Azure Function.
- Logic App.
- Automation runbook.
- Event Hub.
- ITSM connector.
Action groups are reusable across alert rules. They should map to service ownership and escalation paths.
Common Alert Schema
The common alert schema standardizes alert payloads across alert types. It is useful when routing alerts into webhooks, Logic Apps, ITSM systems, or incident-management platforms.
Use the common schema when:
- A shared automation endpoint handles multiple alert types.
- You want consistent fields for severity, resource, condition, and context.
- You need maintainable downstream parsing.
Testing Action Groups
Action groups should be tested before production incidents.
Test:
- Email and SMS recipients.
- Webhook authentication.
- Logic App or Function behavior.
- Incident-management integration.
- Common alert schema parsing.
- Escalation paths.
An untested action group is basically a polite wish.
Alert Processing Rules
Alert processing rules modify alert behavior after an alert fires. They can suppress notifications or apply action groups based on scope, conditions, and schedule.
Common uses:
- Suppress alerts during planned maintenance.
- Route a set of alerts to a temporary team.
- Add an action group across many rules.
- Reduce noise during known platform events.
Suppression should be visible, time-bound, and documented. Silent permanent suppression is dangerous.
Maintenance Windows
During planned maintenance, use alert processing rules or explicit alert disablement with a defined end time. Do not rely on people remembering to re-enable alerts.
Good maintenance handling includes:
- Change record.
- Affected resources.
- Suppression start and end.
- Expected symptoms.
- Owner.
- Rollback plan.
Incident Response Workflow
A basic incident workflow:
- Alert fires.
- On-call acknowledges.
- Triage confirms impact and severity.
- Incident lead coordinates response.
- Engineers mitigate user impact.
- Communications keep stakeholders informed.
- Root cause investigation continues after mitigation.
- Post-incident review creates follow-up work.
- Alerts and runbooks are improved.
The alert starts the workflow. It does not replace the workflow.
Runbooks
A runbook explains what to do when an alert fires.
A useful runbook includes:
- What the alert means.
- Likely causes.
- First queries or dashboards to check.
- Mitigation steps.
- Escalation contacts.
- Rollback or failover instructions.
- Customer communication guidance.
- Links to related alerts and known issues.
Put the runbook link in the alert description or downstream incident ticket.
Alert Quality
Track alert quality over time.
Useful measures:
- True positive rate.
- Time to acknowledge.
- Time to mitigate.
- Alerts per incident.
- Alerts closed as noise.
- Duplicate alerts.
- Alerts without runbooks.
- Incidents found by users before monitoring.
Bad alerts train teams to ignore the system. Good alerts build trust.
Symptom Versus Cause Alerts
Prefer symptom alerts for paging:
- Checkout error rate.
- API p95 latency.
- Availability test failure.
- Queue age exceeding SLO.
Cause alerts are still useful, but often as supporting context:
- CPU high.
- Thread pool starvation.
- Database DTU high.
- Dependency throttling.
Page on user impact. Use cause alerts to diagnose.
Multi-Stage Alerting
Some systems use different alert routes for different urgency levels.
Example:
- Warning dashboard when error rate is elevated.
- Ticket when degradation lasts 15 minutes.
- Page when SLO burn rate is high or availability is affected.
This reduces noise while still preserving visibility.
Automation Actions
Automation can help, but it must be safe.
Good automation:
- Adds context to incidents.
- Restarts a known safe component.
- Scales out within limits.
- Opens a ticket.
- Captures diagnostics.
- Starts a runbook that requires approval for risky steps.
Bad automation can make incidents worse by repeatedly restarting healthy services or hiding symptoms before evidence is captured.
Security and Access
Alerting systems can trigger powerful actions. Secure them.
Consider:
- Who can edit alert rules.
- Who can edit action groups.
- Whether webhook endpoints are authenticated.
- Whether automation identities have least privilege.
- Whether alert payloads contain sensitive data.
- Whether incident channels expose customer data.
Monitoring is part of the control plane.
Common Mistakes
- Paging on every exception.
- Alerting on infrastructure metrics without user impact.
- No action group testing.
- No runbook.
- Everyone receives every alert.
- Severity has no meaning.
- Suppression rules without expiration.
- Alert thresholds copied from another system.
- No ownership for dashboards or alerts.
- Treating alert creation as the end of incident preparedness.
Best Practices
- Page on user-impacting symptoms.
- Use metric alerts for fast numeric signals.
- Use log alerts for complex KQL conditions.
- Keep alert descriptions actionable.
- Attach the right action group and runbook.
- Test notifications and automation.
- Use alert processing rules for planned maintenance.
- Review noisy alerts regularly.
- Automate alert rules and action groups as infrastructure as code.
- Include alerts in post-incident improvement work.