DEV_NET_CORE
GET_STARTED
AzureMonitoring, tracing, and incident response on Azure

Log Analytics queries, dashboards, and availability tests

Overview

Log Analytics is the query experience for Azure Monitor Logs. It lets teams analyze telemetry stored in Log Analytics workspaces using Kusto Query Language. Dashboards, workbooks, and availability tests turn that telemetry into operational views and proactive checks.

This subtopic covers three practical skills:

  • Writing useful KQL queries for troubleshooting and reporting.
  • Designing dashboards and workbooks that show system health clearly.
  • Using Application Insights availability tests to monitor endpoints from outside the application.

For interviews, candidates should be able to write basic KQL, explain query scope and time filters, choose between dashboards and workbooks, design useful availability checks, avoid noisy visualizations, and connect synthetic test failures back to logs and traces.

Core Concepts

Log Analytics Workspace

A Log Analytics workspace is a data store for Azure Monitor Logs. It contains tables such as application request telemetry, exceptions, dependencies, availability results, Azure resource logs, activity logs, container logs, and custom tables.

Workspace design affects:

  • Query scope.
  • Access control.
  • Retention.
  • Cost.
  • Cross-resource investigation.
  • Dashboard and alert design.

For most application teams, the important skill is understanding which tables contain the data they need and how to query them safely.

Kusto Query Language

Kusto Query Language, or KQL, is used to query Azure Monitor Logs. A KQL query is usually a pipeline of operators.

Basic shape:

Code
requests
| where timestamp > ago(1h)
| where success == false
| summarize failures = count() by operation_Name
| order by failures desc

Common operators:

  • where filters rows.
  • project selects columns.
  • extend creates calculated columns.
  • summarize groups and aggregates.
  • order by sorts results.
  • join combines tables.
  • render visualizes results.

Query Scope

Query scope defines what data the query can see. You can query:

  • One resource.
  • One workspace.
  • A resource group scope.
  • Multiple workspaces or resources when configured and authorized.

A common mistake is assuming missing results mean no problem exists. The query might simply be scoped to the wrong resource or time range.

Time Filtering

Always bound production queries by time.

Code
exceptions
| where timestamp between (ago(24h) .. now())
| summarize count() by type
| order by count_ desc

Time filters improve performance, reduce cost, and make results easier to interpret. Dashboards and alerts should use time windows that match the operational question.

Request Failure Query

Find failing routes:

Code
requests
| where timestamp > ago(1h)
| where success == false
| summarize failures = count(), users = dcount(user_Id) by operation_Name, resultCode
| order by failures desc

This is a practical incident query because it groups failures by operation and status code rather than dumping raw rows.

Latency Query

Find slow operations:

Code
requests
| where timestamp > ago(6h)
| summarize
    p50 = percentile(duration, 50),
    p95 = percentile(duration, 95),
    p99 = percentile(duration, 99),
    count = count()
  by operation_Name
| order by p95 desc

Percentiles are often more useful than average latency because averages hide tail behavior.

Dependency Failure Query

Identify failing dependencies:

Code
dependencies
| where timestamp > ago(1h)
| where success == false
| summarize failures = count() by type, target, name, resultCode
| order by failures desc

This helps determine whether an API incident is caused by internal code, SQL, storage, Redis, Service Bus, or an external HTTP dependency.

Exception Query

Find the most frequent exceptions:

Code
exceptions
| where timestamp > ago(24h)
| summarize occurrences = count(), impactedOperations = dcount(operation_Id)
    by type, outerMessage
| order by occurrences desc

Use exception type and message carefully. Messages may include high-cardinality values unless logging is designed well.

Trace Correlation Query

Follow one operation:

Code
union requests, dependencies, exceptions, traces
| where operation_Id == "replace-with-operation-id"
| project timestamp, itemType, operation_Name, message, name, target, resultCode, success
| order by timestamp asc

Correlation queries are powerful during incidents because they reconstruct the operation timeline.

Availability Results Query

Analyze synthetic tests:

Code
availabilityResults
| where timestamp > ago(24h)
| summarize
    availability = 100.0 * countif(success == true) / count(),
    avgDuration = avg(duration),
    failures = countif(success == false)
  by name, location
| order by availability asc

This helps distinguish a single regional test problem from a global endpoint failure.

Joins

KQL joins connect related tables.

Example: failed requests with matching exceptions:

Code
requests
| where timestamp > ago(1h)
| where success == false
| project operation_Id, requestName = operation_Name, resultCode, requestDuration = duration
| join kind=leftouter (
    exceptions
    | project operation_Id, exceptionType = type, outerMessage
) on operation_Id
| order by requestDuration desc

Join carefully. Large joins can be expensive and slow. Filter both sides first.

Saved Queries and Query Packs

Saved queries and query packs help teams reuse tested KQL instead of rebuilding investigation queries during an incident.

Good saved queries:

  • Have clear names.
  • Include comments for parameters.
  • Use safe time ranges.
  • Return summarized results first.
  • Link to runbooks or dashboards when useful.

Query reuse is part of operational maturity.

Dashboards

Dashboards present important metrics and query results at a glance.

A useful production dashboard shows:

  • Availability.
  • Error rate.
  • Request rate.
  • Latency percentiles.
  • Saturation or backlog.
  • Dependency failures.
  • Recent deployments.
  • Active alerts.

Avoid dashboards that contain many charts but no decision support. A dashboard should answer "Is the system healthy?" and "Where should I look next?"

Workbooks

Azure Workbooks are interactive reports for Azure Monitor data. They can combine text, parameters, metrics, logs, charts, grids, and links.

Use workbooks when:

  • You need an interactive troubleshooting view.
  • Users should choose time range, service, region, or operation.
  • You want documentation and charts in one place.
  • You need a reusable operational report.

Dashboards are better for at-a-glance monitoring. Workbooks are better for guided investigation.

Grafana and Azure Dashboards

Azure Monitor data can also be visualized in managed Grafana and Azure dashboards. Choose based on audience and existing operational tooling.

Consider:

  • Who owns the dashboard?
  • Which data sources are required?
  • Does the team already use Grafana?
  • Are Prometheus metrics involved?
  • Is access control aligned with the data shown?
  • Can the dashboard be deployed as code?

Availability Tests

Application Insights availability tests are recurring synthetic checks against HTTP or HTTPS endpoints. They measure availability and response time from external locations.

Use them to monitor:

  • Homepage or health endpoint reachability.
  • Login or lightweight API flow.
  • Critical public APIs.
  • External dependency endpoints.
  • TLS certificate validity.

Availability tests do not require application code changes, but the endpoint must be reachable from the test locations unless a supported private testing approach is used.

Standard Availability Tests

Standard tests are the current availability-test option for single-request checks. They can validate status code, response time, content match, HTTP method, headers, request body, TLS certificate validity, and proactive certificate lifetime.

Use Standard tests rather than classic URL ping tests. Classic URL ping tests are retired on September 30, 2026.

Test Locations and Thresholds

Run availability tests from multiple locations. A single failed location may indicate a regional network path issue rather than an application outage.

Good alert design considers:

  • Number of locations.
  • Failure threshold.
  • Test frequency.
  • Expected maintenance windows.
  • Endpoint timeout.
  • Whether retries are enabled.

For public user-facing apps, at least five locations is a common starting point.

Synthetic Versus Real User Monitoring

Availability tests are synthetic. They test configured paths from test agents. Real user monitoring observes actual browser or client behavior.

Use both when possible:

  • Synthetic checks detect known critical path failures proactively.
  • Real user telemetry shows what actual users experience.
  • Server telemetry explains backend causes.

Synthetic success does not prove every real user path is healthy.

Dashboard Design Example

A practical API dashboard can include:

Code
Top row:
  Availability, request rate, failed request rate, p95 latency

Second row:
  Dependency failures by target
  Exceptions by type
  Queue backlog or oldest message age

Third row:
  Failed requests by operation
  Availability test failures by location
  Recent deployments

This layout starts with user impact, then guides diagnosis.

Alert Query Design

KQL alert queries should be:

  • Time bounded.
  • Summarized to a small result set.
  • Stable under normal traffic variation.
  • Focused on user impact or actionable symptoms.
  • Tested against historical incident windows.

Avoid alerts that fire for every individual exception. Alert on rates, ratios, burn rate, or grouped failure conditions.

Common Mistakes

  • Querying the wrong scope.
  • Forgetting time filters.
  • Building dashboards that show vanity metrics.
  • Using average latency instead of percentiles.
  • Alerting on raw exception count without traffic context.
  • Not saving useful incident queries.
  • Using availability tests that only check a static page.
  • Running tests from one location only.
  • Ignoring synthetic test failures because "real users look fine."
  • Putting sensitive data in logs that dashboards expose broadly.

Best Practices

  • Learn the key tables for your app.
  • Keep KQL queries time bounded and summarized.
  • Use percentiles for latency.
  • Build dashboards around symptoms, not implementation trivia.
  • Use workbooks for guided troubleshooting.
  • Create Standard availability tests for critical endpoints.
  • Configure availability alerts with action groups.
  • Save tested incident queries.
  • Review dashboards after incidents.
  • Manage workspace access and retention deliberately.

Interview Practice

PreviousApplication Insights and OpenTelemetry-aligned observabilityNext UpMetrics vs logs vs traces