Overview
Log Analytics is the query experience for Azure Monitor Logs. It lets teams analyze telemetry stored in Log Analytics workspaces using Kusto Query Language. Dashboards, workbooks, and availability tests turn that telemetry into operational views and proactive checks.
This subtopic covers three practical skills:
- Writing useful KQL queries for troubleshooting and reporting.
- Designing dashboards and workbooks that show system health clearly.
- Using Application Insights availability tests to monitor endpoints from outside the application.
For interviews, candidates should be able to write basic KQL, explain query scope and time filters, choose between dashboards and workbooks, design useful availability checks, avoid noisy visualizations, and connect synthetic test failures back to logs and traces.
Core Concepts
Log Analytics Workspace
A Log Analytics workspace is a data store for Azure Monitor Logs. It contains tables such as application request telemetry, exceptions, dependencies, availability results, Azure resource logs, activity logs, container logs, and custom tables.
Workspace design affects:
- Query scope.
- Access control.
- Retention.
- Cost.
- Cross-resource investigation.
- Dashboard and alert design.
For most application teams, the important skill is understanding which tables contain the data they need and how to query them safely.
Kusto Query Language
Kusto Query Language, or KQL, is used to query Azure Monitor Logs. A KQL query is usually a pipeline of operators.
Basic shape:
requests
| where timestamp > ago(1h)
| where success == false
| summarize failures = count() by operation_Name
| order by failures desc
Common operators:
wherefilters rows.projectselects columns.extendcreates calculated columns.summarizegroups and aggregates.order bysorts results.joincombines tables.rendervisualizes results.
Query Scope
Query scope defines what data the query can see. You can query:
- One resource.
- One workspace.
- A resource group scope.
- Multiple workspaces or resources when configured and authorized.
A common mistake is assuming missing results mean no problem exists. The query might simply be scoped to the wrong resource or time range.
Time Filtering
Always bound production queries by time.
exceptions
| where timestamp between (ago(24h) .. now())
| summarize count() by type
| order by count_ desc
Time filters improve performance, reduce cost, and make results easier to interpret. Dashboards and alerts should use time windows that match the operational question.
Request Failure Query
Find failing routes:
requests
| where timestamp > ago(1h)
| where success == false
| summarize failures = count(), users = dcount(user_Id) by operation_Name, resultCode
| order by failures desc
This is a practical incident query because it groups failures by operation and status code rather than dumping raw rows.
Latency Query
Find slow operations:
requests
| where timestamp > ago(6h)
| summarize
p50 = percentile(duration, 50),
p95 = percentile(duration, 95),
p99 = percentile(duration, 99),
count = count()
by operation_Name
| order by p95 desc
Percentiles are often more useful than average latency because averages hide tail behavior.
Dependency Failure Query
Identify failing dependencies:
dependencies
| where timestamp > ago(1h)
| where success == false
| summarize failures = count() by type, target, name, resultCode
| order by failures desc
This helps determine whether an API incident is caused by internal code, SQL, storage, Redis, Service Bus, or an external HTTP dependency.
Exception Query
Find the most frequent exceptions:
exceptions
| where timestamp > ago(24h)
| summarize occurrences = count(), impactedOperations = dcount(operation_Id)
by type, outerMessage
| order by occurrences desc
Use exception type and message carefully. Messages may include high-cardinality values unless logging is designed well.
Trace Correlation Query
Follow one operation:
union requests, dependencies, exceptions, traces
| where operation_Id == "replace-with-operation-id"
| project timestamp, itemType, operation_Name, message, name, target, resultCode, success
| order by timestamp asc
Correlation queries are powerful during incidents because they reconstruct the operation timeline.
Availability Results Query
Analyze synthetic tests:
availabilityResults
| where timestamp > ago(24h)
| summarize
availability = 100.0 * countif(success == true) / count(),
avgDuration = avg(duration),
failures = countif(success == false)
by name, location
| order by availability asc
This helps distinguish a single regional test problem from a global endpoint failure.
Joins
KQL joins connect related tables.
Example: failed requests with matching exceptions:
requests
| where timestamp > ago(1h)
| where success == false
| project operation_Id, requestName = operation_Name, resultCode, requestDuration = duration
| join kind=leftouter (
exceptions
| project operation_Id, exceptionType = type, outerMessage
) on operation_Id
| order by requestDuration desc
Join carefully. Large joins can be expensive and slow. Filter both sides first.
Saved Queries and Query Packs
Saved queries and query packs help teams reuse tested KQL instead of rebuilding investigation queries during an incident.
Good saved queries:
- Have clear names.
- Include comments for parameters.
- Use safe time ranges.
- Return summarized results first.
- Link to runbooks or dashboards when useful.
Query reuse is part of operational maturity.
Dashboards
Dashboards present important metrics and query results at a glance.
A useful production dashboard shows:
- Availability.
- Error rate.
- Request rate.
- Latency percentiles.
- Saturation or backlog.
- Dependency failures.
- Recent deployments.
- Active alerts.
Avoid dashboards that contain many charts but no decision support. A dashboard should answer "Is the system healthy?" and "Where should I look next?"
Workbooks
Azure Workbooks are interactive reports for Azure Monitor data. They can combine text, parameters, metrics, logs, charts, grids, and links.
Use workbooks when:
- You need an interactive troubleshooting view.
- Users should choose time range, service, region, or operation.
- You want documentation and charts in one place.
- You need a reusable operational report.
Dashboards are better for at-a-glance monitoring. Workbooks are better for guided investigation.
Grafana and Azure Dashboards
Azure Monitor data can also be visualized in managed Grafana and Azure dashboards. Choose based on audience and existing operational tooling.
Consider:
- Who owns the dashboard?
- Which data sources are required?
- Does the team already use Grafana?
- Are Prometheus metrics involved?
- Is access control aligned with the data shown?
- Can the dashboard be deployed as code?
Availability Tests
Application Insights availability tests are recurring synthetic checks against HTTP or HTTPS endpoints. They measure availability and response time from external locations.
Use them to monitor:
- Homepage or health endpoint reachability.
- Login or lightweight API flow.
- Critical public APIs.
- External dependency endpoints.
- TLS certificate validity.
Availability tests do not require application code changes, but the endpoint must be reachable from the test locations unless a supported private testing approach is used.
Standard Availability Tests
Standard tests are the current availability-test option for single-request checks. They can validate status code, response time, content match, HTTP method, headers, request body, TLS certificate validity, and proactive certificate lifetime.
Use Standard tests rather than classic URL ping tests. Classic URL ping tests are retired on September 30, 2026.
Test Locations and Thresholds
Run availability tests from multiple locations. A single failed location may indicate a regional network path issue rather than an application outage.
Good alert design considers:
- Number of locations.
- Failure threshold.
- Test frequency.
- Expected maintenance windows.
- Endpoint timeout.
- Whether retries are enabled.
For public user-facing apps, at least five locations is a common starting point.
Synthetic Versus Real User Monitoring
Availability tests are synthetic. They test configured paths from test agents. Real user monitoring observes actual browser or client behavior.
Use both when possible:
- Synthetic checks detect known critical path failures proactively.
- Real user telemetry shows what actual users experience.
- Server telemetry explains backend causes.
Synthetic success does not prove every real user path is healthy.
Dashboard Design Example
A practical API dashboard can include:
Top row:
Availability, request rate, failed request rate, p95 latency
Second row:
Dependency failures by target
Exceptions by type
Queue backlog or oldest message age
Third row:
Failed requests by operation
Availability test failures by location
Recent deployments
This layout starts with user impact, then guides diagnosis.
Alert Query Design
KQL alert queries should be:
- Time bounded.
- Summarized to a small result set.
- Stable under normal traffic variation.
- Focused on user impact or actionable symptoms.
- Tested against historical incident windows.
Avoid alerts that fire for every individual exception. Alert on rates, ratios, burn rate, or grouped failure conditions.
Common Mistakes
- Querying the wrong scope.
- Forgetting time filters.
- Building dashboards that show vanity metrics.
- Using average latency instead of percentiles.
- Alerting on raw exception count without traffic context.
- Not saving useful incident queries.
- Using availability tests that only check a static page.
- Running tests from one location only.
- Ignoring synthetic test failures because "real users look fine."
- Putting sensitive data in logs that dashboards expose broadly.
Best Practices
- Learn the key tables for your app.
- Keep KQL queries time bounded and summarized.
- Use percentiles for latency.
- Build dashboards around symptoms, not implementation trivia.
- Use workbooks for guided troubleshooting.
- Create Standard availability tests for critical endpoints.
- Configure availability alerts with action groups.
- Save tested incident queries.
- Review dashboards after incidents.
- Manage workspace access and retention deliberately.