Overview
Durable Functions extends Azure Functions with stateful, fault-tolerant workflows defined in code. The Durable runtime records workflow history, checkpoints progress, replays orchestrator code, schedules work, and recovers from process restarts.
The main building blocks are:
- Client functions: Start and manage workflow instances.
- Orchestrator functions: Describe workflow control flow.
- Activity functions: Perform side effects and units of work.
- Durable timers: Persist delays and timeouts without holding compute.
- External events: Deliver asynchronous signals to running workflows.
- Durable entities: Manage small pieces of addressable state through serialized operations.
Durable Functions is useful for:
- Long-running business processes.
- Human approval.
- Multi-step integrations.
- Retryable workflows.
- Fan-out/fan-in processing.
- Monitoring loops.
- Stateful coordination.
- Compensation after partial completion.
It is not a replacement for every message consumer or database transaction. The value is durable coordination across time, failures, and multiple function executions.
For interviews, candidates should explain:
- Event sourcing and orchestrator replay.
- Deterministic orchestrator constraints.
- Why side effects belong in activities.
- Activity at-least-once behavior.
- Durable timers versus thread sleeps.
- External event correlation and timeouts.
- Entity identity and serialized operations.
- Task hubs and storage providers.
- Instance management, observability, versioning, and cleanup.
Core Concepts
Durable Application Roles
The roles use specialized Durable Functions triggers and bindings.
Starting an Orchestration
A non-orchestrator function can use DurableTaskClient:
public sealed class StartOrderWorkflow
{
[Function("StartOrderWorkflow")]
public async Task<HttpResponseData> Run(
[HttpTrigger(
AuthorizationLevel.Function,
"post",
Route = "orders/workflows")]
HttpRequestData request,
[DurableClient] DurableTaskClient client)
{
var input =
await request.ReadFromJsonAsync<OrderWorkflowInput>();
string instanceId =
await client.ScheduleNewOrchestrationInstanceAsync(
nameof(OrderOrchestrator),
input);
return await client.CreateCheckStatusResponseAsync(
request,
instanceId);
}
}
The HTTP response normally returns 202 Accepted and management URLs for status, events, termination, and cleanup.
Instance IDs
Every orchestration has an instance ID. It is used to:
- Query status.
- Raise external events.
- Terminate or suspend the workflow.
- Correlate telemetry.
- Prevent duplicate workflow starts.
A caller can supply a meaningful instance ID such as an order ID when one active workflow per business object is required. Validate instance state and handle start races; an ID alone does not automatically implement every idempotency requirement.
Orchestrator Functions
An isolated-worker orchestrator receives TaskOrchestrationContext:
[Function(nameof(OrderOrchestrator))]
public static async Task<OrderWorkflowResult> OrderOrchestrator(
[OrchestrationTrigger]
TaskOrchestrationContext context)
{
var input =
context.GetInput<OrderWorkflowInput>()
?? throw new InvalidOperationException(
"Workflow input is required.");
await context.CallActivityAsync(
nameof(ReserveInventory),
input);
await context.CallActivityAsync(
nameof(CapturePayment),
input);
return new OrderWorkflowResult("Completed");
}
The code looks like ordinary asynchronous C#, but its awaited tasks are durable tasks controlled by the orchestration runtime.
Event Sourcing and Replay
The runtime persists an event history containing scheduled activities, results, timers, events, and decisions. When an orchestrator resumes:
- The orchestrator starts from the beginning.
- The runtime replays recorded results.
- The code reconstructs local state deterministically.
- Execution proceeds when it reaches new work.
The orchestrator is not one thread or process that remains alive for days. It can unload while waiting and resume on another worker.
Determinism
During replay, orchestrator code must make the same decisions from the same history.
Do not directly use:
DateTime.UtcNoworDateTime.Now.Guid.NewGuid().- Random values.
- HTTP calls.
- Database queries.
- File access.
- Environment values that can change.
- Threads, delays, or arbitrary asynchronous APIs.
- Input or output bindings.
Use orchestration APIs:
DateTime now =
context.CurrentDateTimeUtc;
Guid id =
context.NewGuid();
Move nondeterministic work and external I/O into activities.
Replay-Safe Logging
Normal log calls in an orchestrator can repeat during replay. Use a replay-safe logger:
ILogger logger =
context.CreateReplaySafeLogger(
nameof(OrderOrchestrator));
logger.LogInformation(
"Coordinating order {OrderId}",
input.OrderId);
Activity logs are not replayed in the same way, although an activity itself can execute more than once.
Orchestrator Dependency Injection
Do not inject services that perform I/O or return changing values into orchestrator logic. Replay can call the orchestrator repeatedly, and injected behavior can break determinism.
Use dependency injection freely in:
- Client functions.
- Activity functions.
- Entity implementations where supported and appropriate.
Keep orchestrators focused on durable context operations and deterministic transformation.
Activity Functions
Activities perform actual work:
public sealed class ReserveInventory
{
private readonly InventoryClient inventory;
public ReserveInventory(
InventoryClient inventory)
{
this.inventory = inventory;
}
[Function(nameof(ReserveInventory))]
public Task Run(
[ActivityTrigger]
OrderWorkflowInput input,
CancellationToken cancellationToken)
{
return inventory.ReserveAsync(
input.OrderId,
input.Items,
cancellationToken);
}
}
Activities can:
- Call APIs.
- Query or update databases.
- Use SDK clients.
- Generate random values.
- Read current time.
- Send notifications.
Activity At-Least-Once Execution
An activity can execute more than once. For example, the activity might finish its side effect but the worker could fail before the result is durably recorded.
Make activities idempotent:
- Use a unique business operation ID.
- Use upserts or conditional writes.
- Record completed operations.
- Reuse previous results.
- Make external APIs accept idempotency keys.
The orchestrator's durable history prevents normal repeated scheduling during replay, but it cannot guarantee exactly-once external side effects inside an activity.
Activity Granularity
Activities should represent meaningful, retryable units:
ReserveInventory.CapturePayment.SendConfirmation.GenerateReport.
Too-small activities create storage and scheduling overhead. Too-large activities have poor checkpoint granularity and repeat more work after failure.
Choose boundaries around:
- One external side effect.
- One independent retry policy.
- One compensation boundary.
- One result useful to later workflow decisions.
Activity Retry Policies
Use durable retry APIs for transient failures:
var retry =
TaskOptions.FromRetryPolicy(
new RetryPolicy(
maxNumberOfAttempts: 4,
firstRetryInterval:
TimeSpan.FromSeconds(5)));
await context.CallActivityAsync(
nameof(CapturePayment),
input,
retry);
The exact overloads depend on current Durable packages. Retry only transient failures. Validation, authorization, and permanent business rejection should not be retried blindly.
Durable Timers
Use durable timers instead of Task.Delay:
DateTime due =
context.CurrentDateTimeUtc
.AddHours(24);
await context.CreateTimer(
due,
CancellationToken.None);
While waiting:
- The orchestrator state is persisted.
- No thread remains blocked.
- The app can scale to zero.
- A timer message reactivates the workflow.
Durable Timeouts
Race work against a timer:
using var cancellation =
new CancellationTokenSource();
Task<bool> approval =
context.WaitForExternalEventAsync<bool>(
"Approval");
Task timeout =
context.CreateTimer(
context.CurrentDateTimeUtc.AddDays(3),
cancellation.Token);
Task winner =
await Task.WhenAny(
approval,
timeout);
if (winner == approval)
{
cancellation.Cancel();
return await approval;
}
return false;
Cancel unused timers. An orchestration does not complete while an outstanding durable timer remains incomplete or uncanceled.
Timer Cancellation Does Not Cancel Activities
Winning a Task.WhenAny race and ignoring an activity result does not terminate the activity. It might continue running and produce side effects.
Design cancelable business operations explicitly:
- Pass a cancellation command through a supported channel.
- Check operation state before committing.
- Use compensation when a side effect completed.
- Avoid assuming orchestration task cancellation stops remote work.
External Events
External events let a running workflow receive a one-way asynchronous signal:
ApprovalDecision decision =
await context
.WaitForExternalEventAsync<ApprovalDecision>(
"Approval");
A client raises the named event for a specific instance:
await client.RaiseEventAsync(
instanceId,
"Approval",
decision);
Events support:
- Human approval.
- Webhook callback.
- Device signal.
- Payment confirmation.
- External process completion.
External Event Design
Validate:
- Caller identity and authorization.
- Instance ID ownership.
- Event name.
- Payload schema and version.
- Current workflow state.
- Duplicate and late events.
External events are not synchronous request-response operations. The sender should receive acceptance, then query workflow status or receive a separate completion notification.
Event Buffering and Ordering
Durable runtime can buffer an external event until the orchestrator waits for it. However:
- Duplicate events can exist.
- Different event names can race.
- Business-level ordering assumptions need explicit design.
- Events sent after instance completion cannot continue the completed workflow.
Include an event ID and validate state transitions inside deterministic workflow logic.
Durable Entities
A durable entity is identified by:
- Entity name.
- Entity key.
Example identities:
InventoryItem / product-123
RateLimiter / tenant-42
ApprovalCounter / workflow-981
Each operation for one entity instance executes serially, avoiding concurrent updates to that entity's state.
Entity Operations
Entities expose named operations such as:
Add.Remove.Set.Reset.Acquire.Release.
Clients and orchestrators can signal operations. Orchestrators can also call entities when they need a returned result.
Entities are useful for:
- Counters.
- Small aggregations.
- Coordination flags.
- Shopping-cart-like state.
- Lightweight distributed semaphores.
Entity State Design
Keep entity state:
- Small.
- Serializable.
- Focused on one identity.
- Quick to update.
- Free from large object graphs.
An entity is not a replacement for a relational database, analytics store, or large document repository. High contention on one entity key serializes all operations and can become a hotspot.
Entity Signals Versus Calls
A signal is one-way and does not wait for a result. A call waits for the entity operation result and is available from orchestration contexts in supported models.
Use signals for commands where eventual processing is sufficient. Use calls when workflow decisions need the resulting state.
Entities Versus Orchestrators
An orchestration can coordinate several entities, but excessive entity calls increase history and storage work.
Task Hubs
A task hub isolates Durable runtime state for a set of orchestrations and entities. It includes queues, history, control state, and instance metadata in the selected backend.
Use distinct task hubs for:
- Separate applications sharing a storage resource.
- Deployment slots when both slots may run.
- Test and production.
- Versioned side-by-side deployments when required.
Two incompatible applications using the same task hub can process each other's messages and corrupt workflow behavior.
Storage Providers
Durable Functions supports pluggable backends. Current Microsoft guidance identifies Durable Task Scheduler as the recommended backend for new Durable Functions scenarios where its availability and requirements fit. Azure Storage remains a common and established provider, and other supported providers have different performance and operational characteristics.
Evaluate:
- Regional availability.
- Throughput and latency.
- Networking.
- Cost.
- Operational visibility.
- Migration support.
- Required features and quotas.
Do not change a backend as if it were a transparent connection-string switch.
Instance Management
Durable clients can:
- Start an instance.
- Query status.
- Raise events.
- Suspend and resume.
- Terminate.
- Purge history.
Termination stops future orchestration progress but does not undo completed side effects or necessarily stop activities already running. Compensation must be explicit.
Custom Status
Orchestrators can expose compact custom status:
context.SetCustomStatus(
new
{
Stage = "AwaitingApproval",
input.OrderId
});
Use it for progress visible to clients. Keep it small because it is persisted and returned by status APIs.
History Growth
Every activity, timer, event, and decision adds history. Very long or frequently looping orchestrations can accumulate large histories and replay cost.
Use:
- Sub-orchestrations.
- Sensible activity granularity.
ContinueAsNewfor eternal loops.- Instance completion and restart when business semantics allow it.
- History purge according to retention policy.
Versioning
Running instances replay old history against deployed orchestrator code. Incompatible code changes can make replay diverge.
Safe strategies include:
- Keep changes replay-compatible.
- Deploy a new orchestrator name.
- Route new instances to a new version.
- Let old instances drain.
- Use supported orchestration versioning features where appropriate.
Never casually reorder, remove, or conditionally change historical activity calls in a long-running orchestrator.
Observability
Monitor:
- Instance runtime status.
- Orchestration and activity failures.
- Activity duration and retries.
- External event wait age.
- Timer backlog.
- Task-hub queue depth and latency.
- Storage-provider throttling.
- History size and replay duration.
- Entity operation backlog and hot keys.
Correlate through instance ID and business ID. Use replay-safe orchestrator logs.
Common Mistakes
- Performing HTTP or database I/O in an orchestrator.
- Using nondeterministic time, GUID, or random APIs.
- Assuming activities execute exactly once.
- Injecting changing services into orchestrator decisions.
- Using
Task.Delayinstead of a durable timer. - Forgetting to cancel a losing timeout timer.
- Assuming timeout races cancel running activities.
- Trusting external events without authorization.
- Storing large state in entities.
- Creating a hot entity key.
- Sharing a task hub across incompatible apps or slots.
- Changing orchestrator control flow without a versioning plan.
- Keeping infinite history without
ContinueAsNewor purge.
Practical Best Practices
- Keep orchestrators deterministic and side-effect free.
- Put external work in small idempotent activities.
- Use durable timers for all orchestration waits.
- Pair external events with deadlines and authorization.
- Use meaningful instance IDs and correlation.
- Keep entity state small and operations focused.
- Separate task hubs across incompatible deployments.
- Monitor backend health and history growth.
- Version long-running workflows deliberately.
- Purge completed history according to retention policy.
- Test restart, duplicate activity execution, late events, and replay.
- Verify current storage-provider and SDK guidance before new deployments.