Overview
Azure messaging systems provide reliable delivery, but reliable does not mean exactly once. Producers can retry after uncertain sends. Brokers can redeliver after consumer crashes. Handlers can fail after partially completing work. Endpoints can time out after accepting an event.
This makes duplicate detection, retry strategy, and poison-message handling essential parts of production design.
In Azure, the common tools are:
- Service Bus duplicate detection for repeated sends with the same
MessageId. - Service Bus peek-lock, delivery count, lock renewal, max delivery count, and dead-letter queues.
- Event Grid retry policy and optional dead-lettering.
- Application-level idempotency for all durable side effects.
- Backoff, retry classification, and poison-message runbooks.
For interviews, strong candidates explain that duplicate detection is useful but limited. It does not remove the need for idempotent consumers, stable business keys, outbox and inbox patterns, DLQ monitoring, replay tools, and careful separation of transient failures from permanent data defects.
Core Concepts
Why Duplicates Happen
Duplicates happen because distributed systems often have uncertain outcomes.
Examples:
- A producer sends a message, Service Bus accepts it, but the acknowledgment is lost.
- A consumer updates a database, then crashes before completing the message.
- A message lock expires during long processing.
- A webhook processes an Event Grid event but returns a timeout.
- A retry policy repeats a request after a transient network failure.
- Two application instances submit the same business command.
The safest assumption is at least once delivery. Design every handler as if a duplicate will eventually arrive.
Duplicate Detection in Service Bus
Service Bus duplicate detection can discard duplicate sends during a configured time window. The application sets a stable MessageId. If another message with the same identifier arrives within the window, Service Bus accepts the send request but drops the duplicate.
Example:
var message = new ServiceBusMessage(BinaryData.FromObjectAsJson(order))
{
MessageId = $"order-{order.Id}-submit",
Subject = "SubmitOrder",
ContentType = "application/json"
};
await sender.SendMessageAsync(message, cancellationToken);
The key is predictable repeatability. A random GUID generated on every retry defeats broker-side duplicate detection.
Duplicate Detection Window
The duplicate detection window is the time range during which Service Bus remembers message IDs. A longer window catches more duplicate sends but increases broker work and can reduce throughput.
Choose a window based on:
- Expected producer retry duration.
- Recovery time after producer crash.
- Business tolerance for duplicate sends.
- Throughput requirements.
- Cost of application-level deduplication.
Do not use an arbitrarily long window as a substitute for consumer idempotency.
What Service Bus Duplicate Detection Does Not Solve
Duplicate detection applies when messages are sent to the broker. It does not guarantee exactly once processing by consumers.
It does not prevent:
- Redelivery after lock expiration.
- Duplicate side effects after settlement failure.
- Duplicate business operations caused by different
MessageIdvalues. - Event Grid duplicate delivery.
- Duplicates outside the configured window.
- Duplicates across separate entities unless the same feature and keying strategy apply.
Consumers still need idempotency.
Idempotent Consumer
An idempotent consumer can process the same message more than once without changing the final result incorrectly.
Common techniques:
- Store processed message IDs in an inbox table.
- Use a business key with a unique constraint.
- Make state transitions conditional.
- Use natural idempotency keys in downstream APIs.
- Ignore duplicate commands that already reached a terminal state.
- Complete duplicate messages after confirming the work is already done.
Example:
CREATE TABLE ProcessedMessage (
ConsumerName nvarchar(100) NOT NULL,
MessageId nvarchar(200) NOT NULL,
ProcessedAtUtc datetime2 NOT NULL,
PRIMARY KEY (ConsumerName, MessageId)
);
Inbox Pattern
An inbox table records messages a consumer has already processed. The consumer inserts the message ID in the same database transaction as the business update.
Processing flow:
- Receive message.
- Begin database transaction.
- Try to insert inbox record.
- If insert fails because the record exists, treat the message as duplicate.
- Apply business changes.
- Commit transaction.
- Complete the broker message.
This protects against redelivery after consumer crash or settlement failure.
Outbox Pattern
The outbox pattern handles reliable publishing. The service writes its business change and an outgoing message record in the same database transaction. A dispatcher later publishes the message and marks the outbox record dispatched.
This avoids:
- Database commit succeeded but message was never sent.
- Message was sent but database transaction rolled back.
- Retrying publication with a different message ID.
The outbox dispatcher should use stable MessageId values and tolerate publish retries.
Retry Classification
Not every failure should be retried.
Retrying invalid data only burns capacity and delays good messages.
Retry Backoff
Retries should use backoff rather than tight loops. Backoff protects the broker, consumer, and downstream dependency.
Good retry behavior:
- Short retry for brief transient errors.
- Exponential or progressive delay.
- Jitter to avoid synchronized retries.
- Maximum attempts or deadline.
- Observability for every retry family.
- Stop retrying when the error is permanent.
Service Bus SDK retries handle client-side transient operations. Message redelivery handles consumer failures. Application code still decides whether business work should be retried or dead-lettered.
Peek-Lock Redelivery
With Service Bus peek-lock, a consumer receives a locked message. If processing succeeds, it completes the message. If the consumer abandons the message, crashes, or lets the lock expire, Service Bus can redeliver it.
The delivery count increases when the message is abandoned or lock expiration makes it available again. When delivery count exceeds max delivery count, Service Bus moves the message to the DLQ.
This is the core mechanism for poison-message protection.
Lock Renewal
Long-running handlers can renew message locks, or use automatic lock renewal through SDK processor options.
Lock renewal helps avoid duplicate concurrent processing, but it should not hide bad design. If one message takes many minutes, consider:
- Splitting work into smaller messages.
- Using Durable Functions or a workflow engine.
- Using a database work item with explicit checkpoints.
- Sending a command that starts work and another message for the next step.
Poison Message
A poison message is a message that repeatedly fails and cannot be processed successfully without intervention.
Common causes:
- Malformed JSON.
- Unknown schema version.
- Missing required business data.
- Referential integrity failure.
- Authorization or tenant mismatch.
- A handler bug triggered by a specific payload.
- A downstream API permanently rejects the request.
Poison messages should eventually leave the active queue so they do not block healthy work.
Max Delivery Count
Max delivery count limits how many times Service Bus will deliver a message before moving it to the DLQ. The default is often suitable as a starting point, but real systems should choose intentionally.
Consider:
- Typical transient outage duration.
- Processing time.
- Downstream rate limits.
- Message value.
- How quickly operations needs to see the failure.
- Whether retries are also happening inside the handler.
Too low can dead-letter messages during brief outages. Too high can retry poison messages for too long.
Explicit Dead-Lettering
Do not wait for max delivery count when retrying cannot help. A consumer can explicitly dead-letter a message with a reason and description.
await receiver.DeadLetterMessageAsync(
message,
deadLetterReason: "UnsupportedSchemaVersion",
deadLetterErrorDescription: "Schema version 99 is not supported.",
cancellationToken: cancellationToken);
Use structured reason codes so DLQ tooling can group failures.
Event Grid Retries
Event Grid uses retry behavior for failed event delivery. Push handlers must return success only after they have durably accepted the event. A handler that performs side effects and then times out can receive the event again.
Event Grid does not guarantee ordering, and duplicate events are possible. The subscriber should use event ID or business key for idempotency.
Event Grid Dead-Lettering
Event Grid can dead-letter events that cannot be delivered within configured retry limits or time-to-live. Dead-lettering must be configured on important subscriptions.
Use Event Grid DLQ data to decide:
- Whether the endpoint was misconfigured.
- Whether a schema changed unexpectedly.
- Whether the destination was unavailable.
- Whether replay is safe.
If the event triggers high-value work, routing Event Grid to Service Bus can provide richer queue-based handling.
Retry Storms
A retry storm happens when many failed operations retry at once and make the outage worse.
Avoid retry storms by:
- Using jittered backoff.
- Bounding concurrency.
- Pausing consumers when dependencies are unhealthy.
- Using circuit breakers.
- Respecting rate-limit responses.
- Separating retryable from poison failures.
- Monitoring queue age and dependency errors.
Retries are a recovery tool, but uncontrolled retries are a traffic amplifier.
Replay and Resubmission
Replay must be designed, not improvised.
A safe replay tool should:
- Require authorization.
- Preserve or intentionally replace message IDs.
- Show dead-letter reason and payload.
- Allow correction when data is fixable.
- Rate-limit resubmitted messages.
- Record who replayed what and why.
- Avoid replaying messages into consumers that are still broken.
Replaying without idempotency can create duplicate side effects.
Observability
Track:
- Duplicate-detection drops where visible.
- Message delivery count.
- Lock lost errors.
- Abandon count.
- DLQ count and oldest DLQ age.
- Retry attempts by error category.
- Handler latency.
- Poison-message reason codes.
- Event Grid delivery failures and dead-letter count.
- Dependency health during retries.
Without this data, teams usually discover problems only after customers report missing work.
Common Mistakes
- Generating a new
MessageIdon every retry. - Assuming duplicate detection gives exactly once processing.
- Completing messages before durable side effects succeed.
- Retrying validation errors repeatedly.
- Dead-lettering transient failures immediately.
- Ignoring DLQ messages.
- Replaying DLQ messages without fixing the cause.
- Letting retry loops overwhelm downstream systems.
- Omitting idempotency keys in database writes and external API calls.
- Treating Event Grid as ordered and exactly once.
Best Practices
- Use stable message IDs derived from business context.
- Keep duplicate detection windows as small as practical.
- Make every consumer idempotent.
- Use inbox and outbox patterns for important workflows.
- Classify errors before retrying.
- Use jittered backoff and bounded concurrency.
- Dead-letter permanent failures with structured reasons.
- Monitor DLQ count, age, and reason distribution.
- Build safe replay tools.
- Test crash-after-side-effect and settlement-failure scenarios.