EverydayUser storyConsultative playbook

Our last DR test failed and the auditor wants evidence within 90 days

The annual DR test missed RTO by hours and the regulator has asked for evidence of a credible recovery posture within the quarter. Backup is patchy across workload tiers, replication is uneven, and the runbook has not been updated in two years.

Trigger: Failed DR test; regulator follow-up clock running.
Good outcome: Backup + Azure Site Recovery baseline live, RTO/RPO documented per tier, fresh runbook validated through a tabletop and a partial failover.

Diagnostic discovery

Signals this story fits

Observable cues that confirm the conversation belongs here.

·Annual DR test missed RTO or RPO targets
·Regulator or auditor follow-up running on a fixed clock
·Backup coverage uneven across workload tiers
·DR runbook over 12 months out of date
·No regular partial-failover or tabletop cadence

Questions to ask

Open-ended, SPIN-style — each one has a reason it matters.

1.What are your RTO and RPO targets per workload tier?
WhyEstablishes the baseline. Many customers cannot answer immediately, which is itself the finding.
Listen for: “varies by workload” · “we have not formalised that”
2.When was the last full DR test, and what was the result?
WhySurfaces the actual posture rather than the aspirational one.
3.Which workloads have automated backup vs ad-hoc?
WhyTags the coverage gap. Often the answer is uneven across tiers.
4.Are you tested against partial failure (zonal, regional) and full failure (workload)?
WhyMost DR programmes test one failure mode and ignore the others.
5.Who owns the runbook today and how often is it rehearsed?
WhyWithout an owner and cadence, the runbook is fiction.
6.What does the regulator want to see specifically — evidence of testing, RTO compliance, or both?
WhySharpens the deliverable.

Baseline → target architecture

TOGAF-style gap framing — what we typically see today, and what the proposed end state looks like. The gap between them is the engagement.

Baseline architecture

Patchy backup with some workload tiers excluded. DR scoped to a handful of tier-1 systems. Runbook 12–24 months out of date. No automated failover; manual orchestration. Audit response reactive.

Typical concerns

·Production workloads outside the backup baseline
·No tabletop or partial-failover cadence
·Runbook owners scattered or absent
·Multi-region resilience aspirational, not validated
·Cyber-insurance renewal flagging DR maturity gap

Capability gaps

·Tier-based RTO/RPO targets
·Azure Backup with policy enforcement
·Azure Site Recovery for cross-region replication
·Quarterly tabletop + annual full failover
·Continuous attestation of DR readiness

Target architecture

Tier-based RTO/RPO targets published and enforced via Azure Policy. Azure Backup tenant-wide with retention policies per tier. Azure Site Recovery for replication of the tier-1 workloads. Runbook owned by the platform team with quarterly tabletops and an annual full failover. Defender for Cloud surfaces backup-coverage gaps continuously.

Key capabilities

Tier-based RTO/RPO design
Tenant-wide backup with retention policy
ASR-based cross-region replication
Quarterly tabletop + annual full failover
Continuous backup-coverage attestation

Enabling SKUs

Resolved in the ‘Recommended cards’ section below.

Architecture decisions

Each decision is offered as explicit options with trade-offs — Hohpe's “selling options” principle. A safe default is noted where one exists.

Decision 1.Backup retention — 30 days vs 1 year vs long-term retention (LTR)
30-day baseline
When it fitsOperational recovery only; compliance retention handled separately.
Trade-offsCannot recover older corruption or accidental deletion.
1-year
When it fitsStandard compliance retention; covers most accidental-deletion scenarios.
Trade-offsStorage cost rises materially.
LTR (3+ years)
When it fitsRegulated workloads; auditor demands multi-year recovery.
Trade-offsSignificant storage cost; archive-tier discipline required.
Default recommendationTiered: 30-day operational + 1-year compliance + LTR for regulated workloads only.
Decision 2.ASR pairing — same-region zonal vs cross-region
Same-region zonal
When it fitsLow-latency replication; protects against zonal failure only.
Trade-offsDoes not protect against regional outage.
Cross-region (paired region)
When it fitsRegulator demands regional resilience; multi-region DR posture.
Trade-offsHigher replication cost; failover may carry data-sovereignty implications.
Default recommendationCross-region for tier-1 workloads; same-region zonal for tier-2.
Decision 3.Failover automation — automated vs manual orchestration
Automated
When it fitsMature operations; clear failover criteria; appetite for unattended failover.
Trade-offsRisk of unintended failover from a transient signal.
Manual orchestration
When it fitsOperations team prefers human-in-the-loop; failover blast radius significant.
Trade-offsRTO bound by human response time.
Default recommendationManual orchestration for first 12 months; automate the well-understood scenarios as confidence grows.

Low-risk trial — proof of value

6-week DR foundation + tabletop + partial failover

6 weeks

Tier-based RTO/RPO targets published and approved. Azure Backup policies applied tenant-wide with retention per tier. ASR replication for one tier-1 workload. Runbook rewritten and signed off. Tabletop exercise run with executive presence. One partial failover executed and documented.

Success criteria

RTO/RPO targets published per workload tier
Backup coverage above 95% on the production estate
ASR replication validated for at least one tier-1 workload
Tabletop exercise completed with executive sign-off
Partial failover executed within the documented RTO

InvestmentAzure Backup + ASR consumption ~€2–4k/month for the trial scope. Advisory engagement separate. Regulator follow-up satisfied with the documented programme.

Proof metrics

·Backup coverage above 95% within 60 days
·RTO and RPO achieved on the trial workload
·Tabletop cadence operating quarterly
·Audit-readiness report passes regulator review

Recommended cards

The SKUs and capabilities most likely to be part of the solution, with the editorial rationale for each in the context of this story. Add the ones that fit your situation.

SKUMicrosoftSaaS

Microsoft Defender for Cloud

Cloud-native security posture management (CSPM) and workload protection (CWPP) across Azure, AWS, and GCP. The platform-team's continuous-assessment substrate — secure score, regulatory compliance dashboards, and per-workload threat protection in one product.

In: Azure consumption

Cloud Foundation

Why for this story

Surfaces backup-coverage gaps continuously and produces the secure-score baseline auditors reference for resilience.

SKUMicrosoftAzure

Azure Monitor

Azure's first-party observability platform — metrics, logs, traces, and alerts across Azure resources, hybrid workloads (via Azure Arc), and other clouds. The substrate platform teams use to know what's actually happening across the estate.