Solution Atlas
SpecialisedUser storyConsultative playbook

AKS is becoming the standard runtime and we need governance before it sprawls

Engineering teams are converging on Kubernetes as the application runtime. The platform team wants a paved-road AKS pattern with central security, networking, and policy before per-team clusters multiply beyond what platform engineering can support.

Trigger
Multiple AKS clusters spinning up without standards; security concerned.
Good outcome
Central AKS platform with policy guardrails, namespace governance, and Defender for Containers baseline.
Diagnostic discovery

Signals this story fits

Observable cues that confirm the conversation belongs here.

  • ·Multiple AKS clusters spinning up per team without central pattern
  • ·No central image registry policy or supply-chain scanning
  • ·Defender for Containers not deployed
  • ·Cluster networking varies — some public, some private, some mixed
  • ·Cost spikes from idle clusters; no attribution

Questions to ask

Open-ended, SPIN-style — each one has a reason it matters.

  1. 1.How many AKS clusters are live today, and who provisioned each?

    WhySurfaces the sprawl. Most customers cannot give a confident answer.

    Listen for: “varies” · “each team has its own” · “we are not sure”

  2. 2.What is your cluster-creation pattern — Terraform, Bicep, console, scripts?

    WhyDetermines whether a paved-road template would be adopted easily.

  3. 3.How does cluster networking integrate with your landing zone or hub?

    WhyNetworking is the most-divergent dimension across teams. Surfaces the standardisation opportunity.

  4. 4.What is your image registry and supply-chain posture?

    WhyACR with vulnerability scanning is the canonical answer; surfaces whether it is in place.

  5. 5.Who owns runtime SLOs for production AKS clusters today?

    WhyOften nobody; the platform-as-product question lives here.

  6. 6.What is your Defender for Containers coverage?

    WhyContainer security posture; usually absent or partial.

Baseline → target architecture

TOGAF-style gap framing — what we typically see today, and what the proposed end state looks like. The gap between them is the engagement.

Baseline architecture

Per-team AKS clusters provisioned with inconsistent configuration. No central image registry policy. Cluster networking varies (public clusters, private clusters, mixed). Defender for Containers not in scope. Cost attribution per cluster manual. No paved-road template.

Typical concerns

  • ·Inconsistent security posture across clusters
  • ·Image supply-chain unaudited
  • ·Network configuration drift
  • ·Production clusters without SLO owners
  • ·Cost growth without attribution

Capability gaps

  • ·Paved-road AKS template
  • ·Azure Policy at management-group level
  • ·ACR with vulnerability scanning
  • ·Defender for Containers tenant-wide
  • ·Cluster-cost attribution
Target architecture

Paved-road AKS pattern delivered via a central Bicep template — opinionated network integration with the landing-zone hub, ACR as the standard registry with vulnerability scanning, Azure Policy at the management-group level, Defender for Containers tenant-wide. Cost attribution per cluster via tag taxonomy. Platform team owns the runtime SLO baseline; project teams operate within it.

Key capabilities

  • Paved-road AKS template
  • Central image registry with vulnerability scanning
  • Tenant-wide container security posture
  • Policy-enforced cluster configuration
  • Cluster-cost attribution

Enabling SKUs

Resolved in the ‘Recommended cards’ section below.

Architecture decisions

Each decision is offered as explicit options with trade-offs — Hohpe's “selling options” principle. A safe default is noted where one exists.

  1. Decision 1.Cluster pattern — per-team clusters vs central multi-tenant clusters

    Per-team clusters (default)

    When it fitsTeam autonomy prioritised; blast-radius isolation per team.

    Trade-offsCluster count grows with team count; cost overhead per cluster.

    Central multi-tenant clusters

    When it fitsMature platform team; smaller engineering org; tight cost discipline.

    Trade-offsTenant isolation depends on namespace + policy; weaker boundary than per-cluster.

    Default recommendationPer-team clusters with strong paved-road template. Move to multi-tenant only when platform team has explicit capacity.

  2. Decision 2.Networking — public AKS endpoint vs private cluster

    Public endpoint with authorized IP ranges

    When it fitsLower complexity; engineering convenience.

    Trade-offsAttack surface; auditor may demand private.

    Private cluster (no public endpoint)

    When it fitsRegulated or auditor-pressured; consistent with landing-zone Zero Trust.

    Trade-offsEngineering complexity; jump host or VPN required for kubectl.

    Default recommendationPrivate cluster as the default; public endpoint only for explicit, justified exceptions.

  3. Decision 3.Image registry — ACR vs Harbor vs external (ECR, GHCR)

    ACR (Azure Container Registry)

    When it fitsAzure-native; native Defender integration; consistent with the estate.

    Trade-offsMulti-cloud workloads carry separate ACR or replication overhead.

    Harbor (self-hosted)

    When it fitsMulti-cloud workloads; existing Harbor investment.

    Trade-offsSelf-hosting overhead; less native integration.

    GHCR (GitHub Container Registry)

    When it fitsGitHub-native pipelines; small or developer-led estate.

    Trade-offsLess mature than ACR for production-scale scanning.

    Default recommendationACR for Azure-resident workloads; consider Harbor or GHCR only for explicit multi-cloud or developer-experience reasons.

Low-risk trial — proof of value

6-week AKS paved-road pattern + onboard 3 existing clusters

6 weeks

Paved-road Bicep template authored with policy guardrails. ACR provisioned with vulnerability scanning. Defender for Containers enabled tenant-wide. Three existing clusters refactored or rebuilt against the paved-road pattern. Cost attribution via tag taxonomy validated end-to-end.

Success criteria

  • Paved-road template published with documentation
  • Three clusters onboarded with policy compliance above 90%
  • Defender for Containers producing actionable signal
  • Image vulnerability scanning live with triage cadence

InvestmentDefender for Containers per-vCore. ACR consumption. Estimated ~€2–3k/month for the trial scope. Existing clusters untouched until refactored against the template.

Proof metrics

  • ·Cluster policy-compliance score above 90% on paved-road clusters
  • ·Image vulnerability scanning hit rate (legitimate, not noise)
  • ·Cost attribution per cluster operational
  • ·Platform-team time saved on new cluster provisioning

Recommended cards

The SKUs and capabilities most likely to be part of the solution, with the editorial rationale for each in the context of this story. Add the ones that fit your situation.

Back to Container & application platform