Solution Atlas
SpecialisedUser storyConsultative playbook

We have 30 ML models in production and no idea which ones are drifting

A bank's ML team has deployed dozens of models with no shared lifecycle discipline. There is no central registry, retraining is ad-hoc, and drift detection lives in someone's personal notebook. The regulator has asked how the bank knows the models are still fit for purpose.

Trigger
Regulator review; lack of model lineage flagged.
Good outcome
Central registry, drift detection automated, retraining cadence governed, MLOps as a discipline.
Diagnostic discovery

Signals this story fits

Observable cues that confirm the conversation belongs here.

  • ·Dozens of ML models in production with no central registry
  • ·Drift detection ad-hoc or absent
  • ·Regulator or auditor requesting model lineage
  • ·Retraining cadence informal — depends on who notices an issue
  • ·Mosaic AI / GenAI roadmap pending without MLOps foundation

Questions to ask

Open-ended, SPIN-style — each one has a reason it matters.

  1. 1.How many models are in production today, and where do they live?

    WhySurfaces the sprawl scope. Often the customer cannot give an exact number.

  2. 2.Where's your model registry?

    Why"In someone's notebook" is the most common honest answer. Confirms maturity gap.

  3. 3.How do you detect drift today?

    WhyTests whether drift detection is automated, manual, or absent.

    Listen for: “nobody notices” · “we check quarterly” · “each team owns it”

  4. 4.What's the retraining cadence — and who triggers it?

    WhySurfaces the governance gap. Manual cadence rarely scales beyond a few models.

  5. 5.What does the regulator want to see specifically — lineage, attestation, model cards?

    WhySharpens the deliverable. Different regulators want different artefacts.

  6. 6.What's your data substrate — Databricks, Fabric, mixed?

    WhyDetermines whether MLflow on Databricks is native or a federation play.

Baseline → target architecture

TOGAF-style gap framing — what we typically see today, and what the proposed end state looks like. The gap between them is the engagement.

Baseline architecture

Models scattered across Databricks workspaces, notebooks, and custom services. MLflow used informally without a central registry. Drift detection ad-hoc. Retraining manual and reactive. No documented lineage from training data to deployed model.

Typical concerns

  • ·No defensible answer to "is this model still fit for purpose?"
  • ·Model performance degrading silently
  • ·Retraining triggered only when something breaks
  • ·No model cards or attestation for regulator
  • ·GenAI workloads adding to the sprawl

Capability gaps

  • ·Central model registry
  • ·Automated drift detection
  • ·Retraining cadence with governance
  • ·Model cards and lineage
  • ·Responsible AI gates wired into the lifecycle
Target architecture

Unity Catalog + MLflow on Databricks as the central registry. Lineage from training data through deployed endpoints. Drift detection automated with alerts into the SOC or platform-team queue. Retraining cadence governed by a cadence runbook. Mosaic AI for GenAI lifecycle. Foundry for non-Spark workloads. Fabric provides the data substrate where training data lives.

Key capabilities

  • Central model registry
  • Lineage from data to deployed model
  • Automated drift detection
  • Retraining cadence runbook
  • Model cards and attestation artefacts

Enabling SKUs

Resolved in the ‘Recommended cards’ section below.

Architecture decisions

Each decision is offered as explicit options with trade-offs — Hohpe's “selling options” principle. A safe default is noted where one exists.

  1. Decision 1.Registry location — MLflow on Databricks vs Foundry-hosted

    MLflow on Databricks (Unity Catalog)

    When it fitsData and ML workloads on Databricks; lakehouse-native lineage required.

    Trade-offsTight Databricks coupling.

    Foundry-hosted

    When it fitsPro-code AI workloads on Azure; non-Spark training pipelines.

    Trade-offsLess native lineage if data lives in Databricks.

    Default recommendationMLflow on Databricks if training data is in Databricks; Foundry-hosted for non-Spark workloads.

  2. Decision 2.Drift tooling — built-in vs third-party

    Built-in (Databricks Lakehouse Monitoring / Foundry built-in)

    When it fitsStandard drift patterns; matches platform.

    Trade-offsLess control over custom drift signals.

    Third-party (e.g. Arize, Fiddler, Evidently)

    When it fitsSpecialised drift requirements; bias detection central to compliance.

    Trade-offsAdditional tooling + procurement.

    Default recommendationBuilt-in to start; layer third-party only if specific gaps appear.

  3. Decision 3.Retraining trigger — manual approval vs automated

    Manual approval

    When it fitsRegulated workloads; significant business impact per model change.

    Trade-offsSlower cadence; risk of staleness.

    Automated trigger on drift threshold

    When it fitsMature MLOps; clear drift definitions per workload.

    Trade-offsRisk of unintended retraining if drift signal is noisy.

    Default recommendationManual approval gate for regulated workloads; automated for the well-understood ones.

Low-risk trial — proof of value

60-day MLOps foundation — 3 production models under governance

8 weeks

Unity Catalog + MLflow registry stood up. Three production models registered with lineage. Drift detection automated on those three models. Retraining cadence runbook authored. First model card produced for one regulator-relevant model.

Success criteria

  • Three production models in the central registry with lineage
  • Drift alerts firing and actioned within SLA
  • Retraining cadence runbook validated against one model retraining cycle
  • First model card complete and audit-defensible

InvestmentDatabricks DBU consumption only; commitment decisions deferred. No new model development during trial — focus is governance.

Proof metrics

  • ·Model coverage in registry above 30% at trial end
  • ·Drift alerts produced for at least one model
  • ·Retraining time-to-deploy measured
  • ·Model card produced and validated against regulator requirement

Recommended cards

The SKUs and capabilities most likely to be part of the solution, with the editorial rationale for each in the context of this story. Add the ones that fit your situation.

Back to AI/ML operations at scale