SpecialisedUser storyConsultative playbook

We have 30 ML models in production and no idea which ones are drifting

A bank's ML team has deployed dozens of models with no shared lifecycle discipline. There is no central registry, retraining is ad-hoc, and drift detection lives in someone's personal notebook. The regulator has asked how the bank knows the models are still fit for purpose.

Trigger: Regulator review; lack of model lineage flagged.
Good outcome: Central registry, drift detection automated, retraining cadence governed, MLOps as a discipline.

Diagnostic discovery

Signals this story fits

Observable cues that confirm the conversation belongs here.

·Dozens of ML models in production with no central registry
·Drift detection ad-hoc or absent
·Regulator or auditor requesting model lineage
·Retraining cadence informal — depends on who notices an issue
·Mosaic AI / GenAI roadmap pending without MLOps foundation

Questions to ask

Open-ended, SPIN-style — each one has a reason it matters.

1.How many models are in production today, and where do they live?
WhySurfaces the sprawl scope. Often the customer cannot give an exact number.
2.Where's your model registry?
Why"In someone's notebook" is the most common honest answer. Confirms maturity gap.
3.How do you detect drift today?
WhyTests whether drift detection is automated, manual, or absent.
Listen for: “nobody notices” · “we check quarterly” · “each team owns it”
4.What's the retraining cadence — and who triggers it?
WhySurfaces the governance gap. Manual cadence rarely scales beyond a few models.
5.What does the regulator want to see specifically — lineage, attestation, model cards?
WhySharpens the deliverable. Different regulators want different artefacts.
6.What's your data substrate — Databricks, Fabric, mixed?
WhyDetermines whether MLflow on Databricks is native or a federation play.

Baseline → target architecture

TOGAF-style gap framing — what we typically see today, and what the proposed end state looks like. The gap between them is the engagement.

Baseline architecture

Models scattered across Databricks workspaces, notebooks, and custom services. MLflow used informally without a central registry. Drift detection ad-hoc. Retraining manual and reactive. No documented lineage from training data to deployed model.

Typical concerns

·No defensible answer to "is this model still fit for purpose?"
·Model performance degrading silently
·Retraining triggered only when something breaks
·No model cards or attestation for regulator
·GenAI workloads adding to the sprawl

Capability gaps

·Central model registry
·Automated drift detection
·Retraining cadence with governance
·Model cards and lineage
·Responsible AI gates wired into the lifecycle

Target architecture

Unity Catalog + MLflow on Databricks as the central registry. Lineage from training data through deployed endpoints. Drift detection automated with alerts into the SOC or platform-team queue. Retraining cadence governed by a cadence runbook. Mosaic AI for GenAI lifecycle. Foundry for non-Spark workloads. Fabric provides the data substrate where training data lives.

Key capabilities

Central model registry
Lineage from data to deployed model
Automated drift detection
Retraining cadence runbook
Model cards and attestation artefacts

Enabling SKUs

Resolved in the ‘Recommended cards’ section below.

Architecture decisions

Each decision is offered as explicit options with trade-offs — Hohpe's “selling options” principle. A safe default is noted where one exists.

Decision 1.Registry location — MLflow on Databricks vs Foundry-hosted
MLflow on Databricks (Unity Catalog)
When it fitsData and ML workloads on Databricks; lakehouse-native lineage required.
Trade-offsTight Databricks coupling.
Foundry-hosted
When it fitsPro-code AI workloads on Azure; non-Spark training pipelines.
Trade-offsLess native lineage if data lives in Databricks.
Default recommendationMLflow on Databricks if training data is in Databricks; Foundry-hosted for non-Spark workloads.
Decision 2.Drift tooling — built-in vs third-party
Built-in (Databricks Lakehouse Monitoring / Foundry built-in)
When it fitsStandard drift patterns; matches platform.
Trade-offsLess control over custom drift signals.
Third-party (e.g. Arize, Fiddler, Evidently)
When it fitsSpecialised drift requirements; bias detection central to compliance.
Trade-offsAdditional tooling + procurement.
Default recommendationBuilt-in to start; layer third-party only if specific gaps appear.
Decision 3.Retraining trigger — manual approval vs automated
Manual approval
When it fitsRegulated workloads; significant business impact per model change.
Trade-offsSlower cadence; risk of staleness.
Automated trigger on drift threshold
When it fitsMature MLOps; clear drift definitions per workload.
Trade-offsRisk of unintended retraining if drift signal is noisy.
Default recommendationManual approval gate for regulated workloads; automated for the well-understood ones.

Low-risk trial — proof of value

60-day MLOps foundation — 3 production models under governance

8 weeks

Unity Catalog + MLflow registry stood up. Three production models registered with lineage. Drift detection automated on those three models. Retraining cadence runbook authored. First model card produced for one regulator-relevant model.

Success criteria

Three production models in the central registry with lineage
Drift alerts firing and actioned within SLA
Retraining cadence runbook validated against one model retraining cycle
First model card complete and audit-defensible

InvestmentDatabricks DBU consumption only; commitment decisions deferred. No new model development during trial — focus is governance.

Proof metrics

·Model coverage in registry above 30% at trial end
·Drift alerts produced for at least one model
·Retraining time-to-deploy measured
·Model card produced and validated against regulator requirement

Recommended cards

The SKUs and capabilities most likely to be part of the solution, with the editorial rationale for each in the context of this story. Add the ones that fit your situation.

SKUDatabricksMulti-cloud

Azure Databricks

The lakehouse platform — unified storage in Delta format, Spark-based processing, Unity Catalog for governance, Photon engine for SQL, MLflow for ML lifecycle, Mosaic AI for generative AI. First-party Azure offering with Microsoft commercial relationship.

In: Azure consumption

Data & Analytics

Why for this story

MLflow + Unity Catalog provide model registry, lineage, and drift integration. The natural home if data engineering and ML are already on Databricks.

SKUMicrosoftAzure

Azure AI Foundry

Microsoft's unified AI development platform — Azure OpenAI Service, model catalog (100+ open and proprietary models), agent service, prompt flow, evaluation, AI Search, Azure ML capabilities. The Microsoft strategic surface for AI engineering and generative AI applications.

In: Azure consumption

Data & Analytics

Why for this story

Pro-code MLOps for non-Spark workloads and GenAI lifecycle. Prompt flow evaluation extends responsible AI into the lifecycle.

SKUMicrosoftSaaS

Microsoft Fabric

Microsoft's unified SaaS analytics platform — Data Engineering, Data Warehouse, Real-Time Analytics, Data Science, Power BI, Data Factory, OneLake (the unified storage layer), Activator (real-time triggers). Capacity-based pricing across all workloads. The Microsoft strategic answer to the platform-fragmentation question.

In: Azure consumption

Data & Analytics

Why for this story

The data substrate. Training data and feature stores travel through Fabric for non-Spark workloads; lineage continues across platforms via Purview.

Back to AI/ML operations at scale

We have 30 ML models in production and no idea which ones are drifting

Signals this story fits

Questions to ask

Typical concerns

Capability gaps

Key capabilities

Enabling SKUs

Decision 1.Registry location — MLflow on Databricks vs Foundry-hosted

Decision 2.Drift tooling — built-in vs third-party

Decision 3.Retraining trigger — manual approval vs automated

60-day MLOps foundation — 3 production models under governance

Success criteria

Proof metrics

Recommended cards

Azure Databricks

Azure AI Foundry

Microsoft Fabric