Solution Atlas
SpecialisedUser storyConsultative playbook

Our ML team has outgrown notebooks and needs a proper lakehouse

A retail bank's ML team is running models in fragmented Databricks workspaces with no central governance. They need Unity Catalog, MLflow lifecycle management, and a path to production GenAI on the same data substrate.

Trigger
ML production incidents; regulator wants lineage and access auditing.
Good outcome
Unity Catalog tenant-wide, DBCU commitment sized to forecast, MLflow + Mosaic AI in production.
Diagnostic discovery

Signals this story fits

Observable cues that confirm the conversation belongs here.

  • ·ML team running on multiple fragmented Databricks workspaces
  • ·Regulator flagged model lineage and governance gaps
  • ·MLflow not in production; models managed informally
  • ·Mosaic AI or GenAI on the strategic roadmap
  • ·DBU costs surprising; no DBCU commitment

Questions to ask

Open-ended, SPIN-style — each one has a reason it matters.

  1. 1.How many Databricks workspaces are live today and who owns each?

    WhySurfaces sprawl. Multi-workspace estates without Unity Catalog are the canonical baseline for this story.

  2. 2.Is your governance plane workspace-scoped or Unity Catalog?

    WhyUnity Catalog is the prerequisite for tenant-wide governance, lineage, and fine-grained access.

  3. 3.Where does your ML lifecycle live today — notebooks, MLflow, ad-hoc services?

    WhyDetermines maturity of the registry + retraining cadence.

  4. 4.Have you priced Databricks Mosaic AI against Azure AI Foundry for the GenAI roadmap?

    WhyThe Foundry vs Mosaic AI decision is workload-fit, not vendor preference. Surfaces whether the customer has done the analysis.

  5. 5.What DBCU commitment have you made, if any?

    WhyDBCU produces 20–40% discount on stable workloads. Without it, the customer is paying retail.

  6. 6.What is your storage substrate — ADLS Gen2 only, or mixed?

    WhyDelta Lake lives on ADLS; mixed substrate complicates Unity Catalog rollout.

Baseline → target architecture

TOGAF-style gap framing — what we typically see today, and what the proposed end state looks like. The gap between them is the engagement.

Baseline architecture

Multiple Databricks workspaces created per team. Workspace-scoped Hive metastore. MLflow used informally. No central model registry. DBU pay-as-you-go. ADLS Gen2 the storage substrate. Lineage informal.

Typical concerns

  • ·Fragmented governance across workspaces
  • ·Models in production without lineage or owner
  • ·DBU cost surprises from spot misuse and idle clusters
  • ·No drift detection
  • ·No defensible answer to "is this model still fit for purpose?"

Capability gaps

  • ·Unity Catalog as tenant-wide governance
  • ·MLflow as central model registry
  • ·Drift detection and retraining cadence
  • ·DBCU commitment discipline
  • ·Foundry vs Mosaic AI workload-fit decision
Target architecture

Unity Catalog rolled out tenant-wide as the governance plane. ADLS Gen2 as the storage substrate beneath Delta Lake. DBCU committed to forecast workloads. MLflow as the central model registry with drift detection automated. Mosaic AI for lakehouse-native GenAI. Foundry for non-Spark workloads and the broader Microsoft AI surface. Purview Data Governance federates Unity Catalog with Fabric and Snowflake.

Key capabilities

  • Unity Catalog tenant-wide
  • Central model registry + lineage
  • Drift detection and retraining cadence
  • DBCU commitment discipline
  • Mosaic AI / Foundry workload-fit

Enabling SKUs

Resolved in the ‘Recommended cards’ section below.

Architecture decisions

Each decision is offered as explicit options with trade-offs — Hohpe's “selling options” principle. A safe default is noted where one exists.

  1. Decision 1.GenAI platform — Mosaic AI on Databricks vs Azure AI Foundry

    Mosaic AI

    When it fitsData and ML workloads already on Databricks; need governance + lineage on the same plane.

    Trade-offsSmaller GenAI ecosystem than Azure OpenAI.

    Azure AI Foundry

    When it fitsAzure-native estate; non-Spark workloads; broader model catalogue needed.

    Trade-offsTwo governance planes if Databricks also in use.

    Default recommendationMosaic AI where the data is already in Databricks; Foundry for the broader Microsoft AI surface.

  2. Decision 2.Unity Catalog primary vs Purview Data Governance primary

    Unity Catalog primary

    When it fitsDatabricks-dominant estate; lineage primarily within Databricks.

    Trade-offsCross-platform federation requires Purview anyway.

    Purview Data Governance primary

    When it fitsMulti-platform estate (Databricks + Fabric + Snowflake).

    Trade-offsLess native depth than Unity Catalog within Databricks itself.

    Default recommendationUnity Catalog native, Purview federates across platforms.

  3. Decision 3.DBCU commitment level — 1-year vs 3-year vs none

    3-year DBCU

    When it fitsStable workload pattern; large estate; willing to commit.

    Trade-offsLarger commitment with less flexibility.

    1-year DBCU

    When it fitsModerate predictability; growth phase.

    Trade-offsLower discount tier.

    None (pay-as-you-go)

    When it fitsWorkload volatile; estate small.

    Trade-offsNo discount; surprise bills.

    Default recommendation1-year DBCU sized to 70% of forecast steady-state; pay-as-you-go for the variable tier.

Low-risk trial — proof of value

60-day Unity Catalog rollout + MLflow + Mosaic AI POC

8 weeks

Unity Catalog enabled tenant-wide with the first workspace migrated. MLflow central registry stood up. One production model brought under lineage + drift detection. One Mosaic AI POC against a GenAI use case grounded on Unity-Catalog data.

Success criteria

  • Unity Catalog live with one workspace fully migrated
  • Central MLflow registry with one production model registered
  • Drift detection alerts produced for the trial model
  • Mosaic AI POC produces a working RAG endpoint against catalogued data

InvestmentDBCU consumption only; commitment decisions deferred to month 3. Mosaic AI on existing DBU rates.

Proof metrics

  • ·Unity Catalog adoption % at trial end
  • ·Time-to-model-deployment for registered models
  • ·Drift alert quality (signal vs noise)
  • ·GenAI POC response quality and latency

Recommended cards

The SKUs and capabilities most likely to be part of the solution, with the editorial rationale for each in the context of this story. Add the ones that fit your situation.

Back to Lakehouse with Databricks