📖 Lakehouse Glossary (Click to expand)

Plain-English first; nuts-and-bolts second. Click a card to open; click again to close.

🏛️ Data Lakehouse
A warehouse built inside your lake: store everything, analyse sensibly.
Architecture that blends lake scalability with warehouse features (ACID, indexing, governance) to serve BI & ML from one platform.
🌊 Data Lake
One big storage bay for every kind of file—tidy comes later.
Object storage (ADLS/S3/GCS) for structured, semi-structured, and unstructured data at massive scale.
📦 Object Storage
Buckets in the cloud with labels.
Flat storage of objects (data + metadata + ID) over HTTP APIs; durable and highly scalable.
🧪 ACID Transactions
No half-saved changes—either done or not.
Atomicity, Consistency, Isolation, Durability for reliable reads/writes in distributed systems.
🗂️ Medallion Architecture (Bronze/Silver/Gold)
Bronze = raw, Silver = cleaned, Gold = board-ready.
Layered refinement: append-only raw → conformed/quality-checked → curated marts (dimensional/semantic) for analytics.
📚 Catalog / Unity Catalog
The index telling you what’s where and who can touch it.
Central metadata & governance (e.g., Databricks Unity Catalog) for tables, files, models, lineage, and permissions.
🧱 Table Format Layer
The rulebook that makes a lake act like a warehouse.
Open formats (Delta Lake, Apache Iceberg, Apache Hudi) add ACID, snapshots, schema control, and time travel on object storage.
🔺 Delta Lake
Version control and reliable updates for your lake.
Transaction log on Parquet enabling ACID, schema enforcement/evolution, and efficient upserts/deletes.
🧊 Apache Iceberg
Big-table plumbing for huge lakes.
Hidden partitioning, manifest lists, snapshot isolation; engine-agnostic (Spark/Trino/Flink) with fast scans and metadata ops.
🦒 Apache Hudi
Change-friendly tables without the drama.
Incremental processing with Copy-On-Write / Merge-On-Read, upserts, and CDC pipelines.
🕰️ Time Travel
Roll back to “before it went weird”.
Query historical snapshots by version/timestamp for audits, debugging, and reproducibility.
🧩 Schema Enforcement
If the form’s wrong, it gets bounced.
Validate incoming data against expected types/columns; reject or quarantine incompatible payloads.
🔧 Schema Evolution
Add a column without breaking everyone else.
Controlled schema changes with compatibility checks; table metadata records version history.
🗄️ Parquet (Columnar)
Packs data tight; reads only what you ask for.
Columnar file format with predicate pushdown and encoding for fast analytics and compression.
🧭 Partition Pruning
Skip to the right chapter, not the whole book.
Engines scan only relevant partitions based on filters (date/region/customer), reducing I/O and cost.
✍️ Append-Only Landing
Write new pages; never erase the old ones.
Immutable raw zone preserving source fidelity—ideal for audit, replay, and CDC reconciliation.
🚚 Batch ETL / ELT
The nightly “big shop”.
Scheduled bulk loads; ETL transforms pre-load, ELT transforms in-lake/in-warehouse post-load.
Streaming Ingestion
Data arrives continuously, not in one big lump.
Event pipelines (Kafka/Event Hubs/Kinesis) for low-latency processing and near-real-time analytics.
🔁 Change Data Capture (CDC)
Only the changes, thanks.
Ingest inserts/updates/deletes from sources to keep downstream tables in sync without full reloads.
⚙️ Apache Spark
The workhorse behind big data jobs.
Distributed compute for SQL, streaming, and ML; core engine for Delta/Iceberg/Hudi operations.
💻 Databricks
Spark with the safety rails: collab, governance, and ops.
Managed lakehouse platform with notebooks, clusters, Unity Catalog, Delta Lake, and MLflow integration.
🛰️ Trino / Presto
SQL that talks to many systems without moving data.
MPP query engines federating SQL across object storage and other sources via connectors & CBO.
☁️ Serverless SQL
Run queries without babysitting clusters.
On-demand autoscaled SQL compute with per-query billing against lakehouse tables/external data.
📈 Business Intelligence (BI)
Dashboards that answer “how are we doing?”
Visual analytics using governed models, aggregates, and semantic layers for decision-making.
🧠 Semantic Layer
Data that speaks business, not engineer.
Logical model (metrics/dimensions/rules) mapping physical tables to business terms (dbt/LookML/Power BI models).
🧮 DAX
Excel on protein shakes.
Power BI formula language for measures, time intelligence, and model calculations.
📦 dbt
SQL models with version control and tests.
Modular transformations, tests, and docs; integrates with lakehouse engines for ELT best practice.
📊 Power BI / Looker
Charts and dashboards the CFO will actually open.
BI tools for modelling, visualising, and sharing governed analytics on top of lakehouse datasets.
📝 Notebooks
Code, notes, and outputs in one place.
Interactive docs (Python/SQL/Scala) for exploration, data prep, and ML experimentation.
🍱 Feature Store
Pre-chopped ML ingredients you can reuse.
Curated, versioned features with offline/online stores for consistent training/inference.
📔 MLflow
Keeps receipts on your ML runs.
Tracks experiments, packages models, and manages deployment with a central model registry.
🔐 RBAC / ABAC
Who you are vs. attributes you have.
Role-based access by role; attribute-based evaluates context (dept/geo/data tags) for fine-grained control.
🧬 Lineage
Your data’s family tree.
Provenance showing sources, transformations, and dependencies across pipelines and models.
🕵️ PII Controls
Keep the sensitive bits covered.
Discovery, masking, tokenisation, and consent handling for personally identifiable information.
🧾 Auditing
Who looked, who changed, and when.
Immutable access/change logs for compliance (GDPR/ISO/PCI/HIPAA) and investigations.
🔗 Data Sharing (e.g., Delta Sharing)
Share data without emailing copies around.
Open protocols to grant read access across orgs/platforms directly against governed tables in object storage.
🗓️ Airflow / Azure Data Factory
The schedulers keeping the lights on.
Pipeline orchestration for dependencies, retries, parameterisation, and event-driven runs.
🚀 CI/CD
Ship small, ship often, don’t break things.
Automated build/test/deploy for SQL models, notebooks, and infra (IaC) with environment promotion.
👀 Observability
Find issues before users do.
Metrics, logs, traces, data quality checks, SLAs/SLOs across pipelines, queries, and spend.
💷 Cost Management
Know where the money’s going—and why.
Budgets, tags, auto-stop, workload isolation, right-sizing to control storage/compute/egress/concurrency costs.
🏗️ ADLS / S3 / GCS
Microsoft, Amazon, and Google’s big buckets.
Azure Data Lake Storage, Amazon Simple Storage Service, and Google Cloud Storage underpin the lake layer.
🛰️ Kafka / Event Hubs
Conveyor belts for events.
Distributed streaming backbones for pub/sub ingestion, exactly-once semantics, and scalable consumer groups.
Star Schema
Facts in the middle, lookups round the edge.
Dimensional model with central fact table and denormalised dimensions for fast BI queries.
❄️ Snowflake Schema
Star schema with extra tidy cupboards.
Further-normalised dimensions to reduce redundancy; trades some query simplicity for governance and reuse.