📖 Lakehouse Glossary (Click to expand)

Plain-English first; nuts-and-bolts second. Click a card to open; click again to close.

A
🧪 ACID Transactions
No half-saved changes—either done or not.
Atomicity, Consistency, Isolation, Durability for reliable reads/writes in distributed systems.
🏗️ ADLS / S3 / GCS
Microsoft, Amazon, and Google’s big buckets.
Azure Data Lake Storage, Amazon Simple Storage Service, and Google Cloud Storage underpin the lake layer.
🗓️ Airflow / Azure Data Factory
The schedulers keeping the lights on.
Pipeline orchestration for dependencies, retries, parameterisation, and event-driven runs.
✍️ Append-Only Landing
Write new pages; never erase the old ones.
Immutable raw zone preserving source fidelity—ideal for audit, replay, and CDC reconciliation.
🧊 Apache Iceberg
Big-table plumbing for huge lakes.
Hidden partitioning, manifest lists, snapshot isolation; engine-agnostic (Spark/Trino/Flink).
🦒 Apache Hudi
Change-friendly tables without the drama.
Incremental processing with Copy-On-Write / Merge-On-Read and CDC pipelines.
⚙️ Apache Spark
The workhorse behind big data jobs.
Distributed compute for SQL, streaming, and ML.
🧾 Auditing
Who looked, who changed, and when.
Immutable access/change logs for compliance and investigations.
B
🚚 Batch ETL / ELT
The nightly “big shop”.
Scheduled bulk loads; ETL transforms pre-load, ELT post-load.
📈 Business Intelligence (BI)
Dashboards that answer “how are we doing?”
Visual analytics using governed models and semantic layers.
C
🔁 Change Data Capture (CDC)
Only the changes, thanks.
Ingest inserts, updates, and deletes efficiently.
📚 Catalog / Unity Catalog
The index telling you what’s where and who can touch it.
Central metadata, lineage, and access governance.
🚀 CI/CD
Ship small, ship often, don’t break things.
Automated build, test, and deployment pipelines.
💷 Cost Management
Know where the money’s going—and why.
Budgets, tagging, auto-stop, and workload isolation.
D
🌊 Data Lake
One big storage bay for every kind of file.
Object storage for structured and unstructured data.
🏛️ Data Lakehouse
A warehouse built inside your lake.
Blends lake scalability with warehouse reliability.
🔗 Data Sharing (e.g., Delta Sharing)
Share data without emailing copies around.
Secure cross-org read access on governed tables.
💻 Databricks
Spark with the safety rails.
Managed lakehouse with notebooks, governance, and ML.
🔺 Delta Lake
Version control for your data.
Transaction log on Parquet enabling ACID and time travel.
📦 dbt
SQL models with tests and version control.
Modular ELT transformations and documentation.
🧮 DAX
Excel on protein shakes.
Power BI expression language.
F
🍱 Feature Store
Reusable ML ingredients.
Versioned features for training and inference.
K
🛰️ Kafka / Event Hubs
Conveyor belts for events.
Distributed streaming platforms.
L
🧬 Lineage
Your data’s family tree.
Provenance across sources and transformations.
M
📔 MLflow
Keeps receipts on your ML runs.
Experiment tracking and model registry.
🗂️ Medallion Architecture (Bronze/Silver/Gold)
Raw → cleaned → board-ready.
Layered refinement for analytics and ML.
N
📝 Notebooks
Code and commentary together.
Interactive Python/SQL/Scala documents.
O
👀 Observability
Find issues before users do.
Metrics, logs, and data-quality signals.
📦 Object Storage
Buckets in the cloud.
Flat object storage accessed via APIs.
P
🗄️ Parquet (Columnar)
Reads only what you ask for.
Columnar analytics file format.
🧭 Partition Pruning
Skip irrelevant data.
Query engines scan only needed partitions.
🕵️ PII Controls
Keep sensitive data covered.
Masking, tokenisation, and consent enforcement.
📊 Power BI / Looker
Dashboards people actually use.
BI tools atop governed semantic models.
R
🔐 RBAC / ABAC
Who you are vs. what you have.
Role- and attribute-based access control.
S
🧩 Schema Enforcement
Bad data gets bounced.
Rejects incompatible schema writes.
🔧 Schema Evolution
Change structure safely.
Controlled schema updates with history.
🧠 Semantic Layer
Business meaning on top of data.
Metrics and dimensions mapped to tables.
☁️ Serverless SQL
Query without managing servers.
Autoscaled, per-query SQL compute.
❄️ Snowflake Schema
A tidier star schema.
Normalised dimensions for reuse.
Star Schema
Facts in the middle.
Optimised dimensional BI model.
Streaming Ingestion
Data arrives continuously.
Low-latency event pipelines.
T
🧱 Table Format Layer
The rulebook for lake tables.
Delta, Iceberg, and Hudi capabilities.
🕰️ Time Travel
Query the past.
Snapshot access by version or timestamp.
🛰️ Trino / Presto
SQL across many systems.
Federated MPP query engines.