Lake House Glossary | IainToolin

🧪 ACID Transactions

No half-saved changes—either done or not.

Atomicity, Consistency, Isolation, Durability for reliable reads/writes in distributed systems.

🏗️ ADLS / S3 / GCS

Microsoft, Amazon, and Google’s big buckets.

Azure Data Lake Storage, Amazon Simple Storage Service, and Google Cloud Storage underpin the lake layer.

🗓️ Airflow / Azure Data Factory

The schedulers keeping the lights on.

Pipeline orchestration for dependencies, retries, parameterisation, and event-driven runs.

✍️ Append-Only Landing

Write new pages; never erase the old ones.

Immutable raw zone preserving source fidelity—ideal for audit, replay, and CDC reconciliation.

🧊 Apache Iceberg

Big-table plumbing for huge lakes.

Hidden partitioning, manifest lists, snapshot isolation; engine-agnostic (Spark/Trino/Flink).

🦒 Apache Hudi

Change-friendly tables without the drama.

Incremental processing with Copy-On-Write / Merge-On-Read and CDC pipelines.

⚙️ Apache Spark

The workhorse behind big data jobs.

Distributed compute for SQL, streaming, and ML.

🧾 Auditing

Who looked, who changed, and when.

Immutable access/change logs for compliance and investigations.

🚚 Batch ETL / ELT

The nightly “big shop”.

Scheduled bulk loads; ETL transforms pre-load, ELT post-load.

📈 Business Intelligence (BI)

Dashboards that answer “how are we doing?”

Visual analytics using governed models and semantic layers.

🔁 Change Data Capture (CDC)

Only the changes, thanks.

Ingest inserts, updates, and deletes efficiently.

📚 Catalog / Unity Catalog

The index telling you what’s where and who can touch it.

Central metadata, lineage, and access governance.

🚀 CI/CD

Ship small, ship often, don’t break things.

Automated build, test, and deployment pipelines.

💷 Cost Management

Know where the money’s going—and why.

Budgets, tagging, auto-stop, and workload isolation.

🌊 Data Lake

One big storage bay for every kind of file.

Object storage for structured and unstructured data.

🏛️ Data Lakehouse

A warehouse built inside your lake.

Blends lake scalability with warehouse reliability.

🔗 Data Sharing (e.g., Delta Sharing)

Share data without emailing copies around.

Secure cross-org read access on governed tables.

💻 Databricks

Spark with the safety rails.

Managed lakehouse with notebooks, governance, and ML.

🔺 Delta Lake

Version control for your data.

Transaction log on Parquet enabling ACID and time travel.

📦 dbt

SQL models with tests and version control.

Modular ELT transformations and documentation.

🧮 DAX

Excel on protein shakes.

Power BI expression language.

🍱 Feature Store

Reusable ML ingredients.

Versioned features for training and inference.

🛰️ Kafka / Event Hubs

Conveyor belts for events.

Distributed streaming platforms.

🧬 Lineage

Your data’s family tree.

Provenance across sources and transformations.

📔 MLflow

Keeps receipts on your ML runs.

Experiment tracking and model registry.

🗂️ Medallion Architecture (Bronze/Silver/Gold)

Raw → cleaned → board-ready.

Layered refinement for analytics and ML.

📝 Notebooks

Code and commentary together.

Interactive Python/SQL/Scala documents.

👀 Observability

Find issues before users do.

Metrics, logs, and data-quality signals.

📦 Object Storage

Buckets in the cloud.

Flat object storage accessed via APIs.

🗄️ Parquet (Columnar)

Reads only what you ask for.

Columnar analytics file format.

🧭 Partition Pruning

Skip irrelevant data.

Query engines scan only needed partitions.

🕵️ PII Controls

Keep sensitive data covered.

Masking, tokenisation, and consent enforcement.

📊 Power BI / Looker

Dashboards people actually use.

BI tools atop governed semantic models.

🔐 RBAC / ABAC

Who you are vs. what you have.

Role- and attribute-based access control.

🧩 Schema Enforcement

Bad data gets bounced.

Rejects incompatible schema writes.

🔧 Schema Evolution

Change structure safely.

Controlled schema updates with history.

🧠 Semantic Layer

Business meaning on top of data.

Metrics and dimensions mapped to tables.

☁️ Serverless SQL

Query without managing servers.

Autoscaled, per-query SQL compute.

❄️ Snowflake Schema

A tidier star schema.

Normalised dimensions for reuse.

⭐ Star Schema

Facts in the middle.

Optimised dimensional BI model.

⚡ Streaming Ingestion

Data arrives continuously.

Low-latency event pipelines.

🧱 Table Format Layer

The rulebook for lake tables.

Delta, Iceberg, and Hudi capabilities.

🕰️ Time Travel

Query the past.

Snapshot access by version or timestamp.

🛰️ Trino / Presto

SQL across many systems.

Federated MPP query engines.

📖 Lakehouse Glossary (Click to expand)

Click on pictures below for more information