Glossary | IainToolin

📖 Lakehouse Glossary (Click to expand)

Plain-English first; nuts-and-bolts second. Click a card to open; click again to close.

🏛️ Data Lakehouse

A warehouse built inside your lake: store everything, analyse sensibly.

Architecture that blends lake scalability with warehouse features (ACID, indexing, governance) to serve BI & ML from one platform.

🌊 Data Lake

One big storage bay for every kind of file—tidy comes later.

Object storage (ADLS/S3/GCS) for structured, semi-structured, and unstructured data at massive scale.

📦 Object Storage

Buckets in the cloud with labels.

Flat storage of objects (data + metadata + ID) over HTTP APIs; durable and highly scalable.

🧪 ACID Transactions

No half-saved changes—either done or not.

Atomicity, Consistency, Isolation, Durability for reliable reads/writes in distributed systems.

🗂️ Medallion Architecture (Bronze/Silver/Gold)

Bronze = raw, Silver = cleaned, Gold = board-ready.

Layered refinement: append-only raw → conformed/quality-checked → curated marts (dimensional/semantic) for analytics.

📚 Catalog / Unity Catalog

The index telling you what’s where and who can touch it.

Central metadata & governance (e.g., Databricks Unity Catalog) for tables, files, models, lineage, and permissions.

🧱 Table Format Layer

The rulebook that makes a lake act like a warehouse.

Open formats (Delta Lake, Apache Iceberg, Apache Hudi) add ACID, snapshots, schema control, and time travel on object storage.

🔺 Delta Lake

Version control and reliable updates for your lake.

Transaction log on Parquet enabling ACID, schema enforcement/evolution, and efficient upserts/deletes.

🧊 Apache Iceberg

Big-table plumbing for huge lakes.

Hidden partitioning, manifest lists, snapshot isolation; engine-agnostic (Spark/Trino/Flink) with fast scans and metadata ops.

🦒 Apache Hudi

Change-friendly tables without the drama.

Incremental processing with Copy-On-Write / Merge-On-Read, upserts, and CDC pipelines.

🕰️ Time Travel

Roll back to “before it went weird”.

Query historical snapshots by version/timestamp for audits, debugging, and reproducibility.

🧩 Schema Enforcement

If the form’s wrong, it gets bounced.

Validate incoming data against expected types/columns; reject or quarantine incompatible payloads.

🔧 Schema Evolution

Add a column without breaking everyone else.

Controlled schema changes with compatibility checks; table metadata records version history.

🗄️ Parquet (Columnar)

Packs data tight; reads only what you ask for.

Columnar file format with predicate pushdown and encoding for fast analytics and compression.

🧭 Partition Pruning

Skip to the right chapter, not the whole book.

Engines scan only relevant partitions based on filters (date/region/customer), reducing I/O and cost.

✍️ Append-Only Landing

Write new pages; never erase the old ones.

Immutable raw zone preserving source fidelity—ideal for audit, replay, and CDC reconciliation.

🚚 Batch ETL / ELT

The nightly “big shop”.

Scheduled bulk loads; ETL transforms pre-load, ELT transforms in-lake/in-warehouse post-load.

⚡ Streaming Ingestion

Data arrives continuously, not in one big lump.

Event pipelines (Kafka/Event Hubs/Kinesis) for low-latency processing and near-real-time analytics.

🔁 Change Data Capture (CDC)

Only the changes, thanks.

Ingest inserts/updates/deletes from sources to keep downstream tables in sync without full reloads.

⚙️ Apache Spark

The workhorse behind big data jobs.

Distributed compute for SQL, streaming, and ML; core engine for Delta/Iceberg/Hudi operations.

💻 Databricks

Spark with the safety rails: collab, governance, and ops.

Managed lakehouse platform with notebooks, clusters, Unity Catalog, Delta Lake, and MLflow integration.

🛰️ Trino / Presto

SQL that talks to many systems without moving data.

MPP query engines federating SQL across object storage and other sources via connectors & CBO.

☁️ Serverless SQL

Run queries without babysitting clusters.

On-demand autoscaled SQL compute with per-query billing against lakehouse tables/external data.

📈 Business Intelligence (BI)

Dashboards that answer “how are we doing?”

Visual analytics using governed models, aggregates, and semantic layers for decision-making.

🧠 Semantic Layer

Data that speaks business, not engineer.

Logical model (metrics/dimensions/rules) mapping physical tables to business terms (dbt/LookML/Power BI models).

🧮 DAX

Excel on protein shakes.

Power BI formula language for measures, time intelligence, and model calculations.

📦 dbt

SQL models with version control and tests.

Modular transformations, tests, and docs; integrates with lakehouse engines for ELT best practice.

📊 Power BI / Looker

Charts and dashboards the CFO will actually open.

BI tools for modelling, visualising, and sharing governed analytics on top of lakehouse datasets.

📝 Notebooks

Code, notes, and outputs in one place.

Interactive docs (Python/SQL/Scala) for exploration, data prep, and ML experimentation.

🍱 Feature Store

Pre-chopped ML ingredients you can reuse.

Curated, versioned features with offline/online stores for consistent training/inference.

📔 MLflow

Keeps receipts on your ML runs.

Tracks experiments, packages models, and manages deployment with a central model registry.

🔐 RBAC / ABAC

Who you are vs. attributes you have.

Role-based access by role; attribute-based evaluates context (dept/geo/data tags) for fine-grained control.

🧬 Lineage

Your data’s family tree.

Provenance showing sources, transformations, and dependencies across pipelines and models.

🕵️ PII Controls

Keep the sensitive bits covered.

Discovery, masking, tokenisation, and consent handling for personally identifiable information.

🧾 Auditing

Who looked, who changed, and when.

Immutable access/change logs for compliance (GDPR/ISO/PCI/HIPAA) and investigations.

🔗 Data Sharing (e.g., Delta Sharing)

Share data without emailing copies around.

Open protocols to grant read access across orgs/platforms directly against governed tables in object storage.

🗓️ Airflow / Azure Data Factory

The schedulers keeping the lights on.

Pipeline orchestration for dependencies, retries, parameterisation, and event-driven runs.

🚀 CI/CD

Ship small, ship often, don’t break things.

Automated build/test/deploy for SQL models, notebooks, and infra (IaC) with environment promotion.

👀 Observability

Find issues before users do.

Metrics, logs, traces, data quality checks, SLAs/SLOs across pipelines, queries, and spend.

💷 Cost Management

Know where the money’s going—and why.

Budgets, tags, auto-stop, workload isolation, right-sizing to control storage/compute/egress/concurrency costs.

🏗️ ADLS / S3 / GCS

Microsoft, Amazon, and Google’s big buckets.

Azure Data Lake Storage, Amazon Simple Storage Service, and Google Cloud Storage underpin the lake layer.

🛰️ Kafka / Event Hubs

Conveyor belts for events.

Distributed streaming backbones for pub/sub ingestion, exactly-once semantics, and scalable consumer groups.

⭐ Star Schema

Facts in the middle, lookups round the edge.

Dimensional model with central fact table and denormalised dimensions for fast BI queries.

❄️ Snowflake Schema

Star schema with extra tidy cupboards.

Further-normalised dimensions to reduce redundancy; trades some query simplicity for governance and reuse.