<b>Glossary v1.4</b>

Glossary

Term	🔍 Non-Technical Description	🛠️ Technical Description	🗃️ Seen In / Tools
🔐 ACID	A reliable bank transfer: all or nothing.	Atomicity, Consistency, Isolation, Durability.	PostgreSQL, Oracle
🧪 AI/ML Scenario Blueprint	Design your model’s storyline.	Scenario framing with goals, data, stakeholders.	Whiteboard decks, ML canvas
🌀 AI/ML Cloud Blueprint	End-to-end cloud ML pattern.	Includes ingestion, training, inference, monitoring.	Databricks, SageMaker
🤖 Applied AI/ML	Useful models—not just hype.	Deployed, monitored models solving real tasks.	MLflow, dashboards
🧘 BASE	Looser than ACID, good enough for now.	Basically Available, Soft state, Eventual consistency.	Cassandra, Couchbase
🏢 BSS	Billing, CRM, and commercial bits of telecoms.	Handles billing, orders, customer relationships.	Salesforce, Amdocs
⚖️ CAP Theorem	You can’t have all three: C, A, P.	Only two of Consistency, Availability, Partition Tolerance possible.	DynamoDB, HBase
🔄 CRUD Mapping	Who can do what to which data?	Create, Read, Update, Delete lifecycle mapped to business ops.	APIs, Integration specs
📘 Data-Centric Thinking	Design around trustworthy data.	Data-first design with governance, semantics, CRUD.	Star schemas, canonical models
🧳 Data Passport	A metadata ID card for your dataset.	Classification, ownership, retention, lineage.	Unity Catalog, Collibra
🔍 Data Quality Gates	Stop dirty data before it hurts.	Validation rules before model input or output.	Great Expectations, Airflow
🧭 Data Blueprint	Data’s playbook—what, why, and how.	Entity models, flows, CRUD, governance overlays.	DFDs, semantic maps
📐 Feature Engineering Blueprint	Prepping inputs for model cooking.	Transforms, feature store logic, encoding patterns.	MLflow, Delta Lake
👥 Federated Governance	Central rules, local control.	Global policies with domain-led stewardship.	Data Mesh, Unity Catalog
🏥 Healthcare Scenario	AI/ML for triage, diagnostics, operations.	Domain-specific models with clinical framing.	NHS dashboards
🔁 Iterative Delivery	Work in slices. Rinse and repeat.	Sprint-based delivery in architecture and data.	Scrum, Kanban
🔍 Lineage	Where data came from, who touched it.	End-to-end trace of transformations and movement.	Unity Catalog, Informatica
🗂️ Master Data	Core entities reused everywhere.	High-value shared business objects.	SAP MDG, Oracle DRM
📊 Model Monitoring	Keep your model honest and fresh.	Track drift, bias, performance. Trigger retrain.	MLflow, Prometheus
🧱 Databricks	Lakehouse Swiss Army knife.	Platform for streaming, batch, ML, SQL.	Unity Catalog, AutoLoader
❄️ Snowflake	Elastic cloud data warehouse.	Shared-data architecture with compute scale-out.	Time Travel, Streams
🐙 Kraken	The digital engine room behind smart energy retailers—fast, flexible, and surprisingly human for a platform.	Cloud-native SaaS by Kraken Technologies for energy providers, covering billing, CX, and smart grid ops with scalable API-first architecture.	Octopus Energy, E.ON Next, Kraken Flex
🛠️ OSS	The engineers of the telco stack.	Manages provisioning, monitoring, repair workflows.	ServiceNow, Netcool
🏛️ ERP	Business HQ: finance, HR, logistics.	Integrated suite for core enterprise functions.	SAP, Oracle, Dynamics
📋 RAID Log	Your project stress list: risks, issues, etc.	Tracks Risks, Assumptions, Issues, Dependencies.	Excel, PMO packs
🧾 Universal Journal	SAP’s all-in-one accounting ledger.	Table ACDOCA capturing all line items in S/4HANA.	SAP S/4 Finance

White paper Glossary

📖 Lakehouse Glossary (Click to expand)

Plain-English first; nuts-and-bolts second. Click a card to open; click again to close.

🏛️ Data Lakehouse

A warehouse built inside your lake: store everything, analyse sensibly.

Architecture that blends lake scalability with warehouse features (ACID, indexing, governance) to serve BI & ML from one platform.

🌊 Data Lake

One big storage bay for every kind of file—tidy comes later.

Object storage (ADLS/S3/GCS) for structured, semi-structured, and unstructured data at massive scale.

📦 Object Storage

Buckets in the cloud with labels.

Flat storage of objects (data + metadata + ID) over HTTP APIs; durable and highly scalable.

🧪 ACID Transactions

No half-saved changes—either done or not.

Atomicity, Consistency, Isolation, Durability for reliable reads/writes in distributed systems.

🗂️ Medallion Architecture (Bronze/Silver/Gold)

Bronze = raw, Silver = cleaned, Gold = board-ready.

Layered refinement: append-only raw → conformed/quality-checked → curated marts (dimensional/semantic) for analytics.

📚 Catalog / Unity Catalog

The index telling you what’s where and who can touch it.

Central metadata & governance (e.g., Databricks Unity Catalog) for tables, files, models, lineage, and permissions.

🧱 Table Format Layer

The rulebook that makes a lake act like a warehouse.

Open formats (Delta Lake, Apache Iceberg, Apache Hudi) add ACID, snapshots, schema control, and time travel on object storage.

🔺 Delta Lake

Version control and reliable updates for your lake.

Transaction log on Parquet enabling ACID, schema enforcement/evolution, and efficient upserts/deletes.

🧊 Apache Iceberg

Big-table plumbing for huge lakes.

Hidden partitioning, manifest lists, snapshot isolation; engine-agnostic (Spark/Trino/Flink) with fast scans and metadata ops.

🦒 Apache Hudi

Change-friendly tables without the drama.

Incremental processing with Copy-On-Write / Merge-On-Read, upserts, and CDC pipelines.

🕰️ Time Travel

Roll back to “before it went weird”.

Query historical snapshots by version/timestamp for audits, debugging, and reproducibility.

🧩 Schema Enforcement

If the form’s wrong, it gets bounced.

Validate incoming data against expected types/columns; reject or quarantine incompatible payloads.

🔧 Schema Evolution

Add a column without breaking everyone else.

Controlled schema changes with compatibility checks; table metadata records version history.

🗄️ Parquet (Columnar)

Packs data tight; reads only what you ask for.

Columnar file format with predicate pushdown and encoding for fast analytics and compression.

🧭 Partition Pruning

Skip to the right chapter, not the whole book.

Engines scan only relevant partitions based on filters (date/region/customer), reducing I/O and cost.

✍️ Append-Only Landing

Write new pages; never erase the old ones.

Immutable raw zone preserving source fidelity—ideal for audit, replay, and CDC reconciliation.

🚚 Batch ETL / ELT

The nightly “big shop”.

Scheduled bulk loads; ETL transforms pre-load, ELT transforms in-lake/in-warehouse post-load.

⚡ Streaming Ingestion

Data arrives continuously, not in one big lump.

Event pipelines (Kafka/Event Hubs/Kinesis) for low-latency processing and near-real-time analytics.

🔁 Change Data Capture (CDC)

Only the changes, thanks.

Ingest inserts/updates/deletes from sources to keep downstream tables in sync without full reloads.

⚙️ Apache Spark

The workhorse behind big data jobs.

Distributed compute for SQL, streaming, and ML; core engine for Delta/Iceberg/Hudi operations.

💻 Databricks

Spark with the safety rails: collab, governance, and ops.

Managed lakehouse platform with notebooks, clusters, Unity Catalog, Delta Lake, and MLflow integration.

🛰️ Trino / Presto

SQL that talks to many systems without moving data.

MPP query engines federating SQL across object storage and other sources via connectors & CBO.

☁️ Serverless SQL

Run queries without babysitting clusters.

On-demand autoscaled SQL compute with per-query billing against lakehouse tables/external data.

📈 Business Intelligence (BI)

Dashboards that answer “how are we doing?”

Visual analytics using governed models, aggregates, and semantic layers for decision-making.

🧠 Semantic Layer

Data that speaks business, not engineer.

Logical model (metrics/dimensions/rules) mapping physical tables to business terms (dbt/LookML/Power BI models).

🧮 DAX

Excel on protein shakes.

Power BI formula language for measures, time intelligence, and model calculations.

📦 dbt

SQL models with version control and tests.

Modular transformations, tests, and docs; integrates with lakehouse engines for ELT best practice.

📊 Power BI / Looker

Charts and dashboards the CFO will actually open.

BI tools for modelling, visualising, and sharing governed analytics on top of lakehouse datasets.

📝 Notebooks

Code, notes, and outputs in one place.

Interactive docs (Python/SQL/Scala) for exploration, data prep, and ML experimentation.

🍱 Feature Store

Pre-chopped ML ingredients you can reuse.

Curated, versioned features with offline/online stores for consistent training/inference.

📔 MLflow

Keeps receipts on your ML runs.

Tracks experiments, packages models, and manages deployment with a central model registry.

🔐 RBAC / ABAC

Who you are vs. attributes you have.

Role-based access by role; attribute-based evaluates context (dept/geo/data tags) for fine-grained control.

🧬 Lineage

Your data’s family tree.

Provenance showing sources, transformations, and dependencies across pipelines and models.

🕵️ PII Controls

Keep the sensitive bits covered.

Discovery, masking, tokenisation, and consent handling for personally identifiable information.

🧾 Auditing

Who looked, who changed, and when.

Immutable access/change logs for compliance (GDPR/ISO/PCI/HIPAA) and investigations.

🔗 Data Sharing (e.g., Delta Sharing)

Share data without emailing copies around.

Open protocols to grant read access across orgs/platforms directly against governed tables in object storage.

🗓️ Airflow / Azure Data Factory

The schedulers keeping the lights on.

Pipeline orchestration for dependencies, retries, parameterisation, and event-driven runs.

🚀 CI/CD

Ship small, ship often, don’t break things.

Automated build/test/deploy for SQL models, notebooks, and infra (IaC) with environment promotion.

👀 Observability

Find issues before users do.

Metrics, logs, traces, data quality checks, SLAs/SLOs across pipelines, queries, and spend.

💷 Cost Management

Know where the money’s going—and why.

Budgets, tags, auto-stop, workload isolation, right-sizing to control storage/compute/egress/concurrency costs.

🏗️ ADLS / S3 / GCS

Microsoft, Amazon, and Google’s big buckets.

Azure Data Lake Storage, Amazon Simple Storage Service, and Google Cloud Storage underpin the lake layer.

🛰️ Kafka / Event Hubs

Conveyor belts for events.

Distributed streaming backbones for pub/sub ingestion, exactly-once semantics, and scalable consumer groups.

⭐ Star Schema

Facts in the middle, lookups round the edge.

Dimensional model with central fact table and denormalised dimensions for fast BI queries.

❄️ Snowflake Schema

Star schema with extra tidy cupboards.

Further-normalised dimensions to reduce redundancy; trades some query simplicity for governance and reuse.

📚 Glossary — Plain vs Technical
Term	Non-Technical (Plain English)	Technical
Lakehouse	A warehouse that can swim — stores anything and still behaves sensibly.	Unified architecture on object storage with an ACID table layer and decoupled compute.
Delta-style Table	Gives files a memory and manners.	Transaction log + Parquet files enabling ACID, time travel, and schema evolution.
Medallion (Bronze/Silver/Gold)	Raw → cleaned → ready-to-serve — quality increases each step.	Layered physical/logical zones with conformance and business semantics applied progressively.
External Table	Let the warehouse read the lake’s menu without moving the kitchen.	Metadata in the warehouse referencing files in external storage; governed via catalog policies.
Data Contract	“Don’t break my columns and we’ll stay friends.”	Versioned schema + SLAs between producer and consumer with automated validation.
Expectation Tests	Common-sense checks that shout when data looks odd.	Declarative assertions (nulls, ranges, uniqueness) acting as quality gates in pipelines.
ACID	“No lost updates, no wobbly reads.”	Atomicity, Consistency, Isolation, Durability guarantees for table operations.
CDC	Only ship what changed, not the whole warehouse.	Change Data Capture via logs or timestamps to upsert/merge downstream tables efficiently.
Time Travel	Rewind the table to yesterday’s state.	Query/restore historical table versions using commit IDs or timestamps.
Catalog & Lineage	The phone book and family tree of your data.	Authoritative metadata, ownership, tags, and flow tracing from sources to products.