<b>Glossary v1.4</b>

Glossary

Term 🔍 Non-Technical Description 🛠️ Technical Description 🗃️ Seen In / Tools
🔐 ACIDA reliable bank transfer: all or nothing.Atomicity, Consistency, Isolation, Durability.PostgreSQL, Oracle
🧪 AI/ML Scenario BlueprintDesign your model’s storyline.Scenario framing with goals, data, stakeholders.Whiteboard decks, ML canvas
🌀 AI/ML Cloud BlueprintEnd-to-end cloud ML pattern.Includes ingestion, training, inference, monitoring.Databricks, SageMaker
🤖 Applied AI/MLUseful models—not just hype.Deployed, monitored models solving real tasks.MLflow, dashboards
🧘 BASELooser than ACID, good enough for now.Basically Available, Soft state, Eventual consistency.Cassandra, Couchbase
🏢 BSSBilling, CRM, and commercial bits of telecoms.Handles billing, orders, customer relationships.Salesforce, Amdocs
⚖️ CAP TheoremYou can’t have all three: C, A, P.Only two of Consistency, Availability, Partition Tolerance possible.DynamoDB, HBase
🔄 CRUD MappingWho can do what to which data?Create, Read, Update, Delete lifecycle mapped to business ops.APIs, Integration specs
📘 Data-Centric ThinkingDesign around trustworthy data.Data-first design with governance, semantics, CRUD.Star schemas, canonical models
🧳 Data PassportA metadata ID card for your dataset.Classification, ownership, retention, lineage.Unity Catalog, Collibra
🔍 Data Quality GatesStop dirty data before it hurts.Validation rules before model input or output.Great Expectations, Airflow
🧭 Data BlueprintData’s playbook—what, why, and how.Entity models, flows, CRUD, governance overlays.DFDs, semantic maps
📐 Feature Engineering BlueprintPrepping inputs for model cooking.Transforms, feature store logic, encoding patterns.MLflow, Delta Lake
👥 Federated GovernanceCentral rules, local control.Global policies with domain-led stewardship.Data Mesh, Unity Catalog
🏥 Healthcare ScenarioAI/ML for triage, diagnostics, operations.Domain-specific models with clinical framing.NHS dashboards
🔁 Iterative DeliveryWork in slices. Rinse and repeat.Sprint-based delivery in architecture and data.Scrum, Kanban
🔍 LineageWhere data came from, who touched it.End-to-end trace of transformations and movement.Unity Catalog, Informatica
🗂️ Master DataCore entities reused everywhere.High-value shared business objects.SAP MDG, Oracle DRM
📊 Model MonitoringKeep your model honest and fresh.Track drift, bias, performance. Trigger retrain.MLflow, Prometheus
🧱 DatabricksLakehouse Swiss Army knife.Platform for streaming, batch, ML, SQL.Unity Catalog, AutoLoader
❄️ SnowflakeElastic cloud data warehouse.Shared-data architecture with compute scale-out.Time Travel, Streams
🐙 KrakenThe digital engine room behind smart energy retailers—fast, flexible, and surprisingly human for a platform.Cloud-native SaaS by Kraken Technologies for energy providers, covering billing, CX, and smart grid ops with scalable API-first architecture.Octopus Energy, E.ON Next, Kraken Flex
🛠️ OSSThe engineers of the telco stack.Manages provisioning, monitoring, repair workflows.ServiceNow, Netcool
🏛️ ERPBusiness HQ: finance, HR, logistics.Integrated suite for core enterprise functions.SAP, Oracle, Dynamics
📋 RAID LogYour project stress list: risks, issues, etc.Tracks Risks, Assumptions, Issues, Dependencies.Excel, PMO packs
🧾 Universal JournalSAP’s all-in-one accounting ledger.Table ACDOCA capturing all line items in S/4HANA.SAP S/4 Finance

White paper Glossary

📖 Lakehouse Glossary (Click to expand)

Plain-English first; nuts-and-bolts second. Click a card to open; click again to close.

🏛️ Data Lakehouse
A warehouse built inside your lake: store everything, analyse sensibly.
Architecture that blends lake scalability with warehouse features (ACID, indexing, governance) to serve BI & ML from one platform.
🌊 Data Lake
One big storage bay for every kind of file—tidy comes later.
Object storage (ADLS/S3/GCS) for structured, semi-structured, and unstructured data at massive scale.
📦 Object Storage
Buckets in the cloud with labels.
Flat storage of objects (data + metadata + ID) over HTTP APIs; durable and highly scalable.
🧪 ACID Transactions
No half-saved changes—either done or not.
Atomicity, Consistency, Isolation, Durability for reliable reads/writes in distributed systems.
🗂️ Medallion Architecture (Bronze/Silver/Gold)
Bronze = raw, Silver = cleaned, Gold = board-ready.
Layered refinement: append-only raw → conformed/quality-checked → curated marts (dimensional/semantic) for analytics.
📚 Catalog / Unity Catalog
The index telling you what’s where and who can touch it.
Central metadata & governance (e.g., Databricks Unity Catalog) for tables, files, models, lineage, and permissions.
🧱 Table Format Layer
The rulebook that makes a lake act like a warehouse.
Open formats (Delta Lake, Apache Iceberg, Apache Hudi) add ACID, snapshots, schema control, and time travel on object storage.
🔺 Delta Lake
Version control and reliable updates for your lake.
Transaction log on Parquet enabling ACID, schema enforcement/evolution, and efficient upserts/deletes.
🧊 Apache Iceberg
Big-table plumbing for huge lakes.
Hidden partitioning, manifest lists, snapshot isolation; engine-agnostic (Spark/Trino/Flink) with fast scans and metadata ops.
🦒 Apache Hudi
Change-friendly tables without the drama.
Incremental processing with Copy-On-Write / Merge-On-Read, upserts, and CDC pipelines.
🕰️ Time Travel
Roll back to “before it went weird”.
Query historical snapshots by version/timestamp for audits, debugging, and reproducibility.
🧩 Schema Enforcement
If the form’s wrong, it gets bounced.
Validate incoming data against expected types/columns; reject or quarantine incompatible payloads.
🔧 Schema Evolution
Add a column without breaking everyone else.
Controlled schema changes with compatibility checks; table metadata records version history.
🗄️ Parquet (Columnar)
Packs data tight; reads only what you ask for.
Columnar file format with predicate pushdown and encoding for fast analytics and compression.
🧭 Partition Pruning
Skip to the right chapter, not the whole book.
Engines scan only relevant partitions based on filters (date/region/customer), reducing I/O and cost.
✍️ Append-Only Landing
Write new pages; never erase the old ones.
Immutable raw zone preserving source fidelity—ideal for audit, replay, and CDC reconciliation.
🚚 Batch ETL / ELT
The nightly “big shop”.
Scheduled bulk loads; ETL transforms pre-load, ELT transforms in-lake/in-warehouse post-load.
Streaming Ingestion
Data arrives continuously, not in one big lump.
Event pipelines (Kafka/Event Hubs/Kinesis) for low-latency processing and near-real-time analytics.
🔁 Change Data Capture (CDC)
Only the changes, thanks.
Ingest inserts/updates/deletes from sources to keep downstream tables in sync without full reloads.
⚙️ Apache Spark
The workhorse behind big data jobs.
Distributed compute for SQL, streaming, and ML; core engine for Delta/Iceberg/Hudi operations.
💻 Databricks
Spark with the safety rails: collab, governance, and ops.
Managed lakehouse platform with notebooks, clusters, Unity Catalog, Delta Lake, and MLflow integration.
🛰️ Trino / Presto
SQL that talks to many systems without moving data.
MPP query engines federating SQL across object storage and other sources via connectors & CBO.
☁️ Serverless SQL
Run queries without babysitting clusters.
On-demand autoscaled SQL compute with per-query billing against lakehouse tables/external data.
📈 Business Intelligence (BI)
Dashboards that answer “how are we doing?”
Visual analytics using governed models, aggregates, and semantic layers for decision-making.
🧠 Semantic Layer
Data that speaks business, not engineer.
Logical model (metrics/dimensions/rules) mapping physical tables to business terms (dbt/LookML/Power BI models).
🧮 DAX
Excel on protein shakes.
Power BI formula language for measures, time intelligence, and model calculations.
📦 dbt
SQL models with version control and tests.
Modular transformations, tests, and docs; integrates with lakehouse engines for ELT best practice.
📊 Power BI / Looker
Charts and dashboards the CFO will actually open.
BI tools for modelling, visualising, and sharing governed analytics on top of lakehouse datasets.
📝 Notebooks
Code, notes, and outputs in one place.
Interactive docs (Python/SQL/Scala) for exploration, data prep, and ML experimentation.
🍱 Feature Store
Pre-chopped ML ingredients you can reuse.
Curated, versioned features with offline/online stores for consistent training/inference.
📔 MLflow
Keeps receipts on your ML runs.
Tracks experiments, packages models, and manages deployment with a central model registry.
🔐 RBAC / ABAC
Who you are vs. attributes you have.
Role-based access by role; attribute-based evaluates context (dept/geo/data tags) for fine-grained control.
🧬 Lineage
Your data’s family tree.
Provenance showing sources, transformations, and dependencies across pipelines and models.
🕵️ PII Controls
Keep the sensitive bits covered.
Discovery, masking, tokenisation, and consent handling for personally identifiable information.
🧾 Auditing
Who looked, who changed, and when.
Immutable access/change logs for compliance (GDPR/ISO/PCI/HIPAA) and investigations.
🔗 Data Sharing (e.g., Delta Sharing)
Share data without emailing copies around.
Open protocols to grant read access across orgs/platforms directly against governed tables in object storage.
🗓️ Airflow / Azure Data Factory
The schedulers keeping the lights on.
Pipeline orchestration for dependencies, retries, parameterisation, and event-driven runs.
🚀 CI/CD
Ship small, ship often, don’t break things.
Automated build/test/deploy for SQL models, notebooks, and infra (IaC) with environment promotion.
👀 Observability
Find issues before users do.
Metrics, logs, traces, data quality checks, SLAs/SLOs across pipelines, queries, and spend.
💷 Cost Management
Know where the money’s going—and why.
Budgets, tags, auto-stop, workload isolation, right-sizing to control storage/compute/egress/concurrency costs.
🏗️ ADLS / S3 / GCS
Microsoft, Amazon, and Google’s big buckets.
Azure Data Lake Storage, Amazon Simple Storage Service, and Google Cloud Storage underpin the lake layer.
🛰️ Kafka / Event Hubs
Conveyor belts for events.
Distributed streaming backbones for pub/sub ingestion, exactly-once semantics, and scalable consumer groups.
Star Schema
Facts in the middle, lookups round the edge.
Dimensional model with central fact table and denormalised dimensions for fast BI queries.
❄️ Snowflake Schema
Star schema with extra tidy cupboards.
Further-normalised dimensions to reduce redundancy; trades some query simplicity for governance and reuse.
📚 Glossary — Plain vs Technical
Term Non-Technical (Plain English) Technical
Lakehouse A warehouse that can swim — stores anything and still behaves sensibly. Unified architecture on object storage with an ACID table layer and decoupled compute.
Delta-style Table Gives files a memory and manners. Transaction log + Parquet files enabling ACID, time travel, and schema evolution.
Medallion (Bronze/Silver/Gold) Raw → cleaned → ready-to-serve — quality increases each step. Layered physical/logical zones with conformance and business semantics applied progressively.
External Table Let the warehouse read the lake’s menu without moving the kitchen. Metadata in the warehouse referencing files in external storage; governed via catalog policies.
Data Contract “Don’t break my columns and we’ll stay friends.” Versioned schema + SLAs between producer and consumer with automated validation.
Expectation Tests Common-sense checks that shout when data looks odd. Declarative assertions (nulls, ranges, uniqueness) acting as quality gates in pipelines.
ACID “No lost updates, no wobbly reads.” Atomicity, Consistency, Isolation, Durability guarantees for table operations.
CDC Only ship what changed, not the whole warehouse. Change Data Capture via logs or timestamps to upsert/merge downstream tables efficiently.
Time Travel Rewind the table to yesterday’s state. Query/restore historical table versions using commit IDs or timestamps.
Catalog & Lineage The phone book and family tree of your data. Authoritative metadata, ownership, tags, and flow tracing from sources to products.