Databricks Components Overview

🧱 Databricks Components Overview

Component Non-Technical Description 📘 Technical Description ⚙️ Use Case/Scenario 🎯
Workspace Your collaborative project space Hosts notebooks, jobs, repos, and ML experiments Organising analytics or data science projects
Clusters Computing engine (like your personal AI but scalable) Spark-based distributed compute environment (autoscaling or manual) Run jobs, notebooks, ML models
Jobs Automated tasks or scheduled workflows DAG**-based execution of notebooks, scripts, or JARs Nightly ETL jobs, ML training pipelines
Notebooks Interactive workspace for code and output Supports Python, SQL, Scala, R, Markdown Exploratory data analysis, prototyping
SQL Editor GUI to query data tables Uses Databricks SQL for BI-friendly interface Business users querying curated tables
Delta Lake Like a spreadsheet that remembers everything ACID-compliant storage layer over Parquet Reliable data lake tables with versioning
Unity Catalog Your data’s filing cabinet and bouncer Central metadata & access control layer with RBAC Secure multi-tenant access across clouds
Lakehouse Platform The Databricks "big idea" — warehouse + lake Combines data lake scalability with DB-like performance Unified platform for batch, stream, ML, and BI
MLflow Your model's history, packaging, and delivery Open-source lifecycle management for ML models Experiment tracking, model registry, deployment
Repos Built-in Git versioning Git-backed source control for notebooks & jobs Code collaboration and CI/CD
Data Explorer Browse your tables like folders Visual UI to inspect catalog, schemas, tables Data discovery and governance check
Dashboards Shareable reports and visuals BI dashboard powered by SQL or notebooks Stakeholder insights and KPIs

** A DAG is a Directed Acyclic Graph — a way of structuring work so each task only moves forward and never loops back on itself. In data pipelines it describes the exact order tasks must run in, what depends on what, and what can run in parallel. Each node is a step (extract, transform, conceal, load) and each arrow shows a dependency. Because it’s “acyclic,” you never get circular logic or infinite loops. Orchestration tools like Databricks Jobs, Airflow, ADF, and Prefect use DAGs to guarantee predictable, safe, dependency-driven execution.

What “Directed” Means
The arrows have a one-way direction: one task finishes before the next starts. The flow only moves forward.
What “Acyclic” Means
No loops allowed. You can’t return to a previous step (A → B → A). This avoids infinite loops and “chasing your tail” in a pipeline.
What “Graph” Means
A network of tasks. Each task is a point (node) and each line (edge) between points shows a dependency between those tasks.
Non-Technical Description
A DAG is a tidy roadmap showing which tasks must happen first, what happens next, and how everything connects — always moving forward, never looping back.
Technical Description
A DAG is a graph G = (V, E) with nodes (tasks) and directed edges (dependencies). It contains no cycles and supports topological sorting to guarantee deterministic execution. This is the model used by orchestration tools such as Airflow, Databricks Jobs, ADF, and Prefect.
Scenario – Dataverse Migration
Example pipeline: Extract → Stage → Conceal PII → Map GUIDs → Load → Audit. Each step waits for the step it depends on, so the migration runs in the correct order, with no circular logic and clear sequencing.
Y9 Analogy
A line of dominoes falling forward. You can branch or merge paths, but they always fall in one direction and never loop back to the start.

Databricks on Azure vs AWS vs GCP

Databricks is the same core platform everywhere (Delta Lake, Unity Catalog, MLflow, notebooks). What changes is the cloud foundation underneath it: identity, networking, storage, and integration.

1. Identity & Access Integration
Area Azure AWS GCP
Primary Identity Entra ID IAM GCP IAM
Service Identity Managed Identities IAM Roles + STS Service Accounts
Secrets & Keys Key Vault Secrets Manager / KMS Secret Manager / KMS
2. Storage Layer & Delta Access
Area Azure AWS GCP
Primary Storage ADLS Gen2 S3 GCS
Access Method Entra OAuth / RBAC IAM Role Assumption Service Account Key / OAuth
Delta Lake Identical Identical Identical
3. Networking & Security Controls
Area Azure AWS GCP
Network Model VNet Injection / Private Endpoints VPC / PrivateLink VPC / Private Service Connect
Outbound Control NSGs, Route Tables Security Groups, NACLs VPC Firewall
4. Compute & Cluster Management
Area Azure AWS GCP
Underlying VMs D/E/F/M Series EC2 Instance Families Compute Engine VMs
Photon Engine Yes Yes Yes
5. Integration Ecosystem
Integration Area Azure AWS GCP
Ingestion ADF, Event Hub Kinesis, Glue Pub/Sub, Dataflow
Governance Purview Glue Data Catalog Data Catalog
BI Layer Power BI QuickSight Looker / BigQuery
6. Summary (What Really Matters)

The Databricks platform is identical everywhere. The differences are the plumbing underneath:

  • Identity and access control
  • Storage access (ADLS vs S3 vs GCS)
  • Networking models
  • Integration ecosystem

Code moves easily across clouds. Architecture does not.

Idea Summary What It Demonstrates (Technical)
1. Streaming-First Lakehouse Pipeline Simulate real-time trip ingestion using Auto Loader and build Bronze → Silver → Gold tables. Auto Loader, schema evolution, Delta streaming tables, dedupe handling, late-arrival rules, Workflows.
2. SCD Type 2 Taxi Zones Treat taxi zones as a slowly changing dimension and link historic trips to the correct zone version. Delta MERGE, SCD Type 2 patterns, surrogate keys, temporal joins, gold star schema modelling.
3. Predictive Model: Fare / Duration / Tip Train ML models predicting fare, duration, or tip likelihood using Feature Store and MLflow. Feature engineering, MLflow tracking, model registry, batch scoring, online/offline parity.
4. Geospatial Analysis with H3 Convert pickup/drop-off points into H3 cells to identify hotspots and time-of-day patterns. H3 indexing, spatial joins, heatmaps, optimised Delta tables, DBSQL visualisation.
5. Data Quality & Delta Live Tables Apply validation rules (distance, fare, timestamps) and route invalid trips to quarantine. DLT expectations, pipeline monitoring, auto-lineage, DQ dashboards.
6. Performance & Scaling Patterns Use the dataset to teach partitioning, file compaction, AQE and cluster autoscaling. Repartition/coalesce, OPTIMIZE + Z-Order, AQE, Spark performance patterns.
7. Lineage, Audit & Run Headers Add GUIDs, batch metadata, and row-level lineage to show enterprise-grade governance. Run-header tables, load-history logs, source→target traceability, audit-friendly Delta design.
Mathematical Area Where It Appears in Databricks What It Enables
Linear Algebra Vectorised operations in Spark MLlib; feature vectors; matrix factorisation; embeddings. Regression, classification, PCA, recommendation models, dimensionality reduction.
Calculus Gradient descent in ML training; loss minimisation; optimisation loops. Training ML models (logistic regression, neural nets), tuning learning rates and convergence.
Probability & Statistics Sampling, distributions, hypothesis testing within MLlib & pandas-on-Spark. Confidence intervals, anomaly detection, probabilistic models, forecasting.
Numerical Methods Iterative solvers in optimisation; approximate algorithms; floating-point handling. Large-scale model fitting, stable computation over big data, efficient approximations.
Graph Theory GraphFrames, DAG scheduling in Spark, lineage graphs in Delta Live Tables. Parallel execution planning, dependency resolution, page-rank analysis, relationship modelling.
Set Theory & Relational Algebra Spark SQL joins, filters, projections, aggregations; Delta ACID semantics. Accurate transforms, dedupe, CDC logic, SCD modelling, consistent dataframes.
Optimization Theory Cluster autoscaling, query optimisers, Catalyst optimizer, cost-based planning. Efficient SQL plans, partition pruning, Z-order optimisation, reduced compute cost.
Time-Series Mathematics Window functions, lag/lead, frequency resampling, streaming micro-batching. Forecasting, trend analysis, late-arrival handling, watermarking, real-time pipelines.
Geometry & Geospatial Mathematics H3 indexing, distance calculations, coordinate transforms. Hotspot maps, route optimisation, spatial clustering.
Information Theory Hashing, tokenisation, entropy considerations in PII masking libraries. Secure PII concealment, dedupe hashing, confidentiality-preserving data pipelines.
Finite-State Machines Workflow orchestration, state transitions, notebook task dependencies. Reliable pipelines, retries, fault-tolerant job execution, sub-DAG logic.
Approximation & Sketching Algorithms HyperLogLog, approximate quantiles, bloom filters in Spark SQL. Fast distinct counts, join elimination, memory-efficient profiling.
Mathematical Area What the Mathematics Is Where It Appears in Databricks What It Enables
Linear Algebra The maths of vectors and matrices, and how they combine through operations like dot products and matrix multiplication. Vectorised operations in Spark MLlib; feature vectors; matrix factorisation; embeddings. Regression, classification, PCA, recommendation models, dimensionality reduction.
Calculus The study of change; derivatives measure how fast something changes, and optimisation uses them to minimise a loss function. Gradient descent in ML training; loss minimisation; optimisation loops. Training ML models (logistic regression, neural nets), tuning learning rates and convergence.
Probability & Statistics How uncertainty behaves — distributions, sampling, variance, likelihoods, inference. Sampling, distributions, hypothesis testing within MLlib & pandas-on-Spark. Confidence intervals, anomaly detection, probabilistic models, forecasting.
Numerical Methods Algorithms that approximate solutions when exact equations can’t be solved analytically. Iterative solvers in optimisation; approximate algorithms; floating-point handling. Large-scale model fitting, stable computation over big data, efficient approximations.
Graph Theory The maths of nodes and edges — networks, dependencies, paths, and relationships. GraphFrames, DAG scheduling in Spark, lineage graphs in Delta Live Tables. Parallel execution planning, dependency resolution, page-rank analysis, relationship modelling.
Set Theory & Relational Algebra The rules governing sets and operations on them (joins, unions, intersections). Forms the basis of SQL. Spark SQL joins, filters, projections, aggregations; Delta ACID semantics. Accurate transforms, dedupe, CDC logic, SCD modelling, consistent dataframes.
Optimization Theory The study of choosing the “best” option given constraints — typically minimising or maximising some cost. Cluster autoscaling, query optimisers, Catalyst optimizer, cost-based planning. Efficient SQL plans, partition pruning, Z-order optimisation, reduced compute cost.
Time-Series Mathematics The maths of sequences indexed by time — trends, seasonality, lagged effects, correlations. Window functions, lag/lead, frequency resampling, streaming micro-batching. Forecasting, trend analysis, late-arrival handling, watermarking, real-time pipelines.
Geometry & Geospatial Mathematics Distances, angles, coordinate systems, shapes, and spatial relationships. H3 indexing, distance calculations, coordinate transforms. Hotspot maps, route optimisation, spatial clustering.
Information Theory The study of information, randomness, entropy, and coding. Key to hashing and pseudonymisation. Hashing, tokenisation, entropy considerations in PII masking libraries. Secure PII concealment, dedupe hashing, confidentiality-preserving data pipelines.
Finite-State Machines Systems with a limited number of states and defined transitions between them. Workflow orchestration, state transitions, notebook task dependencies. Reliable pipelines, retries, fault-tolerant job execution, sub-DAG logic.
Approximation & Sketching Algorithms Lightweight maths for “good enough” answers when exact calculations are too expensive. HyperLogLog, approximate quantiles, bloom filters in Spark SQL. Fast distinct counts, join elimination, memory-efficient profiling.