Main Cloud Databases — Quick Reference
A concise map of leading cloud databases with plain-English descriptions, technical notes, common use cases, and data-type support.
Relational & Analytics (SQL)
Service | Non-technical description | Technical description | Typical use case | Str | Sst | Uns |
---|---|---|---|---|---|---|
AWS RDS MySQL / PostgreSQL / SQL Server / Oracle / MariaDB | Managed versions of the classic SQL databases. | PaaS relational engines with backups, patching, Multi-AZ HA. | OLTP apps; software that expects a SQL DB. | ✅ | ⚠️ (JSON in Postgres/MySQL) | ❌ |
Amazon Aurora MySQL / PostgreSQL-compatible | Cloud-optimised SQL with higher performance/HA. | Distributed storage (6-way replication), read replicas, serverless autoscaling. | High-throughput OLTP; microservices backends. | ✅ | ⚠️ (JSON columns) | ❌ |
Azure SQL Database | Managed SQL Server in Azure. | PaaS SQL with elastic pools, serverless compute, built-in HA. | Line-of-business apps; reporting stores. | ✅ | ⚠️ (JSON) | ❌ |
Azure Database for PostgreSQL / MySQL | Managed PostgreSQL/MySQL. | HA “flexible server”; PostgreSQL extensions. | Modern app stacks needing managed OSS SQL. | ✅ | ⚠️ (JSON/JSONB; JSON) | ❌ |
GCP Cloud SQL PostgreSQL / MySQL / SQL Server | Google’s managed SQL trio. | Managed instances, replicas, automated ops. | Web/mobile backends; small–mid OLTP. | ✅ | ⚠️ | ❌ |
Google Cloud Spanner | “SQL that scales globally.” | Horizontally scalable, strongly consistent distributed SQL; ANSI SQL, transactions, JSON type. | Global OLTP (fintech, gaming, SaaS). | ✅ | ⚠️ (JSON) | ❌ |
Amazon Redshift | AWS data warehouse. | MPP columnar SQL engine; SUPER type for semi-structured. | Enterprise BI; ELT at scale. | ✅ | ✅ | ❌ |
Azure Synapse Dedicated SQL Pool | Azure data warehouse. | MPP columnar SQL; PolyBase/Copy for ingestion. | Enterprise BI on Azure. | ✅ | ⚠️ (OPENJSON/PolyBase) | ❌ |
Google BigQuery | Serverless analytics warehouse. | Columnar, ANSI SQL, massive parallelism; native JSON/ARRAY; external tables. | Ad-hoc analytics, ELT, ML-on-SQL. | ✅ | ✅ | ⚠️ (via external tables/GCS) |
Snowflake Multi-cloud | Cloud data platform for analytics. | Elastic compute/storage; VARIANT for semi-structured; stages/external access. | Unified warehouse + data sharing. | ✅ | ✅ (JSON/Parquet/Avro/XML) | ⚠️ (files via stages) |
NoSQL (Key-value, Document, Wide-column)
Service | Non-technical description | Technical description | Typical use case | Str | Sst | Uns |
---|---|---|---|---|---|---|
Amazon DynamoDB | Serverless key-value/document store that never blinks. | Partitioned KV/JSON docs, single-digit-ms latency, auto scaling, streams. | High-scale apps, session/carts, IoT. | ⚠️ | ✅ | ❌ (binaries ≤400KB) |
Azure Cosmos DB | Global multi-model NoSQL. | APIs: Core (SQL/doc), Mongo, Cassandra, Gremlin, Table; multi-region, tunable consistency. | Low-latency global apps, catalogs. | ⚠️ | ✅ | ❌ |
Google Firestore | Serverless document database. | Hierarchical JSON docs; ACID at doc level; real-time listeners. | Mobile/web apps, profiles, settings. | ⚠️ | ✅ | ❌ |
Google Bigtable | Massive time-series/wide-column store. | Sparse, distributed HBase-compatible database. | IoT/telemetry, ad tech, TS data. | ⚠️ | ✅ | ❌ |
Amazon Keyspaces Cassandra | Managed Cassandra. | Serverless Cassandra-compatible wide-column store. | Time-series, high-write workloads. | ⚠️ | ✅ | ❌ |
MongoDB Atlas Multi-cloud | Managed MongoDB. | Document model, flexible schema, ACID at doc level; GridFS for files. | Content/user data, catalogs. | ⚠️ | ✅ | ⚠️ (via GridFS) |
Graph, Search & Log/Telemetry (Specialised)
Service | Non-technical description | Technical description | Typical use case | Str | Sst | Uns |
---|---|---|---|---|---|---|
Amazon Neptune | Managed graph database. | Property graph (Gremlin) & RDF (SPARQL), ACID. | Knowledge graphs, recommendations. | ⚠️ | ✅ | ❌ |
Cosmos DB (Gremlin API) | Graph on Cosmos. | Gremlin traversal on a distributed store. | Social networks, network topology. | ⚠️ | ✅ | ❌ |
Amazon OpenSearch Service | Managed search & analytics. | Lucene-based inverted index; JSON docs; full-text search + aggregations. | Log analytics, app/site search. | ⚠️ | ✅ | ✅ (indexes text; binaries in S3) |
Azure Data Explorer (Kusto) | Fast log/time-series analytics. | Columnar engine with KQL; semi-structured ingestion. | Telemetry, security analytics. | ⚠️ | ✅ | ❌ |
Tip: store large binaries (images, PDFs, media) in cloud object storage (e.g., S3, Azure Blob, Google Cloud Storage) and reference them from your database.
Main Cloud Data Pipelines — Quick Reference
A practical map of batch/ELT, streaming/CDC, and orchestration options. Each row includes a plain-English summary, a technical note, a common use case, and data-type support.
Batch / ELT Pipelines
Service | Non-technical description | Technical description | Typical use case | Str | Sst | Uns |
---|---|---|---|---|---|---|
AWS Glue | Managed ETL to load and transform data. | Serverless Spark jobs, crawlers, Data Catalog, notebooks; job bookmarks. | Batch ingest + transforms to S3/Redshift/Lakehouse. | ✅ | ✅ (JSON/Parquet/Avro) | ⚠️ (UDFs/pass-through) |
Azure Data Factory incl. Synapse Pipelines | Azure’s GUI pipelines for copy and transform. | 100+ connectors, Mapping Data Flows (Spark), triggers, CI/CD. | Lift-and-shift ETL/ELT to ADLS/Synapse/Snowflake. | ✅ | ✅ | ⚠️ (copy/metadata) |
Google Cloud Dataflow | Google’s managed data processing service. | Fully managed Apache Beam runners (batch & streaming), autoscaling workers. | ELT/ETL to BigQuery/Cloud Storage. | ✅ | ✅ | ✅ (custom DoFns) |
Google Cloud Dataproc | Managed Spark/Hadoop clusters. | Ephemeral or long-running Spark, Hive, Hadoop; JARs and notebooks. | Modernise legacy ETL; Spark jobs at scale. | ✅ | ✅ | ⚠️ |
Databricks Delta Live Tables (DLT) | Declarative pipelines on a lakehouse. | Managed Spark/Delta; expectations (DQ), CDC, lineage; Bronze→Silver→Gold. | Lakehouse ELT with quality rules. | ✅ | ✅ | ⚠️ (file extract/ML) |
Snowflake Snowpipe / Tasks | Continuous or batch loads into Snowflake. | Event-driven ingest from object storage; Streams/Tasks for ELT orchestration. | Near-real-time file ingest and transforms in-warehouse. | ✅ | ✅ (VARIANT) | ⚠️ (files via stages) |
Fivetran | Point-and-click SaaS ELT. | Managed connectors for DBs/SaaS; auto schema evolution to DW/Lake. | Rapid onboarding to Snowflake/BigQuery/Redshift. | ✅ | ✅ | ❌ |
Matillion | Visual ELT for cloud warehouses. | Push-down SQL to Snowflake/Redshift/BigQuery; orchestration and components. | Team-owned ELT with versioning. | ✅ | ✅ | ❌ |
Informatica IICS | Enterprise integration in the cloud. | Mappings, CDC, data quality, MDM tie-ins, governance. | Regulated/complex estates and hybrid integration. | ✅ | ✅ | ⚠️ (adapters/custom) |
Airbyte Managed or OSS | Open-source ELT connectors. | Connector SDK, CDC support, sync to lakes/warehouses. | Cost-effective ELT and custom sources. | ✅ | ✅ | ❌ |
Streaming / CDC (Near-real-time)
Service | Non-technical description | Technical description | Typical use case | Str | Sst | Uns |
---|---|---|---|---|---|---|
Amazon Kinesis Streams & Firehose | Real-time pipes on AWS. | Streams for ingestion; Firehose buffers and delivers to S3/Redshift/OpenSearch. | Clickstreams, IoT telemetry, app events. | ✅ | ✅ | ⚠️ (binary pass-through) |
Amazon MSK Managed Kafka | Kafka as a managed service. | Managed brokers, IAM/VPC, Kafka Connect integrations. | High-throughput event backbone. | ✅ | ✅ | ⚠️ |
AWS DMS Database Migration Service | Continuous DB replication. | CDC from relational sources to S3/Kinesis/Redshift/other DBs. | Legacy→cloud replication and cutovers. | ✅ | ⚠️ | ❌ |
Azure Event Hubs | Azure’s big event pipe. | Partitioned event ingestion with low latency; Kafka-compatible endpoint. | Telemetry, logs, stream ingestion. | ✅ | ✅ | ⚠️ |
Azure Stream Analytics | SQL-like stream processing. | Windowing joins/aggregations; outputs to ADLS/SQL/Power BI. | Real-time dashboards and anomaly detection. | ✅ | ✅ | ❌ |
Google Pub/Sub | Google’s global event bus. | Exactly-once options, push/pull, ordered keys; integrates with Dataflow. | Event ingestion for Dataflow/BigQuery. | ✅ | ✅ | ⚠️ |
Google Datastream | Serverless CDC on GCP. | Change data capture from DBs to BigQuery/Cloud Storage. | Low-ops CDC pipelines and cutovers. | ✅ | ⚠️ | ❌ |
Confluent Cloud Kafka + Connect + ksqlDB | Fully managed Kafka across clouds. | Kafka core with managed Connect, Schema Registry, ksqlDB. | Cross-cloud streaming backbone. | ✅ | ✅ | ⚠️ |
Orchestration / Workflow (Run the pipelines)
Service | Non-technical description | Technical description | Typical use case | Str | Sst | Uns |
---|---|---|---|---|---|---|
AWS MWAA Managed Airflow | Airflow without the ops. | Managed schedulers/workers; DAGs as code; AWS integrations. | Complex dependency DAGs on AWS. | ✅ | ✅ | ⚠️ (move/process files) |
Google Cloud Composer Managed Airflow | GCP’s managed Airflow. | GKE-based Airflow with GCP hooks and integrations. | Orchestrate Dataflow/BigQuery/Dataproc. | ✅ | ✅ | ⚠️ |
Azure Data Factory as orchestrator | Triggers and pipelines to coordinate jobs. | Time- or event-based triggers, dependencies, retries, self-hosted runtimes. | End-to-end Azure data workflows. | ✅ | ✅ | ⚠️ |
AWS Step Functions | Serverless workflow engine. | State machines for ETL, retries, parallelism; integrates with Glue/Lambda. | Glue/Spark jobs + Lambdas orchestration. | ✅ | ✅ | ⚠️ |
Databricks Jobs | Schedule and run lakehouse jobs. | Task graphs, cluster policies, DLT integration, Git ops. | Operationalise notebooks/SQL/ML on Databricks. | ✅ | ✅ | ⚠️ |
Tip: Large binaries (images, PDFs, media) belong in cloud object storage (S3, Azure Blob, Google Cloud Storage); your pipelines can reference or transform them as needed.
Main Cloud Compute Options — Quick Reference
A practical map of compute families (VMs, containers, serverless, batch/HPC, big data/stream, ML). Each row includes a plain-English summary, a technical note, a common use case, and data-type handling.
Virtual Machines (VMs)
Service | Non-technical description | Technical description | Typical use case | Str | Sst | Uns |
---|---|---|---|---|---|---|
AWS EC2 | Raw virtual servers you configure as you like. | Wide instance families (CPU/GPU/ARM), Auto Scaling, Spot, placement groups, custom AMIs. | Legacy apps, custom stacks, high control/observability. | ✅ | ✅ | ✅ |
Azure Virtual Machines | Microsoft’s managed virtual servers. | VM Scale Sets, Hybrid benefits, proximity placement, Windows/Linux images. | Windows-heavy estates, hybrid lift-and-shift. | ✅ | ✅ | ✅ |
Google Compute Engine | Google’s on-demand VMs. | Custom machine types, preemptible VMs, live migration, sole-tenant nodes. | Custom runtimes, cost-tuned fleets, HPC baselines. | ✅ | ✅ | ✅ |
Containers & Kubernetes
Service | Non-technical description | Technical description | Typical use case | Str | Sst | Uns |
---|---|---|---|---|---|---|
Managed Kubernetes EKS / AKS / GKE | Kubernetes clusters without the control-plane pain. | Managed API servers, node pools, autoscaling, add-ons (Ingress, CSI, CNI), GPU pools. | Microservices, data/ML platforms, multi-tenant apps. | ✅ | ✅ | ✅ |
Serverless Containers Cloud Run / Azure Container Apps | Run a container from zero to scale without managing servers. | Per-request autoscale to zero, HTTP/async triggers, revisions, simple networking. | APIs, event workers, lightweight ETL/ML inference. | ✅ | ✅ | ⚠️ (short-lived; external storage) |
Elastic Container Service ECS / Fargate | Orchestrated containers on AWS; Fargate removes servers. | Task/Service model, service discovery, IAM, capacity providers; Fargate = serverless execution. | Batch jobs, APIs, back-office workers. | ✅ | ✅ | ✅ |
Serverless Functions (FaaS)
Service | Non-technical description | Technical description | Typical use case | Str | Sst | Uns |
---|---|---|---|---|---|---|
AWS Lambda | Run code on events; no servers to manage. | Event triggers (API, S3, Kafka), ephemeral runtime, concurrency scaling, extensions. | Event processing, light ETL, API backends. | ✅ | ✅ | ⚠️ (timeout/memory limits) |
Azure Functions | Event-driven functions on Azure. | Bindings (HTTP/Queue/Blob/Cosmos), Durable Functions, consumption/premium plans. | Workflows, integrations, reactive tasks. | ✅ | ✅ | ⚠️ |
Google Cloud Functions | Functions as a service on GCP. | Gen 2 on Cloud Run; triggers via Pub/Sub/Storage/HTTP, autoscale. | Event glue, small transforms, webhooks. | ✅ | ✅ | ⚠️ |
PaaS App Platforms
Service | Non-technical description | Technical description | Typical use case | Str | Sst | Uns |
---|---|---|---|---|---|---|
Azure App Service | Deploy web apps and APIs without servers. | Managed runtime (Windows/Linux), slots, autoscale, VNet integration. | Corporate web apps, APIs, portals. | ✅ | ✅ | ⚠️ (external storage/CDN) |
Google App Engine | Google’s original PaaS for apps. | Standard/Flexible environments, autoscale, built-in logging, services/versions. | Multi-service web apps, rapid prototypes. | ✅ | ✅ | ⚠️ |
AWS Elastic Beanstalk | Upload app → platform handles the rest. | Managed provisioning of EC2/ALB/ASG, health checks, rolling updates. | 12-factor apps, quick lifts to AWS PaaS. | ✅ | ✅ | ⚠️ |
Batch & HPC
Service | Non-technical description | Technical description | Typical use case | Str | Sst | Uns |
---|---|---|---|---|---|---|
AWS Batch | Queue up container jobs at scale, pay per compute. | Managed job queues/compute envs, Spot integration, array jobs, GPU support. | Rendering, science/engineering, nightly crunches. | ✅ | ✅ | ✅ |
Azure Batch | Large-scale scheduled compute on Azure. | Pool/Job/Task model, auto-scale pools, low-priority VMs, container support. | Simulation, ETL batches, media processing. | ✅ | ✅ | ✅ |
Google Cloud Batch | Fully managed batch job service. | Autoscaled fleets, preemptible VMs, GPU/TPU options, regional queues. | Video/ML preprocessing, parameter sweeps. | ✅ | ✅ | ✅ |
Big Data & Stream Processing
Service | Non-technical description | Technical description | Typical use case | Str | Sst | Uns |
---|---|---|---|---|---|---|
Amazon EMR | Managed Spark/Hadoop for big data. | EMR on EC2/EKS, autoscaling, Spot, HDFS/S3 storage, many runtimes (Spark/Hive/Presto). | ETL at scale, lake processing, ML feature builds. | ✅ | ✅ | ✅ |
Google Dataproc | Spark/Hadoop on GCP with fast startup. | Ephemeral clusters, autoscaling, component gateway; GCS/BQ integrations. | Budget-friendly Spark jobs, modernised ETL. | ✅ | ✅ | ✅ |
Azure Synapse Spark | Apache Spark inside Synapse. | Serverless/pooled Spark, notebooks, Delta/Parquet, integrated pipelines. | Lakehouse transforms, notebooks, data exploration. | ✅ | ✅ | ✅ |
Databricks Multi-cloud | Lakehouse compute for data + AI. | Managed Spark/Photon, Delta Lake, DLT, MLflow; jobs & clusters. | ELT/ML/AI platforms, collaborative notebooks. | ✅ | ✅ | ✅ |
Stream Processing Kinesis Data Analytics / Dataflow / Stream Analytics | Real-time compute over event streams. | Apache Flink (KDA), Apache Beam (Dataflow), SQL windows (ASA); stateful operators. | Real-time ETL, anomaly detection, dashboards. | ✅ | ✅ | ⚠️ (typically references objects) |
ML / AI Managed Compute
Service | Non-technical description | Technical description | Typical use case | Str | Sst | Uns |
---|---|---|---|---|---|---|
AWS SageMaker | Managed ML training and hosting. | Studio, training jobs, endpoints, pipelines; GPU/CPU; auto scaling; data connectors. | Model training, batch/real-time inference. | ✅ | ✅ | ✅ (images/audio/text via libs) |
Azure Machine Learning | Azure’s end-to-end ML platform. | Designer/SDK, pipelines, managed endpoints, AutoML, AKS/ACI integration. | Enterprise ML ops, governed deployment. | ✅ | ✅ | ✅ |
Google Vertex AI | Unified ML/AI workbench on GCP. | Workbench, AutoML/Custom, pipelines, endpoints, TPU/GPU support. | Vision/NLP/tabular ML, scalable inference. | ✅ | ✅ | ✅ |
Note: “Support” here means the compute is a good fit to process that data type. Large binaries usually live in object storage (S3/Blob/GCS) and are processed from there.
Main Cloud Storage Options — Quick Reference
Core storage services across clouds. Each row includes a plain-English summary, a technical note, a common use case, and whether it suits Str (Structured), Sst (Semi-structured), and Uns (Unstructured) data.
Object Storage (Data Lakes)
Service | Non-technical description | Technical description | Typical use case | Str | Sst | Uns |
---|---|---|---|---|---|---|
Amazon S3 | Durable, low-cost storage for any files. | Object store; 11×9 durability; lifecycle to Standard-IA/Glacier; events; S3 Select. | Data lakes, backups, analytics staging, media. | ✅ (CSV/Parquet files) | ✅ (JSON/Avro/Parquet) | ✅ (images/video/PDFs) |
Azure Blob Storage incl. ADLS Gen2 | Azure’s universal bucket for files. | Hot/Cool/Archive tiers; ADLS Gen2 adds hierarchical namespace & ACLs. | Lakes on ADLS, archival, ML/analytics landing zones. | ✅ | ✅ | ✅ |
Google Cloud Storage (GCS) | Google’s object storage for any data. | Multi/dual-region; lifecycle; notifications; Autoclass tiering. | Data lakes, media libraries, BQ external tables. | ✅ | ✅ | ✅ |
Microsoft OneLake Fabric | Unified data lake for Fabric workspaces. | ADLS Gen2 under the hood; shortcuts; item-level governance. | Enterprise lake tightly integrated with Fabric/Power BI. | ✅ | ✅ | ✅ |
File Storage (NFS / SMB)
Service | Non-technical description | Technical description | Typical use case | Str | Sst | Uns |
---|---|---|---|---|---|---|
Amazon EFS | Shared Linux file system. | NFSv4.1, elastic scale, multi-AZ, burst/perf modes. | Lifted apps needing POSIX; containers; user home dirs. | ✅ (files) | ✅ | ✅ |
Amazon FSx Windows / Lustre / NetApp | Managed Windows or high-speed file systems. | SMB (Windows), POSIX high-throughput (Lustre), ONTAP features (snap/clone). | Windows shares, HPC scratch, enterprise NAS offload. | ✅ | ✅ | ✅ |
Azure Files + Azure NetApp Files | Managed SMB/NFS shares. | AD integration, performance tiers; NetApp = ultra-low latency. | Shared drives, VDI profiles, SAP app shares. | ✅ | ✅ | ✅ |
Google Filestore | Managed NFS for GCP. | POSIX-compliant NFS with zonal/regional tiers. | GKE shared volumes, content repos, media workflows. | ✅ | ✅ | ✅ |
Block Storage (Attach to VMs/Containers)
Service | Non-technical description | Technical description | Typical use case | Str | Sst | Uns |
---|---|---|---|---|---|---|
AWS EBS | Disks for EC2. | SSD/HDD tiers; provisioned IOPS; snapshots; encryption; Multi-Attach (io2). | Databases, low-latency apps, boot volumes. | ✅ (DB files) | ✅ | ✅ |
Azure Managed Disks | Disks for Azure VMs. | Premium/Ultra SSD; ZRS options; snapshots; shared disks. | SAP/app servers; high-IOPS databases. | ✅ | ✅ | ✅ |
Google Persistent Disk + Local SSD | Disks for Compute Engine/GKE. | Balanced/SSD/HDD; regional PD; snapshots; Local SSD for ultra-low latency. | Relational DBs, stateful services, caches. | ✅ | ✅ | ✅ |
Lakehouse Table Formats (on Object Storage)
Service/Format | Non-technical description | Technical description | Typical use case | Str | Sst | Uns |
---|---|---|---|---|---|---|
Delta Lake Databricks/Open | Tables on your lake with ACID. | Transaction log + Parquet; time travel; upserts/merges; schema evolution. | Reliable lakehouse ELT; CDC merges; analytics. | ✅ | ✅ (JSON in cols) | ⚠️ (store files alongside) |
Apache Iceberg | Open table format for big lakes. | Snapshot isolation; hidden partitioning; multi-engine (Spark/Trino/…). | Engine-agnostic lakehouse tables & governance. | ✅ | ✅ | ⚠️ |
Apache Hudi | Incremental tables on a lake. | COW/MOR storage; record-level updates; indexes; Spark/Presto/Trino. | Near-real-time upserts; incremental pipelines. | ✅ | ✅ | ⚠️ |
Archival & Cold Storage
Service | Non-technical description | Technical description | Typical use case | Str | Sst | Uns |
---|---|---|---|---|---|---|
Amazon S3 Glacier Instant / Flexible / Deep | Very cheap, slower access storage. | Archive tiers with minutes–hours retrieval; vaults; lifecycle policies. | Long-term retention, compliance copies, DR. | ✅ (files) | ✅ | ✅ |
Azure Archive Storage | Cold tier for blobs. | Low-cost tier with rehydration; immutable options (legal hold/worm). | Backups, legal/regulated archives, rarely accessed data. | ✅ | ✅ | ✅ |
Google Coldline / Archive | Low-cost long-term storage. | GCS classes with retrieval SLA/cost trade-offs; lifecycle rules. | Compliance, backups, cold datasets. | ✅ | ✅ | ✅ |
Hybrid & Edge Storage (Gateways)
Service | Non-technical description | Technical description | Typical use case | Str | Sst | Uns |
---|---|---|---|---|---|---|
AWS Storage Gateway | On-prem file/tape/block that writes to S3. | File/Volume/Tape gateways; caching; secure sync to S3/Glacier. | Hybrid backup, archive to cloud, DR staging. | ✅ | ✅ | ✅ |
Azure Data Box / StorSimple | Edge appliances for bulk/hybrid storage. | Offline/online transfer; tiering to Blob; import/export devices. | Datacentre migrations; large dataset seeding. | ✅ | ✅ | ✅ |
Google Transfer Appliance / Storage Transfer | Move very large datasets into GCP. | Rugged appliances; scheduled transfers/sync from S3/HTTP/POSIX. | One-off or recurring lake ingestion at scale. | ✅ | ✅ | ✅ |
Notes: Object and file stores hold any data as files (great for Sst/Uns). Block storage shines for low-latency databases and app disks. Lakehouse table formats (Delta/Iceberg/Hudi) sit on object storage to bring ACID/table semantics to files.
🧱 Databricks Components Overview
Component | Non-Technical Description 📘 | Technical Description ⚙️ | Use Case/Scenario 🎯 |
---|---|---|---|
Workspace | Your collaborative project space | Hosts notebooks, jobs, repos, and ML experiments | Organising analytics or data science projects |
Clusters | Computing engine (like your personal AI but scalable) | Spark-based distributed compute environment (autoscaling or manual) | Run jobs, notebooks, ML models |
Jobs | Automated tasks or scheduled workflows | DAG-based execution of notebooks, scripts, or JARs | Nightly ETL jobs, ML training pipelines |
Notebooks | Interactive workspace for code and output | Supports Python, SQL, Scala, R, Markdown | Exploratory data analysis, prototyping |
SQL Editor | GUI to query data tables | Uses Databricks SQL for BI-friendly interface | Business users querying curated tables |
Delta Lake | Like a spreadsheet that remembers everything | ACID-compliant storage layer over Parquet | Reliable data lake tables with versioning |
Unity Catalog | Your data’s filing cabinet and bouncer | Central metadata & access control layer with RBAC | Secure multi-tenant access across clouds |
Lakehouse Platform | The Databricks "big idea" — warehouse + lake | Combines data lake scalability with DB-like performance | Unified platform for batch, stream, ML, and BI |
MLflow | Your model's history, packaging, and delivery | Open-source lifecycle management for ML models | Experiment tracking, model registry, deployment |
Repos | Built-in Git versioning | Git-backed source control for notebooks & jobs | Code collaboration and CI/CD |
Data Explorer | Browse your tables like folders | Visual UI to inspect catalog, schemas, tables | Data discovery and governance check |
Dashboards | Shareable reports and visuals | BI dashboard powered by SQL or notebooks | Stakeholder insights and KPIs |
Snowflake Components – Field Guide
Non-technical and technical descriptions, real-world scenarios, and common gotchas you actually meet on projects.
Category | Component | Non-technical description | Technical description | Scenario | Common gotchas |
---|---|---|---|---|---|
Compute | Virtual Warehouses | On/off “engines” that run your queries and loads. Pick a size, pay while it’s on, pause when idle. | MPP compute clusters with independent caches; scale up (bigger) or out (multi-cluster). Auto-suspend/resume; credits billed per-second (min 1 min). | Month-end: scale out to clear queues; auto-suspend overnight. | Leaving warehouses running; oversizing by habit; long auto-suspend (wasted minutes); each warehouse has its own warm cache. |
Pipelines | Snowpipe | Auto-loads new files as they land. Think “continual drip-feed” rather than big batches. | Event/REST-triggered micro-batch COPY from a stage; near-real-time ingestion; charges per file/processing. | Landing CSVs from cloud storage every few minutes into a raw table. | Millions of tiny files; incorrect file format options; duplicate handling; object store permissions; expecting true streaming latency. |
Pipelines | Snowpipe Streaminglow-latency | Push rows directly into tables without creating files first. | Record-based ingest via SDK/connectors; bypasses staged files; designed for sub-minute latency. | Clickstream or IoT events needing fast availability for dashboards. | Event ordering/duplicates; commit semantics; schema evolution; monitoring cost of continuous ingest. |
Change data | Streams | A change tracker: “what changed since last time?” | CDC over base tables/views (insert/update/delete) with consumption offsets; pairs well with Tasks/Dynamic Tables. | Incremental processing from raw → curated without re-scanning full tables. | Not consuming advances; retention windows; DDL changes breaking downstream expectations. |
Orchestration | Tasks | Built-in scheduler for jobs. Like cron, but in Snowflake. | Time- or dependency-triggered DAGs; run SQL/SP on a chosen warehouse or serverless; track history. | Nightly dimension refresh after raw ingest completes. | Tasks left suspended; wrong warehouse size; timezone surprises; missing privileges. |
Transformation | Dynamic Tables | “Always-fresh” derived tables maintained for you. | Declarative objects with a defining query + freshness target; incremental maintenance and dependency tracking. | Keep a curated customer table in sync from multiple sources without hand-built DAGs. | Assuming they’re free—serverless work still bills; deep chains can hide cost/latency; not a cure-all for messy logic. |
Storage | Databases / Schemas / Tables / Views / MVs | Your folders and sheets: organise data, expose it as tables or views; MVs pre-compute results. | Tables (permanent/transient/temp), optional clustering; Views (secure/regular); Materialized Views with refresh costs/limits. | Speed up a gnarly join with a targeted MV; secure a view for consumers. | Using transient in prod (recovery limits); MVs on non-deterministic queries; over-clustering tiny tables. |
Lakehouse | External & Iceberg Tables | Query data where it lives in your cloud data lake, no copy required. | External tables over files in S3/GCS/Azure; Iceberg tables integrate open-table formats/catalogs; partition pruning depends on layout/metadata. | Blend Parquet in the lake with internal Snowflake tables for analytics. | Stale metadata; path/partition mismatches; small-file performance; storage IAM misconfig. |
Ingest I/O | Stages (internal/external) | Landing zones for files you load from or unload to. | Named internal or external stages with credentials/encryption; directory tables for discovery. | Partners drop files to an external stage you manage. | Credential leakage in URLS; wrong region; case-sensitive paths; forgetting stage privileges. |
Ingest I/O | File Formats | Reusable “how to read this file” settings. | Parsing options for CSV/JSON/Avro/Parquet/ORC/XML; referenced by COPY/Snowpipe/External Tables. | Standardise CSV quirks (nulls, quotes) across ingest jobs. | NULL vs empty strings; date/locale mismatches; hidden BOM/encoding issues. |
Governance | Masking Policies | Hide sensitive values dynamically based on who’s asking. | Policy-based column masking evaluated at query time; role/context aware; auditable. | Show last-4 of cards to support, full value to finance. | Applying to complex types; downstream tools assuming unmasked types; forgetting UNMASK-level access for admins. |
Governance | Row Access Policies | Only the rows you’re allowed to see, nothing more. | Predicate functions enforce row-level security on tables/views. | Country managers see only their region’s data by role. | Over-complex predicates; surprises when combined with filters; diagnosing “missing rows”. |
Cost control | Resource Monitors | Spend tripwires to stop runaway compute. | Credit thresholds with actions (notify/suspend) scoped to warehouses. | Cap a dev warehouse to prevent accidental 24/7 spend. | Only covers warehouses; serverless usage needs separate observation/alerts; period/timezone misunderstandings. |
Metadata | Information Schema & Account Usage | System tables you can query for lineage-ish insight, usage, and health. | Database-scoped Information Schema and account-wide views with ingestion latency; ideal for monitoring/reporting. | Build a usage dashboard showing query cost by team. | Data freshness lag; privilege gaps; mixing object names from different namespaces. |
Dev & Apps | UDF / UDTF / UDAF | Custom functions when SQL alone won’t cut it. | Extend with SQL/JS/Java/Python; UDTF returns tables; sandboxed execution. | Custom normaliser for messy product codes as a UDF. | Performance of row-by-row logic; library limits; cold-start penalties for some runtimes. |
Dev & Apps | Stored Procedures | Procedural scripts with variables and control flow. | Run in JS/SQL/Java/Python; can call SQL, manage transactions, and orchestrate tasks. | Automate schema rollout and grant routines for new projects. | Long-running work timing out; debugging ergonomics; executing on the wrong warehouse. |
Dev & Apps | Snowpark APIs | Write code (DataFrames) that runs close to the data. | Python/Scala/Java APIs with pushdown; UDF/UDTF authoring; package management via curated channels. | Data-prep pipelines in Python without leaving Snowflake. | Accidental client-side collects; serialization limits; dependency/version pinning. |
Dev & Apps | Snowpark Container Services | Run your containers next to your data—ML services, custom apps. | Managed container runtime integrated with Snowflake auth/networking; supports services and batch jobs. | Serve an in-house ML model via a low-latency API within your Snowflake account. | Oversized images; egress/network rules; security approvals; cost of always-on services. |
Collaboration | Secure Data Sharing | Share live data without copying or FTP drama. | Provider/Consumer accounts with shared objects; no data duplication; governed access. | Give a supplier read-only access to sales without exporting files. | Schema changes breaking consumers; accidentally sharing sensitive columns; region/cloud compatibility. |
Collaboration | Listings & Marketplace | An “app store” for data and apps—public or private listings. | Package data/apps with terms and versioning; distribute across orgs/regions/clouds. | Monetise an industry dataset to partners via private listings. | Legal/contracting lag; unclear update cadence; consumer entitlements drifting from expectations. |
Tip: pair this with a one-page “operating rules” note—suspend defaults, file-size targets, freshness SLAs, and a short naming convention.
Dropdowns (Details) — Always Closed on Page Load
This page ensures <details>
elements (dropdowns) are closed whenever the page is opened or refreshed, and also when restored from the back/forward cache.
Example Section A (starts open in markup)
open
in the HTML, the script will close it on load.
Example Section B
Example Section C
Top 20 Cloud Databases — Comparison
Purpose, plain-English & technical descriptions, use cases, pros/cons, and a scoring heat‑map (1–5: higher is better).
Database | Purpose | Non‑Technical Description | Technical Description | Primary Use Cases | Pros | Cons | Scale | Latency | Cost Predictability | Ecosystem | Ops Effort |
---|---|---|---|---|---|---|---|---|---|---|---|
Amazon DynamoDB | Serverless key‑value/document store | Massively scalable app data store for simple key lookups and flexible documents. | Fully managed NoSQL; consistent hashing partitions; adaptive capacity; streams; global tables. | High‑traffic web/mobile backends, IoT, gaming sessions, shopping carts. | Serverless scale; global tables; strong SLAs. | Query patterns limited; hot partition pitfalls; complex cost tuning. | 5 | 4 | 3 | 4 | 5 |
Google Cloud Spanner | Globally distributed relational DB | SQL database that scales across regions while keeping transactions consistent. | Distributed SQL with MVCC and TrueTime for external consistency; ANSI SQL; strong schemas. | Financial systems, inventory, multi‑region SaaS needing strong consistency. | Horizontal scale + SQL + transactions; multi‑region. | Higher cost; needs careful schema; tuning differs from single‑node RDBMS. | 5 | 4 | 3 | 4 | 4 |
Azure Cosmos DB | Multi‑model NoSQL (key‑value/doc/graph/column) | Low‑latency database with global distribution and flexible data models. | APIs for Core (SQL), MongoDB, Cassandra, Gremlin, Table; RU‑based throughput; multi‑region. | Global apps, personalization, IoT, event stores. | Global distribution; multi‑API; low latency at p99. | RU sizing can be tricky; cross‑partition queries can be costly. | 5 | 4 | 3 | 4 | 4 |
Amazon Aurora (MySQL/Postgres) | High‑performance relational (managed) | MySQL/Postgres‑compatible engine with better performance and failover. | Decoupled compute/storage; 6‑way replication; read replicas; serverless v2 options. | OLTP apps, SaaS backends, migrations from on‑prem RDBMS. | Drop‑in compatibility; strong HA; autoscaling options. | Region‑bound scaling; heavy writes may need sharding. | 4 | 4 | 4 | 5 | 4 |
Amazon Redshift | Cloud data warehouse | Columnar SQL warehouse for analytics at scale. | MPP, columnar storage, RA3 managed storage, Spectrum external tables; materialized views. | Enterprise BI, ELT analytics, semi‑structured via SUPER. | Mature ecosystem; performance features; concurrency scaling. | Cluster sizing decisions; Spectrum governance; workload mgmt. | 4 | 3 | 4 | 5 | 4 |
Google BigQuery | Serverless data warehouse | Analytics engine where you just run SQL and pay per query—no clusters to manage. | Dremel‑based columnar engine; separation of storage/compute; BI Engine caches. | Ad‑hoc analytics, ELT at scale, ML‑in‑warehouse, log analytics. | Near‑zero ops; great price/perf for bursty workloads. | Cost predictability lower with ad‑hoc users; quotas/limits. | 5 | 2 | 3 | 5 | 5 |
Snowflake | Cloud data platform/warehouse | Elastic SQL warehouse with easy scaling and cross‑cloud support. | Decoupled compute/storage; virtual warehouses; time travel; data sharing; native apps. | Enterprise analytics, data sharing, multi‑tenant analytics products. | Strong ecosystem; easy scaling; governance features. | Credit creep if unmanaged; proprietary features increase lock‑in. | 5 | 3 | 3 | 5 | 5 |
Databricks SQL (Delta Lakehouse) | Lakehouse SQL engine | SQL over Delta Lake on object storage—warehouse performance with lake flexibility. | Photon engine; Delta tables with ACID; Unity Catalog; serverless SQL warehouse. | BI on data lake, ELT at scale, medallion architectures. | Open formats; strong with streaming + ML adjacent. | Tuning required for small, chatty queries; cost‑by‑concurrency. | 5 | 3 | 3 | 5 | 4 |
Azure SQL Database | Managed relational (SQL Server) | SQL Server as a service—familiar T‑SQL with built‑in HA and backups. | Single database/elastic pools; Hyperscale; automatic tuning; AAD integration. | Line‑of‑business apps, SaaS multi‑tenant, reporting stores. | Rich SQL features; easy Azure integration; predictable tiers. | Vertical scaling limits vs distributed SQL; DTU confusion for newcomers. | 4 | 4 | 4 | 5 | 5 |
Google AlloyDB for PostgreSQL | High‑performance Postgres | Postgres‑compatible with faster analytics and OLTP, fully managed. | Disaggregated storage; columnar engine for analytics; automatic failover. | OLTP plus HTAP‑ish patterns, modern app backends. | Great Postgres perf; minimal ops; analytics boosts. | GCP‑only; migration from other engines needed. | 4 | 4 | 4 | 4 | 5 |
MongoDB Atlas | Managed document database | Flexible JSON document store with global clusters and rich developer tooling. | Replica sets, sharding, multi‑cloud; Atlas Search; triggers; Realm/App Services. | Content, catalogs, user profiles, event data. | Developer‑friendly; flexible schema; strong tools. | Cross‑document transactions limited; joins not native; costs scale with usage. | 4 | 4 | 3 | 5 | 5 |
DataStax Astra DB (Cassandra) | Managed wide‑column (Cassandra) | Cassandra as a service for write‑heavy, always‑on workloads. | Masterless ring; tunable consistency; Stargate APIs (CQL/REST/GraphQL). | IoT telemetry, messaging, time‑series with high ingest. | Linearly scalable writes; global availability; APIs. | Query flexibility limited; model by access pattern. | 5 | 4 | 3 | 4 | 5 |
Google Cloud Bigtable | Managed wide‑column (HBase‑like) | Single‑digit ms key‑value at petabyte scale—great for time‑series and personalization. | Sparse, distributed row store; SSD/HDD nodes; GC policies. | Time‑series, ad tech, personalization features, IoT. | Huge scale; predictable low latency if modeled right. | Single index (row key) mindset; no joins/aggregations. | 5 | 4 | 3 | 4 | 4 |
Redis Enterprise Cloud | In‑memory data store | Super‑fast cache/DB for microsecond reads/writes, with JSON and search options. | Redis with clustering, persistence, modules (JSON, Search, Bloom, TimeSeries). | Caching, session stores, leaderboards, real‑time features. | Ultra‑low latency; versatile modules; enterprise HA. | Memory cost; persistence/consistency trade‑offs. | 4 | 5 | 3 | 5 | 5 |
Elastic Cloud (Elasticsearch) | Search & analytics engine | Free‑text search and log analytics with dashboards (Kibana). | Inverted indexes, distributed shards/replicas; aggregations; ILM; vector search. | Search‑heavy apps, observability, security analytics. | Great search features; broad ecosystem; Kibana visualisation. | Query costs for wide scans; ops complexity for hot‑warm tiers. | 4 | 3 | 3 | 5 | 4 |
InfluxDB Cloud | Time‑series database | Purpose‑built for metrics and events over time with a fluent query language. | TSM/TSI storage, downsampling/retention; Flux/SQL interfaces; serverless options. | IoT metrics, monitoring, SRE/DevOps telemetry. | Time‑series ergonomics; tasks for downsampling. | Flux is niche (SQL emerging); cardinality pitfalls. | 4 | 4 | 3 | 4 | 5 |
Timescale Cloud (TimescaleDB) | Time‑series on PostgreSQL | Postgres with time‑series extensions—nice when you want SQL + time‑series. | Hypertables, compression, continuous aggregates; full SQL & Postgres ecosystem. | Industrial IoT, finance ticks, app metrics needing joins/SQL. | Full SQL power; easy analytics joins; good compression. | Not for ultra‑high ingest vs Bigtable/Cassandra tiers. | 4 | 4 | 4 | 5 | 5 |
CockroachDB (Managed/Dedicated) | Distributed SQL (Postgres wire) | Resilient SQL that scales horizontally with strong consistency. | Raft consensus, range‑based data distribution; Postgres wire compatibility. | Resilient SaaS backends, geo‑partitioned data, OLTP scale‑out. | Survivable regions; SQL + transactions; online scale. | Hot ranges if keys skew; some Postgres features differ. | 5 | 4 | 4 | 4 | 4 |
PlanetScale | Serverless MySQL (Vitess) | MySQL that can shard/scale underneath without changing app code. | Vitess control plane; branching; online schema changes; connection pooling. | Prod MySQL for SaaS; branching for safe changes; scale‑out reads. | Zero‑downtime schema changes; developer workflow wins. | MySQL‑only; some features limited by Vitess layer. | 4 | 4 | 4 | 4 | 5 |
Neo4j AuraDB | Managed graph database | Stores data as nodes and relationships—great for connected queries. | Property graph model; Cypher query language; native graph engines. | Fraud rings, recommendations, knowledge graphs, network analysis. | Expressive relationship queries; fast traversals. | Not ideal for big aggregations; different modeling mindset. | 3 | 3 | 4 | 4 | 5 |
Amazon RDS (PostgreSQL/MySQL) | Managed relational (classic) | Familiar relational databases without the patching and backups. | Managed instances, Multi‑AZ, read replicas, storage autoscaling. | Traditional apps, quick lifts from on‑prem, dependable OLTP. | Mature, predictable; wide community knowledge. | Instance‑bound scaling; manual sharding for big growth. | 4 | 4 | 4 | 5 | 4 |
Azure Database for PostgreSQL | Managed Postgres (Flexible Server) | PostgreSQL in Azure with managed HA and scaling. | Flexible Server, autoscale, zone‑redundant HA, pg_extensions support. | Modern app backends needing Postgres features and Azure integration. | Strong AAD/Key Vault integration; familiar tooling. | Vertical scaling limits vs distributed SQL; regional. | 4 | 4 | 4 | 4 | 5 |
Google Cloud SQL | Managed MySQL/Postgres/SQL Server | Managed relational instances with backups and replicas handled for you. | HA configurations, read replicas, private services connect. | Standard OLTP apps on GCP, quick migrations. | Straightforward; integrates with GCP IAM & VPC. | Instance‑bound; not for huge scale‑out by itself. | 3 | 4 | 4 | 4 | 5 |
Azure Synapse Dedicated SQL Pool | MPP warehouse (Azure) | Azure’s classic MPP warehouse for large-scale BI with predictable capacity. | Distributed compute, PolyBase, materialized views; workload isolation. | Enterprise BI, predictable SLAs, integrated Azure stack. | Predictable capacity; strong Azure integration. | Cluster ops vs serverless engines; less elastic than lakehouse. | 4 | 3 | 4 | 4 | 4 |
Amazon OpenSearch Service | Managed search/observability | Search and log analytics compatible with Elasticsearch APIs. | Shard/replica management, UltraWarm/Cold storage, OpenSearch Dashboards. | Search features, observability stacks on AWS. | Good AWS integration; familiar APIs. | Ops tuning for tiers/ILM; query costs for wide scans. | 4 | 3 | 3 | 4 | 4 |
Azure Cosmos DB for MongoDB | Mongo API on Cosmos | Mongo‑compatible API on Cosmos for global distribution with low latency. | RU/s throughput model; multi‑master write regions; Mongo wire protocol compatibility. | Global user data, catalogs, content stores needing Mongo semantics. | Global replication; automatic indexing; serverless option. | RU planning; cross‑partition costs; feature parity varies by version. | 5 | 4 | 3 | 4 | 4 |
Azure Database for MySQL | Managed MySQL (Flexible Server) | MySQL managed in Azure with HA and scaling, minus the babysitting. | Flexible Server, zone‑redundant HA, Param tuning, Azure Monitor. | Web apps, CMS, e‑commerce stacks on Azure. | Familiar engine; Azure security integrations. | Vertical scaling limits; regional. | 3 | 4 | 4 | 4 | 5 |
Firestore (Google Cloud) | Serverless document DB | Serverless JSON store with offline sync for mobile/web apps. | Hierarchical documents/collections, real‑time listeners, strong security rules. | Mobile/web backends, real‑time presence/chat, small‑team apps. | Near‑zero ops; great SDKs; realtime updates. | Query constraints; cost spikes with chatty patterns. | 4 | 4 | 3 | 4 | 5 |
SingleStoreDB Cloud | Distributed SQL + vectors | Fast SQL store for mixed OLTP/OLAP with vector search features. | Shared‑nothing distributed engine, columnstore + rowstore, pipelines. | Real‑time analytics, operational reporting, AI features with vectors. | Strong mixed‑workload performance; HTAP‑style design. | Vendor‑specific features; sizing still matters. | 4 | 4 | 3 | 4 | 4 |
TiDB Cloud | Distributed MySQL‑compatible | MySQL‑compatible database that scales horizontally with strong consistency. | TiKV (key‑value store) + TiDB SQL layer; Raft consensus; HTAP with TiFlash. | Scale‑out OLTP with MySQL compatibility; HTAP patterns. | Horizontal scale + SQL; HTAP via TiFlash. | Operational tuning for balance; ecosystem smaller than MySQL/Aurora. | 5 | 4 | 3 | 3 | 4 |
Scoring (1–5, higher is better): Scale = max horizontal/elastic capacity; Latency = low‑latency suitability; Cost Predictability = ease of forecasting monthly costs; Ecosystem = connectors & tooling; Ops Effort = how easy it is to run.
- Global OLTP with strong consistency: Cloud Spanner, CockroachDB
- Write‑heavy time‑series/telemetry: DataStax Astra (Cassandra), Bigtable, InfluxDB
- Elastic analytics (serverless): BigQuery, Snowflake, Databricks SQL
- Relational with low ops in Azure: Azure SQL DB, AlloyDB(Postgres on GCP analogue), Azure Postgres
- Document‑first apps: MongoDB Atlas, Cosmos DB
- Search/Observability: Elastic Cloud, OpenSearch
- In‑memory latency: Redis Enterprise Cloud
- MySQL at scale without drama: PlanetScale, TiDB Cloud