Data in Context
In the enterprise world, data isn’t just a by-product—it’s the connective tissue linking BSS, OSS, and ERP. BSS is usually where data first appears, whether it's a customer signing up, an order being placed, or a billing event triggered. OSS kicks in to provision the service, monitor its health, and keep things ticking, while ERP quietly reads from both and handles the heavy lifting: revenue, workforce, stock, and compliance. Between them, they touch every part of the CRUD lifecycle—create, read, update, delete—often in that exact order. But here’s the kicker: without clear data lineage, you’re just hoping everything adds up. Lineage tells you who changed what, when, and why—a breadcrumb trail that makes the difference between confidence and chaos when someone asks, “Where did this number come from?”
And that’s the point. Understanding the data is how you understand the soup to nuts of an organisation. If the data’s been captured, any question can be answered—either directly at the level of detail or by combining it, aggregating it, or surmising something new. That’s where AI starts earning its keep. But if the data hasn’t been captured, the question can’t be answered—not without some serious enhancement or guesswork. In short, data isn’t just a reflection of the business—it is the business. And knowing its lineage is how you stay in control of the story it tells.
Which brings us to security and control. If you want to manage data properly—never mind securing it—you need to know where it lives, where it’s working, and where it’s on the move. That means understanding data at rest (in databases, files, and lakes), in use (in memory, being computed), and in flight (moving between systems or modules). You need to know where it’s persisted, where it’s just hanging around in RAM, and where it’s crossing boundaries—especially if those boundaries span clouds, partners, or regulatory zones. Otherwise, you’re not really managing data—you’re crossing your fingers and calling it governance.
In fact, it might be more accurate to see this as a continuum: at one end, a data-centric (DC) approach where decisions, systems, and safeguards are built on a clear understanding of the data’s lifecycle and flow; at the other, a narrative driven (ND) approach, where assumptions stand in for facts, and structure gives way to improvisation. Most organisations wobble somewhere in between—but only the DC end of the spectrum holds up under scrutiny, scale, or regulation of organised data.
Positioning Statement:
🗄️ "Think of a star schema like the well-organised filing cabinet. Not one of those bottomless pits full of dog-eared manila folders, but one where the files are sorted by topic, clearly labelled, and easy to grab. You only get value from it when you can actually find what you’re looking for — without rummaging around like a Victorian detective.
That’s the point — good structure makes useful answers possible. ✅
It’s not about picking sides between Data Centric (DC) and Narrative-Driven (ND) methods like they’re rival soap opera characters. They’re not opposites — they’re more like dance partners.
1) First you organise the data (Star Schema 🧩)**
2) Then you put it to work (Procedural Logic ⚙️).
** Semi-structured data lends itself to flattening for integration, while metadata from unstructured sources can be harnessed to introduce structure and improve discoverability.
TL;DR: 🧭 Star Schema gives you the organisation; procedure gives you the motion. To move before being organised means you want arrive at your destination in an effective and efficient way.
Useful definitions for platforms involved with data lineage
What It Stands For | What It Actually Does | Where You’ll See It in the Wild |
---|---|---|
📊 Business Support System (BSS) | Handles the bits the customer sees — orders, billing, product setup, CRM. Basically, everything that makes money. | Taking an order online, billing it, changing a phone plan — all BSS territory. |
⚙️ Operations Support System (OSS) | Deals with the plumbing — provisioning, fault management, network operations. Keeps the lights on. | Think service activation, monitoring outages, and network diagnostics. If it’s behind the scenes and blinking, it’s OSS. |
🧱 Enterprise Resource Planning(ERP) | The digital HQ — finance, HR, supply chain, procurement. One giant database trying to keep the company from eating itself. | Payroll, expense approvals, stock levels, invoice runs. Usually backed by SAP, Oracle, or similar enterprise beast. |
🔄 Enterprise Application Integration (EAI) | The glue. Connects systems that were never meant to speak to each other, but now have to. | Syncing data between ERP and CRM, integrating legacy systems, handling API chaos. If systems are chatting nicely, thank EAI. |
Data states
📦 State | 🧠 What It Means | 🔐 Why It Matters |
---|---|---|
🛏️ Data at Rest | Stored data that’s not moving—sitting in a database, data lake, disk, or archive. | Must be encrypted and access-controlled. It’s a sitting duck if not locked down. |
⚙️ Data in Use | Data that’s being processed—held in memory, calculated, edited, or queried. | Vulnerable in RAM. Needs runtime protections like masking or secure enclaves. |
🚚 Data in Flight | Data moving between systems—across networks, APIs, or between components. | Needs strong encryption in transit (TLS, VPN). Otherwise, it’s open to interception. |
Quick Summary:
- At Rest – The data’s asleep. That doesn’t mean it’s safe—someone could still walk off with the hard drive.
- In Use – The data’s awake and working—visible in memory, spreadsheets, or dashboards. Sensitive? Handle with gloves.
- In Flight – The data’s on the move. You’d be mad to send it unbuckled. Strap it in with TLS or secure tunnels.
ETL & ELT Tools – Comparison Table
Tool | Style | Strengths | Watch Outs | Best Fit |
---|---|---|---|---|
Informatica | ETL | Enterprise-grade; strong governance, lineage, scheduling | Heavyweight; steep learning curve; pricey | Legacy systems, regulated sectors, hybrid cloud |
Azure Data Factory | ELT | Works well with Azure stack; flexible hybrid pipelines | UI can be awkward; lacks native data profiling | Azure-first orgs, cloud migrations, MS tech shops |
Matillion | ELT | Great for Snowflake/Redshift; visual, fast to deploy | Debugging is fiddly; less useful beyond cloud DWs | Snowflake/BigQuery-focused data teams |
Fivetran | ELT | Low-maintenance; fast SaaS ingestion | Limited transform layer; can get costly at scale | MVPs, quick BI enablement, SaaS integrations |
Talend | ETL | Open-source option; scripting-friendly | Java-heavy; needs tuning for performance | Custom ETL, open-source-friendly orgs |
dbt | ELT | Top-tier SQL transformations; Git/versioned | Only handles 'T' in ELT; no ingestion/load | Modelling in the warehouse or lakehouse |
Apache NiFi | Stream ETL | Real-time flows; visual routing; flexible control | Not built for BI; steeper learning curve | IoT, event streaming, on-prem/cloud hybrids |
Databricks Workflows | ELT | Unified batch/stream/ML; built for scale | Can be overkill for simple pipelines | AI/ML pipelines, streaming-first use cases |
Airbyte | ELT | Open-source; flexible ingestion connectors | Still maturing; more dev hands-on | Teams wanting custom pipelines, no lock-in |
Star Schema Structure
Every transaction (stuff that gets done) becomes meaningful when seen through the right dimensions (perspectives, because marketing and sales always agree with accounts!). That’s why I use star schema thinking on every assignment—clarity, context, and perspectives built in to understand and question the facts!
Aspect | Kimbal's Motivation (I like Kimbal) |
---|---|
User-Centric Design | Edgar (Ted) F. Codd invented the relational model and formalised both relational algebra and relational calculus—laying the foundation for modern data systems. While the term OLAP (Online Analytical Processing) was popularised later, Codd’s influence reached deep into analytical processing. Ralph Kimball, together with Margy Ross and the Kimball Group, focused on making relational databases more effective for analytical (OLAP-style) workloads. They championed dimensional modelling—especially the star schema—which intentionally trades strict normalisation for performance, usability, and alignment with how business users naturally think. My seminal conclusion is their bottom-up methodology brought analytics closer to real-world decision-making. In short, Kimball, Ross, and their team translated complex data into something usable, scalable, and intuitive—and did so brilliantly. So why not build your data understanding on a foundation of star schema thinking? |
Simplicity and Ease of Use (increased autonomy) | The star schema gives a clear, intuitive structure that mirrors user understanding, enabling self-service analysis with less reliance on technical intermediaries. |
Query Performance | Optimized for performance, star schemas respond quickly to users’ queries, supporting real-time insights and fluid exploration of business questions. |
Data Consistency | Conformed dimensions ensure users see consistent definitions and metrics across reports, building trust and reducing confusion. |
Flexibility | A single star schema can support diverse reporting needs and dashboards, adapting to users’ evolving analytical questions. |
Scalability | Business processes can be added incrementally, allowing data warehouse growth to follow user demand and organizational priorities. |
Separation of Concerns | Keeping facts and dimensions distinct reflects how users separate events from descriptive context, aiding cognitive clarity. |
Historical Tracking | Slowly Changing Dimensions (SCDs) model how context changes over time, crucial for trend, audit and ownership analysis. |
Agility in Development | Star schemas are easy to sketch, build and refine, keeping pace with users’ shifting needs in agile business environments. Excel supports star schemas useful for early proof of concept activities. |
Star Schema's can be used in conjunction with other data modelling techniques | Star schemas are complementary to the other data modelling approaches. |
Aspect | Best Practice | Concern if Absent |
---|---|---|
Use of Standard Models | Star Schemas, Snowflake Schemas, 3NF, Data Vault, etc. are well-documented, proven, and widely adopted. | Invented models may lack scalability, interoperability, and clarity—making governance and analytics harder. |
Transparency | Clear, auditable structures allow for validation, collaboration, and easy onboarding. | Custom diagrams may obscure assumptions, hide poor design choices, or create consultant dependency. |
Alignment with Tools | BI tools (like Power BI, Tableau, Looker) are optimized for star schema-type models. | Non-standard models often require complex, inefficient queries and diminish performance. |
Client Empowerment | Good models enable in-house teams to maintain, enhance, and use data confidently. | Vague or unique diagrams can make the client overly reliant on the consultancy. |
Star schema is effective across all stages of a project:
STAGE | ROLE OF STAR SCHEMA | BENEFITS |
---|---|---|
Requirements Elicitation | Helps identify key business entities (e.g., Employees, Companies, Requisitions) and measurable events (facts) | Clarifies scope early by focusing on facts and dimensions that align with user needs |
Process Modelling | Maps business processes (e.g., purchase request approval) to fact-dimension relationships | Visualizes how data flows through systems; supports stakeholder understanding |
Solution Design | Guides the logical and physical design of data tables using a central fact table with surrounding dimensions | Promotes consistency, modularity, and scalability of the solution |
Solution Build | Guides the logical and physical design of data tables using a central fact table with surrounding dimensions | Promotes consistency, modularity, and scalability of the solution |
Integration Testing | Enables validation of data consistency and correctness through test cases across joined fact and dimension tables | Easier to test due to normalized structures and deterministic joins |
User Acceptance Testing (UAT) | Users validate that facts and dimensions reflect the business logic and expected outputs | Increases confidence and traceability, as data is easy to verify by business users |
Analytics | Supports fast, aggregated queries for trends, anomalies, and patterns using metrics from the fact table | Enables powerful, performant analytics over large datasets |
Business Intelligence (BI) | Star schema is the foundation for semantic models in tools like Power BI, Tableau, or Looker | Provides intuitive structure for drag-and-drop exploration and dashboarding |
Reporting | Powers tabular and visual reports (e.g., PR totals by department, employee, date) | Reports remain performant and readable due to dimensional clarity and optimized joins |
Historic Data Selection (a function of owner and time) | A star schema is ideal for repeatable, transparent reporting, especially in self-service BI tools like Power BI. Its structured fact-and-dimension model supports reusable filters, easy slicing across dimensions, and systematic handling of changes over time. | Procedural logic is better suited for one-off data extractions or complex filtering that’s hard to model. While it offers precise control, it lacks transparency and reusability, making it less efficient for evolving or repeatable reporting needs. |
Prepping Star Schema Data for AI
Right then — if you want to get anything useful out of AI and machine learning, the first job is clearing up the mess. That’s where the star schema earns its keep.
AI doesn’t come with built-in knowledge of your business. It has no clue what a “customer” is, doesn’t care what a “product” does, and doesn’t naturally link a sales figure to anything meaningful. All it sees is columns of numbers and the vague hope of a pattern. So, it’s on us to lay the data out properly.
Think of a star schema as a neat table setting before the guests arrive. In the middle: the main course (facts like transactions, sales amounts, etc.). Around it: all the trimmings — customers, products, dates — the context that makes the numbers make sense.
Why bother?
- It’s tidier to work with — no tangled joins or cryptic table mazes.
- It’s quicker for AI to process — because you’ve done the heavy lifting up front.
- And crucially, the patterns it finds actually mean something — not just statistical gibberish.
We’re not trying to teach the machine what a “dimension” is. We just feed it clean, well-structured data it can chew through and learn from — no need for a data science sermon.
Bottom line: we do the groundwork with a star schema so the AI can get on with being clever. Simple as that.
Data centricity upstages procedural centricity
Aspect | Option 1: Star Schema Structure | Option 2: Procedural Logic |
---|---|---|
Definition | A data model where a central fact table (e.g., Purchase Requisition) is surrounded by dimension tables (e.g., Employee, Company Code, Cost Center). | A set of step-by-step procedural rules written in code (e.g., SQL stored procedures, Python logic) to determine entitlement dynamically. |
Lookup or Legwork: | The algorithm is efficient when more than one occurrence is possible Fact Table: Purchase Requisition Dimensions: Employee, Company, Cost_Center, Material, Plant. A keyed lookup indicates whether the selection criteria has been met. | The algorithm is inefficient when there is more than one occurrence. is possible IF employee.company_code = (112233 OR 334455) AND NOT EXISTS in (445566 OR 667788) THEN allow—entitlement ELSE reject. |
Reusability | High – data model is reusable for different queries and reporting | Low – logic must be rewritten or refactored for other use cases |
Performance | Optimized with joins and indexed tables; faster for large-scale analytics | Slower for large datasets; logic must scan and evaluate for each transaction |
Transparency | Easy to understand data relationships; business users can query | Logic is often hidden in code; requires technical knowledge |
Auditability | Clear lineage and logs via tables | Harder to trace logic changes over time |
Scalability | Scales better with large enterprise systems | Procedural logic becomes complex and error prone as data volume grows |
Entitlement Flexibility | Entitlement rules can be modelled through relationships in dimensions (e.g., linking only active employees in Newco) | Requires logic to handle every exception and state manually |