Data & Infrastructure

Data Lineage

📖

Definition

Data lineage is the documented record of a dataset's origins, the transformations it has undergone, and all the downstream systems and processes that consume it. It answers the questions: where did this data come from, what happened to it along the way, and who or what depends on it now? A complete lineage graph traces a data element from its source system — a database transaction, a sensor reading, an API call — through each pipeline stage, transformation, join, and aggregation, all the way to the dashboards, models, or applications that use it. Lineage can be captured at the column level (which fields fed into this derived metric) or at the dataset level (which upstream tables this report depends on).

In AI and enterprise commerce contexts, data lineage is essential for three distinct purposes. First, it supports debugging: when a model's predictions degrade or a KPI moves unexpectedly, lineage makes it possible to trace the anomaly back to a specific upstream change. Second, it enables regulatory compliance — financial services, healthcare, and increasingly retail organizations must be able to demonstrate exactly how a data-driven decision was reached, which requires a traceable audit trail from raw input to output. Third, it accelerates impact analysis: when a source schema changes or a vendor API is deprecated, lineage maps show every dependent pipeline and model that will break, allowing teams to prioritize remediation before outages occur.

🔗
AI-Ready DataBig dataCustomer Data Platform (CDP)Data mining
📚

Source

AI Best Practices for Commerce - Glossary
Buy the book on Amazon

Last updated: May 12, 2026