In my time working with data, I’ve run into different misconceptions about what data lineage is and what it can and can’t do. Because every article explaining the concept comes with some sort of diagram, for example, it tends to create the perception that data lineage is a diagram or just a feature. While it’s a great way to visualize lineage, a diagram is nowhere near all there is to it. And lineage as a concept is way more than a feature.
So what is data lineage?
A metadata dataset that can be powered by automation
When you execute a SQL command in your Snowflake instance to change the name of a table, it will run it and register it in a logging table. This log, which can be accessed later, contains information like:
- Who ran that command;
- The date and time it happened;
- The command itself.
This is true for every action performed in basically any data warehouse, transformation, and BI tool. And so every day these logs are filled with mountains of information about what’s happening with your data. Data about your data — metadata.
This metadata includes all transformations the data underwent along the way. We can analyze these logs to understand how the data was transformed, what changed, and why.
There is potentially big business value in properly understanding and operationalizing this metadata. But this is no simple task. That’s because the metadata inside these logs alone isn’t of much use: thousands of out-of-context lines of SQL commands, timestamps, and user IDs aren’t going to do a whole lot for your bottom line. Reverse engineering that metadata and creating a way of understanding, recording, and visualizing data as it flows from data sources to consumption — now that is where we are starting to talk!
Some people think of data lineage as just another optional feature. At Alvin, we see it as a powerful, living dataset, constantly going through transformations as data and people flow in your environment. As such, it contains the big picture of your data flow and can answer many key questions, such as:
- How can we be more proactive about pipeline errors and stop breaking things?
- How can we check for assets that are not being used and get rid of them?
- Where is this column coming from? What are its downstream dependencies?
- How can we test and fail fast?
- How can we ensure proper data compliance?
- How can we win back precious engineering time spent firefighting?
- How can we improve data discoverability and the onboarding experience for new joiners?
Those are just some benefits of implementing proper data lineage. Depending on your company size and the business problems you’re solving, there could be even more.
Now that we’ve established the what and the why, let’s look at the how. Here are some of the most common techniques you can use to implement data lineage in your organization.
Manually implemented lineage
Let’s get this one out of the way first. Implementing data lineage manually is the hardest and least practical way to do it.
It involves conducting interviews with business users, data scientists, BI analysts, data stewards, and others who interact with the data, to learn how it flows through different systems and undergoes modifications. The information gathered through these interviews is then used to map out the transformations and data flows.
Because all of this is done almost 100% by humans, it’s a highly error-prone and time-consuming process, making it slow and expensive compared to other ways of implementing data lineage.
Pattern-based lineage
This method uses algorithms to analyze patterns in metadata across tables, columns, and reports. By identifying patterns and relationships, this technique can track the movement and transformation of data. For example, if two tables contain a column with a similar name and data values, pattern-based lineage can link these columns in a data lineage map.
One of the main advantages of pattern-based lineage is that it works on any system or technology and doesn’t require knowledge of programming languages. If you have a small number of datasets (and a tight budget) with straightforward connections, this might be enough for you.
In more complex data relationships, such as those created by data processing algorithms, pattern-based lineage mapping can be difficult to apply.
Lineage through data tagging or self-contained data lineage
In a self-contained data environment with storage, processing, and metadata management, built-in data lineage is effective. Tagging data sets (manually or automatically) with metadata is usually an easy task in those environments. Tools like Databricks, Airflow, and dbt come with lineage and tagging features that work just fine.
Until you have to deal with multiple systems.
Most companies have two or three different systems involved in the BI environment which encompasses ingestion, processing, querying, and reporting. If you want end-to-end data lineage, you’ll need a different approach.
Data lineage by parsing
This is a powerful technique that follows the data footprint as it moves through various systems by reading and understanding the algorithms used to process, transform, and transmit the data. This is by far the best way of implementing end-to-end lineage across multiple systems, but doing it in-house requires a lot of knowledge and hours of work. It’s definitely doable for certain specific use cases (and if you have well-defined processes and strong SQL practitioners in your team, then this will reduce the number of edge cases you need to solve for).
But — and I may be biased because I work with a lineage provider — having seen the technical complexity that goes into building out data lineage, I’d probably never recommend doing it in-house.
Implementing data lineage for specific use cases and platforms is one thing, but building a tool that can automatically generate column-level lineage across multiple platforms and vendors is a different challenge. Storing and connecting lineage data is a graph problem, and parsing and traversing query structures can be time-consuming, especially for larger queries with correlated subqueries and unqualified references. An automated data lineage tool can bring all the benefits I mentioned above with minimal effort and time commitment.