In my wanderings around many different data communities and social channels, it doesn’t take long before I end up in a conversation about efficiency, quality, cost, or discoverability.
And data lineage is often suggested as the answer to many different problems in these discussions. But if data lineage is the answer, what is the question? What do you really get with well-implemented data lineage running in your company?
Recently, we had a LinkedIn Live called The ROI of Data Lineage with two folks that implemented data lineage in their companies: Boyan Vasilev, Data Science Manager at Unity Technologies, and Michael O Toole, Senior Data Warehouse Architect at Kry.
Let's take a look at what answers they found with data lineage.
The impact of data lineage at Unity: observability, optimization, discovery
After implementing data lineage at Unity, Boyan says they got some pretty interesting results.
But what were the questions they were trying to address? Here are a few:
How can we be more proactive about pipeline errors?
Errors don't necessarily mean that something will instantly break: sometimes a process won’t run as it should with no warning at all. It might take weeks or even months until someone realizes that the metric they have been checking every day to take important decisions is not right. That can have huge consequences (inaccurate financial reporting anyone?).
With the metadata generated by data lineage, Boyan and his team were able to build a notification and alerting system that would warn them right away when something didn't go as expected so they can act, which takes us to the next question.
How can we quickly diagnose the root cause of errors?
Even when you have found a pipeline error, solving it can be hard. The challenge will increase with more complexity; the amount of data sources, processing steps and systems you integrate with.
With lineage, you can have a clear vision of where in the pipeline the error has occurred. Also, knowing the downstream dependencies of an asset means you can be proactive in informing stakeholders. Boyan says:
If a job fails for whatever reason, we know that a specific node in our system is out of date. And if that is out of date, everything downstream is too.
In other words: whenever a problem happens, they can trace where it is and who is impacted, and take action before disaster strikes.
How can we optimize the usage of our data assets?
In an ideal world, you would create and maintain the exact number of assets that you need to keep your pipelines running. But we know that in the real world, with a team with 40+ people (in Unity’s case), that’s not always going to be the case. Unnecessary tables, models and dashboards can be a source of confusion for the team and drive unnecessary cost.
The metadata provided by lineage gives a clear vision of unused assets to Boyan and his team, so they can clean that up. And not only that: they can also understand what is actually being used a lot, allowing them to understand how users are consuming data, and an opportunity to optimize their pipelines.
How can we increase collaboration and improve the data discovery experience?
The conversation above shows how something that might be easy to solve can take a lot of your team’s time. To Boyan and the wider Unity team, that is not necessary: lineage (and the associated metadata) gives each one of them enough understanding of their data without having to constantly ping each other.
The impact of Data Lineage at Kry: compliance, testing and onboarding improvements
At Kry, the results after the implementation were clear and impressive.
Again, data lineage was the solution. But what were the questions they were trying to answer in the first place?
How do you ensure compliance?
As a MedTech, one major concern at Kry related to data is that of access, compliance and security. How could they safely and ethically store data such that it is used for its allowed purpose, while still giving engineers and analysts an easy (although guarded) access pattern?
In the context of a data warehouse where data flows from many sources and may be combined and merged in many different ways, understanding the flow of data was essential to categorizing it, and the categorization was a prerequisite for driving access patterns.
For example, the e-mail address coming from a user table in a source system could be marked as PII (Personal Identifiable Information). Knowing this, only analysts who were allowed access to PII data should be able to select from this column. For source columns, often this classification was provided by the producers via data contracts. They decided to implement lineage so they could:
- Automatically “inherit” data access classifications from source systems. Eg: if a column used CONCAT(date, email) to create a time series id, they would mark this column as PII because it used or inherited from a PII column.
- Identify and prevent the interaction of data that was legally not allowed to be mixed — using a users login_at and social_id together might not be permitted as they had been collected with consent from the users for purposes that did not overlap.
Using lineage to drive classification meant:
- The hundreds of thousands of columns they had did not need to be manually documented, because everything was automated through lineage.
- Analysts could trace the “why” of a classification, eg. “why can I not use this column?” -> “because eventually a long time back in the pipeline it comes from region-restricted French data”.
- There was a clear pattern for changing classifications — a user could manually set a new classification (which was logged and associated with a reason). Eg. this column might have come from a user's email, but it is now highly aggregated and no longer need this access level.
- Table layout and design could concentrate on usability and performance, without having to consider security. This meant the lineage itself for each column ensured clean clear column-level access, and they did not have to build separate tables just to store more sensitive data.
- Changes could be programmatically considered if a change would have a large or unusual effect on security/access in the data warehouse as a whole.
With lineage implemented, it was possible to guarantee compliance at Kry.
How to test data changes and stop breaking things?
One of the most common problems teams face is cleaning up after breaking changes. It's not always easy to understand what the downstream consequences will be if you change the name of a column, for example, especially in complex environments with lots of different people transforming data.
Anticipating that with regression testing and avoiding manual mapping is something that’s achievable with data lineage. At Kry, Michael and his team can test if running a script will break something before actually running it. That's because they can analyze the downstream dependencies of the assets involved with the metadata provided by the lineage.
Testing manually is repetitive (and sometimes painful), but not doing it at all is accepting the risk of adding errors to your system that can break dashboards and cause metrics to be inaccurate.
Data lineage allows you to quickly test, fail fast and fix your mistakes at speed, saving a lot of engineering time.
How to improve the onboarding of new joiners?
It takes time for new hires to properly understand a codebase and it's not uncommon to commit mistakes from lack of knowledge. With self-documented pipelines provided by lineage, the knowledge transfer of the domain doesn’t depend only on people.
Michael says there was so much less failure in this process and the results were measurable: they saw that the number of merged commits from new hires after a month increase 4 times.
What level of lineage do you need?
Well, that depends very much on the question you need to answer. Tools like dbt and Airflow by design come with some basic lineage, which is way better than nothing! But these tools are limited both in terms of granularity (column vs table level), and the lineage is limited to the boundaries of that specific tool. So, if the question is, ‘will I break any dashboards if I rename this field?’, then model-level lineage ain’t gonna cut it.
So, after reading this you realise you need lineage solution? Then probably the next question you'll be asking yourself is: should I build, or should I buy?
The Great Debate: Build vs Buy
As I work for a lineage provider, I’m a tad biased. But having been building out data lineage for the past 3 years, and seeing the technical complexity, we’d probably never recommend building it in-house! That said, for certain specific use cases, it is totally doable. Also, if you have well-defined processes and strong SQL practitioners in your team, then this will reduce the number of edge cases you need to solve for.
Of course, it is one thing to implement data lineage in a company with support for specific use cases and platforms, but a completely different one to build a generic tool that can plug into any data environment and automatically generate column-level lineage from multiples platforms and vendors. So, mainly for catharsis, I’m going to list a few challenges we ran into over the years:
- Storing and connecting lineage data ends up being a “graph” problem. Which is well outside the domain of data engineers and indeed most engineers in general. Parsing lineage might be possible but using it requires efficient graph considerations.
- Variations of query syntax in multiple SQL dialects. People use many variations of query syntax SELECT INTO vs CREATE TABLE AS in Redshift for example. These seem like corner cases but lineage only makes sense when it is complete.
- Parsing and traversing query structures often work on a small scale. But when we're talking about larger queries, well… we can just say that performance optimizations on parsing are really time-consuming.
- Correlated subqueries and unqualified references can pose a real challenge in larger queries. Questions like “To which column does this refer” and “which subqueries are accessible in the scope” can be a big jump in difficulty.
Let’s say it’s been a journey, and a never-ending one at that, but we love it really, and our customers benefit from all the blood, sweat, and tears. So if you are looking for something out of the box, you can try it out for free.
Closing thoughts: signs that you need data lineage
Data lineage is (probably) not something that you just wake up one day and think omg it would be awesome to have column-level lineage.
Instead, it should come from 1) the need to answer some of the following questions and 2) you currently have nowhere to turn for the answer:
- How can we be more proactive about pipeline errors and stop introducing breaking changes?
- How can I check for assets that are not being used and get rid of them?
- Where is this column is coming from? What are its downstream dependencies?
- How can we quickly test and fail fast?
- How can we ensure proper data compliance?
- How can we win back precious engineering time spent firefighting?
- How can we improve data discoverability and the onboarding experience for new joiners?