When I was invited to the Data Engineering Podcast, hosted by Tobias Macey, I got pretty excited. It’s a podcast I listen to regularly myself, and one I consider to be particularly interesting due to the deep dive into data engineering topics and the guest speakers they bring in.
It would be quite impossible to give a worthy recap of the episode so I would recommend giving it a listen.
But for all the busy people TLDR; here are 5 selected takeaways from the interview, with a quick intro along with the relevant transcript from the podcast.
Different roles have different ideas of lineage
Data lineage has many different dimensions: what may be essential to a data engineer, may be irrelevant to someone from the data governance team.
"Multiple people have multiple ideas of lineage and how it should be presented; that’s what we want to cater to.
We now are most focused on the technical layer and providing something more than just a visual interface, but also a more operational aspect. When it comes to the other systems, we think the most value is in this cross-system lineage. So you connect Tableau or Looker and your data warehouse. And then what we do is we go into the APIs and the metadata in those systems and we look for the connections that are set up there and then we go into “okay, what are the connections we have already from the warehouse?”
And then we match those up. So this enables us to do this cross-system lineage, I think quite elegantly, because you connect the systems in minutes and, depending on the amount of metadata and queries, and then the lineage is automatically generated within a few hours."
Boring tech is good tech
We always like to talk about how the data space is exciting, but do we really want this excitement at work?
"On the platform side, we definitely believe that boring technology is great technology because you understand it. While a lot of people might think that we use some graph database or whatnot, we actually rely mainly on Postgres and OpenSearch.
On the customer side, to us a day where we’re able to have a client just not do anything special and achieve the results they want, that’s kind of the holy grail for us. That there is this plug-and-play magic to the lineage.
A boring day in data is a good day."
No tool can replace good culture and processes
Need to solve a problem? Just throw a tool at it… right? Tools can be great enablers, but there needs to be good culture and processes built around them or else you are just transferring the pain elsewhere.
"Everyone nowadays is able to write their models, everyone is able to be a data engineer. But I think a bit of the problem that we see is that there are so many things happening, so many tables and pipelines and data. It’s impossible to know what’s useful and what’s not useful.
I’m definitely not saying dbt is the problem, but it comes with its own set of challenges. It’s one of the reasons that the problem exists because you can’t replace good processes and good culture with just tools."
Data lineage cannot be seen as a feature of a data catalog. It’s too hard, and too fundamental.
Looking at existing data lineage tooling, it’s often presented as a feature (sometimes a neglected one). Data lineage is essentially a dataset, and as such accuracy is everything. There are more applications than we can fit on our roadmap, but most of them rely on a very high level of accuracy. We know very well how hard that is to achieve, which is why we’ve chosen to focus on it.
"The term data lineage is a little bit overused, and it is losing its meaning a little bit. As an industry, we need to give value back to it. And I think how we do that is it’s no different to what I’ve been trying to preach; that you need to show how it can be used and why it’s useful. That’s why we as an industry have a long way to go. I think lineage is unfortunately a little bit just put in a corner as a feature that ticks a checkbox and okay, there’s a diagram there, great, we have lineage.
So I think really focusing on that is how it can be used and how it is useful and how it can drive value, save time, and increase data quality. It’s like a pillar of data observability.
I guess it’s because there has been so much funding into the space, so much investment, everyone saying that “yeah, we have column-level lineage, we have everything”. It has kind of set the expectations so high with potential customers that you read a proof of concept or an RFP from a company and they expect to have absolutely everything and I think that that’s probably not feasible from one single product."
Data tooling should respect existing workflows, rather than trying to reinvent the wheel.
It’s important to recognize that data teams have existing tooling and workflows. If you’re offering them another tool, you can’t expect them to throw away what they already have to accommodate it.
"There’s too much inward focus in a lot of the companies that are trying to enter the space and trying to be category-defining in some way. And a lot of that is probably fueled by investors, then everyone’s kind of fighting a little bit to do that.
I’m probably throwing stones in a glass house. But still, I think it’s important, like I said, to stay humble and try to be focused on the use cases and the processes of the data engineers. Our approach is really to latch onto whatever they’re using. And dbt is an obvious one there when it comes to just market reach and the ability to provide value to many data engineers. So what we have here is this impact analysis, or regression testing, as we also like to call it. But the thinking is that a lot of companies already use GitHub or GitHub Actions to run data quality tests (in dbt), basically data quality tests on the changes that they’re making. What we see is that this a perfect area to also integrate our tooling."
If you want to try Alvin, request access here and a crew member will reach out to set up a demo and walk you through the onboarding :)