Most of the data companies work with is related to their products and customers. Tens of thousands of terabytes of data about customers, products, payments — you know the drill. This data is crucial and by now most businesses have a reasonably good handle on how to leverage it to bring value. But there’s another kind of data that tends to just lie around and get (unfairly) neglected: metadata.
Not taking full advantage of your metadata is a massive missed opportunity because as I’m about to demonstrate in my TED talk this article, it can solve a number of business pains and bring serious results.
Before digging into how metadata can create value, let’s start by understanding the two levels of metadata: basic and derived.
Basic Metadata
Simply put, metadata is data about data. Depending on the system, there is likely a lot of metadata that can be harvested to provide extra context on your data assets, such as table names, descriptions, row count, and so on.
This basic metadata can be leveraged in data catalogs, where the information can be gathered, organized and presented to data users. This metadata can unlock features like search and discoverability of your data assets.
In data warehouses, you will also have logs which contain information about every command that has been run. This can be accessed later, and contains information like:
- Who ran that command;
- The date and time it happened;
- The command itself.
This can be useful, but we can go one step further. By analyzing these logs, we can understand how the data was transformed, what changed, and why. This is the process of turning basic metadata into derived metadata.
Derived Metadata
When we analyze our metadata, that’s when things really start to get interesting! For example, parsing SQL we can derive insights that can power a range of operational use cases. We can infer connections between entities and calculate usage statistics, predict the impact of schema changes, and understand how the data flows from data producers to consumers.
The image above is an example of derived metadata: it's a lineage graph that was built after parsing and analyzing metadata from different sources. This is what we at Alvin call operationalizing metadata.
What problems can metadata solve?
To use it to its fullest potential, you will need to collect all the right metadata from different data sources. Even though most metadata resides in your data warehouse, looking at metadata generated by your warehouse alone will not give you the full picture.
Before analyzing metadata, teams should gather all the information from their business intelligence, transformation, warehousing, and orchestration tools into one central location.
With this high-value metadata, you can address many different business problems:
No more breaking things — get proactive about pipeline errors
It can be hard to predict what’s going to happen when you make infrastructure changes. Sometimes the most innocent SQL query can impact parts of your system that you didn’t even know existed.
If you have lineage provided by derived metadata, you know where data is coming from and where it’s going to. And when you know that, you can perform regression testing before actually making changes to your data, avoiding breaking changes and major issues.
Optimization of data usage and cost
Without constant attention, unused tables and dashboards can pile up fast: think of all the single-use tables for ad-hoc analysis you’ve seen in your life. It can get tricky to figure out if all the pipelines you have running at any given time are really necessary.
So you could join the companies out there that are spending thousands of dollars on compute and storage for things that aren’t even being used… Or you could optimize it.
Once again, metadata to the rescue. With usage and lineage metadata, you can begin to understand how your data is being used and what it’s costing you, find what is not being used, and do periodic clean-ups and other kinds of optimizations.
PII tracing and compliance
Much of the work needed to comply with privacy regulations, such as GDPR, CCPA and LGPD, falls on engineers’ shoulders. And just like any other kind of data, sensitive data flows from its source to many different tables, dashboards, notebooks, spreadsheets, machine learning models and more.
Manually (and accurately) keeping track of where PII ends up is a momentous task that can be achieved much more smoothly, and with less cruelty to engineers, by tracking the flow of data with lineage.
Data trust
How is this field calculated? Should I use the VALUE or the PRICE field? Where does this data come from? These are common questions that data consumers ask engineers, usually when working with new datasets.
Lineage can help trust become more self-serve and reduce the burden on data engineers.