We’ve all been there – drowning in an ever-increasing amount of tables, struggling to understand what this table is used for - who added it, who is using it and what for: the list goes on. It’s incredibly easy to add new pipelines to create more data, but ever harder for data teams to clean up and curate the data. Storage insights leverages and unites all available metadata about tables to allow fast decision-making without all the detective work.
After talking to our customers, we realized that the problem of removing unused or duplicate data is an increasingly time-consuming and error-prone process. Unused and duplicate data is problematic for a range of reasons:
- Compliance
- Cost
- No source of truth
- Lack of trust
- Overhead in communication
- …and the list goes on.
Deleting a table is a lot more involved than just issuing the statement, it’s about understanding where the table “sits” in the overall pipeline (is it a source, staging or metric table), who “owns” it, who is using it, and what for, and if the table is connected to any dashboard, processes or systems outside of the warehouse. Once this is done, the process of engaging with potential stakeholders starts, and after that again, the actual work can be done. Phew..
It became clear to us that this process can be streamlined by bringing all of this information together. In true Alvin spirit, we collected, aggregated and summarized all the data in a neat list for our users.
Getting hands on:
First, a little secret: Alvin is running on top of BigQuery. This means many things, but first and foremost - we are actively dogfooding our own product to solve the same problems as our customers, for ourselves. What you will see here is screenshots of our own production environment and how we have used the product to cut our storage cost and remove unused tables. We have, as the chart indicates, actively used our product to reduce our storage and processing costs by over 50%.
The first thing I’m doing is to hover over the badge in the “Analytical users” column of an arbitrarily selected table and this shows me the analytical users of this table. This is important; analytical (i.e SELECT) usage is very different than if the table is used as part of a pipeline. This very quickly gives you an idea of the users who are actually querying the table.
Furthermore, we have analytical usage count and downstream usage count of the table. Showing downstream analytical usage is also important, as it's not uncommon to have tables that are important as pipeline sources or intermediary tables. Deleting such tables would break pipelines and thus tables and dashboards further down the pipeline.
The “Compute cost” column indicates the cost of queries building this table over the time period selected.
Powerful grouping and filtering capabilities
We’re not done yet, however. The tool offers powerful slicing and dicing capabilities. A commonly requested feature is to understand which schemas or even databases/projects have low utilization, and with a simple click you can aggregate all the stats across multiple dimensions, like schema, database or even owner/team/domain. You can also filter across these dimensions to drill down into a specific team, domain or database/project.
To make life even easier, the most common workflows have predefined filters in the sidebar, so you can just jump straight in!
- Source tables
- Intermediary pipeline tables
- Pipeline output tables
- “Island” Tables that are not part of any pipeline but used for analytical purposes
- And lastly, unused tables.
Next steps
As always, we are continuously improving our product through feedback, and with this feature there are already plans to enable automatic deletion of tables based on user-defined criteria.
It’s hard to really cover the feature in full detail, with all the things you can do to order, group and filter the data. So feel free to sign up at Alvin to test out our demo environment, and if you like what you see you can connect your own environment for free to give your data warehouse an overdue spring clean! Or if you would just like to share feedback, you can drop me a message on Linkedin.