The term data quality is in almost every data practitioner’s mouth these days. It’s almost certain that low-quality data will usually drive poor results, no results at all, or catastrophes like investing lots of money in the wrong initiatives or breaking data laws and regulations.
But “quality” is a nuanced characteristic of data and involves different processes: ingestion, parsing, profiling, cleansing, and more. And the parameters to define quality might vary a lot depending on the niche, moment, and size of a company.
Recently we had a live event at Alvin where I had the pleasure to chat with Sarah Floris, Mark Freeman and Martin Sahlen about how to define what data quality means to companies and how to navigate the hundreds of techniques and tools to achieve it.
In this article, I'll bring you the point of view of our three esteemed panellists.
A spectrum of needs
When Mark first took up the cause of advocating for data quality organization, he faced a significant hurdle — the lack of a clear-cut definition. As a result, whenever the term “data quality” was brought up, particularly from an engineering standpoint, it was often misconstrued as a pursuit of flawlessness, driven solely by the pursuit of perfection. But that wasn’t Mark’s intention at all.
From Mark’s perspective, the concept of data quality encompasses a broad spectrum of requirements across various domains. The fundamental question he poses is: What state should the data be in to yield value or mitigate risks?
With this in mind, Mark has approached data quality with a specific focus on risk mitigation due to its profound impact. For instance, he recognizes the importance of data governance, which entails meeting regulatory obligations and upholding specific standards. This aspect serves as a powerful motivator for emphasizing data quality. Similarly, in the realm of machine learning models, each stage and various factors, such as nodes or drift, can significantly impact the model’s performance. Hence, ensuring data quality becomes of paramount importance.
To steer the conversation away from perceiving data quality solely as the presence of clean and easily manageable data, Mark proposes adopting a perspective centered on well-defined standards. This involves establishing guidelines and striving to meet those standards, aligning with the organization’s business model, and contributing to revenue generation or risk mitigation.
A team effort
Sarah embarked on her career as a data consultant, initially working as a team of one. This experience quickly shed light on the direct impact of data quality and clean data on outcomes.
At first, Sarah didn’t fully grasp the software engineering perspective on data quality. She viewed it as an integral part of her job and a top priority. However, she soon realized that data quality is an ongoing process that involves understanding the unique requirements of the data and its intended purpose.
Initially, Sarah believed that all data had to be impeccably clean. But upon closer examination, she made a significant revelation — certain areas of the business didn’t require immediate data quality checks. This newfound understanding allowed her to better prioritize her time as a data scientist and machine learning engineer.
Sarah recognizes that data quality is not solely the responsibility of a single data engineer. It requires a collaborative effort from the team to determine the most effective approach and identify who should take ownership of various aspects of data quality. While engineering teams typically handle tasks like data ingestion and frequency, they may not be directly involved in scrutinizing the data itself, such as assessing the number of nodes. This responsibility may be taken up by data analysts, data scientists, or even employees who have the available time.
In essence, Sarah perceives data quality as a combination of teamwork and alignment with the specific needs of the business.
Quality is different for every company
According to Martin, there are parallels to be drawn between software development and data management. To define quality in both realms, he emphasizes the importance of stepping back and carefully delineating the specific use case. The primary goal is to identify desired outcomes and achievements. This systematic approach helps stakeholders grasp the essence of the concept and its potential benefits.
While this methodology may seem abstract, Martin underscores the significance of considering technical practicalities and leveraging past experiences. Drawing from his previous role, he encountered a significant amount of data that initially seemed similar but turned out to be distinct. He recognizes the potential negative consequences of making assumptions in such situations.
When involved in the procurement and sale of data, Martin acknowledges the profound impact of data quality on business decisions and revenue generation. However, he envisions that there is no one-size-fits-all approach; it requires a highly specific and tailored strategy. Thus, defining parameters that determine data quality becomes a crucial step. One common pitfall is solely focusing on node verification, as it fails to provide a comprehensive solution for data quality.
To truly grasp the concept, Martin insists on a deeper exploration that considers the characteristics of the data source. Is it real-time or batch? What is the inherent nature of the data? For example, clickstream data involves high frequency, while order data in a fulfillment system possesses unique characteristics.
Critical questions arise: Can there be instances of missing or duplicate data? Are there systems that generate data on a monthly basis? Reflecting on these aspects helps navigate the realm of data quality more effectively.
Of course, helping data teams answer these questions is what drove Martin to build Alvin. If you want to see how data lineage and observability can help you nail your data quality strategy, you can try Alvin for free.