Soda

Guides

4 Approaches to Data Quality: Which is the Best?

If you're unsure about the best approach to ensure data quality in your organization, this guide offers a breakdown of four key strategies—manual data quality, automated data quality, data observability, and data quality testing—each with its own benefits and outcomes. Explore which method best aligns with your business needs, team capabilities, and operational scale. Whether your goal is reliability, efficiency, or precision, this guide will help you select the right approach to build trust in your data.

4 Approaches to Data Quality: Which is the Best?

In this guide
Download Guide

Introduction

In collaboration with Soda, Nicola Askham, known as The Data Governance Coach, has authored a comprehensive three-part series that addresses critical topics including data governance, AI, and data quality. Nicola specializes in helping organizations enhance their data management practices. Over the past twenty years, she has assisted numerous corporations in reducing costs and inefficiencies through her dedicated coaching, consulting, and training initiatives.

Four Approaches to Data Quality

Being a Data Governance enthusiast, I do love a good definition and I often start my blogs with one. Usually, I don’t have to bother with a definition when writing about data quality as everyone understands the terminology. But when it comes to understanding the best approach you need to get control of data quality, are you comfortable with the difference between manual data quality, automated data quality, data observability, and data quality testing? They are all there to make sure that consumers in your organization can trust the data that they are using but, each has its own way of working and its own outcomes. 

1. Manual Data Quality

Manual data quality refers to a traditional approach in which business users use business rules to define what makes data quality "good enough" for their use. A data quality analyst then translates these rules into code that measures data quality, typically using SQL. It is what we have been doing for years, and is sometimes referred to as operational data quality. However, this process is never as straightforward as I’ve described it; it usually involves running these rules against the data and identifying exceptions, which business users review. This often leads to an iterative cycle of refining the rules to better meet business needs. The main drawbacks of manual data quality are its time-consuming and labor-intensive nature. Translating business rules into code and running these checks requires significant human effort and expertise. The iterative process of refining rules can be slow, especially as exceptions are identified and reviewed by business users. This approach can also be prone to human error and may struggle to keep pace with the evolving needs of the business. Additionally, it may not scale well with larger datasets or more complex data environments.

2. Automated Data Quality

Automated data quality leverages artificial intelligence to enhance efficiency in defining and enforcing data quality rules. Contrary to the belief that AI can entirely replace human involvement in data quality, AI tools assist users by converting business requirements into executable checks without requiring manual coding. This “no-code” approach allows business users to articulate their data quality needs in natural language, which AI then translates into actionable rules. This significantly accelerates the process of setting up data quality tests and reduces the overhead for engineering involvement. One concern is the reliance on the quality of training data and algorithms—poorly trained AI models can produce inaccurate or biased results. The "black box" nature of some AI and ML systems adds to this issue, as their decision-making process isn't always transparent, making outcomes harder to trust. While AI handles routine tasks well, it often misses context-specific nuances that require human judgment. Over-reliance on automation can also reduce human oversight, leading to complacency in monitoring data quality.

3. Data Observability

Data observability is often confused with data quality and is often used as being interchanged with data quality, and this can be misleading and cause confusion. Originating from the fields of software engineering and DevOps, data observability focuses on the reliability of data over time. It involves metrics, logs, and lineage, to detect anomalies, identify schema changes, or changes in the volumes or types of data. It functions in production environments, flagging issues as they occur. It's important to note that data observability is primarily reactive, but not preventative when it comes to data quality. By the time an issue is observed, the damage might already be done. While it helps identify changes and anomalies that might affect data reliability, it doesn’t prevent issues from entering production.

4. Data Quality Testing

Data quality testing, particularly in the left-most part of the data pipeline, aims to be preventative. It involves validating data before it flows into production environments, stopping breaking changes before they impact downstream systems. Data contracts between teams can define expected data behaviors, ensuring data quality reliability and preventing costly errors. Data quality testing is essential for ensuring the quality and accuracy of data but it can be resource-intensive and complex, creating and maintaining test cases as data sources evolve. It may not be scalable with growing data demands. 

Which Approach is the Best?

Which one is right for you? Is one better than the other? Well, it depends! As with many things, it depends on the needs of your business, the capabilities of your team, the scale at which you operate, the nature of your datasets, and time and budget. 

There are a lot of big changes that are happening in data quality at the moment and so I thought it would be really great to talk to Maarten Masschelein, CEO and Co-founder of Soda, to explain all this to us. I’ve always loved talking to Maarten because he also has a background in data governance, as one of the first employees at Collibra. It was here that he started to see that a lot of companies were struggling with operationalizing both data governance and data quality and it gave him the idea of Soda. As a part of our discussion, we gave ourselves an exercise to come up with a good analogy. We think this one with a chef and his produce captures the distinctions between the various data quality practices by comparing them to different aspects of managing a kitchen and cooking..

Imagine you're running a kitchen, and you're sourcing produce for various dishes. You sort the ingredients into different buckets according to quality: some are of prime quality, others are a bit bruised or nearing their expiration date. This is like manual or operational data quality—categorizing and setting rules based on what you know is required for different dishes (or data processes). You might have a rule that the freshest tomatoes go into a salad, while the slightly older ones get cooked down into a sauce where no one will notice the difference. This process is iterative, as the chef might adjust their standards based on the results–sometimes needing to refine their selection criteria to get the best outcome. 

“Manual data quality typically refers to the process where you involve the end users or consumers of your data and you collect the requirements, what do they expect of the data? Because we want to make sure the data is fit for purpose.” 

Then there's data observability, which is keeping an eye on the whole kitchen operation. You're constantly monitoring how ingredients are used, how long they’ve been sitting out, and whether the storage conditions are maintaining their quality over time. It's about making sure that everything stays reliable and consistent, even as conditions change. 

“Data observability is in the realm of reliability which is a measure of quality over time. It doesn’t necessarily tell us if the data is good for a specific purpose, but ensures that any changes are detected and flagged for review, maintaining the overall health and stability of the data environment.”

Data quality testing is like taste-testing different parts of a dish as you prepare it. You might try a spoonful of sauce to make sure the seasoning is right or cut into a piece of meat to check if it’s cooked properly. These tests help catch issues early so that you can correct them before the dish is complete.

“The complimentary thing to do is to add testing, which is why data contracts is such a big deal, as people try to define the handover points between teams and software engineering. You want to go test early because if you let your data flow through the pipelines into the consumption layer, any problems are going to cost you a lot of money.”

Finally, automated data quality, especially when reliant on AI, is like having a smart kitchen assistant that suggests what to do with produce based on its condition. It might say, “These tomatoes are getting soft—let’s make them into a sauce,” or “This fish isn’t as fresh as we’d like; let’s marinate it and use it in a stew where the texture won’t be as noticeable.” This automated approach helps ensure that every ingredient is used efficiently and appropriately, without relying solely on a chef’s constant attention.

“The automated part typically refers to the process of running a number of standard checks from your data. The potential scope is much wider than what constitutes automated data quality because it introduces automation via machine learning or GenAI to more efficiently establish and maintain good data quality.”

We did touch on the transformative potential of generative AI for data quality management - we couldn’t not! Soda AI can help automate or enhance your approach to data quality in a few ways.

  • Automation of checks: GenAI can automate the creation of data quality checks by converting natural language requirements into executable rules. This is exemplified by tools like Soda’s Ask AI assistant to translate business language directly into enforceable data quality checks.
  • Enhanced detection: By understanding the context and semantics of the data, ML-driven checks can detect anomalies and quality issues in the data more efficiently.
  • Debugging and Root Cause Analysis: GenAI aids in identifying and understanding issues within the data. It can analyze records to pinpoint formatting errors or suggest corrections based on the context.

Outside of Soda, you can use AI for Data Correction. Beyond detecting issues, GenAI can propose fixes, such as filling in missing or incorrect data by leveraging an extensive knowledge base.

Is there a one-size that fits all? I don’t think so. A combination of approaches are required for the kitchen to run smoothly. Manual and automated quality checks ensure that the produce meets the necessary standards before it’s used in recipes, just as data quality checks ensure data if fit for business use. Data observability maintains the ongoing reliability of the ingredients, ensuring that any unexpected changes are caught early. This is important for ingredients that might degrade over time or require specific storage conditions. 

Maarten mentioned that many of Soda’s users begin by implementing the anomaly dashboard for observability before advancing to data quality testing and operational data quality. However, some choose to focus solely on the most critical data with operational data quality and testing. Does everyone fully embrace automation? With caution. Maarten believes that “the more automated we can make this with a human in the loop, the happier everyone will be. When everyone can easily get involved in data quality, we’ll all have access to reliable data products.”

🎧 You can listen to the conversation with Maarten in full here.

What’s Next? 

Good luck!