Data quality and reliability checks help businesses detect data-related issues long before they have a negative impact. In this guide, we share a few simple, effective checks that you can implement today to help your business run more smoothly and efficiently. Plus, we'll share our thoughts on some longer-term solutions that will help you place good data at the heart of your business model.
In the era of big data, decision-making is all about inferring a future state by understanding the past and present. But when your data doesn’t properly capture the reality of your business, it won’t serve as a reliable basis for any predictive model. Rather than driving smart business decisions, data that’s not validated for quality and reliability can be worthless, or even damaging, to your company.
Unlike buggy code, which causes software to break, low-quality data can remain undetected for a long time. But when it creates issues, the firefighters (a.k.a. the data engineering team) are called to the rescue. At Soda, we often see data engineers within the industry who spend too much of their work time patching up existing data pipelines and debugging data issues when their expertise would be much better used for designing and optimizing the company’s overall data infrastructure, or building new data products.
It is for these firefighters that Soda exists. Data quality and reliability checks help businesses detect data-related issues long before they have a negative impact. In this guide, we share a few simple, effective checks that you can implement today to help your business run more smoothly and efficiently. Plus, we’ll share our thoughts on some longer-term solutions that will help you place good data at the heart of your business model.
Data quality checks formulate your expectations of the tables in your database or of the columns within a table. You could, for example, specify that your datasets shouldn’t be empty or that a certain column shouldn’t contain duplicate values. The Soda Checks Language (SodaCL) is a concise and readable language built expressly for data quality and reliability. Data quality expectations can be defined in Soda in a number of ways. Data engineers and technical users can write SodaCL checks directly in a checks.yml file, or leverage check suggestions in the Soda Library CLI to prepare a basic set of data quality checks for you. Alternatively, you can add SodaCL checks to a programmatic invocation of Soda Library. Non-technical and business users, such as data analysts or data scientists, can use a simple user interface. Dropdown menus and pre-populated fields make it easy to specify the data quality rules with no-code checks. In addition, you can provide natural language instructions to SodaGPT, the first AI co-pilot for data quality, to receive fully-formed, syntax-correct checks.
To compare the expectations outlined in your quality checks file to your actual data, Soda utilizes a scan that it runs against your datasets to extract metadata and gauge data quality. The results of the scan alert you to irregularities in your data. Depending on the type of alert and the relevancy of the affected data, you may take different measures to address the issues, such as fixing the source of the problem or attaching a warning to the data before handing it over to another team. For a detailed introduction to Soda, have a look at our guide to implementing data quality checks.
Proactively checking data in order to prevent downstream impact introduces an element of foresight into the processes and workflows that rely on (good-quality) data. This approach is very different from the reactive approach that we’ve seen in many companies. In a reactive workflow, when a problem occurs, the data engineer has to go in ASAP and write ad-hoc checks and fixes. Too often, this means that they are inundated with tickets, resulting in the notorious data engineering bottleneck and frustration across the team.
We’ve also seen data engineers routinely repeat the same manual reliability checks over and over again — for instance, at ingestion or after a transformation. They usually know that this situation is far from ideal but don’t have the time or resources to look for alternatives.
Here’s the good news: if you’re a data engineer looking to automate your data quality procedures, you don’t need to reinvent the wheel. As experts in the space, we’ve identified some checks that will make your life easier from day one and require almost no domain knowledge. If any of these checks sound an alarm during a scan, then there’s a high likelihood that something is off.
Simple but effective, a row count check lets you make sure that your datasets aren’t empty — an important prerequisite for any downstream task. Row count checks can also alert you to unusual spikes in the volume of your data. When a transformed dataset suddenly contains many more rows than expected, it could point to a bug in your analytics code, such as an outer join being incorrectly used to join two tables instead of an inner join.
A schema describes the columns in your dataset. Although dataset schemas may change during the early stages of your business – columns added or removed, or changes to column ordering – they should stabilize at some point. Add a schema evolution check to automatically monitor changes to your schema and notify you when anything happens. Run two scans to start seeing results: first to capture a baseline measurement, another to run a comparison.
At a time when new data points are produced and transmitted in a continuous flow, it is particularly important to keep an eye on the timeliness of data. To that end, you can use SodaCL to implement a freshness check on a date or timestamp column. For instance, you could use it to configure an alert if the youngest data in a dataset is older than a day. When triggered, it alerts you to roadblocks in your larger data ecosystem. Perhaps a third-party supplier accidentally sent a file with old data? Or maybe a pipeline didn’t run correctly? With a freshness check, you’ll know.
Duplicate values can greatly distort datasets. Apply a duplicate check to make sure a column contains only unique values. You may, for instance, apply it to both order_id and account_number to make sure that orders are not falsely duplicated.
Did someone accidentally enter a date incorrectly? Should a column of order numbers contain a certain number of characters? Wouldn’t you like to know if either of those things has happened? Use a validity check to issue warnings when data in your dataset is invalid or unexpected.
A report on forecasted revenue will not yield very accurate predictions if a monthly payments column is missing values. Use a missing check to find the NULLs and make sure the data that your teams are working with is complete.
We never get tired of repeating it: automating your data quality checks will bring your company nothing but positive results. Data engineers can go back to doing their actual jobs and hopefully be relieved of the pressure associated with undetected data quality issues. No more data engineer-related bottlenecks!
Of course, unreliable data is not just a constant source of stress for the data engineer. It also results in an environment in which you never really know how much you can trust your data-informed decisions. After all, even the cleverest machine learning model will only be as good as the data it’s trained on. Further, having automated data quality checks in place also increases the potential for self-service analytics, which we’ll go into in another guide.
Data quality isn’t inherently good or bad. That judgment depends very much on what you want the data to achieve. For example, the same dataset can have different quality requirements depending on whether it’s used for reports that only a few people read, or for making strategic decisions for a whole department.
When everyone in your company is clear about what they expect from the data they use, you get better-informed conversations about data. Here are two more ways you can guide your teams toward an environment of trusted data.
Regular, automated quality checks are an important foundation for any data-driven business. But they can only provide true value when someone is responsible for addressing the alerts raised during a scan. That’s why every dataset should have a data owner, a person who is ultimately accountable for the quality of that data. When there’s an issue or someone further downstream requires a change, the data owner is their contact person.
Note that data owners are not typically data engineers. That’s because a data engineer’s expertise lies in managing the data rather than understanding the content and context of the data itself. A data owner brings domain expertise to the table with their intimate knowledge of what the data represents and the processes that generate it. Data owners and engineers work closely together to bring high-quality data to everyone on the team who needs it.
Teams often want their data-based products to be 100% accurate but are unaware of how unrealistic that expectation is. In reality, data that is truly interesting can also be very messy! Real-life data always has missing values, outliers, and other noise. A good way for your company to respond to your data’s inherent variability is by quantifying the reliability of the data as a “health score.”
Let’s imagine for a moment that one of the datasets used in a periodically-updated dashboard fails the freshness check. By introducing a health score, you can still update your dashboard despite the inaccurate data, but signal to the viewers that it is slightly less reliable than previous iterations. The users of your data can then make a decision on whether to wait for more reliable data or work with what they already have.
Getting a grip on data quality can feel like an insurmountable challenge, but not anymore! By introducing procedures dedicated to data quality and reliability into your workflow, you can enable data engineers to put their expertise to its best use. Plus, everyone in your company is rewarded with better-quality, trustworthy data to work with.
Start a free trial of Soda to implement foundational data quality checks today and avoid the pain of not knowing, or finding out too late, that a data quality issue has had a downstream impact. If you’d prefer to talk directly to us, schedule a meeting.
Good luck!