In the ever-evolving world of data management, quality is the linchpin keeping everything sound. Maintaining high-quality data is imperative, whether you're developing intricate machine learning models, or crafting insightful dashboards for pivotal decision-making.
But the path to achieving this objective can seem like an intricate maze, filled with myriad metrics and elements to track and validate. How do you identify what needs to be tested? How can you make your efforts yield your expected outcome? Where should you even start?
Automate the Basics
When it comes to data quality coverage, a few simple checks can make a world of a difference. Regular updates of your data are crucial, and you may want to consider adhering to established Service Level Agreements (SLAs) for consistent updates, paying special attention to columns with human-entered input to confirm they align with a predetermined format.
Routine procedures, like checking for duplicate entries or null values in your important columns, can safeguard the integrity of your data. As obvious as it might seem, it’s not uncommon to find that even those basic quality checks are missing from key data assets. The reason? Where teams don’t have strict data entry or data quality processes in place, people simply forget to check and optimistically hope that it will all be okay. Spoiler alert: it won’t.
Establishing basic data quality coverage shouldn’t be left to chance, or your team’s maturity; it should be systematic and automatic. To this end, and drawing from our experiences working in data teams, we’ve designed the new check suggestions functionality based on what a mature data team would build in-house.
Let the Automation Guide You, Not Blind You
At Soda, we believe in the power of declarative data quality testing. That’s why we developed Soda Checks Language (SodaCL) and why we have focused heavily on data testing via explicit, user-declared rules.
But even with this powerful, intuitive language, we don’t expect you to face the world of data quality coverage alone. No, we believe the right degree of automation – call it conversational automation – can help you and your team follow best practices to get from zero to “whew” in just a few minutes. What is the single Soda Library command that does this for you? soda suggest
This powerful feature takes the guesswork out of establishing basic data quality checks. It paves the way for you to easily kickstart the data quality process by profiling your data, then recommending relevant checks. Rather than starting from scratch by asking yourself, “What checks do I need, here?” you can run soda suggest and answer Soda’s yes/no, or multiple-choice questions in the command-line to produce a solid, production-ready file full of checks, ready to run a scan.
Without exaggeration, five minutes is all it takes. Surely, it’s worth it to validate that your dataset contains, complete, valid, fresh, anomaly-free data!
The Magic. Watch it Happen.
Let’s take a look at some of the key elements of the end-to-end check suggestion flow.
Select Your Suggestions
We know that you may not always want suggestions for all check types, so we’ve got you covered. One of the first questions in the check suggestion flow asks you to select the checks for which you’d like suggestions. You can select one or two if that’s all you need. In the screenshot below, we select everything because, honestly, the whole process is quite fast. And why wouldn’t you want more coverage?
Smart Suggestions for Freshness
Helpful Validity Check Suggestion
Another very helpful check suggestion is for format validity. We all know that string columns can end up being a bit of a catch-all; people store all sorts of data in varchar columns. A format validity check allows you to assert that columns, especially those populated by user input, follow an expected, valid format such as date or currency.
However, because SodaCL supports 40+ validity formats, it can be really time consuming to go through each of your dataset’s string columns to figure out which pattern or format each column should match.
Check suggestions eliminates the guesswork by profiling the columns containing strings and suggesting the most suitable valid format. In the example below, the check suggestion algorithm correctly detects that the “email_address” column should be formatted as an email semantic type. Bravo!
Production-ready file of checks
Once you've completed the check suggestion flow, Soda prepares a production-ready checks YAML file, complete with a prompt that asks you if you want to use it to run a scan right away. (Yes, you do!)
In addition to showing you a pretty summary of the checks it suggests, and storing the file locally on your system, you can take this file and put it anywhere you need it. Add these checks to your data pipeline in production to catch data issues before they have a downstream impact. Or, add them to your CICD pipeline to find post-transformation data quality issues before merging into production.
The beauty of this functionality lies in its flexibility; you can plug the checks as-is into your Airflow DAG, or easily modify or expand them according to your needs. You can customize your checks, adjust the thresholds, incorporate filters, and so on.
But Wait, There’s More!
You’ve seen some of our favorite highlights, but check suggestions does quite a bit more. It guides you through the steps to prepare checks for:
- schema changes
- row counts, and anomaly detection on row counts
- missing values, which automatically look for null values
- duplicate values
Check out the whole exhaustive list of everything check suggestions does in Soda documentation.
What’s Next?
Ensuring robust data quality shouldn’t be a daunting task. In launching check suggestions, Soda has transformed this stultifying task into a simple, guided experience. This powerful feature, paired with our enhanced Soda Library, offers a new level of automation that helps your team systematically and intuitively establish basic data quality coverage.
Our journey doesn’t end here. We have ambitious plans for extending the capabilities of check suggestions to include a greater number of checks, and smarter, threshold-based checks to make them more precise and adaptable.
Moreover, our vision includes a plan to include business-oriented users who generally don’t regularly use command-line tools. We're in the process of designing a way to present check suggestions in Soda Cloud in a way that is even more user-friendly.
We enthusiastically encourage you to try check suggestions the next time you need to add data quality coverage to a dataset. If you are new to Soda, take advantage of the 45-day free trial to experience the benefits of automated, intelligent data quality checks. Do yourself a favor and take just a few minutes of your day to eliminate the most basic of data quality headaches by implementing the most basic of data quality checks.
As always, we look forward to your feedback and suggestions; join us in the Soda Community on Slack and let us know what you think! We’re eager to evolve our products, even as our objective to simplify and systemize data quality checks remains unchanged.
Dive into the world of automated data quality checks with Soda to stop bad data disrupting good business.