Managing Data Quality in Databricks

In this guide

Trust the data you use in Databricks

For all the things your organization uses Databricks to achieve, none of it is truly reliable without good-quality data. Every decision that relies on pipeline output, every machine-learning model that trains on data, every downstream report is only as sound as the data that feeds it. Checking data quality is a baseline requirement; without it, the data that matters most to data engineers and any downstream data consumers is unreliable at best, and silently misleading at worst.

When approaching this challenge, the first step is to establish and maintain a basic level of data quality testing on the data you ingest, transform, and analyze in your pipelines and notebooks. Using Soda Library, Soda’s enterprise framework for data testing, data engineers can invoke Soda checks for data quality directly from within a Databricks Notebook, as in the following simple example.

Soda efficiently gathers the business rules for data quality that you defined, then runs aggregated SQL queries on your data to surface any data that falls outside expected parameters for completeness, timeliness, validity, and more.

In Databricks, invoke Soda as frequently as you need to check for data quality early, and often.

Check for data quality after ingestion to validate that data remains intact and as-expected.
Check again after referencing data and joining tables to confirm data is complete and valid.
Check again before exporting to an external source, or to “circuit break” your pipeline so poor-quality data flows no further downstream.

Leverage the Soda Cloud platform to level up

After using a Soda Python library to establish a baseline for good-quality data in Databricks, level up your data quality game by leveraging Soda Cloud. A best-in-class data quality management platform, Soda Cloud offers visibility into historical measurements, tracks data quality trends over time, and enables data engineers to set up granular alert notifications when the most important data triggers red-flag warnings.

Take a huge leap forward with Soda Cloud which uses machine learning to recognize patterns in your data that enable it to automatically gauge anomalies and changes in your datasets’ schemas. Define more advanced business rules for data quality that:

group results by category like country or sales region
reconcile data between data sources
monitor the evolution of data quality by category

Without changing an invocation of Soda from within a Databricks Notebook, log in to an out-of-the-box Soda Cloud account to gain immediate insight and observability into the state of your data’s health.

While Databricks’ built-in data quality dashboard does offer some insight into the state of data quality, the bare-bones presentation may not be suitable for everyone in the organization to consume. Without insider knowledge, it can be difficult for a non-engineer to parse the information to get what they need out of Databricks directly. So where Databricks steps aside, Soda steps up with more features to foster data quality collaboration.

Expose everything you do in Databricks to the business

With the basic data quality building blocks in place, organizations are looking towards a future in which everyone who accesses, uses, references, analyzes, or learns from data in Databricks can easily do so. For example, though data analysts and scientists are experts in analyzing data to make well-informed strategic decisions for the business, they may not be versed in data pipeline management, writing Python or SQL, or have the skill set to navigate the data plane.

By integrating Soda in notebooks and connecting to Soda Cloud, data engineers can expose everything they do for data quality in Databricks to the business at large. Data consumers who may never have had the opportunity or know-how to gauge the health of the data they use can log in to Soda to get self-serve access to all the data quality checks that data engineers execute in Databricks.

United in Soda Cloud, data producers and data consumers can define agreed-upon data quality expectations. Colleagues can easily connect across domain teams to share and use data that is relevant, timely, and trustworthy for insights, analytics, and machine learning. Soda facilitates the collaboration between data producers and data consumers in their efforts to manage the quality of business-critical data.

Using out-of-the box guided workflows and AI-assisted data test creation, users of all skill levels are empowered to get access to the data that matters the most, when and how they need it. Soda delivers the power of data-informed decisions to far more people than ever before.

An evolution of data quality management

While it’s certainly true that when it comes to data quality, an ounce of prevention is worth a pound of cure, we believe that Soda takes the adage further: an ounce of participation is worth a pound of access requests! Setting up basic data quality standards, then expanding the landscape to include consumer participation in data quality management relieves the bottlenecks that traditional closed and centralized data engineering teams suffered in responding to dozens, if not hundreds, of requests to access and demands to fix data sources.