Implement Data Quality Checks in a Databricks Pipeline with Soda: Step-by-Step Tutorial

Learn how to quickly set up and run data quality checks within Databricks pipelines using Soda. In this hands-on tutorial, we’ll guide you through every step, helping you proactively prevent data issues in your data pipelines.

Key Takeaways

How to install and configure Soda in Databricks
Run data quality checks on large datasets
How to create custom checks using SodaCL and SQL
Use Soda Cloud for monitoring, alerts, and analysis
Integrate Soda with Slack and other tools for timely alerts

Requirements and Links

Written Instructions

Run Soda Data Quality Checks in Databricks Notebooks

This tutorial guides you through the steps to quickly set up and run Soda data quality checks within a Databricks notebook. Learn how to install Soda packages, configure Soda Cloud, and execute data quality checks.

Step 1: Set Up Soda Cloud

ASSIST SET UP SODA

Step 2: Import the Example Notebook

Import the notebook provided in this tutorial to your Databricks workspace.

ASSIST: IMPORT NOTEBOOK

‍Step 3: Configure Soda Cloud Attributes

Attributes in Soda Cloud act as metadata for your checks or datasets and are essential for filtering and reporting. Create an attribute named Check Explanation to describe the purpose and function of each check:
Label = Check Explanation
Attribute resource type = Check
Type = Text

‍ASSIST: ATTRIBUTES

Step 4: Generate API Keys

Generate a set of API keys from your Soda Cloud environment to connect Soda Library to Soda Cloud:

In your Soda Cloud account, navigate to your avatar > Profile > API Keys, then click the “+” symbol, and generate new keys.

Copy these keys to a temporary file for use, shortly, in the Databricks notebook.

‍ASSIST: GENERATE APIs

Step 5: Install the Soda Spark DF Package

In the Databricks notebook, install the Soda Spark DF package. You can install it on a per-notebook basis or add it to your cluster to make it available whenever the cluster starts up.

Step 6: Configure API Keys and Host Information

Paste the API keys you generated into the notebook.

Set the host according to your region: cloud.us.soda.io for the US region or cloud.soda.io for the EU region.

Step 7: Define a Dataset and Checks

Define how your dataset appears in the Soda Cloud user interface using the DF view (user-defined dataset name).

Define the Data Source Name and Scan Definition Name to help track your checks in Soda Cloud.

This example uses the New York City Taxi dataset, (available in Databricks) as sample data against which to run checks.

Step 8: Write Data Quality Checks

The following describes the checks to execute during a Soda scan:

Ensure there are no missing values for the Passenger Count.
Validate that less than 5% of Fare Amount values are null.
Check that Trip Cost is a positive number when passengers and trip distance are recorded.

ASSIST: WRITE SodaCL

In addition to preparing SodaCL checks, you can use SQL queries to define custom checks. For example:

Ensure that Pickup Dates and Times are valid.
Sample about 100 records and send them to Soda Cloud for further analysis.

Step 9: Execute the Data Quality Checks

Once the checks are configured:

Run the notebook, and Soda begins scanning the dataset.
View the results will be output in both the notebook and Soda Cloud. For failed checks, you can choose to send failed row samples to Soda Cloud or route them to a local table for further analysis.

Step 10: Set Up Alerts

Soda integrates with communication tools like Slack and Microsoft Teams so you can configure alert notifications when data quality checks warn or fail. Set up the integration in your avatar > Organization Settings > Integrations. For example, when checks fail, Soda sends a notification to a specific Slack channel and includes a link to Soda Cloud for investigation.

Step 11: Analyze the Results

View the data quality check results in Soda Cloud, including detailed breakdowns of passing and failing checks.

Re-run checks on filtered data to focus on specific time periods or subsets of data.

ASSIST: REVEW CHECK RESULTS

ASSIST: MANAGE FAILED ROWS

Step 12: Advanced Configurations

This basic setup demonstrates how to quickly get started with Soda and Databricks. You can explore more advanced configurations that involve:

Integrating Soda checks into your pipelines.
Scheduling periodic checks.
Setting up more complex alerting mechanisms.

For more exhaustive instructions, visit the Soda documentation at docs.soda.io.

Implement Data Quality Checks in a Databricks Pipeline with Soda: Step-by-Step Tutorial

Key Takeaways

Requirements and Links

Written Instructions

Sign Up

Stay Connected