Run Soda Data Quality Checks in Databricks Notebooks
This tutorial guides you through the steps to quickly set up and run Soda data quality checks within a Databricks notebook. Learn how to install Soda packages, configure Soda Cloud, and execute data quality checks.
Step 1: Set Up Soda Cloud
Log in to your Soda Cloud account. If you don’t have an account, visit soda.io and start a free trial.
Step 2: Import the Example Notebook
Import the notebook provided in this tutorial to your Databricks workspace.
Step 3: Configure Soda Cloud Attributes
Attributes in Soda Cloud act as metadata for your checks or datasets and are essential for filtering and reporting. Create an attribute named Check Explanation to describe the purpose and function of each check:
Label = Check Explanation
Attribute resource type = Check
Type = Text
Step 4: Generate API Keys
Generate a set of API keys from your Soda Cloud environment to connect Soda Library to Soda Cloud:
In your Soda Cloud account, navigate to your avatar > Profile > API Keys, then click the “+” symbol, and generate new keys.
Copy these keys to a temporary file for use, shortly, in the Databricks notebook.
Step 5: Install the Soda Spark DF Package
In the Databricks notebook, install the Soda Spark DF package. You can install it on a per-notebook basis or add it to your cluster to make it available whenever the cluster starts up.
Step 6: Configure API Keys and Host Information
Paste the API keys you generated into the notebook.
Set the host according to your region: cloud.us.soda.io for the US region or cloud.soda.io for the EU region.
Step 7: Define a Dataset and Checks
Define how your dataset appears in the Soda Cloud user interface using the DF view (user-defined dataset name).
Define the Data Source Name and Scan Definition Name to help track your checks in Soda Cloud.
This example uses the New York City Taxi dataset, (available in Databricks) as sample data against which to run checks.
Step 8: Write Data Quality Checks
The following describes the checks to execute during a Soda scan:
In addition to preparing SodaCL checks, you can use SQL queries to define custom checks. For example:
Step 9: Execute the Data Quality Checks
Once the checks are configured:
Step 10: Set Up Alerts
Soda integrates with communication tools like Slack and Microsoft Teams so you can configure alert notifications when data quality checks warn or fail. Set up the integration in your avatar > Organization Settings > Integrations. For example, when checks fail, Soda sends a notification to a specific Slack channel and includes a link to Soda Cloud for investigation.
Step 11: Analyze the Results
View the data quality check results in Soda Cloud, including detailed breakdowns of passing and failing checks.
Re-run checks on filtered data to focus on specific time periods or subsets of data.
Step 12: Advanced Configurations
This basic setup demonstrates how to quickly get started with Soda and Databricks. You can explore more advanced configurations that involve:
For more exhaustive instructions, visit the Soda documentation at docs.soda.io.