Setting Up Data Quality Tests in Your Pipeline

Financial institutions—including banks, insurers, mortgage lenders, investors, and creditors—rely heavily on accurate, complete, and timely information to drive critical business processes and make informed decisions.

Poor data quality can disrupt financial reporting, loan approvals, fraud detection, credit risk management, and risk assessments, causing significant disruptions that undermine institutions' credibility and jeopardize customer trust.

As a result, maintaining high data quality in the financial sector is not only a good practice, it is also a regulatory requirement. The Basel Committee on Banking Supervision's standard number 239 (BCBS 239) emphasizes the importance of banks improving their risk data aggregation capabilities and internal risk reporting practices. Compliance with BCBS 239 ensures that financial institutions can report risks accurately and quickly, enhancing the financial system's stability.

Therefore, implementing robust data quality tests early in the data pipeline, also known as "shift-left" testing, enables organizations to detect and address issues proactively, minimizing downstream consequences.

In this blog post, we will go over how to integrate Soda into a financial data pipeline to ensure data quality, compliance, and accurate risk reporting. We will cover how to set up Soda checks, automate validations, and keep an audit trail to ensure transaction accuracy, completeness, and timeliness.

The Power of Automation for the Financial Sector

Traditional manual checks are time-consuming, vulnerable to human error, and unsuitable for modern, large-scale financial processes. With thousands of transactions per second, financial institutions require a scalable, automated solution to ensure data integrity and regulatory compliance.

Without automation, institutions will face:

❌ delayed risk reporting that increases vulnerability to fraudulent activities;
❌ inconsistent data validation that can lead to inaccurate financial insights;
❌ high operational costs due to the extensive resource requirements of manual checks.

To mitigate these risks, organizations need to incorporate robust data quality checks into their data pipelines. Soda offers a powerful, automated solution for data quality management. Its built-in testing and observability tools are capable of scanning financial data for anomalies, missing values, and inconsistencies in real time.

This ensures organizations can:

✅ enforce BCBS 239 compliance with automated checks for accuracy, completeness, and timeliness;
✅ prevent poor-quality data from reaching downstream systems, reducing costly errors;
✅ set up proactive alerts to address issues before they escalate.

Rather than manually reviewing transaction logs, a financial institution can set up Soda checks to detect duplicate transactions before they affect customers' accounts, monitor missing or delayed transactions to avoid reporting discrepancies, and flag unusual patterns, allowing fraud detection teams to act faster.

In the following sections, we'll look at how this works in practice.

Setting Up Soda for Automated Checks

Automating data quality checks is critical in the financial industry because data accuracy has a direct impact on decision-making, regulatory compliance, and customer trust.

Soda allows you to proactively monitor your data and identify potential issues early in the pipeline. It can seamlessly integrate into your existing workflows while providing comprehensive visibility into the health of your data.

By following this step-by-step guide, you'll discover how to quickly deploy and integrate Soda into your financial data pipeline, streamlining compliance efforts and reducing manual oversight.

Choosing the Right Deployment Model

To evaluate the quality of your data with Soda, you'll have to choose a deployment model that allows you to establish connections with your data sources. You will then define the necessary data quality checks and run scans to ensure they are carried out effectively.

This tutorial will focus on using Soda Library, a Python tool which gives you direct control over the data quality checks within your pipeline. You'll then be able to check the results in your command-line interface (CLI) as well as in your Soda Cloud account.

You can also connect your data sources directly into Soda Cloud, using a SaaS-style setup in which you manage data quality without any coding. For more detailed guidance on selecting the appropriate deployment model, refer to Soda's setup guide.

Step-by-Step Setup with Soda Library

You can integrate Soda to several different Data Sources, but for this tutorial, we'll show you how to set up Soda Library to automate your data quality checks using MySQL.

By following these steps, you will be able to configure connections to your database, define critical data quality checks, and begin monitoring data quality in real time.

The technical goals of this tutorial are:

1. Install Soda Library using your CLI.

2. Configure the YAML files to:

Connect to your data sources to run quality scans.
Link your setup to Soda Cloud to validate your license and see data quality metrics.
Create and customize data quality checks based on your requirements.

After you've completed this setup, you'll be able to integrate Soda's data quality checks into your pipeline, ensuring that your financial data remains reliable and compliant.

This is a simple setup in which you install Soda Library locally and connect it to Soda Cloud via API keys. Let's get started.

1. Account and API keys creation

Step 1: Join Soda Cloud

To begin, sign up for a Soda Cloud account for a free 45-day trial.

Why create an account? Because your Soda Library needs to communicate with a Soda Account via API keys to validate your license or free trial.

Step 2: Generate API keys

Once you've created your account, create your API keys in Soda Cloud:

Navigate to your avatar > Profile, then access the API keys tab.
Click the plus icon to generate new API keys

Copy the API soda_cloud configuration syntax and keep it in a safe place. You’ll need it when configuring the connection file.

2. Soda to Data Source Configuration

Now, let’s connect Soda Library to MySQL. Here’s how you can do it.

Step 1: Install dependencies

Open your preferred development environment (IDE) and ensure that your Python version is ≤ 3.10 and pip is ≥ 21.0.

Step 2: Set up project directory

Create a new directory for your Soda project on your local environment, and navigate to it via your CLI. Your project structure should look like this:

your_project/
│── soda-env/ │
├── configuration.yml │
├── checks.yml

‍

Step 3: Create virtual environment

It's best practice to install Soda Library within a virtual environment. To create one, run the following commands:

py -3.10 -m venv soda-env
.\soda-env\Scripts\Activate

‍

Step 4: Install Soda Library for MySQL

Since we’re using MySQL, install the required package:

pip install soda-mysql

‍

If you're using a different data source, be sure to install the corresponding package. You can find a list of available packages in our documentation.

Step 5: Set up configuration file

In your Soda project directory, create a file named configuration.yml. Paste the following connection details into the file, replacing the placeholders with your actual connection information:

data_source your_database_name:
  type: mysql
  host: 127.0.0.1 #or 'root' or any host name you're using
  username: #mysql_username
  password: #mysql_password
  database: #your_database_name

soda_cloud:
   host: cloud.soda.io
   api_key_id: #a4ac173d---c29
   api_key_secret: #V13xLLEDG---flTw

‍

Note that this is the file where you need to pass your API configuration syntax.

Make sure that the credentials and host information are accurate and that the user has the necessary permissions to access the database. For more detailed instructions, consult Soda's documentation on connecting to MySQL.

You can also use system variables to pass sensitive information (passwords and API keys) if you prefer. For more information about that, check our documentation on How to Install Soda Library.

Step 6: Test connection

Save the configuration.yml and then run the following command to test if Soda can successfully connect to your database:

soda test-connection -d your_database_name -c configuration.yml

‍

If the connection is successful, you should see an output similar to:

Soda Core 3.5.0
Successfully connected to 'your_database_name'.
Connection 'your_database_name' is valid.

‍

Once this step is complete, your Soda-hosted agent will establish secure connections with Soda Cloud, allowing you to implement automated data quality checks in any schema inside your database.

‍

Use case: Financial Data Quality Monitoring

With Soda Library installed and connected to your data source, we are ready to move on to defining your data quality checks and running automated scans on your data. But first, let's go over some key principles in order to understand which checks are more relevant to our financial data quality management.

Key Principles for Financial DQ Management

By adhering to key principles, you can ensure that your financial data is accurate, consistent, and secure. Here are some important principles that financial institutions should keep in mind to ensure they have great data quality:

‍Accuracy and Consistency: Reliable financial decisions require data that is both accurate and consistent. Minor errors or inconsistencies can result in expensive mistakes.

Completeness and Relevance: Incomplete or missing data can have a significant impact on risk assessments and overall business operations.

Timeliness and Freshness: Financial data should be current. Delays in updating your data can lead to incorrect financial assessments and possible regulatory penalties.

Standardization and Governance: Clear guidelines and standard practices help to avoid inconsistencies and confusion. With proper governance, you can ensure data integrity across departments and systems.

Data Security and Compliance: Financial data is sensitive, and its protection is critical. Implementing role-based access control, encryption, and audit trails ensures that only authorized users have access to the data, protecting it against breaches, fraud, or misuse.

Automation and Continuous Monitoring: Automation speeds up data quality checks, allowing you to detect anomalies, duplicates, and inconsistencies in real time. Continuous monitoring ensures that data quality issues are quickly identified and addressed, resulting in a more transparent and efficient financial system.

By following these principles, you can not only meet regulatory requirements but also build trust in your financial data, allowing you to make better decisions and operate in a more secure environment.

In the next steps, we’ll walk you through how to create and configure those checks for monitoring data quality in real time. But before that we are going to create a sample dataset.

1. Sample Dataset Creation

To effectively monitor financial data quality, we'll create a transactions table that simulates real-world financial data. This table will include intentional data quality issues to demonstrate how to detect and address them using Soda checks.

Step 1: Set Up the Database and Table

Begin by creating a new database named soda_trial and defining the transactions table with appropriate columns:

CREATE DATABASE soda_trial;
USE soda_trial;

CREATE TABLE transactions (
    transaction_id INT PRIMARY KEY,
    account_number VARCHAR(20),
    transaction_date DATE,
    amount DECIMAL(15,2),
    currency CHAR(3),
    transaction_type VARCHAR(10)
);

‍

Step 2: Populate the Table with Sample Data

Next, insert sample records into the table, including intentional data quality issues to test the effectiveness of Soda checks:

INSERT INTO transactions (transaction_id, account_number,
transaction_date, amount, currency, transaction_type)
VALUES
(1, 'ACC1234567', '2025-03-25', 1000.00, 'USD', 'Credit'),
(2, 'ACC1234567', '2025-03-25', 1000.00, 'USD', 'Credit'),
(3, 'ACC1234568', '2025-03-26', -500.00, 'USD', 'Debit'),
(4, 'ACC1234569','2025-03-27', 750.00, NULL, 'Credit'),
(5, 'ACC1234570', '2025-03-28', 200.00, 'GBP', 'Trans'),
(6, 'AC1234571', '2025-03-29', 300.00, 'EUR', 'Debit')
;

2. Soda Checks Set up

Soda scans are designed to perform data quality checks on your data source, helping to identify invalid, missing, or unexpected data. With the sample dataset in place, we can know define data quality checks using SodaCL (Soda Checks Language) to identify and address the introduced issues.

Step 1: Understanding Key Concepts

First, let's make sure we understand some important concepts:

SodaCL: a YAML-based language that includes over 25 built-in metrics that you can use to write checks, but you also have the option of writing your own SQL queries or expressions.
Soda Check: a Python expression that checks metrics during a Soda scan to see if they match the standards you set for a threshold. ‍
Metric: A property or measurement of the data within your dataset.‍
Threshold: The value or range that a metric is compared against in a check.‍
Soda scan: runs multiple checks against one or more datasets present in the database you connected to Soda.

For a comprehensive list of SodaCL metrics and checks, refer to SodaCL Documentation.

Step 2: Designing Effective Data Quality Checks

Now, let's go over some of the principles we saw above. We chose some crucial dimensions for financial data, and their associated issues so we can create our checks using SodaCL.

Accuracy and Consistency

Duplicate Transactions: Duplicate entries of transaction IDs can inflate financial metrics and distort analyses.
Invalid Transaction Amounts: Transaction amounts falling outside expected ranges may indicate data entry errors or fraud.

Completeness and Relevance

Missing Transaction Dates or Account Numbers: Incomplete records limit the ability to accurately track and reconcile transactions compromising financial reporting and challenges in auditing processes.

Timeliness and Freshness

Outdated Transactions: Delayed updates can lead to inaccurate financial assessments and potential regulatory penalties.

Standardization and Governance

Non-Standard Transaction Types or Currency Codes: Inconsistent data formats can lead to misinterpretation and processing errors.

Automation and Continuous Monitoring

Lack of Regular Data Quality Checks: Without automated monitoring, data issues may go undetected until they cause significant problems.

Implementing Soda checks enables proactive identification and resolution of these and other data quality issues, ensuring reliable financial data for informed decision-making. So, let's see that in practice.

Step 3: Implementing the Checks

Soda prepares a scan using checks and data source connection configurations, which it then runs against datasets to extract metadata and evaluate data quality.

For that, you'll have to go to your project directory and create a checks.yml file to define the checks for our dataset. We'll focus on several key data quality dimensions relevant to financial data.

Below is a full commented code you can apply to our test database. It defines a set of data quality checks for our transactions table:

It verifies that transactions have unique IDs, amounts are valid and formatted correctly, and account numbers follow a specific pattern.
Completeness checks ensure no missing values in key fields, while freshness checks confirm that transaction data is updated daily.
Standardization rules enforce valid currency codes and transaction types.
Compliance measures include row count limits, and schema validation detects unauthorized changes in required columns and data types.

These rules support automated and continuous monitoring of data quality.

checks for transactions: # Remember to chose your table here
  # Accuracy and Consistency
  - duplicate_count(transaction_id) = 0 # Detect duplicate transaction_id entries
  - duplicate_count(account_number, amount, transaction_type):
      warn: when > 0  # ensure there are no duplicate transactions
  - invalid_count(amount) = 0:
      valid min: 0.01 # Ensure amount values are within a valid range
      valid format: decimal # Ensure formatting is consistent
  - invalid_count(account_number) = 0:
      valid regex: '^ACC.*'

  # Completeness
    # Verify no missing values
  - missing_count(transaction_id) = 0
  - missing_count(account_number) = 0
  - missing_count(transaction_date) = 0
  - missing_count(amount) = 0
  - missing_count(currency) = 0
  - missing_count(transaction_type) = 0

  # Timeliness and Freshness
  - freshness(transaction_date) < 1d # Assess that data is refreshed daily

  # Standardization and Governance
  - invalid_count(currency) = 0:
      valid values: ['USD', 'EUR', 'GBP'] # Add all acceptable currency codes
      invalid regex: '[^A-Z]{3}' # Ensure formatting is consistent
  - invalid_count(transaction_type) = 0:
      valid values: ['Credit', 'Debit', 'Transfer']

  # Data Security and Compliance
    # Monitor that the table does not exceed a specified number of rows
    # to manage data volume and compliance.
  - row_count < 1000000

  # Automation and Continuous Monitoring
    # Define the expected schema to detect unauthorized changes
  - schema:
      fail:
        when required column missing:
          - transaction_id
          - account_number
          - transaction_date
          - amount
          - currency
          - transaction_type
        when wrong column type:
          transaction_id: int
          account_number: varchar
          transaction_date: date
          amount: decimal
          currency: char
          transaction_type: varchar

‍

‍If you need to scan different datasets, remember to create a different check file for each (in this case, each table in your MySQL database).

3. Integrating Soda into Your Pipeline

To automate data quality checks within your ETL (Extract, Transform, Load) workflows, incorporate a step that executes Soda scans after data ingestion. This integration ensures that data quality is assessed before further processing, allowing for early detection and issue resolution.

Use the following command to run a Soda scan on your dataset:

soda scan -d soda_trial -c configuration.yml checks.yml‍

‍

In this command:

-d soda_trial specifies the data source name as defined in your configuration file.
-c configuration.yml points to your Soda configuration file containing connection details.
checks.yml is the file where you've defined your data quality checks using SodaCL.

4. Review the scan results

By integrating Soda scans into your pipeline, you establish a proactive approach to data quality, ensuring that issues are identified and addressed promptly, thereby maintaining the integrity of your data throughout its life cycle.

Following a scan, each check results in one of three default states:

pass: the data meets the specified quality thresholds.
fail: the data does not meet the specified quality thresholds.
error: there is an issue with the check's syntax or execution.

Extra state:

warn: a configurable state that alerts you to potential issues without marking the check as a full failure. See more in Add alert configurations. Warning example from our code:

  - duplicate_count(account_number, amount, transaction_type):
      warn: when > 0

‍

When checks fail, they reveal low-quality data and provide results that assist you in investigating and resolving quality issues.

You can review the outcomes of your Soda scans through two primary channels:

1. Command-Line Interface (CLI): Upon executing a scan, Soda provides immediate feedback in the terminal, displaying the results of each check. This real-time insight allows for quick assessments and immediate action if necessary.

Soda Core 3.5.0
Sending failed row samples to Soda Cloud
Sending failed row samples to Soda Cloud
Sending failed row samples to Soda Cloud
Sending failed row samples to Soda Cloud
Sending failed row samples to Soda Cloud
Scan summary:
10/15 checks PASSED:
    transactions in soda_trial
      row_count < 1000000 [PASSED]
      Schema Check [PASSED]
      duplicate_count(transaction_id) = 0 [PASSED]
      missing_count(transaction_id) = 0 [PASSED]
      missing_count(amount) = 0 [PASSED]
      missing_count(account_number) = 0 [PASSED]
      missing_count(transaction_date) = 0 [PASSED]
      freshness(transaction_date) < 1d [PASSED]
      invalid_count(currency) = 0 [PASSED]
      missing_count(transaction_type) = 0 [PASSED]
1/15 checks WARNED:
    transactions in soda_trial
      duplicate_count(account_number, amount, transaction_type) warn when > 0 [WARNED]
        check_value: 1
4/15 checks FAILED:
    transactions in soda_trial
      invalid_count(amount) = 0 [FAILED]
        check_value: 1
      invalid_count(account_number) = 0 [FAILED]
        check_value: 1
      missing_count(currency) = 0 [FAILED]
        check_value: 1
      invalid_count(transaction_type) = 0 [FAILED]
        check_value: 1
Oops! 4 failures. 1 warning. 0 errors. 10 pass.
Sending results to Soda Cloud

‍

‍Schema checks are primarily intended to validate the structural aspects of your data, such as column presence, absence, and position, as well as the data types assigned to them. However, schema checks do not include enforcing specific data formats or constraints, such as ensuring that numeric values adhere to a specific decimal precision.

2. Soda Cloud: For a more detailed and collaborative review, within Soda Cloud, you can:

Visualize scan results through intuitive dashboards.
Monitor data quality trends over time.
Receive alerts for failed checks or anomalies.
Collaborate with team members to address data quality issues.

Monitoring and Alerting for Data Quality

Soda enables real-time monitoring and alerting, keeping stakeholders informed about data quality issues. Configure Soda to send alerts to Slack, Jira, or MS Teams when checks fail.

To enable alerts, define notification channels in your Soda configuration and specify the conditions under which they should be triggered. This proactive approach ensures that data quality issues are resolved quickly, preserving the integrity of your data assets.

Connection to MS Teams

If you have permission, you can integrate your Microsoft Teams workspace into Soda Cloud so it can interact with individuals and channels.

Use the Microsoft Teams integration to:

Send alerts to Microsoft Teams for check results (warn or fail notifications).
Create a dedicated Teams channel for investigating failed checks and collaborating on incident resolutions.
Track Soda Discussions for real-time data quality collaboration with your team.

To set up the integration:

1. Access Soda Cloud Integrations: Log in to your Soda Cloud account, navigate to your avatar > Organization Settings, and select the Integrations tab. Click the + icon in the upper right to add a new integration.

2. Select MS Teams: In the Add Integration dialog box, choose Microsoft Teams. You will then be guided through a workflow setup.

3. Create a Workflow: Follow the guided integration workflow that instructs you to log in to your MS Teams account. You’ll need to create a workflow within Teams (see Microsoft’s documentation for “Creating a workflow from a channel in Teams”) using the provided template to post to a channel when a webhook request is received.

4. Complete the Integration: Once the workflow is successfully created, copy the generated URL and return to Soda Cloud to finish the guided steps. During this process, you’ll configure the integration scopes:

Alert Notification Scope: Enable Soda Cloud to send alert notifications (for warn and fail check results) directly to your chosen MS Teams channel. This allows users to select MS Teams as the destination for individual or grouped check alerts.
Incident Scope: Set up notifications for when a new incident is created in Soda Cloud. This scope will display an external link to your MS Teams channel in the Incident Details, directing your team to the appropriate space for incident resolution.
Discussions Scope: Configure Soda Cloud to post to a specific Teams channel whenever a discussion is initiated or modified. This facilitates ongoing data quality collaboration within your organization.

With the integration in place, Soda Cloud can automatically send notifications to your MS Teams channels, ensuring that alerts, incident updates, and collaborative discussions are easily shared with your team, allowing for response and efficient issue resolution.

Go to our documentation to learn more about how to Integrate with MS Teams, and also how to Organize results, set alerts, investigate issues.

Conclusion - Best Practices for Maintaining Data Quality

A well-architected financial data pipeline, combined with automated validation and monitoring, ensures that data is consistently accurate, complete, and trustworthy. The beauty of Soda Library is its ability to integrate directly into your pipeline, allowing you to automate data quality checks at all stages—from ingestion to processing.

By following this guide, you can create an efficient, scalable solution that allows your team to ensure data quality in real time while adhering to regulatory standards such as BCBS 239.

Looking ahead, keep an eye out for upcoming blog posts that will explore enhancements to your pipeline, such as:

Empowering Business Users: Learn how to enable business users to propose and enforce their own business rules directly in your pipeline.

Automated Metric Monitoring: Discover how to set up automated metric monitoring to identify hidden issues at scale, ensuring that no data quality issues slip through the cracks.

Feel free to explore more of Soda's capabilities in our website and documentation. If you’d like to dive deeper, you can always request a demo for a face-to-face discussion on how Soda can transform your data quality processes.

References

‍

Setting Up Data Quality Tests in Your Pipeline

The Power of Automation for the Financial Sector

Setting Up Soda for Automated Checks

Choosing the Right Deployment Model

Step-by-Step Setup with Soda Library

1. Account and API keys creation

Step 1: Join Soda Cloud

Step 2: Generate API keys

2. Soda to Data Source Configuration

Step 1: Install dependencies

Step 2: Set up project directory

Step 3: Create virtual environment

Step 4: Install Soda Library for MySQL

Step 5: Set up configuration file

Step 6: Test connection

Use case: Financial Data Quality Monitoring

Key Principles for Financial DQ Management

1. Sample Dataset Creation

Step 1: Set Up the Database and Table

Step 2: Populate the Table with Sample Data

2. Soda Checks Set up

Step 1: Understanding Key Concepts

Step 2: Designing Effective Data Quality Checks

Step 3: Implementing the Checks

3. Integrating Soda into Your Pipeline

4. Review the scan results

Monitoring and Alerting for Data Quality

Connection to MS Teams

To set up the integration:

Conclusion - Best Practices for Maintaining Data Quality

References

Sign Up

Share

Setting Up Data Quality Tests in Your Pipeline

Sign Up

Stay Connected