Table of Contents

Published

Aug 1, 2023

Data Quality Performance Considerations: Optimize Cost & Scale

Tom Baeyens

Tom Baeyens

Tom Baeyens

CTO and Co-founder at Soda

CTO and Co-founder at Soda

CTO and Co-founder at Soda

Performance is often overlooked when embarking on data quality improvements. Non-performant approaches typically backfire when the initiative gains traction and the data warehouse cost starts ramping-up.

What is “performance” in data quality?

In the context of data quality, performance refers to how efficiently, cost-effectively, and scalably checks execute — in terms of compute, latency, resource usage, and warehouse cost. Poor performance leads to high bills, bottlenecks, and reduced trust in your quality system.

This blog outlines Soda’s performance measures, and takes you through the considerations we have put into building our product. Without performance optimization, you're forced to choose between comprehensive data quality and cost control. Soda's approach eliminates this trade-off through four key strategies outlined below.

1. Provide full configurability

Our philosophy is to give full control to engineers to operate data quality checks and combine this with sensible defaults. All aspects of data quality management can be managed in YAML configuration files by engineers. These aspects include the technical details of which data is scanned automatically, how the data is scanned, what specific checks are executed, on what filters and so on.

Example checks.yaml:

checks for dim_customer:
  - invalid_count(email_address) = 0:
          valid format: email
          name: Ensure values are formatted as email addresses
  - missing_count(last_name) = 0:
          name: Ensure there are no null values in the Last Name column
  - duplicate_count(phone) = 0:
          name: No duplicate phone numbers
  - freshness(date_first_purchase) < 7d:
          name: Data in this dataset is less than 7 days old
  - schema:
          warn:
            when schema changes: any
          name

Managing data quality as-code by engineers provides that control. The principle of sensible defaults ensures minimal configuration is needed in most common cases. At the same time, control is available in the form of configuration options. With those configuration options, engineers can tweak check semantics or behavior and that way remain in full control of how and where checks are executed. Black box solutions cannot give the control that is needed to keep the cost under control when data quality is scaled out over an entire data team.

An added benefit is that control provides a better signal-to-noise ratio. Without the advanced configuration controls available to tweak, it can become costly or result in alert fatigue.

2. Check only what matters

To reduce data quality costs, you need to ensure checks are only executed on the slices of data that matter. If an Airflow job produces a daily batch of new data, then it makes sense to have all or almost all of the checks run on that new data only and not on the full historic dataset.

Apart from the time partitioning, Soda's generic filter mechanism can be used to further trim the size of the data being verified by running checks only on any other grouping like customer or region.

checks for fact_internet_sales:
  - group by:
      group_limit: 10
      name: average discount
      query: |
        SELECT sales_territory_key, AVG(discount_amount) as average_discount
        FROM fact_internet_sales
        GROUP BY sales_territory_key
      fields:
        - sales_territory_key
      checks:
        - average_discount:
            fail: when > 40
            name

3. Group metrics in single queries

Soda guides users to run as many checks as possible in one scan execution. That's because for each scan execution, all the metrics will only be computed once! And as many metrics as possible will be executed on a single query. Let's walk through an example.

There are several checks on the dataset dim_customers. They may originate from different stakeholders:

  • There should be between 10,000 and 30,000 customers

  • The average in employee_count should not change more than 15% day over day

  • No invalid categories (high, medium, low), ignoring the missing values NULL and 'N/A'

In SodaCL (Soda Checks Language), these translate to:

checks for dim_customers:
   - row_count between 10000 and 30000
   - change percent for avg(employee_count) between -15 and +15
   - invalid_count(category) = 0:
       valid values: ['high', 'medium', 'low']
       missing values: ['N/A'

Read the complete SodaCL configuration reference

Each of these checks will be based on metrics, and each metric will be translated by Soda into a SQL expression. For the above checks, the metrics will be:

Metric

SQL expression

row_count

COUNT(*)

avg(employee_count)

AVG(cst_size)

invalid_count(category)

COUNT
(CASE WHEN
NOT (cat IS NULL OR cat IN ('N/A')) AND NOT (cat IN ('high','medium','low')) THEN 1
END)

The straightforward approach would be to run individual queries for these different metrics. Because Soda uses a concept of check files and a scan, we enable the optimization to compute all the metrics for a single dataset in a single query. This saves many passes over the data in the SQL engine thereby saving a lot of indirect costs that data quality can potentially generate in the data warehouse.

The resulting single query becomes:

SELECT
   COUNT(*) as row_count
   AVG(cst_size) as avg_cst_size,
   COUNT(CASE WHEN NOT (cat IS NULL OR cat IN ('N/A')) 
	   AND NOT (cat IN ('high','medium','low')) 
	   THEN 1 END) as invalid_count_category
FROM

As the number of checks and metrics increases, so will the cost savings. To make sure this scales (SQL-engines have a query length limit), we’ve built-in a cut-off point at 50 metrics per query.

Read more about User-defined checks using SQL configuration.

4. Leverage compute engine specific features

When translating metrics to queries, we also make use of the specific SQL engine optimizations. For example Snowflake has a query cache that stores query results for a while so that if the same query is performed, it is loaded from the cache instead of recomputed. We leverage this in the computation of the data profiling.

Conclusion

Soda is constructed and implemented to ensure that it can optimize and provide the necessary control to engineers. Giving control to engineers makes sure they can tweak whenever the sensible defaults are not sufficient.

With Soda's configuration-first approach, you maintain granular control over:

  • What to scan (specific columns, tables, or partitions)

  • When to scan (schedule frequency and triggers)

  • How to scan (sampling rates, query timeouts)

  • Where to scan (compute warehouse selection)

This configurability means you can balance data quality coverage with warehouse cost optimization based on your specific needs.

Limiting the checked data to only what's needed reduces the cost and increases the speed. Merging as many metrics for checks as possible into single queries provides significant cost savings especially when scaling out data quality over a broader team.

Together these measures add up and help you keep your data warehouse bill under control.

Read our docs to see How Soda works.

Need help?

Start trusting your data. Today.

Soda fixes data. End-to-end. From detection to resolution.
All automated with AI.

Start trusting your data. Today.

Soda fixes data. End-to-end. From detection to resolution.
All automated with AI.

Start trusting your data. Today.

Soda fixes data. End-to-end. From detection to resolution.
All automated with AI.