Data Quality Performance Considerations

Performance is often overlooked when embarking on data quality improvements. Non-performant approaches typically backfire when the initiative gains traction and the data warehouse cost starts ramping-up. This blog outlines Soda’s performance measures, and takes you through the considerations we have put into building our product. Below you can find the four important strategies that we’ve implemented to keep your warehouse cost as low as possible.

Provide full configurability

Our philosophy is to give full control to engineers to operate data quality checks and combine this with sensible defaults. All aspects of data quality management can be managed in YAML configuration files by engineers. These aspects include the technical details of which data is scanned automatically, how the data is scanned, what specific checks are executed, on what filters and so on.

Managing data quality as-code by engineers provides that control. The principle of sensible defaults ensures minimal configuration is needed in most common cases. At the same time, control is available in the form of configuration options. With those configuration options, engineers can tweak check semantics or behavior and that way remain in full control of how and where checks are executed. Black box solutions cannot give the control that is needed to keep the cost under control when data quality is scaled out over an entire data team.

An added benefit is that control provides a better signal-to-noise ratio. Without the advanced configuration controls available to tweak, it can become costly or result in alert fatigue.

Check only what matters

To reduce costs, you need to ensure checks are only executed on the slices of data that matter. If an Airflow job produces a daily batch of new data, then it makes sense to have all or almost all of the checks run on that new data only and not on the full historic dataset.

Apart from the time partitioning, Soda's generic filter mechanism can be used to further trim the size of the data being verified by running checks only on any other grouping like customer or region.

Group metrics in single queries

Soda guides users to run as many checks as possible in one scan execution. That's because for each scan execution, all the metrics will only be computed once! And as many metrics as possible will be executed on a single query. Let's walk through an example.

There are several checks on the dataset dim_customers. They may originate from different stakeholders:

There should be between 10,000 and 30,000 customers
The average in employee_count should not change more than 15% day over day
No invalid categories (high, medium, low), ignoring the missing values NULL and 'N/A'

In SodaCL, these translate to

checks for dim_customers: - row_count between 10000 and 30000 - change percent for avg(employee_count) between -15 and +15 - invalid_count(category) = 0: valid values: ['high', 'medium', 'low'] missing values: ['N/A']

Each of these checks will be based on metrics, and each metric will be translated by Soda into a SQL expression. For the above checks, the metrics will be

Metric	SQL expression
row_count	COUNT(*)
avg(employee_count)	AVG(cst_size)
invalid_count(category)	COUNT(CASE WHEN NOT (cat IS NULL OR cat IN ('N/A')) AND NOT (cat IN ('high','medium','low')) THEN 1 END)

The straightforward approach would be to run individual queries for these different metrics. Because Soda uses a concept of check files and a scan, we enable the optimization to compute all the metrics for a single dataset in a single query. This saves many passes over the data in the SQL engine thereby saving a lot of indirect costs that data quality can potentially generate in the data warehouse.

The resulting single query becomes:

SELECT COUNT(*) as row_count AVG(cst_size) as avg_cst_size, COUNT(CASE WHEN NOT (cat IS NULL OR cat IN ('N/A')) AND NOT (cat IN ('high','medium','low')) THEN 1 END) as invalid_count_category FROM dim_customers

As the number of checks and metrics increases, so will the cost savings. To make sure this scales (SQL-engines have a query length limit), we’ve built-in a cut-off point at 50 metrics per query.

Leverage compute engine specific features

When translating metrics to queries, we also make use of the specific SQL engine optimizations. For example Snowflake has a query cache that stores query results for a while so that if the same query is performed, it is loaded from the cache instead of recomputed. We leverage this in the computation of the data profiling.

Conclusion

Soda is constructed and implemented to ensure that it can optimize and provide the necessary control to engineers. Giving control to engineers makes sure they can tweak whenever the sensible defaults are not sufficient. Limiting the checked data to only what's needed reduces the cost and increases the speed. Merging as many metrics for checks as possible into single queries provides significant cost savings especially when scaling out data quality over a broader team. Together these measures add up and help you keep your data warehouse bill under control.

‍

Data Quality Performance Considerations

Provide full configurability

Check only what matters

Group metrics in single queries

Leverage compute engine specific features

Conclusion

Sign Up

Share

Data Quality Performance Considerations

Sign Up

Stay Connected