Data Integrity Testing: 7 Tests from Simple to Advanced
Data Integrity Testing: 7 Tests from Simple to Advanced
Mar 5, 2026

Santiago Viquez
Santiago Viquez
DevRel at Soda
DevRel at Soda
Table of Contents

Data is now at the center of almost every decision, product, and customer interaction. However, that data is only useful if people can trust it. When numbers are wrong, missing, or inconsistent across systems, dashboards lose credibility, teams argue about which version is correct, and customers may even receive incorrect messages or bills.
This is exactly the problem that data integrity testing is meant to solve. It provides a structured way to check data accuracy, consistency, completeness, and reliability as it moves through your ingestion pipelines, transformation layers, warehouses, and downstream applications.
While basic data validation checks confirm that individual values meet expected formats or ranges, data integrity testing operates at a system level. It ensures that metrics don’t drift silently, that relationships between tables remain intact, and that migrations or refactors don’t corrupt historical results.
In this guide, we will take a look at:
What data integrity testing is and why it is essential for business analytics
Seven data integrity checks that range from simple to more advanced
How to think about simple vs advanced data integrity tests
How data quality tools, such as Soda, can help you automate, scale, and operationalize these checks
Best practices and ways to improve data integrity over time
What is Data Integrity Testing?
Data integrity describes whether your data remains trustworthy as it flows through systems, transformations, and time.
At a high level, data integrity covers several core properties:
Accuracy – values correctly reflect the source of truth or agreed system of record
Consistency – related datasets and systems do not contradict each other
Completeness – expected records and critical fields are present
Relational integrity – foreign keys and relationships between tables remain valid
Stability over time – migrations, reprocessing, and refactors do not silently alter historical results
Data integrity testing is the practice of checking those properties in a continuous way. Instead of discovering problems only when someone complains about a strange chart, you proactively run automated data integrity checks and receive alerts when something does not look right.
Common data integrity failures include:
Data corruption, such as truncated strings, encoding errors, or incorrect data types introduced during ingestion or transformation
Data loss, where batches fail silently, rows are dropped in joins, or key fields suddenly become null
Data inconsistency, when two systems report different values for what should be the same metric
Regression after change, where a migration, backfill, or logic update unintentionally alters historical results
Strong data integrity testing acts as a system-level trust mechanism. It ensures that your data platform continues to behave predictably even as it evolves. It gives you an early warning system that protects both analytics teams and business stakeholders from making decisions based on unreliable data.
Simple vs Advanced Data Integrity Tests
Not every test requires the same level of effort. It is helpful to think about data integrity testing as a progression.
At the simple end, you have basic row count checks, completeness checks on critical fields, and uniqueness checks on primary keys. These are easy to implement, and they already catch many common issues like missing batches or duplicate records. They are an excellent starting point for teams that are just beginning to formalize their data integrity checks.
As your data environment grows more complex, you will naturally move into more advanced territory. Cross-system consistency testing, detailed referential integrity checks, business rule validation, and durability tests around major changes all require a deeper understanding of your data models, processes, and stakeholders. The payoff is higher confidence that your data is both technically sound and aligned with business reality.
At this stage, many organizations adopt specialized data integrity tools to help them scale. Platforms like Soda allow data teams to define checks as configuration, run them automatically, and review results in a central place rather than scattering them across ad hoc scripts and dashboards.
7 Tests for Ensuring Data Integrity
There are many ways to test data, but most integrity work can be grouped into a set of seven core data integrity checks. Together, they provide a strong foundation that you can then extend in more advanced scenarios.
1. Accuracy Test
An accuracy test answers a simple question: Does the data in your analytics environment correctly match the original source or the official system of record?
In practice, you might:
Compare a random sample of customer records between your CRM and your warehouse
Check that the total number of orders or the total revenue per day is the same in your source and destination
Validate that calculated columns, such as
order_total, match the sum of line items plus tax and discounts
In large systems, accuracy tests are often implemented as reconciliation checks with defined tolerances rather than exact row-by-row equality. These tests often combine record-level comparisons for a small sample with aggregate comparisons for each batch. If these numbers drift, you know that something in your ingestion, transformation, or loading process is introducing errors.
Example: Reconciliation vs Source
Use a reconciliation with aggregate_diffto compare an aggregate with tolerance:
reconciliation: source: dataset: prod_raw/postgres/public/orders_source checks: - aggregate_diff: name: "Total revenue matches source within 0.1%" function: sum column: order_total threshold: metric: percent must_be_less_than: 0.1
Practical note: Accuracy is usually tolerance-based, not exact row-by-row.
2. Consistency Test
Consistency tests look across systems, tables, or layers to confirm that they tell the same story. Data can be accurate inside each table, but still inconsistent when you compare related datasets.
Imagine a situation where your finance team reports one revenue number while your analytics dashboards show another. Or maybe a customer appears as "active" in your subscription table but as "closed" in your support system. These inconsistencies erode trust and waste time as teams try to reconcile numbers.
To run consistency checks, you typically join datasets on shared keys and compare critical attributes or metrics. For example, you can align customer records from two systems and verify that key fields match, or compare revenue totals from your finance mart to those used in a product analytics dashboard. Consistency tests help establish a single, reliable version of the truth.
Example: Cross-System Attribute Consistency
Use failed_rows check to flag mismatched fields after a join:
checks: - failed_rows: name: "Customer status consistent across billing and marketing" query: | SELECT b.customer_id FROM billing_customers b JOIN marketing_customers m USING (customer_id) WHERE b.status <> m.status threshold: must_be: 0
3. Completeness Test
Completeness tests focus on whether all expected data is present. They address two related questions:
Are all the rows that should be there actually there?
Are key fields within those rows populated to an acceptable level?
For example, after each ingestion run, you can check that the number of orders ingested into your warehouse matches the number of orders in the source system for that day. You can also monitor the percentage of null values in critical fields such as customer_id, order_date, or status. A sudden drop in row counts or a spike in null values is often the first visible sign of an upstream failure or schema change.
Completeness tests are usually among the first checks teams implement, because they are simple to define and they immediately catch some of the most disruptive data integrity problems.
Example: Row Count + Nullability
Use row_count + missing checks with thresholds:
checks: - row_count: threshold: must_be_greater_than: 0 columns: - name: customer_id checks: - missing: name: "customer_id missing rate < 0.1%" threshold: metric: percent must_be_less_than: 0.1
4. Uniqueness Test
Uniqueness tests make sure that identifiers that are supposed to represent a single entity really are unique. Duplicate records can distort metrics, lead to double communications, and cause serious operational issues.
Typical examples include:
Ensuring that
customer_idis unique in the customer tableMaking sure that a combination such as (
customer_id,date) appears at most once in a daily activity tableChecking that there are no duplicate order records after a complex join or reprocessing step
You can implement uniqueness tests by comparing the total row count with the count of distinct values for the key or key combination. If these numbers differ, you know that duplicates exist and you can investigate where they were introduced.
Example: PK or Composite Key
Use dataset-level duplicate check for composite keys:
checks: - duplicate: name: "order_id + line_number must be unique" columns: [order_id, line_number] threshold: must_be: 0
5. Referential Integrity Test
Referential integrity tests deal with relationships between tables. Whenever one table references another, that relationship needs to hold up in your analytics environment.
For instance, every order.customer_id should exist in the customers table, and every line_item.order_id should have a matching order row. If a foreign key points to a primary key that no longer exists, you have an "orphaned" record. These orphans can cause missing values in reports, incorrect aggregations, and confusion when teams try to trace how a number was calculated.
To test referential integrity, you can run queries that search for foreign keys with no match in the parent table. In some warehouses, these relationships can be enforced at the database level. In others, they are implemented as logical data integrity checks that run on a schedule.
Example: Foreign Keys Exist
Use invalid check with reference data:
columns: - name: customer_id checks: - invalid: name: "customer_id must exist in customers" valid_reference_data: dataset: warehouse/public/dim_customers column
Practical note: For huge tables, validate recent partitions or incrementals.
6. Validation of Business Rules
Technical checks are important, but they are not enough on their own. Data also needs to reflect the way your business works. Business rule validation confirms that data complies with the rules that matter to your organization.
A few examples:
An order total should never be negative, and discounts should not exceed 100 percent
A subscription end date must be later than its start date
Transactions for a particular region should only use allowed currencies
These rules might be defined in documentation or captured in application logic. Bringing them into your data integrity testing strategy helps you catch issues that pure technical checks would miss. It also makes test results easier for business stakeholders to understand because each failed check can be traced back to a specific rule that they recognize.
In practical terms, you translate business rules into expressions or queries and configure them as part of your regular data integrity checks. Over time, this library of rules becomes a valuable asset because it codifies a shared understanding of how your business data should behave.
Example: Domain Logic
Use failed_rows check with an expression:
checks: - failed_rows: name: "Discount cannot exceed gross amount" expression: discount_amount > (quantity * unit_price) threshold: must_be: 0
7. Durability Test
Durability tests are an important part of maintaining data integrity. Their purpose is to make sure that your data remains correct and trustworthy after significant changes, such as migrations, backfills, performance optimizations, or major refactors of your pipelines.
Before you migrate a large table from one warehouse to another, for example, you can capture key metrics such as revenue, active users, or churn for several historical periods. After the migration, you compute the same metrics and compare them. Any unexpected difference signals a problem that must be corrected before stakeholders rely on the new system.
The same idea applies when you modify transformation logic. Durability testing confirms that important metrics either stay the same or change only in ways that you expect and can explain. This gives your team confidence to improve and evolve your data platform without sacrificing trust in its outputs.
Example: Regression After Change
Use a reconciliation check with row_count_diffto compare "before" vs "after" datasets:
reconciliation: source: dataset: baseline/postgres/public/orders_pre_migration checks: - row_count_diff: name: "Post-migration row count matches baseline" threshold: must_be: 0
Practical note: Run durability checks on agreed historical windows (e.g., last 90 days) and key KPIs.
Tools for Data Integrity Testing
You can implement data integrity testing with pure SQL and scheduling, but that approach often becomes difficult to maintain. As the number of datasets, checks, and contributors grows, it becomes easy to lose track of what is being tested, how often, and who is responsible for fixing issues.
Dedicated data integrity tools address this by providing:
A standard way to describe checks, including accuracy, completeness, uniqueness, referential integrity, business rules, and durability
A scheduler that runs checks at the right points in your pipelines or on a defined cadence
Alerting that sends failures to the right teams through channels such as email, chat tools, or incident systems
A history of check results so that you can see trends in your data health over time
Solutions like Soda integrate with modern data stacks, connect to data warehouses and lakes, support checks as code, and make it easier for data engineers, analysts, and business owners to collaborate on data integrity testing. Instead of writing one-off scripts, teams can reuse and extend a shared library of checks across tables and projects.
When you evaluate data integrity tools, look for support for both basic and advanced data integrity checks, easy integration with your existing orchestration and version control, and clear visibility for non-technical stakeholders who need to understand the health of the data products they rely on.
Best Practices for Data Integrity Testing
To get the most from data integrity testing, it helps to follow a few core practices that we explore more deeply in our article on data integrity best practices.
🟢 First, group your tests into clear suites. You might organize them by domain, for example, billing, customer, or product data, or by data product, such as marketing funnel models or financial reporting. This makes coverage easier to comprehend and clarifies which team is accountable for each group of checks.
🟢 Second, treat failed checks as signals that deserve attention, not as background noise. Assign owners for key datasets, agree on what a timely response looks like, and align the severity of the alerts with business impact. A failed check on a critical revenue table should trigger a different response to a minor anomaly in a low-priority dataset.
🟢 Third, integrate data integrity checks into your data pipeline rather than running them entirely after the fact. Tests that run after ingestion or transformation steps can block or pause downstream jobs when they detect serious problems. This prevents broken data from flowing into dashboards and machine learning models, which is a key part of strong data integrity testing.
🟢 Finally, avoid creating hundreds of checks that are technically precise but not meaningful. Each test should exist for a reason, and that reason should be understood by both technical and business stakeholders. This clarity makes maintenance and iteration much easier.
How to Improve Data Integrity Over Time
Data integrity is not something you set up once and then forget. New sources, new products, and new questions will always create new risks and requirements. The most successful teams treat integrity as a continuous loop of monitoring, learning, and improvement.
A simple but powerful habit is to review every major data incident and ask, "What test would have caught this earlier?" Once you have an answer, add or refine a check so that the same problem does not happen again. Over time, your integrity suite becomes a direct reflection of the real issues you have faced and resolved.
You can also embed integrity thinking into how you design new pipelines and data products. When you plan a new model, define not only what the output should look like, but also which data integrity checks will protect it, who will own those checks, and how you will be alerted if they fail.
As your program matures, you can shift from simple checks to more advanced data integrity techniques such as monitoring distributions, detecting anomalies, and running comprehensive durability tests around large changes. Data integrity tools make it easier to scale without overwhelming your team with manual effort.
Frequently Asked Questions
What are the most common data integrity errors?
Common problems include missing batches of data after ingestion failures, unexpected increases in null values in key fields, duplicate rows created during reprocessing or joins, orphaned foreign keys where child rows no longer have matching parents, and unexplained shifts in important metrics after changes to pipelines or platforms. The seven types of data integrity checks in this article are designed to detect these patterns precisely.
How often should data integrity tests be run?
For batch pipelines, most teams perform data integrity tests with every run. Daily jobs have daily checks, hourly jobs have hourly checks, and so on. For streaming or near-real-time systems, data consistency checks often operate on short time windows or frequent snapshots. During high-risk events such as migrations and major refactors, durability tests and comparison checks are usually run more intensively to ensure data remains accurate.
Can data integrity tests be automated?
Yes, and they should be automated for any non-trivial data stack. Manual spot checks are useful when you are exploring data quality, but they cannot provide reliable protection at scale and constantly catch human error. Automation means defining checks in a reusable format, running them through your orchestration layer or a dedicated data integrity platform such as Soda, and routing alerts directly to the people who can fix issues. With the right approach, you can maintain strong and even advanced data integrity while your data landscape grows and evolves.
Data is now at the center of almost every decision, product, and customer interaction. However, that data is only useful if people can trust it. When numbers are wrong, missing, or inconsistent across systems, dashboards lose credibility, teams argue about which version is correct, and customers may even receive incorrect messages or bills.
This is exactly the problem that data integrity testing is meant to solve. It provides a structured way to check data accuracy, consistency, completeness, and reliability as it moves through your ingestion pipelines, transformation layers, warehouses, and downstream applications.
While basic data validation checks confirm that individual values meet expected formats or ranges, data integrity testing operates at a system level. It ensures that metrics don’t drift silently, that relationships between tables remain intact, and that migrations or refactors don’t corrupt historical results.
In this guide, we will take a look at:
What data integrity testing is and why it is essential for business analytics
Seven data integrity checks that range from simple to more advanced
How to think about simple vs advanced data integrity tests
How data quality tools, such as Soda, can help you automate, scale, and operationalize these checks
Best practices and ways to improve data integrity over time
What is Data Integrity Testing?
Data integrity describes whether your data remains trustworthy as it flows through systems, transformations, and time.
At a high level, data integrity covers several core properties:
Accuracy – values correctly reflect the source of truth or agreed system of record
Consistency – related datasets and systems do not contradict each other
Completeness – expected records and critical fields are present
Relational integrity – foreign keys and relationships between tables remain valid
Stability over time – migrations, reprocessing, and refactors do not silently alter historical results
Data integrity testing is the practice of checking those properties in a continuous way. Instead of discovering problems only when someone complains about a strange chart, you proactively run automated data integrity checks and receive alerts when something does not look right.
Common data integrity failures include:
Data corruption, such as truncated strings, encoding errors, or incorrect data types introduced during ingestion or transformation
Data loss, where batches fail silently, rows are dropped in joins, or key fields suddenly become null
Data inconsistency, when two systems report different values for what should be the same metric
Regression after change, where a migration, backfill, or logic update unintentionally alters historical results
Strong data integrity testing acts as a system-level trust mechanism. It ensures that your data platform continues to behave predictably even as it evolves. It gives you an early warning system that protects both analytics teams and business stakeholders from making decisions based on unreliable data.
Simple vs Advanced Data Integrity Tests
Not every test requires the same level of effort. It is helpful to think about data integrity testing as a progression.
At the simple end, you have basic row count checks, completeness checks on critical fields, and uniqueness checks on primary keys. These are easy to implement, and they already catch many common issues like missing batches or duplicate records. They are an excellent starting point for teams that are just beginning to formalize their data integrity checks.
As your data environment grows more complex, you will naturally move into more advanced territory. Cross-system consistency testing, detailed referential integrity checks, business rule validation, and durability tests around major changes all require a deeper understanding of your data models, processes, and stakeholders. The payoff is higher confidence that your data is both technically sound and aligned with business reality.
At this stage, many organizations adopt specialized data integrity tools to help them scale. Platforms like Soda allow data teams to define checks as configuration, run them automatically, and review results in a central place rather than scattering them across ad hoc scripts and dashboards.
7 Tests for Ensuring Data Integrity
There are many ways to test data, but most integrity work can be grouped into a set of seven core data integrity checks. Together, they provide a strong foundation that you can then extend in more advanced scenarios.
1. Accuracy Test
An accuracy test answers a simple question: Does the data in your analytics environment correctly match the original source or the official system of record?
In practice, you might:
Compare a random sample of customer records between your CRM and your warehouse
Check that the total number of orders or the total revenue per day is the same in your source and destination
Validate that calculated columns, such as
order_total, match the sum of line items plus tax and discounts
In large systems, accuracy tests are often implemented as reconciliation checks with defined tolerances rather than exact row-by-row equality. These tests often combine record-level comparisons for a small sample with aggregate comparisons for each batch. If these numbers drift, you know that something in your ingestion, transformation, or loading process is introducing errors.
Example: Reconciliation vs Source
Use a reconciliation with aggregate_diffto compare an aggregate with tolerance:
reconciliation: source: dataset: prod_raw/postgres/public/orders_source checks: - aggregate_diff: name: "Total revenue matches source within 0.1%" function: sum column: order_total threshold: metric: percent must_be_less_than: 0.1
Practical note: Accuracy is usually tolerance-based, not exact row-by-row.
2. Consistency Test
Consistency tests look across systems, tables, or layers to confirm that they tell the same story. Data can be accurate inside each table, but still inconsistent when you compare related datasets.
Imagine a situation where your finance team reports one revenue number while your analytics dashboards show another. Or maybe a customer appears as "active" in your subscription table but as "closed" in your support system. These inconsistencies erode trust and waste time as teams try to reconcile numbers.
To run consistency checks, you typically join datasets on shared keys and compare critical attributes or metrics. For example, you can align customer records from two systems and verify that key fields match, or compare revenue totals from your finance mart to those used in a product analytics dashboard. Consistency tests help establish a single, reliable version of the truth.
Example: Cross-System Attribute Consistency
Use failed_rows check to flag mismatched fields after a join:
checks: - failed_rows: name: "Customer status consistent across billing and marketing" query: | SELECT b.customer_id FROM billing_customers b JOIN marketing_customers m USING (customer_id) WHERE b.status <> m.status threshold: must_be: 0
3. Completeness Test
Completeness tests focus on whether all expected data is present. They address two related questions:
Are all the rows that should be there actually there?
Are key fields within those rows populated to an acceptable level?
For example, after each ingestion run, you can check that the number of orders ingested into your warehouse matches the number of orders in the source system for that day. You can also monitor the percentage of null values in critical fields such as customer_id, order_date, or status. A sudden drop in row counts or a spike in null values is often the first visible sign of an upstream failure or schema change.
Completeness tests are usually among the first checks teams implement, because they are simple to define and they immediately catch some of the most disruptive data integrity problems.
Example: Row Count + Nullability
Use row_count + missing checks with thresholds:
checks: - row_count: threshold: must_be_greater_than: 0 columns: - name: customer_id checks: - missing: name: "customer_id missing rate < 0.1%" threshold: metric: percent must_be_less_than: 0.1
4. Uniqueness Test
Uniqueness tests make sure that identifiers that are supposed to represent a single entity really are unique. Duplicate records can distort metrics, lead to double communications, and cause serious operational issues.
Typical examples include:
Ensuring that
customer_idis unique in the customer tableMaking sure that a combination such as (
customer_id,date) appears at most once in a daily activity tableChecking that there are no duplicate order records after a complex join or reprocessing step
You can implement uniqueness tests by comparing the total row count with the count of distinct values for the key or key combination. If these numbers differ, you know that duplicates exist and you can investigate where they were introduced.
Example: PK or Composite Key
Use dataset-level duplicate check for composite keys:
checks: - duplicate: name: "order_id + line_number must be unique" columns: [order_id, line_number] threshold: must_be: 0
5. Referential Integrity Test
Referential integrity tests deal with relationships between tables. Whenever one table references another, that relationship needs to hold up in your analytics environment.
For instance, every order.customer_id should exist in the customers table, and every line_item.order_id should have a matching order row. If a foreign key points to a primary key that no longer exists, you have an "orphaned" record. These orphans can cause missing values in reports, incorrect aggregations, and confusion when teams try to trace how a number was calculated.
To test referential integrity, you can run queries that search for foreign keys with no match in the parent table. In some warehouses, these relationships can be enforced at the database level. In others, they are implemented as logical data integrity checks that run on a schedule.
Example: Foreign Keys Exist
Use invalid check with reference data:
columns: - name: customer_id checks: - invalid: name: "customer_id must exist in customers" valid_reference_data: dataset: warehouse/public/dim_customers column
Practical note: For huge tables, validate recent partitions or incrementals.
6. Validation of Business Rules
Technical checks are important, but they are not enough on their own. Data also needs to reflect the way your business works. Business rule validation confirms that data complies with the rules that matter to your organization.
A few examples:
An order total should never be negative, and discounts should not exceed 100 percent
A subscription end date must be later than its start date
Transactions for a particular region should only use allowed currencies
These rules might be defined in documentation or captured in application logic. Bringing them into your data integrity testing strategy helps you catch issues that pure technical checks would miss. It also makes test results easier for business stakeholders to understand because each failed check can be traced back to a specific rule that they recognize.
In practical terms, you translate business rules into expressions or queries and configure them as part of your regular data integrity checks. Over time, this library of rules becomes a valuable asset because it codifies a shared understanding of how your business data should behave.
Example: Domain Logic
Use failed_rows check with an expression:
checks: - failed_rows: name: "Discount cannot exceed gross amount" expression: discount_amount > (quantity * unit_price) threshold: must_be: 0
7. Durability Test
Durability tests are an important part of maintaining data integrity. Their purpose is to make sure that your data remains correct and trustworthy after significant changes, such as migrations, backfills, performance optimizations, or major refactors of your pipelines.
Before you migrate a large table from one warehouse to another, for example, you can capture key metrics such as revenue, active users, or churn for several historical periods. After the migration, you compute the same metrics and compare them. Any unexpected difference signals a problem that must be corrected before stakeholders rely on the new system.
The same idea applies when you modify transformation logic. Durability testing confirms that important metrics either stay the same or change only in ways that you expect and can explain. This gives your team confidence to improve and evolve your data platform without sacrificing trust in its outputs.
Example: Regression After Change
Use a reconciliation check with row_count_diffto compare "before" vs "after" datasets:
reconciliation: source: dataset: baseline/postgres/public/orders_pre_migration checks: - row_count_diff: name: "Post-migration row count matches baseline" threshold: must_be: 0
Practical note: Run durability checks on agreed historical windows (e.g., last 90 days) and key KPIs.
Tools for Data Integrity Testing
You can implement data integrity testing with pure SQL and scheduling, but that approach often becomes difficult to maintain. As the number of datasets, checks, and contributors grows, it becomes easy to lose track of what is being tested, how often, and who is responsible for fixing issues.
Dedicated data integrity tools address this by providing:
A standard way to describe checks, including accuracy, completeness, uniqueness, referential integrity, business rules, and durability
A scheduler that runs checks at the right points in your pipelines or on a defined cadence
Alerting that sends failures to the right teams through channels such as email, chat tools, or incident systems
A history of check results so that you can see trends in your data health over time
Solutions like Soda integrate with modern data stacks, connect to data warehouses and lakes, support checks as code, and make it easier for data engineers, analysts, and business owners to collaborate on data integrity testing. Instead of writing one-off scripts, teams can reuse and extend a shared library of checks across tables and projects.
When you evaluate data integrity tools, look for support for both basic and advanced data integrity checks, easy integration with your existing orchestration and version control, and clear visibility for non-technical stakeholders who need to understand the health of the data products they rely on.
Best Practices for Data Integrity Testing
To get the most from data integrity testing, it helps to follow a few core practices that we explore more deeply in our article on data integrity best practices.
🟢 First, group your tests into clear suites. You might organize them by domain, for example, billing, customer, or product data, or by data product, such as marketing funnel models or financial reporting. This makes coverage easier to comprehend and clarifies which team is accountable for each group of checks.
🟢 Second, treat failed checks as signals that deserve attention, not as background noise. Assign owners for key datasets, agree on what a timely response looks like, and align the severity of the alerts with business impact. A failed check on a critical revenue table should trigger a different response to a minor anomaly in a low-priority dataset.
🟢 Third, integrate data integrity checks into your data pipeline rather than running them entirely after the fact. Tests that run after ingestion or transformation steps can block or pause downstream jobs when they detect serious problems. This prevents broken data from flowing into dashboards and machine learning models, which is a key part of strong data integrity testing.
🟢 Finally, avoid creating hundreds of checks that are technically precise but not meaningful. Each test should exist for a reason, and that reason should be understood by both technical and business stakeholders. This clarity makes maintenance and iteration much easier.
How to Improve Data Integrity Over Time
Data integrity is not something you set up once and then forget. New sources, new products, and new questions will always create new risks and requirements. The most successful teams treat integrity as a continuous loop of monitoring, learning, and improvement.
A simple but powerful habit is to review every major data incident and ask, "What test would have caught this earlier?" Once you have an answer, add or refine a check so that the same problem does not happen again. Over time, your integrity suite becomes a direct reflection of the real issues you have faced and resolved.
You can also embed integrity thinking into how you design new pipelines and data products. When you plan a new model, define not only what the output should look like, but also which data integrity checks will protect it, who will own those checks, and how you will be alerted if they fail.
As your program matures, you can shift from simple checks to more advanced data integrity techniques such as monitoring distributions, detecting anomalies, and running comprehensive durability tests around large changes. Data integrity tools make it easier to scale without overwhelming your team with manual effort.
Frequently Asked Questions
What are the most common data integrity errors?
Common problems include missing batches of data after ingestion failures, unexpected increases in null values in key fields, duplicate rows created during reprocessing or joins, orphaned foreign keys where child rows no longer have matching parents, and unexplained shifts in important metrics after changes to pipelines or platforms. The seven types of data integrity checks in this article are designed to detect these patterns precisely.
How often should data integrity tests be run?
For batch pipelines, most teams perform data integrity tests with every run. Daily jobs have daily checks, hourly jobs have hourly checks, and so on. For streaming or near-real-time systems, data consistency checks often operate on short time windows or frequent snapshots. During high-risk events such as migrations and major refactors, durability tests and comparison checks are usually run more intensively to ensure data remains accurate.
Can data integrity tests be automated?
Yes, and they should be automated for any non-trivial data stack. Manual spot checks are useful when you are exploring data quality, but they cannot provide reliable protection at scale and constantly catch human error. Automation means defining checks in a reusable format, running them through your orchestration layer or a dedicated data integrity platform such as Soda, and routing alerts directly to the people who can fix issues. With the right approach, you can maintain strong and even advanced data integrity while your data landscape grows and evolves.
Data is now at the center of almost every decision, product, and customer interaction. However, that data is only useful if people can trust it. When numbers are wrong, missing, or inconsistent across systems, dashboards lose credibility, teams argue about which version is correct, and customers may even receive incorrect messages or bills.
This is exactly the problem that data integrity testing is meant to solve. It provides a structured way to check data accuracy, consistency, completeness, and reliability as it moves through your ingestion pipelines, transformation layers, warehouses, and downstream applications.
While basic data validation checks confirm that individual values meet expected formats or ranges, data integrity testing operates at a system level. It ensures that metrics don’t drift silently, that relationships between tables remain intact, and that migrations or refactors don’t corrupt historical results.
In this guide, we will take a look at:
What data integrity testing is and why it is essential for business analytics
Seven data integrity checks that range from simple to more advanced
How to think about simple vs advanced data integrity tests
How data quality tools, such as Soda, can help you automate, scale, and operationalize these checks
Best practices and ways to improve data integrity over time
What is Data Integrity Testing?
Data integrity describes whether your data remains trustworthy as it flows through systems, transformations, and time.
At a high level, data integrity covers several core properties:
Accuracy – values correctly reflect the source of truth or agreed system of record
Consistency – related datasets and systems do not contradict each other
Completeness – expected records and critical fields are present
Relational integrity – foreign keys and relationships between tables remain valid
Stability over time – migrations, reprocessing, and refactors do not silently alter historical results
Data integrity testing is the practice of checking those properties in a continuous way. Instead of discovering problems only when someone complains about a strange chart, you proactively run automated data integrity checks and receive alerts when something does not look right.
Common data integrity failures include:
Data corruption, such as truncated strings, encoding errors, or incorrect data types introduced during ingestion or transformation
Data loss, where batches fail silently, rows are dropped in joins, or key fields suddenly become null
Data inconsistency, when two systems report different values for what should be the same metric
Regression after change, where a migration, backfill, or logic update unintentionally alters historical results
Strong data integrity testing acts as a system-level trust mechanism. It ensures that your data platform continues to behave predictably even as it evolves. It gives you an early warning system that protects both analytics teams and business stakeholders from making decisions based on unreliable data.
Simple vs Advanced Data Integrity Tests
Not every test requires the same level of effort. It is helpful to think about data integrity testing as a progression.
At the simple end, you have basic row count checks, completeness checks on critical fields, and uniqueness checks on primary keys. These are easy to implement, and they already catch many common issues like missing batches or duplicate records. They are an excellent starting point for teams that are just beginning to formalize their data integrity checks.
As your data environment grows more complex, you will naturally move into more advanced territory. Cross-system consistency testing, detailed referential integrity checks, business rule validation, and durability tests around major changes all require a deeper understanding of your data models, processes, and stakeholders. The payoff is higher confidence that your data is both technically sound and aligned with business reality.
At this stage, many organizations adopt specialized data integrity tools to help them scale. Platforms like Soda allow data teams to define checks as configuration, run them automatically, and review results in a central place rather than scattering them across ad hoc scripts and dashboards.
7 Tests for Ensuring Data Integrity
There are many ways to test data, but most integrity work can be grouped into a set of seven core data integrity checks. Together, they provide a strong foundation that you can then extend in more advanced scenarios.
1. Accuracy Test
An accuracy test answers a simple question: Does the data in your analytics environment correctly match the original source or the official system of record?
In practice, you might:
Compare a random sample of customer records between your CRM and your warehouse
Check that the total number of orders or the total revenue per day is the same in your source and destination
Validate that calculated columns, such as
order_total, match the sum of line items plus tax and discounts
In large systems, accuracy tests are often implemented as reconciliation checks with defined tolerances rather than exact row-by-row equality. These tests often combine record-level comparisons for a small sample with aggregate comparisons for each batch. If these numbers drift, you know that something in your ingestion, transformation, or loading process is introducing errors.
Example: Reconciliation vs Source
Use a reconciliation with aggregate_diffto compare an aggregate with tolerance:
reconciliation: source: dataset: prod_raw/postgres/public/orders_source checks: - aggregate_diff: name: "Total revenue matches source within 0.1%" function: sum column: order_total threshold: metric: percent must_be_less_than: 0.1
Practical note: Accuracy is usually tolerance-based, not exact row-by-row.
2. Consistency Test
Consistency tests look across systems, tables, or layers to confirm that they tell the same story. Data can be accurate inside each table, but still inconsistent when you compare related datasets.
Imagine a situation where your finance team reports one revenue number while your analytics dashboards show another. Or maybe a customer appears as "active" in your subscription table but as "closed" in your support system. These inconsistencies erode trust and waste time as teams try to reconcile numbers.
To run consistency checks, you typically join datasets on shared keys and compare critical attributes or metrics. For example, you can align customer records from two systems and verify that key fields match, or compare revenue totals from your finance mart to those used in a product analytics dashboard. Consistency tests help establish a single, reliable version of the truth.
Example: Cross-System Attribute Consistency
Use failed_rows check to flag mismatched fields after a join:
checks: - failed_rows: name: "Customer status consistent across billing and marketing" query: | SELECT b.customer_id FROM billing_customers b JOIN marketing_customers m USING (customer_id) WHERE b.status <> m.status threshold: must_be: 0
3. Completeness Test
Completeness tests focus on whether all expected data is present. They address two related questions:
Are all the rows that should be there actually there?
Are key fields within those rows populated to an acceptable level?
For example, after each ingestion run, you can check that the number of orders ingested into your warehouse matches the number of orders in the source system for that day. You can also monitor the percentage of null values in critical fields such as customer_id, order_date, or status. A sudden drop in row counts or a spike in null values is often the first visible sign of an upstream failure or schema change.
Completeness tests are usually among the first checks teams implement, because they are simple to define and they immediately catch some of the most disruptive data integrity problems.
Example: Row Count + Nullability
Use row_count + missing checks with thresholds:
checks: - row_count: threshold: must_be_greater_than: 0 columns: - name: customer_id checks: - missing: name: "customer_id missing rate < 0.1%" threshold: metric: percent must_be_less_than: 0.1
4. Uniqueness Test
Uniqueness tests make sure that identifiers that are supposed to represent a single entity really are unique. Duplicate records can distort metrics, lead to double communications, and cause serious operational issues.
Typical examples include:
Ensuring that
customer_idis unique in the customer tableMaking sure that a combination such as (
customer_id,date) appears at most once in a daily activity tableChecking that there are no duplicate order records after a complex join or reprocessing step
You can implement uniqueness tests by comparing the total row count with the count of distinct values for the key or key combination. If these numbers differ, you know that duplicates exist and you can investigate where they were introduced.
Example: PK or Composite Key
Use dataset-level duplicate check for composite keys:
checks: - duplicate: name: "order_id + line_number must be unique" columns: [order_id, line_number] threshold: must_be: 0
5. Referential Integrity Test
Referential integrity tests deal with relationships between tables. Whenever one table references another, that relationship needs to hold up in your analytics environment.
For instance, every order.customer_id should exist in the customers table, and every line_item.order_id should have a matching order row. If a foreign key points to a primary key that no longer exists, you have an "orphaned" record. These orphans can cause missing values in reports, incorrect aggregations, and confusion when teams try to trace how a number was calculated.
To test referential integrity, you can run queries that search for foreign keys with no match in the parent table. In some warehouses, these relationships can be enforced at the database level. In others, they are implemented as logical data integrity checks that run on a schedule.
Example: Foreign Keys Exist
Use invalid check with reference data:
columns: - name: customer_id checks: - invalid: name: "customer_id must exist in customers" valid_reference_data: dataset: warehouse/public/dim_customers column
Practical note: For huge tables, validate recent partitions or incrementals.
6. Validation of Business Rules
Technical checks are important, but they are not enough on their own. Data also needs to reflect the way your business works. Business rule validation confirms that data complies with the rules that matter to your organization.
A few examples:
An order total should never be negative, and discounts should not exceed 100 percent
A subscription end date must be later than its start date
Transactions for a particular region should only use allowed currencies
These rules might be defined in documentation or captured in application logic. Bringing them into your data integrity testing strategy helps you catch issues that pure technical checks would miss. It also makes test results easier for business stakeholders to understand because each failed check can be traced back to a specific rule that they recognize.
In practical terms, you translate business rules into expressions or queries and configure them as part of your regular data integrity checks. Over time, this library of rules becomes a valuable asset because it codifies a shared understanding of how your business data should behave.
Example: Domain Logic
Use failed_rows check with an expression:
checks: - failed_rows: name: "Discount cannot exceed gross amount" expression: discount_amount > (quantity * unit_price) threshold: must_be: 0
7. Durability Test
Durability tests are an important part of maintaining data integrity. Their purpose is to make sure that your data remains correct and trustworthy after significant changes, such as migrations, backfills, performance optimizations, or major refactors of your pipelines.
Before you migrate a large table from one warehouse to another, for example, you can capture key metrics such as revenue, active users, or churn for several historical periods. After the migration, you compute the same metrics and compare them. Any unexpected difference signals a problem that must be corrected before stakeholders rely on the new system.
The same idea applies when you modify transformation logic. Durability testing confirms that important metrics either stay the same or change only in ways that you expect and can explain. This gives your team confidence to improve and evolve your data platform without sacrificing trust in its outputs.
Example: Regression After Change
Use a reconciliation check with row_count_diffto compare "before" vs "after" datasets:
reconciliation: source: dataset: baseline/postgres/public/orders_pre_migration checks: - row_count_diff: name: "Post-migration row count matches baseline" threshold: must_be: 0
Practical note: Run durability checks on agreed historical windows (e.g., last 90 days) and key KPIs.
Tools for Data Integrity Testing
You can implement data integrity testing with pure SQL and scheduling, but that approach often becomes difficult to maintain. As the number of datasets, checks, and contributors grows, it becomes easy to lose track of what is being tested, how often, and who is responsible for fixing issues.
Dedicated data integrity tools address this by providing:
A standard way to describe checks, including accuracy, completeness, uniqueness, referential integrity, business rules, and durability
A scheduler that runs checks at the right points in your pipelines or on a defined cadence
Alerting that sends failures to the right teams through channels such as email, chat tools, or incident systems
A history of check results so that you can see trends in your data health over time
Solutions like Soda integrate with modern data stacks, connect to data warehouses and lakes, support checks as code, and make it easier for data engineers, analysts, and business owners to collaborate on data integrity testing. Instead of writing one-off scripts, teams can reuse and extend a shared library of checks across tables and projects.
When you evaluate data integrity tools, look for support for both basic and advanced data integrity checks, easy integration with your existing orchestration and version control, and clear visibility for non-technical stakeholders who need to understand the health of the data products they rely on.
Best Practices for Data Integrity Testing
To get the most from data integrity testing, it helps to follow a few core practices that we explore more deeply in our article on data integrity best practices.
🟢 First, group your tests into clear suites. You might organize them by domain, for example, billing, customer, or product data, or by data product, such as marketing funnel models or financial reporting. This makes coverage easier to comprehend and clarifies which team is accountable for each group of checks.
🟢 Second, treat failed checks as signals that deserve attention, not as background noise. Assign owners for key datasets, agree on what a timely response looks like, and align the severity of the alerts with business impact. A failed check on a critical revenue table should trigger a different response to a minor anomaly in a low-priority dataset.
🟢 Third, integrate data integrity checks into your data pipeline rather than running them entirely after the fact. Tests that run after ingestion or transformation steps can block or pause downstream jobs when they detect serious problems. This prevents broken data from flowing into dashboards and machine learning models, which is a key part of strong data integrity testing.
🟢 Finally, avoid creating hundreds of checks that are technically precise but not meaningful. Each test should exist for a reason, and that reason should be understood by both technical and business stakeholders. This clarity makes maintenance and iteration much easier.
How to Improve Data Integrity Over Time
Data integrity is not something you set up once and then forget. New sources, new products, and new questions will always create new risks and requirements. The most successful teams treat integrity as a continuous loop of monitoring, learning, and improvement.
A simple but powerful habit is to review every major data incident and ask, "What test would have caught this earlier?" Once you have an answer, add or refine a check so that the same problem does not happen again. Over time, your integrity suite becomes a direct reflection of the real issues you have faced and resolved.
You can also embed integrity thinking into how you design new pipelines and data products. When you plan a new model, define not only what the output should look like, but also which data integrity checks will protect it, who will own those checks, and how you will be alerted if they fail.
As your program matures, you can shift from simple checks to more advanced data integrity techniques such as monitoring distributions, detecting anomalies, and running comprehensive durability tests around large changes. Data integrity tools make it easier to scale without overwhelming your team with manual effort.
Frequently Asked Questions
What are the most common data integrity errors?
Common problems include missing batches of data after ingestion failures, unexpected increases in null values in key fields, duplicate rows created during reprocessing or joins, orphaned foreign keys where child rows no longer have matching parents, and unexplained shifts in important metrics after changes to pipelines or platforms. The seven types of data integrity checks in this article are designed to detect these patterns precisely.
How often should data integrity tests be run?
For batch pipelines, most teams perform data integrity tests with every run. Daily jobs have daily checks, hourly jobs have hourly checks, and so on. For streaming or near-real-time systems, data consistency checks often operate on short time windows or frequent snapshots. During high-risk events such as migrations and major refactors, durability tests and comparison checks are usually run more intensively to ensure data remains accurate.
Can data integrity tests be automated?
Yes, and they should be automated for any non-trivial data stack. Manual spot checks are useful when you are exploring data quality, but they cannot provide reliable protection at scale and constantly catch human error. Automation means defining checks in a reusable format, running them through your orchestration layer or a dedicated data integrity platform such as Soda, and routing alerts directly to the people who can fix issues. With the right approach, you can maintain strong and even advanced data integrity while your data landscape grows and evolves.
Trusted by the world’s leading enterprises
Real stories from companies using Soda to keep their data reliable, accurate, and ready for action.
At the end of the day, we don’t want to be in there managing the checks, updating the checks, adding the checks. We just want to go and observe what’s happening, and that’s what Soda is enabling right now.

Sid Srivastava
Director of Data Governance, Quality and MLOps
Investing in data quality is key for cross-functional teams to make accurate, complete decisions with fewer risks and greater returns, using initiatives such as product thinking, data governance, and self-service platforms.

Mario Konschake
Director of Product-Data Platform
Soda has integrated seamlessly into our technology stack and given us the confidence to find, analyze, implement, and resolve data issues through a simple self-serve capability.

Sutaraj Dutta
Data Engineering Manager
Our goal was to deliver high-quality datasets in near real-time, ensuring dashboards reflect live data as it flows in. But beyond solving technical challenges, we wanted to spark a cultural shift - empowering the entire organization to make decisions grounded in accurate, timely data.

Gu Xie
Head of Data Engineering
4.4 of 5
Start trusting your data. Today.
Find, understand, and fix any data quality issue in seconds.
From table to record-level.
Trusted by




Trusted by the world’s leading enterprises
Real stories from companies using Soda to keep their data reliable, accurate, and ready for action.
At the end of the day, we don’t want to be in there managing the checks, updating the checks, adding the checks. We just want to go and observe what’s happening, and that’s what Soda is enabling right now.

Sid Srivastava
Director of Data Governance, Quality and MLOps
Investing in data quality is key for cross-functional teams to make accurate, complete decisions with fewer risks and greater returns, using initiatives such as product thinking, data governance, and self-service platforms.

Mario Konschake
Director of Product-Data Platform
Soda has integrated seamlessly into our technology stack and given us the confidence to find, analyze, implement, and resolve data issues through a simple self-serve capability.

Sutaraj Dutta
Data Engineering Manager
Our goal was to deliver high-quality datasets in near real-time, ensuring dashboards reflect live data as it flows in. But beyond solving technical challenges, we wanted to spark a cultural shift - empowering the entire organization to make decisions grounded in accurate, timely data.

Gu Xie
Head of Data Engineering
4.4 of 5
Start trusting your data. Today.
Find, understand, and fix any data quality issue in seconds.
From table to record-level.
Trusted by
Solutions




Trusted by the world’s leading enterprises
Real stories from companies using Soda to keep their data reliable, accurate, and ready for action.
At the end of the day, we don’t want to be in there managing the checks, updating the checks, adding the checks. We just want to go and observe what’s happening, and that’s what Soda is enabling right now.

Sid Srivastava
Director of Data Governance, Quality and MLOps
Investing in data quality is key for cross-functional teams to make accurate, complete decisions with fewer risks and greater returns, using initiatives such as product thinking, data governance, and self-service platforms.

Mario Konschake
Director of Product-Data Platform
Soda has integrated seamlessly into our technology stack and given us the confidence to find, analyze, implement, and resolve data issues through a simple self-serve capability.

Sutaraj Dutta
Data Engineering Manager
Our goal was to deliver high-quality datasets in near real-time, ensuring dashboards reflect live data as it flows in. But beyond solving technical challenges, we wanted to spark a cultural shift - empowering the entire organization to make decisions grounded in accurate, timely data.

Gu Xie
Head of Data Engineering
4.4 of 5
Start trusting your data. Today.
Find, understand, and fix any data quality issue in seconds.
From table to record-level.
Trusted by
Solutions



