How to Test Data Pipelines: A Practical Framework

How to Test Data Pipelines: A Practical Framework

Fabiana Ferraz

Fabiana Ferraz

Technical Writer at Soda

Technical Writer at Soda

Table of Contents

There is a particular kind of horror in realizing your smoke detector only works once the house is already on fire. By that point, the damage is already done.

Data pipelines suffer from a similar problem, but with one key difference. When a house burns down, you know right away. When a pipeline silently drops 3% of your rows, a vendor changes their API response format, or a transformation starts double-counting a key metric, everything continues to run as if nothing is wrong. There are no errors and no alerts. Instead, a dashboard looks slightly off three weeks later, after bad data has already moved through your entire stack and influenced decisions that cannot be reversed.

Most data teams respond to this reactively. They add tests after something breaks, during a post-mortem, almost as a reflex. The issue is not a lack of awareness about the importance of testing. It is when they try to build coverage deliberately that the process quickly becomes unclear: where should you start, and what should you add next?

This piece answers that question with a stage-by-stage decision process you can apply to your pipeline today. Whether you are starting from scratch or closing gaps in a production pipeline, here is what you will walk away with:

  • A risk-first triage method to decide which datasets and stages to test first.

  • A stage-by-stage breakdown of common failures across ingestion, transformation, serving, and performance, along with the minimum checks that catch them.

  • A repeatable decision rule for what to add next as your coverage grows.

  • A practical distinction between what testing covers and what observability handles, and why both are necessary.

Start With the Risk Map, Not the Test List: Three Questions to Find Your Starting Point

The instinct when building test coverage is to open your pipeline and start writing checks. Resist it.

Without a clear sense of which data actually matters and what it costs when it breaks, you end up with coverage that is wide but shallow. Lots of checks on datasets nobody depends on, and nothing protecting the revenue report your CFO pulls every Monday morning.

Before writing a single check, run a fast triage. These three questions will tell you where your first checks belong:

  1. Which datasets feed your most critical outputs? Think dashboards that drive business decisions, ML models in production, customer-facing features, or regulatory reports. These are your highest-priority targets, the places where a silent failure turns into someone else’s urgent problem.

  2. What does failure actually cost here? A broken internal analytics table is annoying. A broken fraud detection feed is an incident. Rank by consequence, not by how easy the dataset is to test.

  3. Where have things broken before? Past failure points are the most reliable signal for where future failures will happen. If a specific source has a history of schema changes or late-arriving data, that is where your first checks belong.

The answers give you a risk map. You will have a short, ordered list of the datasets and pipeline stages where a silent failure would hurt most. Everything you build from here starts there.

Try this now: Name three datasets that would trigger an incident if they silently broke today. If you cannot name them immediately, that is useful information. It means your team does not yet have a shared picture of what "critical" means, and getting that alignment is the first step.

One more thing worth stating: complete test coverage is not the goal, and chasing it is a trap. The goal is to make sure the failures that matter get caught by your checks, not by your users. A small set of well-placed, enforced checks on your highest-risk data will do more for your pipeline's reliability than broad coverage that nobody acts on.

Stage-by-Stage: Where to Test and What to Catch

A pipeline is a sequence of handoffs, each with its own failure modes. Schema drift at ingestion looks nothing like a join that silently drops rows mid-transformation. Testing them the same way, or at the same point, means one of them reliably slips through.

What follows is a stage-by-stage breakdown of where things actually go wrong, the minimum check that catches it, and what adequate coverage looks like at each point. Use it as a build order, not a checklist. Work through your highest-risk pipeline stages first, then expand.

Stage 1: Ingestion (when the data first arrives)

What breaks here:

This is where the outside world meets your pipeline, which makes it the least predictable stage. Vendors change API response formats without warning. Source systems rename columns. A batch job runs twice and delivers duplicate records. Data arrives late, or not at all, and nothing throws an error.

The minimum check:

  • Schema validation on arrival. Confirm that expected columns exist, types have not changed, and no new fields have appeared that your downstream models aren't ready for

  • Freshness check. Verify that data arrived within the expected window; a missing delivery is a failure even if it produces no error

What good coverage looks like:

Every new data source gets a schema test and a freshness check before anything downstream runs. Schema changes trigger an alert rather than silently propagating.

Stage 2: Transformation (where logic errors hide)

What breaks here:

Transformation failures are the hardest to catch because they do not crash your pipeline. They just produce wrong answers. A join that drops rows when a key does not match. An aggregation that double-counts when a filter condition is slightly off. A business rule applied inconsistently across regions because someone updated one branch of the logic and not the other. The pipeline finishes successfully, but the numbers are wrong.

The minimum check:

  • Unit tests on transformation logic. Test individual transformations with known inputs and expected outputs. Run them on every code change, not just at deployment

  • Row count reconciliation between input and output. If 1 million records enter a transformation and 970,000 leave it, that difference needs an explanation

What good coverage looks like:

Core transformation logic is tested automatically on every code change before anything reaches production. Row count parity checks run as part of every pipeline execution.

Stage 3: Pre-serving (the last gate before your data reaches users)

What breaks here:

Even if your pipeline ran without errors and your transformations are logically correct, the output data can still be untrustworthy. Required fields that are null. Duplicate records that inflate metrics. Values that fall outside any plausible range, such as a transaction amount of $0, an age of 847, or a customer ID that does not exist in your master table. None of these crashes anything, but all of them corrupt the outputs your users are making decisions from.

The minimum check:

  • Null checks on required fields

  • Duplicate detection on unique identifiers

  • Range and referential integrity checks on values that have known valid boundaries

  • Critically: these checks should block promotion. If they fail, the data does not move forward

What good coverage looks like:

A quality scan runs as a blocking gate inside the pipeline, after transformation and before serving. Failure stops the pipeline and alerts the team rather than letting bad data reach downstream consumers.

Stage 4: Under load (what staging never told you)

What breaks here:

A pipeline that handles 100,000 rows cleanly in staging can behave very differently with 10 million rows and three concurrent jobs running against the same warehouse. Jobs that are completed in minutes now breach SLA. Memory pressure causes silent truncation. A query that was merely slow in development becomes full resource exhaustion in production. You find out when a stakeholder notices the dashboard has not refreshed.

The minimum check:

  • Benchmark at realistic volumes before go-live, not trimmed development datasets, but anonymized or sampled production data at the scale you actually expect

  • Set alert thresholds for job duration, processing rate, and throughput before you need them, not after the first breach

What good coverage looks like:

Performance baselines are established before a pipeline goes to production and monitored continuously afterward. Degradation trends surface as alerts, not as SLA misses.

Test Matrix for Data Pipelines

The four stages above tell you where to focus and what to check. The matrix below shows how those stages translate into specific test types, when each runs, and who owns it. Use it as a reference when building out or reviewing your coverage.

Test Type

What It Catches

Pipeline Stage

When It Runs

Typical Tooling

Owned By

Unit testing

Broken transformation logic, regressions

Pre-ingestion / model build

On code change (CI)

pytest, dbt

Data Engineer

Schema validation

Unexpected column additions, type changes, renames

Ingestion / source layer

On arrival or model run

dbt, Soda

Data Engineer

Integration testing

Mismatches between pipeline stages, broken handoffs

Between layers (staging → mart)

On pipeline run

dbt tests

Data Engineer

Data quality validation

Nulls, duplicates, out-of-range values, referential integrity violations

Post-load / pre-serving

Scheduled or triggered post-run

Soda

Data Engineer / Steward

Reconciliation testing

Volume drops, row count drift, aggregation mismatches vs source

Cross-system / end-to-end

After full pipeline run

Soda, custom SQL

Data Engineer

Performance monitoring

SLA misses, slow jobs, resource spikes

Orchestration layer

Continuous / per-run

Airflow, SparkUI

Platform / Infra

Without a matrix, testing becomes inconsistent and ad hoc. Standardizing pipeline testing is a foundational step toward building trusted data products that downstream consumers can rely on. (We discuss this in: 5 Best Practices to Build Trusted Data Products.)

Key Tools for Data Pipeline Testing

Knowing what to test is half the problem. The other half is making sure those checks actually run, every time, without someone manually triggering them. Here is how the test types covered earlier map to specific tools.

Unit Testing – pytest and dbt

  • Validate transformation logic

  • Catch regressions on code change

  • Run in CI before deployment

dbt tests validate structural and value constraints at model build time, making them a natural fit for unit-level checks on transformation logic.

Integration testing: dbt tests

  • Verify data across multiple pipeline stages

  • Ensure outputs match expectations

  • Catch inconsistencies before production

dbt's built-in test framework can also serve as a lightweight integration layer, validating that data moving between models meets defined expectations at each handoff point.

Performance testing: Spark profiling and Airflow SLA monitoring

  • Monitor job durations

  • Detect slowdowns before SLA impact

  • Optimize pipeline efficiency

SparkUI profiling surfaces bottlenecks in processing jobs, while Airflow's SLA miss detection flags delays at the orchestration level, giving teams early warning before downstream consumers are affected.

Data pipeline validation: Soda

  • Define “good” data with human-readable checks

  • Automatically alert teams on failures

  • Stop pipelines to prevent bad data from reaching production

SodaCL checks run inside pipelines and integrate with orchestration and warehouse tools like Airflow, Dagster, Prefect, Snowflake, and Databricks.

The right tooling makes testing a default behavior rather than a deliberate extra step and that's the only way it scales across a growing pipeline.

Check our Customer Stories to see how real companies prevent errors at the source rather than repairing damage downstream.

4 Questions that Tell You what to Add Next

The teams that maintain reliable pipelines over time are not the ones that built the most comprehensive test suite on day one, but the ones with a clear, repeatable process for knowing what to add next.

Work through these four questions in order, against your highest-risk pipeline stages first.

1. Is there a stage with no checks at all?

If yes, this is your immediate priority, ahead of everything else. A stage with zero coverage is a stage where any failure is invisible. Go back to the stage-by-stage breakdown above, identify the minimum viable check for that stage, and add it. One check that runs automatically is worth more than ten checks you are planning to write.

2. Do your checks run but not block?

Checks that generate alerts but do not stop bad data from moving forward are better than nothing, but not by much. If a quality scan fails and the pipeline keeps running, downstream consumers still get bad data, they just get it with a notification attached. Identify your most critical checks and promote at least one to a blocking quality gate. If it fails, the pipeline stops.

3. Are you only catching structural problems, not logic problems?

Schema validation and null checks are the foundation, but they will not tell you that your revenue calculation is double-counting or that a join is silently dropping valid records. If your coverage is entirely structural, your next addition is transformation unit tests, tested with known inputs, expected outputs, and run automatically on every code change.

4. Have you tested at production volume?

Everything above can be in place and your pipeline can still fail under real conditions if you have only ever run it against development-scale data. Before your next significant volume increase — a new data source, a new customer segment, a step-change in row counts — run a performance benchmark at realistic scale and set alert thresholds before you need them.

Work through these four questions regularly at the start of a new pipeline build, after a significant schema change, after an incident, or simply as part of a quarterly pipeline review. Each time, you are looking for the highest-risk gap, not the easiest win.

Building a Complete Data Reliability Strategy

Even with solid coverage in place, there is one gap that testing alone cannot close.

Testing covers what you know. Observability covers what you don't.

Testing enforces expectations you have already defined. When you write a check that says "this column should never be null" or "row count at the destination should match the row count at the source," you are encoding a known requirement and verifying it automatically. Testing is precise, deterministic, and proactive. It catches the failures you anticipated.

Observability detects anomalies you did not anticipate. It monitors the statistical behavior of your data over time and surfaces deviations that fall outside normal patterns — a sudden drop in record volume, a metric distribution that shifts overnight, a high-cardinality column that suddenly shows only three distinct values. These are the failures nobody thought to write a check for, because nobody expected them.

The failure modes are genuinely different, which means the tools and responses are different too. A test failure tells you exactly what broke and where. An observability alert tells you something changed, and an investigation is required.

A mature data reliability strategy needs both. Testing gives you precision and enforcement. Observability gives you the speed and breadth to catch what you never thought to define. Together, they cover the full surface area of your pipeline.

If you want to go deeper on how testing and observability fit together, check Four Approaches to Data Quality: Manual vs Automated, Observability vs Testing.

Frequently Asked Questions

Trusted by the world’s leading enterprises

Real stories from companies using Soda to keep their data reliable, accurate, and ready for action.

At the end of the day, we don’t want to be in there managing the checks, updating the checks, adding the checks. We just want to go and observe what’s happening, and that’s what Soda is enabling right now.

Sid Srivastava

Director of Data Governance, Quality and MLOps

Investing in data quality is key for cross-functional teams to make accurate, complete decisions with fewer risks and greater returns, using initiatives such as product thinking, data governance, and self-service platforms.

Mario Konschake

Director of Product-Data Platform

Soda has integrated seamlessly into our technology stack and given us the confidence to find, analyze, implement, and resolve data issues through a simple self-serve capability.

Sutaraj Dutta

Data Engineering Manager

Our goal was to deliver high-quality datasets in near real-time, ensuring dashboards reflect live data as it flows in. But beyond solving technical challenges, we wanted to spark a cultural shift - empowering the entire organization to make decisions grounded in accurate, timely data.

Gu Xie

Head of Data Engineering

4.4 of 5

Your data has problems.
Now they fix themselves.

Automated data quality, remediation, and management.

One platform, agents that do the work, you approve.

Trusted by

Trusted by the world’s leading enterprises

Real stories from companies using Soda to keep their data reliable, accurate, and ready for action.

At the end of the day, we don’t want to be in there managing the checks, updating the checks, adding the checks. We just want to go and observe what’s happening, and that’s what Soda is enabling right now.

Sid Srivastava

Director of Data Governance, Quality and MLOps

Investing in data quality is key for cross-functional teams to make accurate, complete decisions with fewer risks and greater returns, using initiatives such as product thinking, data governance, and self-service platforms.

Mario Konschake

Director of Product-Data Platform

Soda has integrated seamlessly into our technology stack and given us the confidence to find, analyze, implement, and resolve data issues through a simple self-serve capability.

Sutaraj Dutta

Data Engineering Manager

Our goal was to deliver high-quality datasets in near real-time, ensuring dashboards reflect live data as it flows in. But beyond solving technical challenges, we wanted to spark a cultural shift - empowering the entire organization to make decisions grounded in accurate, timely data.

Gu Xie

Head of Data Engineering

4.4 of 5

Your data has problems.
Now they fix themselves.

Automated data quality, remediation, and management.

One platform, agents that do the work, you approve.

Trusted by

Trusted by the world’s leading enterprises

Real stories from companies using Soda to keep their data reliable, accurate, and ready for action.

At the end of the day, we don’t want to be in there managing the checks, updating the checks, adding the checks. We just want to go and observe what’s happening, and that’s what Soda is enabling right now.

Sid Srivastava

Director of Data Governance, Quality and MLOps

Investing in data quality is key for cross-functional teams to make accurate, complete decisions with fewer risks and greater returns, using initiatives such as product thinking, data governance, and self-service platforms.

Mario Konschake

Director of Product-Data Platform

Soda has integrated seamlessly into our technology stack and given us the confidence to find, analyze, implement, and resolve data issues through a simple self-serve capability.

Sutaraj Dutta

Data Engineering Manager

Our goal was to deliver high-quality datasets in near real-time, ensuring dashboards reflect live data as it flows in. But beyond solving technical challenges, we wanted to spark a cultural shift - empowering the entire organization to make decisions grounded in accurate, timely data.

Gu Xie

Head of Data Engineering

4.4 of 5

Your data has problems.
Now they fix themselves.

Automated data quality, remediation, and management.

One platform, agents that do the work, you approve.

Trusted by