Introducing Soda Cleanse: From Detection to Remediation

Introducing Soda Cleanse: From Detection to Remediation

Maarten Masschelein

Maarten Masschelein

Maarten Masschelein

CEO and Founder at Soda

CEO and Founder at Soda

Lauren De bruyn

Lauren De bruyn

Lauren De bruyn

Product Engineer at Soda

Product Engineer at Soda

Grzegorz Kaczan

Grzegorz Kaczan

Grzegorz Kaczan

Lead Frontend Engineer at Soda

Lead Frontend Engineer at Soda

Table of Contents

Data management tools solved automated detection. They never solved automated remediation. And detection without remediation is just debt on a longer timeline.

We've helped teams detect data issues automatically at scale for years. It's high time we helped fix them, too.

Today, we're introducing Soda Cleanse: an agentic data cleansing capability that extends Soda's detection capabilities into automated remediation.

Specialized AI agents analyze the failures Soda finds, generate targeted fix proposals, and route them to data stewards for approval. Nothing changes in your data without a human sign-off.

The steward governs. The agent does the janitorial work.

Soda Cleanse is an add-on to Soda Cloud and is available today in Private Preview.

Request access →

How Soda Cleanse Works

Soda Cleanse is contract-driven, agent-specialized, and human-approved. Those three properties distinguish it from generic AI data cleaning approaches, and from the ad-hoc scripts most teams fall back on today.

The Inbox is Soda Cleanse's center of gravity. It’s where data stewards and record owners support triage, decide, and move on. It's not a dashboard and not a chatbot. It’s a workflow application with an audit trail.

The Issues Inbox. Each row shows the record, affected column, suggested fix, and assigned steward. Expanding a row reveals the AI Reasoning panel (a plain-English explanation with a confidence score) alongside the Row Context fields the agent used to determine the fix.

↗The Issues Inbox. Each row shows the record, affected column, suggested fix, and assigned steward. Expanding a row reveals the AI Reasoning panel (a plain-English explanation with a confidence score) alongside the Row Context fields the agent used to determine the fix.

Cleanse is built on top of Soda Cloud and the Diagnostics Warehouse (DWH). Every failed row from every failing check in a Soda deployment pools in the DWH automatically, with a consistent schema and full scan history. Cleanse plugs into that pool directly. No custom ingestion pipeline, no per-source wiring.

Three things happen on a loop:

1. Ingest

Every failed row from every failing check lands in the Inbox automatically, with its check context, scan history, and prior decisions attached. One row, one Issue — no matter how many checks fired against it.

2. Propose

The data contract declares how each failure should be fixed, picking the right tool for the problem: a safe default, a lookup, an AI-assisted suggestion, or a human call. AI is the last resort, not the first.

3. Apply

Approved fixes reach source through an audited writer — or stay in a staging table if your team isn't ready to grant write-access yet. Nothing changes in your data without a steward signing off, and every decision lands in the audit trail.

Soda Cleanse closes the loop on data quality. Contracts define correct. Agents fix what isn't. Stewards govern the outcome.

💡 For a deeper look at the evolution from manual scripts to agentic cleansing, read our Guide to Modern Data Cleansing.

What contract-driven cleansing gets you

The remediation strategies Cleanse runs live inside the same data contract the Soda customer is already maintaining. A customer who adds a new check gets the remediation slot for free: they fill it in when they're ready, and Cleanse picks it up automatically.

Four outcomes fall out of that:

  1. One artifact, one workflow. Detection and remediation live inside the same data contract, so they can't drift apart. Every fix traces back to the rule that validates the data.

  2. A safe on-ramp for risk-averse buyers. No team needs to grant production write-access on day one to get value out of Cleanse. That conversation can wait until you're ready for it.

  3. Interpretable-first golden record selection. Advanced, interpretable heuristics pick the most likely correct record, and LLMs step in only to resolve ambiguous cases. Every merge decision is traceable back to the signals that drove it — defensible in an audit, without the black-box tradeoff.

  4. Governed today, autonomous tomorrow. Every decision — proposal, approval, write, rejection — lands in an immutable audit log. When an auditor asks "who changed this record and why," the answer is already there. And the architecture that supervises today's workflow is the same one that scales toward fuller autonomy on your terms.

Soda Cloud finds it. Cleanse fixes it. One contract, one workflow, one audit trail.

Soda Cleanse requires Soda 4.0 and the Diagnostics Warehouse (Enterprise plan).

If you're not yet on Soda 4.0, talk to our team to understand how to get started.

What Soda Cleanse Can Fix

Soda Cleanse ships with specialized agents for four failure types. Each is built for the reasoning that type requires.

Entity normalization

Variant names for the same entity ("USA", "U.S.A.", "United States", "United States of America") break joins and inflate counts. The normalization agent derives the canonical form from surrounding data and contract context, then proposes it for approval.

Imputation

Missing values slip through schemas because NULL is often technically valid. The imputation agent reasons from the contract's definition of the field and surrounding data to propose a value that fits for the steward to accept, edit, or reject.

Deduplication

Duplicates are resolved with advanced, interpretable heuristics that identify the most likely correct record, with LLMs stepping in only to resolve ambiguous cases. Merge candidates surface in the Inbox with full evidence of why each record was picked.

Reconciliation

When the same entity carries different values across sources, or drifts from a trusted reference dataset, the reconciliation agent identifies the mismatch and proposes a correction consistent with the contract's source of truth.

How To Get Started

Soda Cleanse is in Private Preview. Access is limited intentionally: we want to work closely with early teams to make sure the agents produce proposals worth approving, and that the steward workflow fits how governance actually runs.

Already on Soda 4.0 with data contracts in place?

  • If the Diagnostics Warehouse is already running, no new infrastructure is required

  • Soda Cleanse runs in Kubernetes alongside your existing Soda runner

  • Teams in early access have reached their first agent proposal in days

  • Request access →

Not yet on Soda 4.0?

  • Install Soda Core 4.0 and explore the documentation.

  • If you are on v3.0, reach out to your customer engineer to plan how to migrate to v4.0.

  • Get one data contract in place for one dataset.

  • Pilot Cleanse on that dataset, then expand from there.

Either way, start narrow: pick the failure type causing the most manual cleanup work on your team, pick one dataset, and let the agent run. Expansion follows naturally from there.

What's Next

Observability and Contracts were the first half: knowing when something is wrong.

Cleanse is the second: resolving it on the same platform, under the same contract, with no exports, tickets, or handoffs in between.

Stewards stop fixing records and start governing the process — approving what the agents propose, rejecting what doesn't hold up, and letting the audit trail do the rest.

Private Preview is the beginning. We're using early access to make sure agents produce proposals that are accurate enough to approve quickly, and to understand how different governance setups affect the steward review workflow.

As agents accumulate approval history specific to your organization's data standards, the review queue gets shorter. The goal isn't just automated remediation. It's remediation that gets smarter over time, without requiring your team to maintain it.

An MCP endpoint is on the roadmap. It will let external agents participate in the remediation workflow directly: driving triage, proposing fixes, and eventually closing the loop without a human for the fix types that have earned it.

If you're already working with Soda and have thoughts on which failure types matter most for your pipelines, join the conversation in Soda Community Slack. That feedback shapes what we build next.

Frequently asked questions

How long does it take to deploy Soda Cleanse?

For teams already on Soda 4.0 (https://soda.io/blog/introducing-soda-4.0) with data contracts in place, activation can be immediate with no new infrastructure required. For teams starting from scratch, deploying a first contract and running a pilot on one dataset typically takes days to weeks.

Does Soda Cleanse require Soda 4.0?

Yes. Soda Cleanse imports failed records from the Diagnostics Warehouse — a Soda 4.0 Enterprise feature.

What types of data issues can agentic cleansing fix?

Specialized agents handle normalization, imputation, deduplication, and reconciliation. Each failure type gets a purpose-built agent, not a general-purpose model.

Does my data leave my environment?

No. Soda Cleanse is deployed inside your own environment. Neither the application nor the data leaves your boundary.

How is Soda Cleanse different from rule-based remediation platforms?

Rule-based platforms keep detection and remediation in separate systems, which means they can drift apart and take months to deploy and configure. Soda Cleanse is contract-driven: fix strategies are declared inside the same data contract that validates your data, so every fix traces back to the rule it satisfies — and teams reach their first fix in days.

Trusted by the world’s leading enterprises

Real stories from companies using Soda to keep their data reliable, accurate, and ready for action.

At the end of the day, we don’t want to be in there managing the checks, updating the checks, adding the checks. We just want to go and observe what’s happening, and that’s what Soda is enabling right now.

Sid Srivastava

Director of Data Governance, Quality and MLOps

Investing in data quality is key for cross-functional teams to make accurate, complete decisions with fewer risks and greater returns, using initiatives such as product thinking, data governance, and self-service platforms.

Mario Konschake

Director of Product-Data Platform

Soda has integrated seamlessly into our technology stack and given us the confidence to find, analyze, implement, and resolve data issues through a simple self-serve capability.

Sutaraj Dutta

Data Engineering Manager

Our goal was to deliver high-quality datasets in near real-time, ensuring dashboards reflect live data as it flows in. But beyond solving technical challenges, we wanted to spark a cultural shift - empowering the entire organization to make decisions grounded in accurate, timely data.

Gu Xie

Head of Data Engineering

4.4 of 5

Your data has problems.
Now they fix themselves.

Automated data quality, remediation, and management.

One platform, agents that do the work, you approve.

Trusted by

Trusted by the world’s leading enterprises

Real stories from companies using Soda to keep their data reliable, accurate, and ready for action.

At the end of the day, we don’t want to be in there managing the checks, updating the checks, adding the checks. We just want to go and observe what’s happening, and that’s what Soda is enabling right now.

Sid Srivastava

Director of Data Governance, Quality and MLOps

Investing in data quality is key for cross-functional teams to make accurate, complete decisions with fewer risks and greater returns, using initiatives such as product thinking, data governance, and self-service platforms.

Mario Konschake

Director of Product-Data Platform

Soda has integrated seamlessly into our technology stack and given us the confidence to find, analyze, implement, and resolve data issues through a simple self-serve capability.

Sutaraj Dutta

Data Engineering Manager

Our goal was to deliver high-quality datasets in near real-time, ensuring dashboards reflect live data as it flows in. But beyond solving technical challenges, we wanted to spark a cultural shift - empowering the entire organization to make decisions grounded in accurate, timely data.

Gu Xie

Head of Data Engineering

4.4 of 5

Your data has problems.
Now they fix themselves.

Automated data quality, remediation, and management.

One platform, agents that do the work, you approve.

Trusted by

Trusted by the world’s leading enterprises

Real stories from companies using Soda to keep their data reliable, accurate, and ready for action.

At the end of the day, we don’t want to be in there managing the checks, updating the checks, adding the checks. We just want to go and observe what’s happening, and that’s what Soda is enabling right now.

Sid Srivastava

Director of Data Governance, Quality and MLOps

Investing in data quality is key for cross-functional teams to make accurate, complete decisions with fewer risks and greater returns, using initiatives such as product thinking, data governance, and self-service platforms.

Mario Konschake

Director of Product-Data Platform

Soda has integrated seamlessly into our technology stack and given us the confidence to find, analyze, implement, and resolve data issues through a simple self-serve capability.

Sutaraj Dutta

Data Engineering Manager

Our goal was to deliver high-quality datasets in near real-time, ensuring dashboards reflect live data as it flows in. But beyond solving technical challenges, we wanted to spark a cultural shift - empowering the entire organization to make decisions grounded in accurate, timely data.

Gu Xie

Head of Data Engineering

4.4 of 5

Your data has problems.
Now they fix themselves.

Automated data quality, remediation, and management.

One platform, agents that do the work, you approve.

Trusted by