Data Contract Examples: 4 Templates You Can Use Today

Data Contract Examples: 4 Templates You Can Use Today

18 mars 2026

Fabiana Ferraz

Fabiana Ferraz

Rédacteur technique chez Soda

Rédacteur technique chez Soda

Table des matières

Data contracts formalize expectations between data producers and data consumers, ensuring consistent structure, measurable quality, and clear governance as data moves across systems. Instead of relying on assumptions, teams define explicit, testable standards for schema, transformations, ownership, and service levels.

The impact on business decisions can be costly and widespread. Gartner estimates poor data quality costs organizations an average of $12.9 million per year. Clear contracts reduce that risk by shifting quality from reactive troubleshooting to proactive control.

This guide provides five practical data contract templates you can adapt to your environment:

Each template is designed to be customized for common, real-world use cases, helping you operationalize data governance.

What is a Data Contract?

A data contract is a formal agreement between the people who produce data and the people who consume it. It defines what a dataset should look like, which columns must exist, what types they should be, what values are valid, how fresh the data needs to be, and who is accountable for maintaining data quality.

The keyword is enforceable. A data contract that lives in a Confluence doc and never gets checked is just documentation. A true data contract runs as executable checks every time new data arrives. If the data meets the contract, it moves forward. If it doesn't, it stops.

Why Data Contracts Are Essential for Data Teams

There's a reason data contracts have become a recurring topic in data engineering communities over the last two years. The old approach to data quality, which fixes issues only after they appear in production, does not scale.

An analysis of over 1,000 data pipelines found that 72% of data quality issues are discovered only after they've already affected business decisions. By the time someone notices the dashboard looks off, the invalid data has already powered a report, trained a model, or informed a decision.

Data contracts shift quality checks upstream. Teams define expectations in advance and validate data against them as it moves through the pipeline. These are formal data contract agreements between producers, who generate and manage data, and consumers, who rely on it, working like APIs in software to ensure data follows a fixed format and data quality rules and standards.

The practical result is clearer ownership, because when you create data contracts, you explicitly define who is responsible and speed up resolution when something breaks. Teams experience fewer surprise failures, since checks run continuously rather than only after someone reports an issue. And governance becomes defensible, with data quality rules defined as version-controlled code instead of informal assumptions that live in someone’s head.

Learn how data contracts turn data standards into enforceable rules, closing the gap between governance and execution in our Definitive Guide to Data Contracts”.

4 Data Contract Templates You Can Use Today

Below are four ready-to-use templates based on common scenarios: shared datasets, transformations, schema stability, and data integrity. Each follows a production-ready YAML structure that you can adapt to your environment. For more starting points, browse Soda's template library.

Template 1: Basic Data Contract Template

If your team is writing its first data contract example, start here. It covers the essential fields: data schema validation, row count, and a completeness (missing) check. Use this for any shared dataset where multiple teams need a lightweight, reliable baseline.

dataset: datasource/db/public/customers

checks:
  - schema: null
  - row_count:
      threshold:
        must_be_greater_than: 0

columns:
  - name: customer_id
    data_type: VARCHAR
    checks:
      - missing:
          threshold:
            must_be: 0
      - duplicate: null
  - name: email
    data_type: VARCHAR
    checks:
      - missing:
          threshold:
            metric: percent
            must_be_less_than: 1
  - name: created_at
    data_type: TIMESTAMP
    checks:
      - missing:
          threshold:
            must_be: 0

What this covers: Schema presence, row count, required timestamps, checks to prevent duplicate or missing IDs, and a low-tolerance rule for missing emails provide a solid baseline for any customer-facing dataset.

Template 2: Transformation Data Contract

Transformations are one of the most common places where things go quietly wrong. A column gets renamed, a join drops rows, or a recalculation shifts a metric's range, and nobody finds out until a stakeholder asks a question. This contract template wraps a transformation layer and validates output from a dbt model, Spark job, or SQL transformation before it moves downstream.

dataset: datasource/analytics/public/orders_transformed

checks:
  - schema: null
  - row_count:
      threshold:
        must_be_greater_than: 1000
  - freshness:
      column: transformed_at
      threshold:
        unit: hour
        must_be_less_than: 4

columns:
  - name: order_id
    data_type: VARCHAR
    checks:
      - missing:
          threshold:
            must_be: 0
      - duplicate: null
  - name: order_total
    data_type: NUMERIC
    checks:
      - invalid:
          valid_min: 0
          valid_max: 50000
      - missing:
          threshold:
            must_be: 0
  - name: order_status
    data_type: VARCHAR
    checks:
      - invalid:
          valid_values: ['pending', 'confirmed', 'shipped', 'delivered', 'cancelled'

What this covers: Freshness of the transformed output, minimum row count as a sanity check, no missing or duplicate order IDs, valid order total range, and a list of valid values for order status.

Template 3: Schema Validation Contract

Schema drift is one of the most common causes of pipeline failures. A data producer team renames a field or changes a data type, and suddenly, a downstream consumer is reading NULL where they expected a string. This data contract example of schema enforcement is especially useful in environments where multiple producers write to shared datasets.

dataset: datasource/db/public/product_catalog

checks:
  - schema:
      allow_extra_columns: false
      allow_other_column_order: false

columns:
  - name: product_id
    data_type: VARCHAR
    optional: false
  - name: product_name
    data_type: VARCHAR
    optional: false
  - name: category
    data_type: VARCHAR
    optional: false
  - name: price
    data_type: NUMERIC
    optional: false
  - name: stock_quantity
    data_type: INTEGER
    optional: false
  - name: last_updated
    data_type: TIMESTAMP
    optional: true

What this covers: Required columns and data types are strictly enforced, while optional columns are clearly marked. Any schema change triggers a warning, providing a stability guarantee for downstream consumers.

Template 4: Data Integrity Contract

A dataset can pass a schema check and still contain values that make no business sense, such as negative quantities, future-dated records, or impossible ranges. This contract template focuses on data integrity, ensuring the values themselves are trustworthy, not just the structure around them. Valid values checks are particularly important for any data feeding financial reporting or compliance workflows.

dataset: datasource/finance/public/transactions

checks:
  - row_count:
      threshold:
        must_be_between:
          greater_than: 10000
          less_than: 5000000
  - freshness:
      column: transaction_date
      threshold:
        unit: hour
        must_be_less_than: 2

columns:
  - name: transaction_id
    data_type: VARCHAR
    checks:
      - missing:
          threshold:
            must_be: 0
      - duplicate: null
  - name: amount
    data_type: NUMERIC
    checks:
      - invalid:
          valid_min: 0.01
          valid_max: 999999.99
      - missing:
          threshold:
            must_be: 0
  - name: currency_code
    data_type: VARCHAR
    checks:
      - invalid:
          valid_values: ['USD', 'EUR', 'GBP', 'CAD', 'AUD']
      - missing:
          threshold:
            must_be: 0
  - name: transaction_date
    data_type: TIMESTAMP
    checks:
      - missing:
          threshold:
            must_be: 0

What this covers: Amount ranges that reflect real-world transaction limits, a controlled currency list, freshness requirements, and zero tolerance for missing or duplicate transaction IDs.

How to Customize Data Contract Templates

The templates above are intentionally generic. The value of a contract template comes from how well it reflects your actual data. A few things that make the biggest difference:

  • Start with the dataset you trust the least. Most teams have at least one pipeline that everyone quietly worries about. That's where a data contract will have the most immediate impact. If you're not sure where to begin, creating a data contract for your highest-traffic shared dataset is usually the right first move.

  • Calibrate thresholds against real data. Before setting a range or a valid_min, run a quick historical query to understand what normal actually looks like. Too-tight thresholds create constant false alarms; too-loose thresholds won't catch real problems.

  • Use version control from day one. Store contracts in the same Git repository as your pipeline code. Any change goes through a pull request and stays visible to everyone who depends on that dataset.

Best Practices for Managing Data Contracts

Most data scientists and analysts don't fail at data contracts because they missed a policy. They fail because contracts weren't maintained, weren't enforced, or weren't owned by anyone. Following solid data contract best practices from the start makes the difference between contracts that protect your pipelines and contracts that collect dust.

  • Treat contracts as living documents. Datasets evolve, with new columns added and business logic changing. Build a regular review into your team's workflow, even once a quarter, to make sure contracts still reflect reality. The goal is safe change, not no change.

  • Enforce contracts in your CI/CD pipeline. Run contract verification automatically on every pull request and every new data load. Soda's documentation recommends verifying contracts on new data as soon as it is produced, to limit its exposure to downstream systems before it's been validated. If a check fails in CI, the data doesn't move forward.

  • Write contracts collaboratively. The producer team knows the data; the consumer team knows what they need from it. A contract written by just one side tends to miss something important — the conversation that happens while writing it is often as valuable as the data contract implementation itself.

  • Connect contracts to your broader data stack. Good data contracts management means linking contracts to your data catalog for discoverability and publishing results to a tool like Soda Cloud, giving your team a central view of what is passing, what is failing, and what needs attention across all datasets and not just the ones someone happens to be monitoring.

How Data Contracts Fit into Data Governance

Data contracts are one component of a broader governance strategy, but they operate where governance often breaks down: the execution layer. Many frameworks define policies, ownership models, and data classifications. Those structures are valuable, but policies alone do not prevent a broken pipeline from delivering inaccurate data downstream.

Most organizations already define data standards. The challenge is that these standards often live in static documents or catalog descriptions. They describe what data should look like, but they are not enforced in pipeline execution. Data contracts — specifically, formal data contract agreements encoded as executable checks — close that gap by making expectations testable, not just documented.

In practice, contracts are most effective when integrated with the rest of the data stack. Linking them to your data catalog improves discoverability. Connecting them to orchestration ensures failures trigger alerts. Publishing results to Soda Cloud provides a centralized view of contract health across datasets.

Testing and observability reinforce this model. Testing validates contract rules during development and execution, while observability monitors adherence in production. Together, they provide coverage across the full data lifecycle.

Ready to Put These Templates to Work?

The four templates above provide a practical starting point for common scenarios, including shared datasets, transformation validation, schema stability, and data integrity. Each can be adapted to your environment and integrated into existing pipelines.

Moving from static YAML definitions to enforced, runtime contracts requires automation. With automated checks, alerting, and a shared view of data quality across teams, contracts become operational controls rather than documentation. That’s what Soda is built for: turning defined expectations into continuously enforced standards across every pipeline your team runs.

Frequently Asked Questions

How do I enforce data contract compliance?

Make the contract executable. Tools like Soda let you define your contract in YAML and run automated verification against your actual data as part of your pipeline. If the data meets the contract, it moves forward. If it doesn't, the pipeline stops and an alert goes out. The goal is to remove manual oversight from the equation, because manual oversight tends to work right up until the moment it doesn't.

What if data contracts change over time?

They will change, and that is expected. Data contracts should evolve alongside the datasets they govern. The key is to treat contracts the same way you treat pipeline code: store them in version control, update them through pull requests, and involve the teams who depend on the dataset in the review process. This keeps expectations aligned with how the data is actually produced and consumed. What you want to avoid is a contract that drifts out of sync with reality. When that happens, it creates false confidence instead of reliable guarantees.

How do data contracts work in an automated pipeline?

A data contract operates at defined checkpoints in the data flow, typically after ingestion and again after transformation. Each time data moves through these stages, contract verification runs automatically against the defined expectations, including schema, completeness, value ranges, and freshness. If a check fails, the pipeline can stop, route the data to a quarantine layer, or trigger an alert to the owning team. This ensures issues are identified and addressed close to the source, before inaccurate data reaches dashboards, models, or business decisions downstream.

Data contracts formalize expectations between data producers and data consumers, ensuring consistent structure, measurable quality, and clear governance as data moves across systems. Instead of relying on assumptions, teams define explicit, testable standards for schema, transformations, ownership, and service levels.

The impact on business decisions can be costly and widespread. Gartner estimates poor data quality costs organizations an average of $12.9 million per year. Clear contracts reduce that risk by shifting quality from reactive troubleshooting to proactive control.

This guide provides five practical data contract templates you can adapt to your environment:

Each template is designed to be customized for common, real-world use cases, helping you operationalize data governance.

What is a Data Contract?

A data contract is a formal agreement between the people who produce data and the people who consume it. It defines what a dataset should look like, which columns must exist, what types they should be, what values are valid, how fresh the data needs to be, and who is accountable for maintaining data quality.

The keyword is enforceable. A data contract that lives in a Confluence doc and never gets checked is just documentation. A true data contract runs as executable checks every time new data arrives. If the data meets the contract, it moves forward. If it doesn't, it stops.

Why Data Contracts Are Essential for Data Teams

There's a reason data contracts have become a recurring topic in data engineering communities over the last two years. The old approach to data quality, which fixes issues only after they appear in production, does not scale.

An analysis of over 1,000 data pipelines found that 72% of data quality issues are discovered only after they've already affected business decisions. By the time someone notices the dashboard looks off, the invalid data has already powered a report, trained a model, or informed a decision.

Data contracts shift quality checks upstream. Teams define expectations in advance and validate data against them as it moves through the pipeline. These are formal data contract agreements between producers, who generate and manage data, and consumers, who rely on it, working like APIs in software to ensure data follows a fixed format and data quality rules and standards.

The practical result is clearer ownership, because when you create data contracts, you explicitly define who is responsible and speed up resolution when something breaks. Teams experience fewer surprise failures, since checks run continuously rather than only after someone reports an issue. And governance becomes defensible, with data quality rules defined as version-controlled code instead of informal assumptions that live in someone’s head.

Learn how data contracts turn data standards into enforceable rules, closing the gap between governance and execution in our Definitive Guide to Data Contracts”.

4 Data Contract Templates You Can Use Today

Below are four ready-to-use templates based on common scenarios: shared datasets, transformations, schema stability, and data integrity. Each follows a production-ready YAML structure that you can adapt to your environment. For more starting points, browse Soda's template library.

Template 1: Basic Data Contract Template

If your team is writing its first data contract example, start here. It covers the essential fields: data schema validation, row count, and a completeness (missing) check. Use this for any shared dataset where multiple teams need a lightweight, reliable baseline.

dataset: datasource/db/public/customers

checks:
  - schema: null
  - row_count:
      threshold:
        must_be_greater_than: 0

columns:
  - name: customer_id
    data_type: VARCHAR
    checks:
      - missing:
          threshold:
            must_be: 0
      - duplicate: null
  - name: email
    data_type: VARCHAR
    checks:
      - missing:
          threshold:
            metric: percent
            must_be_less_than: 1
  - name: created_at
    data_type: TIMESTAMP
    checks:
      - missing:
          threshold:
            must_be: 0

What this covers: Schema presence, row count, required timestamps, checks to prevent duplicate or missing IDs, and a low-tolerance rule for missing emails provide a solid baseline for any customer-facing dataset.

Template 2: Transformation Data Contract

Transformations are one of the most common places where things go quietly wrong. A column gets renamed, a join drops rows, or a recalculation shifts a metric's range, and nobody finds out until a stakeholder asks a question. This contract template wraps a transformation layer and validates output from a dbt model, Spark job, or SQL transformation before it moves downstream.

dataset: datasource/analytics/public/orders_transformed

checks:
  - schema: null
  - row_count:
      threshold:
        must_be_greater_than: 1000
  - freshness:
      column: transformed_at
      threshold:
        unit: hour
        must_be_less_than: 4

columns:
  - name: order_id
    data_type: VARCHAR
    checks:
      - missing:
          threshold:
            must_be: 0
      - duplicate: null
  - name: order_total
    data_type: NUMERIC
    checks:
      - invalid:
          valid_min: 0
          valid_max: 50000
      - missing:
          threshold:
            must_be: 0
  - name: order_status
    data_type: VARCHAR
    checks:
      - invalid:
          valid_values: ['pending', 'confirmed', 'shipped', 'delivered', 'cancelled'

What this covers: Freshness of the transformed output, minimum row count as a sanity check, no missing or duplicate order IDs, valid order total range, and a list of valid values for order status.

Template 3: Schema Validation Contract

Schema drift is one of the most common causes of pipeline failures. A data producer team renames a field or changes a data type, and suddenly, a downstream consumer is reading NULL where they expected a string. This data contract example of schema enforcement is especially useful in environments where multiple producers write to shared datasets.

dataset: datasource/db/public/product_catalog

checks:
  - schema:
      allow_extra_columns: false
      allow_other_column_order: false

columns:
  - name: product_id
    data_type: VARCHAR
    optional: false
  - name: product_name
    data_type: VARCHAR
    optional: false
  - name: category
    data_type: VARCHAR
    optional: false
  - name: price
    data_type: NUMERIC
    optional: false
  - name: stock_quantity
    data_type: INTEGER
    optional: false
  - name: last_updated
    data_type: TIMESTAMP
    optional: true

What this covers: Required columns and data types are strictly enforced, while optional columns are clearly marked. Any schema change triggers a warning, providing a stability guarantee for downstream consumers.

Template 4: Data Integrity Contract

A dataset can pass a schema check and still contain values that make no business sense, such as negative quantities, future-dated records, or impossible ranges. This contract template focuses on data integrity, ensuring the values themselves are trustworthy, not just the structure around them. Valid values checks are particularly important for any data feeding financial reporting or compliance workflows.

dataset: datasource/finance/public/transactions

checks:
  - row_count:
      threshold:
        must_be_between:
          greater_than: 10000
          less_than: 5000000
  - freshness:
      column: transaction_date
      threshold:
        unit: hour
        must_be_less_than: 2

columns:
  - name: transaction_id
    data_type: VARCHAR
    checks:
      - missing:
          threshold:
            must_be: 0
      - duplicate: null
  - name: amount
    data_type: NUMERIC
    checks:
      - invalid:
          valid_min: 0.01
          valid_max: 999999.99
      - missing:
          threshold:
            must_be: 0
  - name: currency_code
    data_type: VARCHAR
    checks:
      - invalid:
          valid_values: ['USD', 'EUR', 'GBP', 'CAD', 'AUD']
      - missing:
          threshold:
            must_be: 0
  - name: transaction_date
    data_type: TIMESTAMP
    checks:
      - missing:
          threshold:
            must_be: 0

What this covers: Amount ranges that reflect real-world transaction limits, a controlled currency list, freshness requirements, and zero tolerance for missing or duplicate transaction IDs.

How to Customize Data Contract Templates

The templates above are intentionally generic. The value of a contract template comes from how well it reflects your actual data. A few things that make the biggest difference:

  • Start with the dataset you trust the least. Most teams have at least one pipeline that everyone quietly worries about. That's where a data contract will have the most immediate impact. If you're not sure where to begin, creating a data contract for your highest-traffic shared dataset is usually the right first move.

  • Calibrate thresholds against real data. Before setting a range or a valid_min, run a quick historical query to understand what normal actually looks like. Too-tight thresholds create constant false alarms; too-loose thresholds won't catch real problems.

  • Use version control from day one. Store contracts in the same Git repository as your pipeline code. Any change goes through a pull request and stays visible to everyone who depends on that dataset.

Best Practices for Managing Data Contracts

Most data scientists and analysts don't fail at data contracts because they missed a policy. They fail because contracts weren't maintained, weren't enforced, or weren't owned by anyone. Following solid data contract best practices from the start makes the difference between contracts that protect your pipelines and contracts that collect dust.

  • Treat contracts as living documents. Datasets evolve, with new columns added and business logic changing. Build a regular review into your team's workflow, even once a quarter, to make sure contracts still reflect reality. The goal is safe change, not no change.

  • Enforce contracts in your CI/CD pipeline. Run contract verification automatically on every pull request and every new data load. Soda's documentation recommends verifying contracts on new data as soon as it is produced, to limit its exposure to downstream systems before it's been validated. If a check fails in CI, the data doesn't move forward.

  • Write contracts collaboratively. The producer team knows the data; the consumer team knows what they need from it. A contract written by just one side tends to miss something important — the conversation that happens while writing it is often as valuable as the data contract implementation itself.

  • Connect contracts to your broader data stack. Good data contracts management means linking contracts to your data catalog for discoverability and publishing results to a tool like Soda Cloud, giving your team a central view of what is passing, what is failing, and what needs attention across all datasets and not just the ones someone happens to be monitoring.

How Data Contracts Fit into Data Governance

Data contracts are one component of a broader governance strategy, but they operate where governance often breaks down: the execution layer. Many frameworks define policies, ownership models, and data classifications. Those structures are valuable, but policies alone do not prevent a broken pipeline from delivering inaccurate data downstream.

Most organizations already define data standards. The challenge is that these standards often live in static documents or catalog descriptions. They describe what data should look like, but they are not enforced in pipeline execution. Data contracts — specifically, formal data contract agreements encoded as executable checks — close that gap by making expectations testable, not just documented.

In practice, contracts are most effective when integrated with the rest of the data stack. Linking them to your data catalog improves discoverability. Connecting them to orchestration ensures failures trigger alerts. Publishing results to Soda Cloud provides a centralized view of contract health across datasets.

Testing and observability reinforce this model. Testing validates contract rules during development and execution, while observability monitors adherence in production. Together, they provide coverage across the full data lifecycle.

Ready to Put These Templates to Work?

The four templates above provide a practical starting point for common scenarios, including shared datasets, transformation validation, schema stability, and data integrity. Each can be adapted to your environment and integrated into existing pipelines.

Moving from static YAML definitions to enforced, runtime contracts requires automation. With automated checks, alerting, and a shared view of data quality across teams, contracts become operational controls rather than documentation. That’s what Soda is built for: turning defined expectations into continuously enforced standards across every pipeline your team runs.

Frequently Asked Questions

How do I enforce data contract compliance?

Make the contract executable. Tools like Soda let you define your contract in YAML and run automated verification against your actual data as part of your pipeline. If the data meets the contract, it moves forward. If it doesn't, the pipeline stops and an alert goes out. The goal is to remove manual oversight from the equation, because manual oversight tends to work right up until the moment it doesn't.

What if data contracts change over time?

They will change, and that is expected. Data contracts should evolve alongside the datasets they govern. The key is to treat contracts the same way you treat pipeline code: store them in version control, update them through pull requests, and involve the teams who depend on the dataset in the review process. This keeps expectations aligned with how the data is actually produced and consumed. What you want to avoid is a contract that drifts out of sync with reality. When that happens, it creates false confidence instead of reliable guarantees.

How do data contracts work in an automated pipeline?

A data contract operates at defined checkpoints in the data flow, typically after ingestion and again after transformation. Each time data moves through these stages, contract verification runs automatically against the defined expectations, including schema, completeness, value ranges, and freshness. If a check fails, the pipeline can stop, route the data to a quarantine layer, or trigger an alert to the owning team. This ensures issues are identified and addressed close to the source, before inaccurate data reaches dashboards, models, or business decisions downstream.

Data contracts formalize expectations between data producers and data consumers, ensuring consistent structure, measurable quality, and clear governance as data moves across systems. Instead of relying on assumptions, teams define explicit, testable standards for schema, transformations, ownership, and service levels.

The impact on business decisions can be costly and widespread. Gartner estimates poor data quality costs organizations an average of $12.9 million per year. Clear contracts reduce that risk by shifting quality from reactive troubleshooting to proactive control.

This guide provides five practical data contract templates you can adapt to your environment:

Each template is designed to be customized for common, real-world use cases, helping you operationalize data governance.

What is a Data Contract?

A data contract is a formal agreement between the people who produce data and the people who consume it. It defines what a dataset should look like, which columns must exist, what types they should be, what values are valid, how fresh the data needs to be, and who is accountable for maintaining data quality.

The keyword is enforceable. A data contract that lives in a Confluence doc and never gets checked is just documentation. A true data contract runs as executable checks every time new data arrives. If the data meets the contract, it moves forward. If it doesn't, it stops.

Why Data Contracts Are Essential for Data Teams

There's a reason data contracts have become a recurring topic in data engineering communities over the last two years. The old approach to data quality, which fixes issues only after they appear in production, does not scale.

An analysis of over 1,000 data pipelines found that 72% of data quality issues are discovered only after they've already affected business decisions. By the time someone notices the dashboard looks off, the invalid data has already powered a report, trained a model, or informed a decision.

Data contracts shift quality checks upstream. Teams define expectations in advance and validate data against them as it moves through the pipeline. These are formal data contract agreements between producers, who generate and manage data, and consumers, who rely on it, working like APIs in software to ensure data follows a fixed format and data quality rules and standards.

The practical result is clearer ownership, because when you create data contracts, you explicitly define who is responsible and speed up resolution when something breaks. Teams experience fewer surprise failures, since checks run continuously rather than only after someone reports an issue. And governance becomes defensible, with data quality rules defined as version-controlled code instead of informal assumptions that live in someone’s head.

Learn how data contracts turn data standards into enforceable rules, closing the gap between governance and execution in our Definitive Guide to Data Contracts”.

4 Data Contract Templates You Can Use Today

Below are four ready-to-use templates based on common scenarios: shared datasets, transformations, schema stability, and data integrity. Each follows a production-ready YAML structure that you can adapt to your environment. For more starting points, browse Soda's template library.

Template 1: Basic Data Contract Template

If your team is writing its first data contract example, start here. It covers the essential fields: data schema validation, row count, and a completeness (missing) check. Use this for any shared dataset where multiple teams need a lightweight, reliable baseline.

dataset: datasource/db/public/customers

checks:
  - schema: null
  - row_count:
      threshold:
        must_be_greater_than: 0

columns:
  - name: customer_id
    data_type: VARCHAR
    checks:
      - missing:
          threshold:
            must_be: 0
      - duplicate: null
  - name: email
    data_type: VARCHAR
    checks:
      - missing:
          threshold:
            metric: percent
            must_be_less_than: 1
  - name: created_at
    data_type: TIMESTAMP
    checks:
      - missing:
          threshold:
            must_be: 0

What this covers: Schema presence, row count, required timestamps, checks to prevent duplicate or missing IDs, and a low-tolerance rule for missing emails provide a solid baseline for any customer-facing dataset.

Template 2: Transformation Data Contract

Transformations are one of the most common places where things go quietly wrong. A column gets renamed, a join drops rows, or a recalculation shifts a metric's range, and nobody finds out until a stakeholder asks a question. This contract template wraps a transformation layer and validates output from a dbt model, Spark job, or SQL transformation before it moves downstream.

dataset: datasource/analytics/public/orders_transformed

checks:
  - schema: null
  - row_count:
      threshold:
        must_be_greater_than: 1000
  - freshness:
      column: transformed_at
      threshold:
        unit: hour
        must_be_less_than: 4

columns:
  - name: order_id
    data_type: VARCHAR
    checks:
      - missing:
          threshold:
            must_be: 0
      - duplicate: null
  - name: order_total
    data_type: NUMERIC
    checks:
      - invalid:
          valid_min: 0
          valid_max: 50000
      - missing:
          threshold:
            must_be: 0
  - name: order_status
    data_type: VARCHAR
    checks:
      - invalid:
          valid_values: ['pending', 'confirmed', 'shipped', 'delivered', 'cancelled'

What this covers: Freshness of the transformed output, minimum row count as a sanity check, no missing or duplicate order IDs, valid order total range, and a list of valid values for order status.

Template 3: Schema Validation Contract

Schema drift is one of the most common causes of pipeline failures. A data producer team renames a field or changes a data type, and suddenly, a downstream consumer is reading NULL where they expected a string. This data contract example of schema enforcement is especially useful in environments where multiple producers write to shared datasets.

dataset: datasource/db/public/product_catalog

checks:
  - schema:
      allow_extra_columns: false
      allow_other_column_order: false

columns:
  - name: product_id
    data_type: VARCHAR
    optional: false
  - name: product_name
    data_type: VARCHAR
    optional: false
  - name: category
    data_type: VARCHAR
    optional: false
  - name: price
    data_type: NUMERIC
    optional: false
  - name: stock_quantity
    data_type: INTEGER
    optional: false
  - name: last_updated
    data_type: TIMESTAMP
    optional: true

What this covers: Required columns and data types are strictly enforced, while optional columns are clearly marked. Any schema change triggers a warning, providing a stability guarantee for downstream consumers.

Template 4: Data Integrity Contract

A dataset can pass a schema check and still contain values that make no business sense, such as negative quantities, future-dated records, or impossible ranges. This contract template focuses on data integrity, ensuring the values themselves are trustworthy, not just the structure around them. Valid values checks are particularly important for any data feeding financial reporting or compliance workflows.

dataset: datasource/finance/public/transactions

checks:
  - row_count:
      threshold:
        must_be_between:
          greater_than: 10000
          less_than: 5000000
  - freshness:
      column: transaction_date
      threshold:
        unit: hour
        must_be_less_than: 2

columns:
  - name: transaction_id
    data_type: VARCHAR
    checks:
      - missing:
          threshold:
            must_be: 0
      - duplicate: null
  - name: amount
    data_type: NUMERIC
    checks:
      - invalid:
          valid_min: 0.01
          valid_max: 999999.99
      - missing:
          threshold:
            must_be: 0
  - name: currency_code
    data_type: VARCHAR
    checks:
      - invalid:
          valid_values: ['USD', 'EUR', 'GBP', 'CAD', 'AUD']
      - missing:
          threshold:
            must_be: 0
  - name: transaction_date
    data_type: TIMESTAMP
    checks:
      - missing:
          threshold:
            must_be: 0

What this covers: Amount ranges that reflect real-world transaction limits, a controlled currency list, freshness requirements, and zero tolerance for missing or duplicate transaction IDs.

How to Customize Data Contract Templates

The templates above are intentionally generic. The value of a contract template comes from how well it reflects your actual data. A few things that make the biggest difference:

  • Start with the dataset you trust the least. Most teams have at least one pipeline that everyone quietly worries about. That's where a data contract will have the most immediate impact. If you're not sure where to begin, creating a data contract for your highest-traffic shared dataset is usually the right first move.

  • Calibrate thresholds against real data. Before setting a range or a valid_min, run a quick historical query to understand what normal actually looks like. Too-tight thresholds create constant false alarms; too-loose thresholds won't catch real problems.

  • Use version control from day one. Store contracts in the same Git repository as your pipeline code. Any change goes through a pull request and stays visible to everyone who depends on that dataset.

Best Practices for Managing Data Contracts

Most data scientists and analysts don't fail at data contracts because they missed a policy. They fail because contracts weren't maintained, weren't enforced, or weren't owned by anyone. Following solid data contract best practices from the start makes the difference between contracts that protect your pipelines and contracts that collect dust.

  • Treat contracts as living documents. Datasets evolve, with new columns added and business logic changing. Build a regular review into your team's workflow, even once a quarter, to make sure contracts still reflect reality. The goal is safe change, not no change.

  • Enforce contracts in your CI/CD pipeline. Run contract verification automatically on every pull request and every new data load. Soda's documentation recommends verifying contracts on new data as soon as it is produced, to limit its exposure to downstream systems before it's been validated. If a check fails in CI, the data doesn't move forward.

  • Write contracts collaboratively. The producer team knows the data; the consumer team knows what they need from it. A contract written by just one side tends to miss something important — the conversation that happens while writing it is often as valuable as the data contract implementation itself.

  • Connect contracts to your broader data stack. Good data contracts management means linking contracts to your data catalog for discoverability and publishing results to a tool like Soda Cloud, giving your team a central view of what is passing, what is failing, and what needs attention across all datasets and not just the ones someone happens to be monitoring.

How Data Contracts Fit into Data Governance

Data contracts are one component of a broader governance strategy, but they operate where governance often breaks down: the execution layer. Many frameworks define policies, ownership models, and data classifications. Those structures are valuable, but policies alone do not prevent a broken pipeline from delivering inaccurate data downstream.

Most organizations already define data standards. The challenge is that these standards often live in static documents or catalog descriptions. They describe what data should look like, but they are not enforced in pipeline execution. Data contracts — specifically, formal data contract agreements encoded as executable checks — close that gap by making expectations testable, not just documented.

In practice, contracts are most effective when integrated with the rest of the data stack. Linking them to your data catalog improves discoverability. Connecting them to orchestration ensures failures trigger alerts. Publishing results to Soda Cloud provides a centralized view of contract health across datasets.

Testing and observability reinforce this model. Testing validates contract rules during development and execution, while observability monitors adherence in production. Together, they provide coverage across the full data lifecycle.

Ready to Put These Templates to Work?

The four templates above provide a practical starting point for common scenarios, including shared datasets, transformation validation, schema stability, and data integrity. Each can be adapted to your environment and integrated into existing pipelines.

Moving from static YAML definitions to enforced, runtime contracts requires automation. With automated checks, alerting, and a shared view of data quality across teams, contracts become operational controls rather than documentation. That’s what Soda is built for: turning defined expectations into continuously enforced standards across every pipeline your team runs.

Frequently Asked Questions

How do I enforce data contract compliance?

Make the contract executable. Tools like Soda let you define your contract in YAML and run automated verification against your actual data as part of your pipeline. If the data meets the contract, it moves forward. If it doesn't, the pipeline stops and an alert goes out. The goal is to remove manual oversight from the equation, because manual oversight tends to work right up until the moment it doesn't.

What if data contracts change over time?

They will change, and that is expected. Data contracts should evolve alongside the datasets they govern. The key is to treat contracts the same way you treat pipeline code: store them in version control, update them through pull requests, and involve the teams who depend on the dataset in the review process. This keeps expectations aligned with how the data is actually produced and consumed. What you want to avoid is a contract that drifts out of sync with reality. When that happens, it creates false confidence instead of reliable guarantees.

How do data contracts work in an automated pipeline?

A data contract operates at defined checkpoints in the data flow, typically after ingestion and again after transformation. Each time data moves through these stages, contract verification runs automatically against the defined expectations, including schema, completeness, value ranges, and freshness. If a check fails, the pipeline can stop, route the data to a quarantine layer, or trigger an alert to the owning team. This ensures issues are identified and addressed close to the source, before inaccurate data reaches dashboards, models, or business decisions downstream.

Trusted by the world’s leading enterprises

Real stories from companies using Soda to keep their data reliable, accurate, and ready for action.

At the end of the day, we don’t want to be in there managing the checks, updating the checks, adding the checks. We just want to go and observe what’s happening, and that’s what Soda is enabling right now.

Sid Srivastava

Director of Data Governance, Quality and MLOps

Investing in data quality is key for cross-functional teams to make accurate, complete decisions with fewer risks and greater returns, using initiatives such as product thinking, data governance, and self-service platforms.

Mario Konschake

Director of Product-Data Platform

Soda has integrated seamlessly into our technology stack and given us the confidence to find, analyze, implement, and resolve data issues through a simple self-serve capability.

Sutaraj Dutta

Data Engineering Manager

Our goal was to deliver high-quality datasets in near real-time, ensuring dashboards reflect live data as it flows in. But beyond solving technical challenges, we wanted to spark a cultural shift - empowering the entire organization to make decisions grounded in accurate, timely data.

Gu Xie

Head of Data Engineering

4,4 sur 5

Commencez à faire confiance à vos données. Aujourd'hui.

Trouvez, comprenez et corrigez tout problème de qualité des données en quelques secondes.
Du niveau de la table au niveau des enregistrements.

Adopté par

Trusted by the world’s leading enterprises

Real stories from companies using Soda to keep their data reliable, accurate, and ready for action.

At the end of the day, we don’t want to be in there managing the checks, updating the checks, adding the checks. We just want to go and observe what’s happening, and that’s what Soda is enabling right now.

Sid Srivastava

Director of Data Governance, Quality and MLOps

Investing in data quality is key for cross-functional teams to make accurate, complete decisions with fewer risks and greater returns, using initiatives such as product thinking, data governance, and self-service platforms.

Mario Konschake

Director of Product-Data Platform

Soda has integrated seamlessly into our technology stack and given us the confidence to find, analyze, implement, and resolve data issues through a simple self-serve capability.

Sutaraj Dutta

Data Engineering Manager

Our goal was to deliver high-quality datasets in near real-time, ensuring dashboards reflect live data as it flows in. But beyond solving technical challenges, we wanted to spark a cultural shift - empowering the entire organization to make decisions grounded in accurate, timely data.

Gu Xie

Head of Data Engineering

4,4 sur 5

Commencez à faire confiance à vos données. Aujourd'hui.

Trouvez, comprenez et corrigez tout problème de qualité des données en quelques secondes.
Du niveau de la table au niveau des enregistrements.

Adopté par

Trusted by the world’s leading enterprises

Real stories from companies using Soda to keep their data reliable, accurate, and ready for action.

At the end of the day, we don’t want to be in there managing the checks, updating the checks, adding the checks. We just want to go and observe what’s happening, and that’s what Soda is enabling right now.

Sid Srivastava

Director of Data Governance, Quality and MLOps

Investing in data quality is key for cross-functional teams to make accurate, complete decisions with fewer risks and greater returns, using initiatives such as product thinking, data governance, and self-service platforms.

Mario Konschake

Director of Product-Data Platform

Soda has integrated seamlessly into our technology stack and given us the confidence to find, analyze, implement, and resolve data issues through a simple self-serve capability.

Sutaraj Dutta

Data Engineering Manager

Our goal was to deliver high-quality datasets in near real-time, ensuring dashboards reflect live data as it flows in. But beyond solving technical challenges, we wanted to spark a cultural shift - empowering the entire organization to make decisions grounded in accurate, timely data.

Gu Xie

Head of Data Engineering

4,4 sur 5

Commencez à faire confiance à vos données. Aujourd'hui.

Trouvez, comprenez et corrigez tout problème de qualité des données en quelques secondes.
Du niveau de la table au niveau des enregistrements.

Adopté par