Soda

Guides

Implementing Data Quality in a Data Mesh: A Case Study at HelloFresh

This guide explores HelloFresh's journey to implement data quality in a data mesh, decentralize data ownership, and increase usage and trust.

Implementing Data Quality in a Data Mesh: A Case Study at HelloFresh

In this guide
Download Guide
data mesh (noun) An organizational pattern for analytical data management that serves to establish and support a data-informed organization. It allows everyone in an organization to leverage and contribute to data insights, adding their perspectives and domain expertise. It focuses on data enablement rather than centralized control, removing bottlenecks and fostering cross-domain collaboration to build and maintain a single source of truth.

In the beginning

In 2021, the journey from a centralized data organization to a data mesh had only just begun for HelloFresh.

Less than a decade ago, HelloFresh, like so many other companies, had a small, specialized team that focused on warehousing data and producing reports for analysts and executives. They fashioned themselves a siloed existence, growing separate from the rest of the organization as the only ones who understood their data, how to manage it, and how much it could be trusted. 

But as the volume of data and demand for access to it increased, the team struggled to keep up. 

Their specialized data management skills and processes kept most of the organization from accessing or using data. As the team struggled to keep up with demand, the quality of the data suffered and the lack of reliability and quick access to the data was starting to stagnate innovation.  

In early 2020, they decided to stop the constant firefighting that their Data Engineers were doing due to a lack of data quality standardization and uncertain ownership of data in the organization. To unlock analytical data at scale, the central, specialized team decided to pivot from data warehousing to a data mesh construct in which they built the tools and programs that would enable everyone to help themselves to data they could trust.The journey had begun. 

Their ultimate goal is to build data products that have a purpose and that people can trust, high-quality assets that the rest of the organization can quickly discover, understand, and securely access. As part of their phased approach to implementing this new organizational mindset, they elected to use Soda to tackle data quality testing and monitoring. 

This case study on HelloFresh seeks to illustrate an example of where Soda sits in an organization, how it fits into the data mesh landscape, and the value it provides in expanding internal adoption and ownership of data quality.

An early version of data mesh

There are four main elements that make up the early version of data mesh at HelloFresh: 

  • domain-oriented, decentralized data teams
  • data products
  • federated and incentivized data governance
  • a self-serve data platform

Data domain teams

In an effort to decentralize data ownership, the data warehousing group started by building teams of data specialists that operate autonomously, that were not a part of a centralized “data ownership” construct. These Data Domain Teams, as they came to be called, are comprised of Data Engineers, Data Analysts, Data Scientists, and Data Product Managers. These teams take ownership of the data in their domain and provide the support and services that come with that responsibility. They build and maintain data products while adhering to compliance standards and ensuring their data products are accessible to everyone in the organization via a self-serve data platform. 

Data Domain Teams do not report to a centralized body that controls access to the data or dictates mandates about data products. However, they do have access to federated data governance standards and data quality and management tools to facilitate their objectives and meet the needs of their customers, the consumers of data in the organization. 

The data warehousing group took these necessary steps to pivot towards the Data Domain Team model:

  1. Maintain the existing data warehouse to meet established commitments.
  2. Hire people to build a new cloud data infrastructure.
  3. Enable the people asking for reports to serve themselves via a data platform.
  4. Slowly decommission the data warehouses, disassemble old commitments, and set up new Service Level Objectives (SLOs).

Data products

In another conscious shift in mindset, HelloFresh decided to treat their data as a product, rather than an operational by-product. Data is a proper thing, a tool people throughout the organization use to gauge success and make business decisions. It deserves the attention and strategy one would apply to a product. 

A data product, then, is essentially raw data that has been transformed into something usable by data consumers. It might comprise a single dataset, or several datasets, and a dashboard. Each data product has a Data Product Manager, and the support of a product and engineering team. 

As the owner of a data product, a Data Product Manager does three things:

  1. identifies the internal customers, the consumers of the data, the stakeholders
  2. understands the needs of the customers and what they use the data for —
  3. productizes the data and applies SLOs to meet data quality commitments

In more formal contexts, a Service Level Objective (SLO) or a Service Level Agreements (SLAs) can be a threatening tool that people use to control something, or to squarely lay blame when something goes wrong. However, in the context of data products at HelloFresh, SLOs are more like tools for creating alignment between the data product owners and data consumers so as to establish trust in the data. They act as an enforceable contract that focuses on explicitly describing what the data product provides and making sure that the data is fit for purpose. 

As an example, an SLO could be as simple as “Make sure this data is up-to-date every morning at 8:00AM” so that a team can run a daily report on errors in the system. These kinds of SLOs promote clarity and visibility, and are not intended for use as heavy-handed contracts.

Soda and data products

The product and engineering teams that build and maintain data products are the primary purchasers and users of Soda products. These teams use Soda to execute data quality tests on their data products and share proof of data product quality with colleagues. 

Federated data governance

Data governance, another weighty term, is something HelloFresh wanted to lighten and properly incentivize. 

There’s a notion that data governance is more about control, about being a gatekeeper of data, restricting access or acting as an obstacle or bottleneck to getting reliable data. Instead, HelloFresh’s approach focuses on developing a framework of data quality standards that apply to all data products, helping teams understand the framework, and providing support and clarity about expectations involving the data products. The centralized Data Governance Team uses policies and guidelines to explicitly set standards for what good data looks like, and incentivizes Data Domain Teams to certify their data products. 

The Data Product Certification program is a voluntary exercise that Data Domain Teams can pursue. It is somewhat gamified with different data quality levels that they can strive to reach (bronze, silver, and gold) and the badges to match on Slack. Teams can request that their data be certified according to the data-quality standards set by the Data Governance Team, thus sharing the ownership of ensuring good data quality. 

The goal is to incentivize Data Domain Teams to do the right thing with their data, something HelloFresh refers to as “governance by convenience”. They strive to make the right way to test or validate their data quality the easiest way. The program rewards teams for investing in both data quality and the collective commitment to creating data visibility.

Soda and data governance

HelloFresh’s Data Product Certification process is still manual. The manual process involves measuring the data product’s quality, security, and self-description, among other things, and Soda is the tool they use to test for quality. 

In the future, the certification process will be automated, again leveraging Soda products for data quality testing. Starting early, however, has enabled them to begin building a data product framework and given them the opportunity to learn what works and what doesn’t. It’s new territory, but they have found that data quality is at its best when ownership is shared.

Self-serve data platform

The final hurdle in decentralizing data ownership involved making it discoverable, interoperable, secure, trustworthy, and self-describing for data consumers within the organization. When you have Data Domain Teams incentivized to certify the quality of their data products, and good data quality tooling, your data consumers are empowered to help themselves. 

In a data mesh, HelloFresh discovered, a self-serve data platform must respect the alignment between autonomous, decentralized operations teams and their need for quick and reliable access to data. Such a platform had to use tools that could integrate well with the existing data landscape and data infrastructure so teams wouldn’t have to struggle with the overhead of implementing yet another data ingestion pipeline tool. 

Ideally offering built-in data governance and interfaces for global monitoring of policies, the tools that comprise the self-serve data platform make it easy for teams to set up data quality tests or checks, and encourage data quality to be monitored throughout the whole organization. Using such a Platform is all about democratizing access to data so everyone has the data they need to do their jobs. 

Soda and the self-serve data platform

From HelloFresh’s perspective, it was critical that the tool they chose for their self-serve data platform could provide an accessible user interface. 

The people on the Data Domain Teams are generally data analysts or data scientists who are very comfortable writing SQL queries but less comfortable or capable of writing code to test for data quality. Soda’s approach with a command-line tool that facilitates the use of SQL queries via easy-to-useYAML files and out-of-the-box metrics made it a logical choice for the team to adopt. 

In truth, HelloFresh felt that implementing data quality checks is something that they could have done in-house, but the work involved in building and maintaining a user interface and global (internal) monitoring functionality would have required a significant product investment, so it was worth leveraging the product of third-party, data-quality experts.

Ready for data analytics at scale

As the construct of a bottlenecked data governance framework recedes in the rearview mirror, HelloFresh is looking ahead to a future state in which they embrace data analytics at scale. And though the shift towards data mesh involves more than just “changing the tools we use”, the technological choices they have made are quickly proving to be accelerants as they work to achieve their goals in democratizing data. 

From facilitating the creation of robust data products and aligning data quality governance, to providing self-serve data quality checks within the organization, Soda’s technology continues to be an instrumental part of their efforts. Even years later, HelloFresh’s journey is ongoing, but these first steps feed tremendous optimism for a system that is poised to seize upon the business value and efficiency that shared data ownership has to offer.

  1. Discover Soda’s data quality platform to implement data quality into your data mesh
  2. Watch the Video: Data Mesh at HelloFresh - Data Mesh Learning Meetup

*Disclaimer: This guide was created in 2021. Please note that figures and statistics may have changed since its publication.