Soda

Guides

How Data Teams Manage Quality and AI

Discover how data teams are navigating the evolving landscape of data quality management in our latest guide, How Data Teams Manage Quality and AI. Based on insights from 287 practitioners across major data and AI conferences, this comprehensive analysis explores current tooling, the impact of GenAI on workflows, and the capabilities needed to ensure reliable, high-quality data. Whether you're curious about automation, pipeline testing, or the future of AI in data quality, this guide offers insights to enhance your approach.

How Data Teams Manage Quality and AI

In this guide
Download Guide

The Survey

We surveyed 287 data practitioners across multiple regions, roles, and perspectives to explore how software engineering principles, Generative AI (GenAI), and organizational requirements are shaping data quality management.

The survey respondents were attendees of prominent data and AI conferences, including Big Data LDN, Microsoft Day, Gartner Data & AI Summit, Dataversity DGIQ, and the Databricks Data & AI Summit and World Tour in New York.

What best describes your approach to managing data quality?

  • 56% of participants are using data observability tooling specifically to improve data quality. This underscores the market understanding that observability is key to maintaining reliable, trustworthy data in production.
  • 41% are focused on pipeline testing as their primary approach to ensuring data quality, highlighting the importance of testing frameworks.
  • 5% indicated they were using other methods, indicating that some teams are still experimenting with alternative (and mostly in-house) approaches to managing data quality.

Note: Percentages exceed 100% because participants could select multiple approaches, reflecting the best practice of combining methods to manage data quality.

Do you think GenAI will disrupt the data quality workflow?

  • The majority of respondents - 42% - believe that AI will slightly disrupt data quality management, suggesting that while it is expected to introduce efficiencies, many still foresee the need for human oversight.
  • 29% remain uncertain about AI's role, indicating that while there's excitement about the potential, many data teams are still navigating how AI will fit into their workflows.
  • A notable 23% predict that AI will entirely disrupt data quality management, signifying that a substantial number of teams are preparing for major shifts driven by AI technologies.
  • Only 6% believe AI will have no impact on data quality management, indicating that very few data teams expect business-as-usual in the face of AI advancements.

The key takeaway is that GenAI is set to play a significant, though varied, role in reshaping how teams manage data quality. While the majority expect it to bring moderate changes, a substantial portion are gearing up for a more transformative impact. Regardless of the level of disruption, one thing is clear: maintaining reliable, good-quality data remains a critical component in the use of data, AI, and machine learning.

How do you think GenAI will disrupt the data quality tooling space?

  1. Automation and AI-powered rules creation: many respondents foresee automation playing a central role in their work, particularly in generating data quality checks automatically and simplifying rule definition, along with automating anomaly detection and other forms of data quality testing. Some expect GenAI to replace static components with learning components, meaning that AI will continuously evolve to refine quality checks based on new data. Further, respondents suspect that GenAI will potentially automate follow-ups with data owners to resolve issues, reducing the manual effort involved.
  2. Self-learning and continuous improvement: several responses point out that GenAI will continuously learn from new data and improve quality over time. They think that it will become capable of predicting potential data issues and identifying patterns in the data that humans might overlook. GenAI could even auto-solve quality issues, becoming smarter and requiring less human intervention as it evolves.
  3. Enhanced efficiency and speed: respondents mention that GenAI will make data quality checks easier to implement and speed up the identification of issues. This would improve the turnaround time for both detecting and resolving data quality problems, while allowing for faster bug detection and better adaptability to changes.
  4. Reduced human effort: GenAI will act as a copilot for data engineers, helping them with tasks like code building, automated observation, and issue description. By automating many of the manual processes, it will offload work from data teams, allowing them to focus on more strategic tasks.
  5. Data governance and human oversight: there is recognition that GenAI will need to address data governance. Automated checks and remediation will still require human approval to ensure governance standards are met. Some respondents express concern that GenAI could introduce bias or novel bugs, emphasizing the need for extra checks to mitigate these risks.
  6. Real-time data quality monitoring: GenAI is expected to excel at real-time data quality monitoring, catching deviations, or anomalies almost instantly.
  7. Application in unstructured data and user interfaces: several respondents foresee GenAI making a significant impact on unstructured data quality by helping to categorize, standardize, and even clean free-text data, which is often more challenging to manage. It may also inject data quality monitoring directly into user interfaces, simplifying the user experience.
  8. Challenges and cautions: some challenges and risks mentioned include the potential for GenAI to enable the proliferation of bad data if applied without proper controls. Concerns about bias being introduced into the data or decision-making processes require additional safeguards. Some feel that while GenAI will make a significant impact, it will still require human-driven decisions at critical stages.

In summary, the responses suggest that GenAI will significantly disrupt the data quality tooling space by:

  • Automating many manual processes, from generating quality checks to detecting and resolving issues.
  • Enhancing efficiency with faster, real-time data quality insights.
  • Learning continuously from data to improve over time.
  • Expanding into unstructured data and bringing data quality control closer to users through intuitive interfaces.

However, the need for governance, human oversight, and bias control remain critical elements in ensuring any advancements are applied responsibly and effectively.

Which capabilities are needed to enhance data quality within your stack?

Data teams were clear on the top product capabilities that can significantly enhance data quality. The responses reflect a wide variety of capabilities that respondents feel are needed to leverage GenAI effectively for data quality management. Here's a categorized breakdown:

A significant number of responses highlight the need for Automation and Observability:

  • Automation: automated checks, automated anomaly detection, and AI-powered prevention were recurring themes.
  • Observability: tools for data observability to monitor data across the pipeline were frequently requested.
  • Real-time monitoring: real-time quality checks and validation of data in motion.
  • Automated detection: outlier detection and automated error detection were also highly desired.

Many respondents recognized the need for Integration and Connectivity:

  • Integration with existing tools: seamless integration with platforms like Azure Data Factory, Kafka, Purview, Databricks, and Master Data Management tools.
  • Multiple connectors: the ability to connect easily to a wide variety of data sources, especially in multi-cloud and hybrid environments.
  • Integration in pipelines: real-time checks and validation within processing pipelines.

Several responses focused on the need for Data Governance and Lineage:

  • Data governance: AI tools that help improve governance, surface issues, and support better rule adherence.
  • Data lineage: the ability to track data lineage and ensure transparency at every layer of the data lifecycle.

Many respondents indicated the importance of Anomaly Detection and Data Profiling:

  • Anomaly detection: the ability to identify irregularities in data quickly.
  • Data profiling: tools that profile data automatically and ensure adherence to data quality rules and expectations.

Respondents are looking for Ease of Use and Flexibility:

  • Plug-and-play: quick adaptability and flexibility of AI tools with minimal manual intervention.
  • Easy integration: simplicity in setup with fast insights.
  • User-friendly interfaces: interfaces that display data quality insights clearly and help users easily manage and fix issues.

Some respondents pointed to the need for Data Cleaning and Quality Control:

  • Data cleaning: AI-powered tools that can automatically clean data and detect missing or inconsistent fields.
  • Quality control: automated data quality checks that happen both at data ingestion and throughout the data lifecycle.

A few responses highlighted Governance and Rule Standardization:

  • Standardized libraries: libraries to help enforce quality control standards across environments, especially for pipelines.
  • Governance and reporting: insights that motivate decision-making and governance tools that ensure data quality compliance.

There's recognition for the need for Education and Culture:

  • Promote data literacy: educating teams about AI tools and ensuring they understand data quality processes.
  • Fostering a data quality culture: emphasizing the importance of embedding data quality into the organizational culture.

The importance of Testing and Validation was underlined by requests for:

  • Automatic testing: tools that run data tests automatically, such as unit tests and integration tests for data pipelines.
  • CI/CD integration: the ability to integrate data quality testing into Continuous Integration/Continous Delivery (CI/CD) workflows.

Lastly, a few responses focused on AI-specific Needs, including:

  • Predictive AI: tools that predict ad apply data quality rules based on the specific business domain.
  • Synthetic data: generation of synthetic bad data for testing expectations in data quality systems.

TL;DR

To summarize, the answers show a desire for a holistic, automated, integrated AI-drive data quality management tools that can work seamlessly across platforms. Respondents are particularly focused on real-time monitoring, automated anomaly detection, and strong data governance tools. Integration with existing data ecosystems and ease of use are also key factors for driving AI adoption in data quality processes.

Respondents who are unsure about the disruptive role of GenAI still expressed a variety of needs for enhancing data quality. Key areas of interest include automate checks, data governance, observability, and tools that are easy to use and integrate with existing workflows. Many, however, remain unsure of the specifics, reflecting a mix of early-stag exploration an lack of familiarity with advanced data quality management solutions.

For the respondents who don't see GenAI as a major disrupter in the data quality space, they expressed some interest in low-code platforms, accessibility improvements, and better tools for unstructured data. For the most part, they either feel their current tools are sufficient or are unsure about their needs in this area. Some respondents are looking for incremental improvements in documentation and business process alignment rather than a full, AI-driven transformation.

Conclusion - And Our Survey Says...

Our survey reveals a dynamic and evolving landscape in data quality management, influenced by advances in testing, automation, GenAI, individual skill sets, and organizational priorities. While automation and observability are now central to most teams’ strategies, testing emerged as a critical approach for ensuring data reliability, with pipeline testing standing out as a key practice for early prevention.

The emphasis on testing reflects a growing understanding that catching and resolving issues upstream—during development or ingestion—can significantly reduce downstream errors and operational inefficiencies. Not to be ignored, many practitioners are still exploring their options or relying on manual processes. Generative AI is positioned as a potential game-changer, with most respondents anticipating it to either slightly or significantly disrupt workflows. Its ability to automate rule creation, enhance anomaly detection, and support real-time monitoring highlights a promising future, though concerns around governance, bias, and oversight persist.

Organizations also expressed a clear demand for integrated, easy-to-use tools that align with existing workflows, emphasizing real-time capabilities, governance, and adaptability across platforms. While some teams are ready for transformative AI-driven solutions, others are focused on incremental improvements or remain uncertain about how to best incorporate these technologies.

Ultimately, the findings underscore that while technology continues to shape how data quality is managed, reliable and trustworthy data remains at the heart of every data-driven initiative. The future will likely see a blend of innovative AI capabilities, robust testing strategies, and careful human oversight, to ensure that data quality supports the increasing complexity of scaling today's data ecosystems.

LFG!

Ready to take your data quality management to the next level? Explore Soda’s powerful tools for pipeline testing, automation, and observability. Start with our free resources or join us for a hands-on webinar to see Soda in action.

The survey respondents were attendees of prominent data and AI conferences, including Big Data LDN, Microsoft Day, Gartner Data & AI Summit, Dataversity DGIQ, and the Databricks Data & AI Summit and World Tour in New York.