Soda, the data quality company, has today launched SodaGPT, the first Generative AI (GenAI) powered tool for data quality, enabling a no-code self-serve approach for users of all backgrounds to naturally express and define data quality expectations. Available in preview from today, SodaGPT combines the domain specific language capabilities of SodaCL with the natural language processing power of GenAI, to provide a single platform for data consumers and data engineers to work together to produce data that can be trusted and used by everyone
SodaGPT uses its own, proprietary generative pre-trained transformer technology based on the open-source Falcon-7b model, to translate natural English language queries into production-ready data quality tests in SodaCL, the human-readable, domain specific language for data quality. The tool provides a simple way for data consumers and domain experts to become involved in data quality management and lessen the load on data engineers spending time fighting data issues, enabling them to express and define their own data quality expectations to ensure that data is fit for purpose.
“SodaGPT is a huge step forward for the democratization of data, providing a no-code, GenAI tool that ensures everyone can get involved in data quality testing and, as a result, make data-informed decisions,” said Maarten Masschelein, CEO, Soda. “LLMs are one of the many exciting trends reshaping our world and transforming the way we work with information systems, and they have the power to transform how we extract value from data. With SodaGPT, we are ripping up the antiquated approach to data quality checks built exclusively for a technology audience that can read and write in SQL, simplifying the process for data consumers in order to free-up data engineers to focus on building new data products.”
The introduction of a new self-serve ‘contribution’ model empowers data consumers to express, contribute and then collaborate on data quality expectations that meet their own business requirements. Natural language code ‘contributions’ made using SodaGPT and automatically translated into SodaCL, facilitate seamless collaboration between the data consumers who can now define data quality expectations in their own words, and the data engineers who provide critical human oversight to ensure that checks are correctly defined before being embedded into the data pipeline.
SodaGPT ‘shifts left’ the management of data quality and enables data to be tested as early and as often as possible in the development lifecycle to avoid issues that might impact data products or wreak havoc on the business down the line. Soda research recently found that 60% of data engineers are still spending almost half their time dealing with data issues. With SodaGPT, the ability to create a more robust, reliable data pipeline with problems caught before they enter production means that data consumers can be more productive using data they can trust, and data engineers spend less time reactively fixing problems, and more time proactively adding value straight back into the business.
Soda’s security-first approach to software development ensures that SodaGPT has been entirely trained using the Falcon open-source model to produce SodaCL based on natural language input, with no dependency on OpenAI. This means that proprietary data shared with the model through prompt-writing never leaves Soda’s SOC-II Type 2 accredited platform, guaranteeing the same high level of internal control, systems and policy privacy and protection as all other Soda products.
SodaGPT is available in preview for all registered users of Soda Cloud, from today. For more information on SodaGPT, please visit Soda Documentation.