The First Generative AI for Data Quality
Today marks the release of SodaGPT, the first generative AI for data quality that enables a no-code approach to express data quality checks.
This new tool marks the beginning of true, self-serve data quality management for everyone involved in data. SodaGPT combines the expressiveness of the Soda Checks Language (SodaCL) with the natural language processing power of generative AI, to provide accelerated time-to-value for implementing data quality checks. Available in preview, log in to, or sign up for, a Soda Cloud account and click “Ask SodaGPT” to check it out!
Soda has always been committed to making it easy for coders and analysts alike to participate in improving data quality in an organization. It is often the case that the non-coders among us are best able to decide what needs to be tested for data to be trustworthy and fit-for-purpose – after all, they are the ones who make decisions on data, and have built the appropriate domain expertise to formulate those requirements. Yet, in most solutions those users are often prevented from doing so because of the barriers of having to learn a new tool and language.
SodaGPT squarely addresses those barriers. Using our own, proprietary generative pre-trained transformer technology based on the open-source Falcon-7b model, SodaGPT translates natural English language input into production-ready data quality checks written in SodaCL. This new feature provides a simple way for data consumers to become truly involved in data quality management. This decreases the load on data engineers who don’t have to translate requirements or expectations manually in order to implement data quality checks into data pipelines.
SodaGPT, the MVP in your Data Mesh
Data mesh is all about applying product and software engineering principles to data. In the context of data product thinking, we measure quality through customer satisfaction. The more regular users of a product, the better. The happier the customer, the less likely they are to decide to use an alternative product.
SodaGPT revolutionizes the implementation of data quality checks by allowing users to hit the ground running, no coding expertise required. It elevates the low-code essence of SodaCL into an entirely code-free experience, while still harnessing all the advantages of SodaCL, a domain-specific language for data quality testing that is here to stay.
When designing SodaCL, we had a clear vision: organizations that aspire to an efficient data mesh, or a close derivative, need to think of data as a product and therefore understand end-user/data consumer requirements. They need to be able to manage these requirements as code alongside other data product code such as transformation, retention, and access, and they must support governance concepts in a computational way. In order to make a data mesh accessible, Soda aims to make self-serve tools available in the data platform layer.
With SodaGPT, we've catapulted this concept to new heights, making self-service within data mesh not just a possibility, but a reality.
A Note on Privacy and Security
We understand that expressing data quality requirements potentially exposes a certain level of sensitive information. Picture a company named EcoWings, which is secretly developing a drone in the shape of a hummingbird that helps in pollination. This drone is equipped with AI technology that detects flowering plants and delivers pollen to them, mimicking the natural pollination process. A user formulating the following natural-language input would expose a significant amount of intellectual property:
“Can you help me make sure that the drone_sensor_2 column is never above 2 when the nectar_level is below 2 and when flower_diameter is < 1cm. This applies to the pollination_drones_measurements table”
To safeguard your data, we have developed SodaGPT as an entirely homegrown solution. Soda uses our own, proprietary generative pre-trained transformer technology based on the Falcon-7b model, an open-source Large Language Model (LLM); it does not depend upon the LLM belonging to OpenAI (the company who built ChatGPT). This means that your data, whether included in input or output, never leaves the Soda platform, and is fully covered by our SOC2 Type 2 certification, as well as our settings for localization of your data.
What’s Next?
Today’s preview release of SodaGPT is only the first step in helping users write SodaCL; it is far from having reached its full potential for accuracy. The Soda team is busy refining and training the model, pushing it to support more of SodaCL’s built-in checks, increasing its capacity for check output, and extending its capability to handle custom, user-defined checks that involve highly specific SQL queries.
Expect SodaGPT to get better every week, and expect to see it pop up in more parts of the Soda product – and even your data catalog! – soon. Stay tuned!