The New Open Source Software (OSS) for Metric Collection, Data Testing and Data Monitoring with SQL Accessible Data
We've built Soda SQL from the ground up to do 3 things well, on SQL accessible data:
- Metric collection
- Data testing
- Data monitoring
Soda SQL helps Data Engineers to maintain high-quality, trusted data pipelines in production.
Quick links:
As data teams are operationalizing data products and features, they’ve become critically aware of the need to test data because bad data equals incorrect decisions. As a result, a number of data teams have implemented their own data testing frameworks. However, these homegrown solutions quickly become unwieldy, as datasets and data teams grew. That’s why in late 2018 we set about to create a solution that would enable data teams to monitor their critical decision flows: the Soda Data Monitoring Platform.
For the next two years, we were crazy busy working with customers: designing the platform, and testing it with users. In 2019, we formed a Customer Advisory Board, consisting of data & analytics engineers, data product managers, business intelligence teams, data governance teams, data scientists, as well as heads of data (and analytics) from some of the most data-intensive companies out there.
We’ve been blessed with their partnership and the guidance of our investors. Both have provided us with invaluable insights and feedback as we road tested our product. Fast forward to February 2021, and we finally felt that we had an important component ready for public release. Soda SQL solidifies our commitment to equip data engineers with tools for testing, monitoring and profiling data.
Soda SQL is the first part of Soda’s strategy to provide freely-available open source data management tools to engineers working in data-intensive environments, where data quality is paramount.
Let Soda SQL Do Your Data Testing
The goal is that Soda’s developer tools will solve a need that tens of thousands of data engineers have worldwide: monitoring data quality. We see that data engineers are constantly on the lookout to bring additional software engineering principles into the data engineering workflow. One of those engineering principles being Test-Driven Development (TDD). This is what Soda is championing in the data engineering world.
In software, as in so many other areas, what you don’t know can hurt you. At Soda, we refer to these unknown things as silent data issues. Even with data engineers in the front line protecting against them, silent data issues can wreak havoc on the data that is passed to your users downstream.
The first line of defense is to check data as it lands in your data platform, as well as on every downstream data table that’s being created. We call this data testing. Soda SQL creates data tests for you, and lets you easily add more using Python expressions and SQL. Once these tests are defined, Soda immediately starts protecting you against silent data issues.
Soda SQL Defends Against Silent Data Issues
Soda SQL works hand-in-glove with data engineering workflows. As an engineer, you get full control and visibility. You define how Soda SQL works by using industry standard YAML configuration files. These files can be checked into version control and let you control and audit the tests that are executed and the metrics that are used to evaluate the results.
When new data is processed, Soda SQL will scan it through a set of efficient queries. Soda is built on the belief that data quality starts with metric collection, data testing, and data monitoring. We think that Soda SQL can be a great start to creating data observability at scale in your organization!
How Does It Work
Soda SQL is a simple command line tool that enables you to test and monitor data through metric collection. The tool generates a folder structure with files for each of your datasets. Each file contains one or multiple Soda Scan configurations. The default configuration is based on the initial scan of the data, and contains smart suggestions. When unique data is found, for example, we’ll automatically suggest to include a duplicate metric & test for that column.
Once you’re happy with the datasets and tests, you can add them to any modern data orchestration tool.
So Why Call It Soda SQL?
As our name suggests, we went all in on SQL. Unashamed. Without any reservations. After the hype of NoSQL (which should have been called NoTransaction BTW) there is a clear trend back towards SQL across data stacks, the data landscape and data platforms. One other advantage of a SQL approach is that it lets you leave your data in place. You don't need to load or move your data around to test and monitor it. Soda SQL can simply be used by itself where your data lives. And lastly, SQL brings a lot of flexibility. It allows us to, for example, split data testing from broad metric collection. When data teams are processing data, every second counts, therefore, less critical flows for monitoring can easily be run in parallel, not blocking the critical path.
Soda SQL and the Soda Data Monitoring Platform
Now let’s talk about the Soda Data Monitoring Platform. We believe that there are a lot of valuable services that we can provide on top of Soda SQL to help data teams, seamlessly integrating with their tools of choice for data discovery and incident management.
The Soda Data Monitoring Platform provides real-time insights into your metrics, test results, and datasets. Think of Soda SQL as the engine, and the platform is a slick UI where you see and collaborate on what’s happening, as well as create monitors in a no-code environment.
We’re currently building a free trial cloud service that will store metrics over time, and enable Soda Insights. Insights is a pro-active detection service that tells you which data is worth fixing. Subscribe to our newsletter to stay up to date on these exciting developments!
Our first goal in our open source strategy is to help organisations achieve observability through metric collection across the data stack. Next up is the support for streaming and dataframes. Each open source project will be built to natively support these technologies so that they are easy to set-up, and provide full control over the performance impact.
My background in open source has taught me the value of community, and how fun it is to build something together. I invite you to join us. Head over to our Soda SQL project on GitHub and check it out for yourself. (And give us a star, please!).