At Soda, we've dedicated the past several years to mastering the art of data testing—a meticulous process in which users write checks explicitly and declaratively. This approach has always been at the core of our “prevention is better than cure” philosophy, and we've strived to equip both coders and non-coders alike with the tools needed to ensure good-quality data from the get-go.
Thanks to features we’ve recently released, getting started with Soda has become much easier. For example, we launched the Soda-hosted agent which eliminates the need to deploy Soda Library on your own infrastructure, and we made no-code checks available in Soda Cloud empowering non-coders to create checks and collaborate directly with anyone else in their organization, Whether you’re integrating these tools into complex, bespoke data pipelines with our shift-left API, or just beginning your data quality management journey in Soda Cloud, we’ve made sure it’s smooth sailing, and that you can seamlessly transition between flavors of Soda.
We’ve also been at the forefront of using Generative Artificial Intelligence (GenAI) to help users establish data quality coverage. From SodaGPT, our SodaCL generator, to our latest AI assistants for regular expressions and SQL queries, we’re excited about the path we’re taking. It’s bound to lead to an awesome, GenAI-first user experience with lots of help and automation. Curious about how deep GenAI integration goes in Soda? Keep your eyes peeled for the announcement of a few new, and extremely cool, features powered by GenAI.
Today, we're excited to share the next step in our journey towards more automation and more AI on the Soda platform by expanding our focus to encompass data observability. Even though issue prevention is the ultimate end goal for data quality testing in the pipeline, we acknowledge that the initial steps towards establishing good data quality can be daunting. With our newest features, we aim to automate the basics of data monitoring—like tracking volume, freshness, schema changes, and other vital metrics—using AI and machine learning to detect anomalies. This not only aids in building a comprehensive view of your data's behavior over time but also eliminates the initial, overwhelming "blank page" feeling we all dread!
We’re automating the mundane to ensure that foundational monitoring is in place, freeing you up to focus on crafting more proactive, high-quality data coverage. Think of it this way: Soda takes care of the basics so you can dive deeper without having to worry about missing out on any critical data metrics. You’ve got bigger things to tackle, and Soda makes sure you’re fully equipped to do just that.
Onboard your data sources and Soda takes care of the rest
Observability in Soda takes the form of Anomaly Dashboards in Soda Cloud. With some basic, data source configuration, you can quickly onboard and begin monitoring your datasets for anomalies. All you need to do is use SodaCL include and exclude patterns to specify which datasets in a data source Soda should profile, then Soda partitions and begins profiling your data, which it then uses to create automated anomaly detection checks for the dataset. Soda executes these checks once per day and warns you when it detects anything anomalous.
An Anomaly Dashboard automatically sets up three dataset-level anomaly detection checks. It monitors row volume changes, data freshness, and schema evolution. After the machine-learning model trains for a short while to learn how your data evolves over time, Soda displays any anomalies in those metrics in each dataset’s Anomaly Dashboard.
The dashboard also tracks three column-level metrics: the percentage of missing values, the percentage of duplicates, and, for numeric columns, the calculated average. Again, when any of those metrics change in an unexpected way Soda surfaces the anomalies in the dashboard.
We could keep describing the feature in words, but at this point, it’s probably best to watch me explain what it all looks like.
Sensible & Scalable
It’s time to talk a bit about what happens behind the scenes. Most likely, your first questions at this point are something along the lines of “How will this solution cope with my very, very large datasets?” or “How can I make sure all this doesn’t hang during processing?”. You’re right to ask those questions, they were top-of-mind for us during design and development. Let’s talk about scalability and sensibility.
Datasets, especially large ones, generally contain at least one column that indicates when a record/row was created, loaded, or updated. This is good news for Soda because it can use that column to partition your dataset for profiling. When Soda profiles your dataset for the first time, it automatically detects the most suitable timestamp column in your dataset that it can use to partition the data so that the machine learning algorithm can train on a sample of data rather than the whole lot. If Soda cannot detect such a column, you can manually identify a column for it to use. Failing that, Soda selects a random, one-million-row sample from your dataset to profile the data.
Be one of the first to activate Anomaly Dashboards for Observability
As a first step, we have released anomaly dashboards in private preview so we can leverage the feedback from our earliest adopters. Hey, we’re treading new ground here and we want to make sure we get it right!