This episode of Club Soda features the data team of FirstParty, recorded live at a speakeasy in downtown New York, where data teams and practitioners gathered.
The topic is data products and the conversation centers on FirstParty's mission to provide businesses with the capabilities to maximise the value of their data assets.
Your host Maarten Masschelein is in conversation with Jolie McDonnell (Data Scientist), Ben Sgro (VP Engineering), and Tommy Dodge (Director of Analytics).
Data community, welcome to this episode of Club Soda, featuring the data team of FirstParty. Let me set the scene. We're in a speakeasy in downtown New York. And we've gathered data teams and practitioners who are building data platforms, implementing a new technology stack, streamlining data operations and striving to close the gap between data producers and data consumers.
The theme, the topic and the conversation starter is data products. It centers on FirstParty's mission to provide businesses with the capabilities to maximize the value of their data assets because they believe that the data market is inefficient and that many data owners are sitting on assets with significant untapped value. Without further ado, let's get the conversation started. Introducing your host, Martin and his guests, Jolie, Ben and Tommy.
All righty. So the panel is all about how we unlock value with data through data products. That's going to be very much focused on data products. It's a very interesting topic. We maybe get started with introductions?
So my name is Martin. I'm one of the co-founders of Soda. Been in the data management space for about 10 years. And it was really interesting about three, four years ago when you saw the move towards ownership of data by the CTO, chief technology. Before that, it was mostly the chief data officer, which was more of a change management type person, but now we're getting into full throttle. CTO owns data. We're bringing in software engineering principles and product thinking into data. So it's really a very cool and fun place to be active in today. I'll ask you guys to maybe introduce us yourselves as well, and maybe in that think about what do you love about data and what excites you today?
It is, yeah, I know you're the co-founder. That's awesome.
Yeah. There you go.
Yeah, so I'm Tommy Dodge. I'm the director of analytics at FirstParty. Been in the data monetization space for the majority of my career. I'm both the supply side, worked at a vendor, selling a variety of raw feeds to different end use cases. Then also directly for companies that are trying to monetize their data assets. And then on the buy side, spent some time working at a large hedge fund in a data sourcing role. So basically evaluating hundreds of data sets to see if it can be used for trading strategies.
Yeah, overall, I love this space. I think it's very entrepreneurial, a lot of opportunity, more data sets coming online in the supply side, more buyer use cases. I think hedge funds were an early mover, but there's a lot more buyers in the market now. Corporates, insurance companies, PE firms, real estate companies. It's just growing, so it's just a really exciting space to be in.
Yeah, absolutely.
Yeah.
Jolie?
Hi, I'm Jolie. I'm a data scientist at FirstParty. I recently graduated from Johns Hopkins University with a Master's in Data Science. FirstParty is my first real experience outside of college, so it's been really awesome getting to work at FirstParty and watch the company grow and number of clients and data sets and just technical infrastructure overall. So I'm really excited to be here and really excited to talk about Soda, and how I use it in my day-to-day and just talk about the cool features.
I'm Ben Sgro, I'm the VP of engineering at FirstParty. I've had a long career. When we're discussing different companies I've been at, I spent some time at Equinox doing data engineer science, InfoSec, white hat, doing vulnerability scanning. I was at Masterworks working on their platform. At FirstParty, yeah, we're doing some really interesting stuff. Soda's been paramount to the quality of the work that we do and what we deliver to our clients. As Tommy mentioned, the opportunity here is huge and we have a zero defect policy, so Soda's first and foremost line of defense for the products we deliver to our customers. So happy to get into that bond. Yeah.
Awesome. Let's start with... Because I'm sure not everyone is fully intimately familiar with your business, what you guys do exactly. Tommy, if you could start with that?
Sure, yeah, good question. Yeah, what is FirstParty? So yeah, I'd say we're a software enabled services business, so we focus on helping folks that are sitting on valuable raw data assets. We help them monetize those assets and basically maximize the value that can be extracted from them. It's a pretty hands-on process. And look, data quality is paramount. A lot of the... I'd say to give an example of a typical client, I saw someone before we started, but a perfect client for us is someone that has a large mature business with strong penetration in a given segment, but monetizing those data assets is not core business. So they would come to us, we'd help them evaluate the opportunity in the market, prioritize what to build, and then offer expertise and tools on how to build it most effectively to generate the positive ROI.
Very nice. And Ben, maybe you could take this one, the data team, how's that structured within FirstParty?
Yeah, so we're not big enough yet to have silos or thoughts as traditional engineering is, is get larger. We do have some silos that we're working on, but we have data engineers, analysts, scientists, and traditional software engineers. We all kind of work on a little bit of everything. So Jolie primarily on the data science side, but it's helping obviously on the data engineering side, reliability and robust and so called pipelines in Soda. Tommy, just today we had a meeting called roofless automation. So trying to figure out-
What's it like?
... what are some core things that we do manually that... We're a small company with 25 people, so it's like where can we scale via technology? And where can we ourselves be forced multipliers by automating processes? So we're always thinking about that. We're doing a lot of really cool research around LLMs, as everybody's familiar with and probably working on, right now we're fine-tuning Mistral's 7 billion parameter model. So that's been really cool.
Yeah, the team's great, multidisciplinary, super skilled people, very collaborative environment. I worked at a lot of different places. A big draw for me at FirstParty was just this sheer awesome community of everybody there and kind of the no-jerk rule. But none of these types of problems that you especially see at larger companies, not a lot of politics to deal with. Just overall a very great place to be.
That's very fun to work in such an environment.
Sure, yeah.
So we're going to talk a little about data products. So we should probably do some foundational work and all align on what it means before we go deeper. So Tommy, could you give us, I don't know, something close to the definition of-
I'll try. Yeah. I guess, yeah, I'm sure a lot of folks have different definitions, data products, but for us at FirstParty, it's really just the final deliverable that's going out the door to serve an end user for a variety of different business use cases. And that could be in various formats. Some end users may want a more direct raw feed, others might want something that's more aggregated. And someone might want literally written commentary on the data to consume, not looking at any visuals at all. Right?
So it's really a data product is really the end result of various transformations, cleansing, enrichments, and then just hitting a found durable then goes out the door to serve a business use case.
Something of value to the end user?
There you go, that was a really [inaudible 00:08:31].
For a certain use case or a... so that makes totally sense.
Yeah, exactly.
So maybe we can continue and talk a bit about how all of this data work has evolved at FirstParty over the years. You guys are a relatively new organization, new team, new company.
For sure. Yes.
So how did that start? Did it start with the DNA or did it grow over time? I'm very curious to learn a bit more about it.
For sure. Yeah, so we are a relatively young firm, started about two and a half years ago. I'd say mostly the first year we were more of a hands-on consulting shop with less software. So we were working mostly with static data assets, and they're sent from our clients to us. And we were evaluating the market opportunity, kind of coming up with proof of concepts, doing different data validation exercises, and just really helping them hone in on the value that was out there that could be extracted in the different use cases.
That was going really well, but the natural next step after that, work is like, "Okay, great, now I have proof of concept, it's been validated. I'm ready to go to market. I want to produce a live product." And so we're like, "Oh, shit, we got to sign-
Become more accountable for now.
... come up with a way to support that at scale." And as Ben alluded to, we started taking on more and more of that work of this live data pipeline support of generating these data products on a daily cadence to a variety of rigorous end users. A lot of these folks are in the investment finance based consuming these data assets. And if you make a mistake, it's a problem that has a commercial impact. So data quality was paramount, and that's kind of when we started our journey with Soda actually, when we started taking on more and more of that live pipeline work.
Yeah, what is the hardest thing about data? For them what is like... sort of curveball question now, hopefully, but what is in the end, what do you think is... Because there's so many aspects to it, right? You need to be able to, of course, procure and get or create it yourself and just transform, make available the quality control, access management, think of retention, all of these things, so much work. What was the hardest, I guess, in all of that? Or maybe there's something else here? Someone else?
Sure, sure. So definitely a lot of our clients don't even know what to do with the data. They're coming to us as the experts. So for instance, we got a deliverable the other night, and it's a large amount of data. It is in Parquet files. Okay, so that's good. But we bring the data in and there's data structures. There's one column that has 300 values in it, packed as key value pairs, but it's not JSON. So you can't really easily extract it. There's bulls, floats and timestamps all in there. It's like, "Oh geez, okay."
And we don't really want to go back to them and say anything, because we're the data experts, you need to figure it out. So part of it is just getting that data, understanding it. Some of the clients who work without clickstream data, so you're talking tens of billions of rows. Okay, Redshift can handle that, S3 can handle that, but obviously slows down our analysis. Like we're having to pay for compute and storage of these larger data instances. And then we get into problem solving. We have certain patterns that work well for certain types of data, but once you start getting into lower latency, really large data, things like SPARK or distributed computing that gets more complex, more expertise required there.
We're mostly an AWS shop. We also work with Snowflake, other providers. But yeah, just top of funnel stuff can be difficult. Then not even to mention as we begin transforming and applying ML and having these pipelines, Soda at every point is checking those things and make sure it all makes sense. It's within our thresholds. So there's just layers on layers of complexity that we have to manage. And for a small team, we need tools, we need tools, but can't do it manually, doesn't scale.
Tools and love, I guess, variety and the skills that you have as a team as well. People that just want to, "Oh, something new. Let's pick it up. Let's go learn and do, figure it out." And if you have no experience with it, so be it [inaudible 00:13:00].
Yeah, yeah.
But yeah, the variety, that's crazy because ultimately you have all of these customers, it's a huge variety in terms of how they work or think about data formats, types of data. So that I guess with the big data error, we had this three or four Vs of data or five Vs or six Vs, I don't know, volume, veracity, that is always a complexity we have to deal with.
Yeah.
That makes sense. Jolie, for you, when you think about overcoming some of these challenges, what's kind of important for you? You're a data engineer, data scientist, so you are very much dealing with the problems firsthand, I guess you are?
Yeah. I'm in the data. So I guess kind of touching on what Ben said, so finding a data issue on your own is finding a needle in a haystack. So I feel like back then we kind of had maybe the time or not as many data sets to work on from our clients that we could go in and run manual QA checks. So I would say in the past we were checking the data manually at the point of delivery, at the very end of our data pipelines. But over the course of our pipelines, we have various transformations that are taking place, sometimes 20 to 30 steps in an Airflow DAG.
And every step in the pipeline we're making a transformation to the data, creating a new table. Oftentimes, it's through a complex query or computation or even sometimes an entire ML model. And even though we know that the queries might be running, if the Airflow tasks aren't failing, we have no idea if the queries ran as intended. So it really becomes important to start to think from an analytics perspective, how do we ensure that what we want to happen in the data is happening? And then furthermore, when a data issue arises, where did that data issue happen? How can we get to the bottom of it as quickly as possible without manually queuing at the very end, seeing an anomaly and then having to backtrack through all those steps?
Yes, it usually always happens at the worst times. I guess when I was-
On Friday nights.
Yeah.
Oh my God.
We had no deploy Fridays.
Yeah.
It's funny.
In good reasons.
Yeah, yeah.
I had it myself as well. I was responsible for revenue operations at the software company before, and we had a lot of end of month, end of border cycles. And it always ended up, we tried to not get into the Friday and into the Friday nights, but it just happened so often. And that team actually created a very good bond through that. But was also, of course, every now and then quite frustrated, that we're not spending time in a bar, instead of one of our computers going through figuring out what went wrong.
Yeah. And we have a couple of clients where we went through some major performance improvements where it was weeks on weeks and weekends, and it was actually, as much as it sucks to go through that, I think there's a certain amount of comradery and team building that comes out of being shoulder to shoulder, solving these problems, setting your alarm for 4:00 AM to go check a job and everybody jumping on Zoom and doing this stuff. It sounds sick, but that's fun. So we went through that a little bit earlier, which was enjoyable to a degree.
Some deal with the ones that are still here, our audience.
That's it.
Yeah, and we learned our lesson.
Absolutely.
Oh yeah, we learned a lot.
I feel like we learned our lesson, we did it once, it was great.
Won't make that mistake again. Yeah.
So I think we've talked a bit about data quality has evolved, come through a lot of the first questions and answers. I guess we all know it's important, but let's maybe think about or talk about the approach. How are you guys dealing with that? We've talked about, right, we have some checks, ideally not at the end, it's a little bit earlier. But how are you conceptually thinking about that problem? Not sure, Ben, if you want to take that?
Yeah, sure. So when we start a pipeline, we kind of architect that pipeline, figure out what's the tooling going to be, and then Tommy, Jolie, we'll do a lot of analysis to figure out, like she mentioned, steps in the DAG, a directed acyclic graph, so everybody knows. So in Airflow, we have all these steps and each step we transform to a table, like Jolie said. Could be output of a query, it can be output of a computation ML model. We will at every step, and we start at the top of funnel. So as soon as data lands from the client, we'll check for counts, duplicates. And just actually the last couple of days, we have a client that's been dropping tons of duplicate records.
So normally without checks in place, we may not find them until later on in the pipeline. And then that's not very surgical. How do we go back and figure out where the problem occurred? How do we alert the client? Is our code failing? We don't know. So with Soda, I mean, it's been awesome in one respect that we catch these things like the day they drop that data, we do a little bit of analysis and we can go back to them and say, "Hey, look, there's dupes in your data, here's the queries. You need to take a look at this."
So that gives a ton of trust from our customers and our clients to us. And we don't have PagerDuty. So some companies have, I mean we have pipelines that are up 24 by seven, but we don't have PagerDuty or any kind of rotation like that. And we will talk more about our alerting system, but we get these alerts, we have SLAs that we have to maintain and we adhere to those, but we're not waking up at 4:00 AM to answer or Soda all of them. It's not that bad.
So we'll look at the pipeline, and Tommy and Jolie will kind of figure out, "Okay, these are the data types, these are the transformations, here's the Soda Core checks we're going to build." And we have quite a depth of checks already so we can reuse a lot of that. And having the Soda Core checks be in YAML is a pretty easy to write. And it's a nice language. You can write them pretty quickly. And we'll build those along and we'll test them and we'll iterate through them.
And even as that pipeline is live, for instance, like Z-score values that we have, as data flows in and flows out and counts change over time, those things may get triggered. We may alert the client, "Hey, should we adjust this? Do you intend the data to come in at this low volume? At this high volume?" So we are not really setting these things, they don't require that much maintenance, but we are spending time making sure that they're tuned correctly for the client and for the pipeline. Yeah.
Well what's crazy and super nice is that you're very proactive towards your end customer in that, through that. Like you are-
It looks great for us. We are-
... "How do you know what's duplicates in our data? Do you know?"
Yeah, it's huge. Yeah.
That's great. And the other concept that I picked up on in a way is you try to shift left. Shift left is the notion of doing this early and as often as possible. Not only as early, for example, when data comes in or each time that the refresh is from the customer, but also throughout the design phase of the data products. Where you're thinking about from a product management perspective, what are we aiming to deliver? You do it as early as possible. And that is I think a very important concept. Something that we can all take also from software engineering, for sure, because that's where that started.
Jolie, you stumbled upon Soda, I guess, or Soda Core, I don't know exactly when? It was probably six to nine months, a year ago?
Yeah, I would say January-ish last, a year ago.
So how did you find it? What's-
And why?
So I wasn't involved in the stumbling upon Soda Core, but I do know the story from why we slipped on Soda. I think you might've been more involved in the business [inaudible 00:21:07].
Yeah, well, when I came into the organization, like Jolie and Tommy mentioned, we were doing a lot of manual checks. So one of the reasons I'm in the organization is to scale the organization. So manual processes, we talked about their roofless automation, is not going to scale. So I looked at other competitors. When I was at Equinox, we actually built a very... What I loved about Soda was at Equinox we built something called HAMBot, which was basically an open source version, very similar to Soda, pre Soda obviously, when we had kind of a DSL domain specific language. And we could express assertions that we wanted to run against tables and we would write those and deploy it also in pipeline.
So that was the model I already had in my mind of this is the right kind of data quality you expect.
Just checked in a way.
That when I saw Soda, I was like, "Yeah, this is really up there." And there's other ones that do stuff, but we looked at Soda, we also really liked it. And there's tooling that Jolie will talk about. Some of the advanced like SLA and dashboarding and things that we haven't even really gotten too deep into that Soda offers that we want to get into for our clients. But yeah, I looked at it, made the case to the executives. It was kind of a no-brainer that we needed this. They supported it and we started rolling it out.
Some of our initial rollout, we probably didn't do our due diligence to really understand the depth and quality and control that Soda allowed. So there were some things that we ended up having to retrofit and refactor, but that was again, a good learning experience for Jolie. She ramped on Soda and we started solving problems the right way. So yeah, I mean it's been a great experience. Yeah.
So in a way, you created a business case internally for that. So what were the key metric or what were you mostly... Was it the SLAs or?
Yeah, so SLAs are a big one for us. Our co-founder and chief data officer, Alex Schwartz, has a zero-defect policy that we try to implement as best we can. And that's really to Tommy's point, is go and make a first impression once. When our data product goes out there, if somebody, if a hedge fund comes in and uses it, they find an error, and then they may not come back to use our product. And that's huge. That's a really big problem. So you have to be super precise and controlled and in those pipelines. And making sure that those outputs are a hundred percent accurate. So the SLAs, the accuracy, the zero defect, those are all big parts of all metrics, and things like key criteria for a framework for data quality.
That makes sense. And when you implemented first, what was the first use case that you guys focused on? Was there any particular first use case?
Jolie, do you know about that one? If you talk about that one.
When is the first pipeline for a point of sale-
Yeah, point of-
... customer? Yeah.
Yeah, yeah.
Yeah. So our first pipeline that we added checks for was for our point of sale customer because we had SLAs to uphold. So we knew that we had to make sure the data was of utmost quality to our client. So that's the main use case that we touched on at first. And we had Soda Core checks throughout the pipeline and we documented them nice and sent it over to the client.
Yeah, along those lines. Yeah, but it's funny. It's like our first use case using Soda and we kind of did this great lucid diagram with all them tables and transformations and all the, of course, buying Soda checks. And they were like, "Holy shit, this is like solid-
You guys are good.
... oh shit." They were kind of blown away. Their head of data was like, "Dang, this is sweet."
That's great.
And like were, "Thanks, guys. Let's keep doing this for the other ones. That was a good job on that. That's really brilliant."
It was really nice. And then I also wanted to touch on, just from an engineering perspective, other really good benefits of choosing Soda over other companies is, one, that you can connect to multiple data sources.
Oh yeah.
It's very flexible.
We've spent quite some time on that.
Yeah.
And like 4:30, sweating.
Friday nights.
Oh, yeah.
Yeah. Because our clients already use multiple data sources.
Right, for you it's-
Snowflake, Redshift, so we needed that type of flexibility. And we don't know what our future clients will house their data in.
Yeah, JSON files, Parquet files, they are partitioned differently, like all I have. Well-
And you never know. So that was super helpful. I think also, as Ben mentioned, the YAML format was really a huge benefit to us. We are a very diverse team. We come from a variety of skillsets and sometimes just having easy to read YAML checks to understand what's going wrong in the data across the company and across the business is really imperative. And the last thing is the notification integrations. We needed to know in real time when something went wrong in the data pipelines. So that was really important to us as well.
My co-founder, Tom, will be very happy about this. YAML and readable, and he was obsessing over it for quite some time when we developed it.
That's great.
Yeah, he wanted to make it human-readable-and-writeable as much as possible. Because if you do all of this stuff in SQL, for example, if you can't do SQL already, then it gets extremely verbal. A lot of you cannot read it. You cannot really understand, and always the mental loads too. And if you didn't get your alerts, you wanted to make it as simple as possible and to make it easy to adopt by engineers. And that was the play with Soda Core as well. So I'm happy that it's hit home, that you guys liked it.
Along those lines, we've integrated Soda alerts into our Slack channels. So we actually include the client in the shared Slack channel. So they see the alerts too. So it's-
Not all of them.
Yeah, normally you shy away from that, right?
Yeah.
Because you expose your internals so much, in a way, to the customer. A lot of people are very nervous about it, let's put it that way.
Right. But yeah, I was going to say, it's nice when they see their green checks that they've passed, that's great. But then when there's an issue which, and everybody happens, we could have that kind of direct dialogue with the client. And be like, "Hey, no worries. This has exceeded the threshold but doesn't have downstream impact on the output." And it's like you said, it's not in verbose like program language, it's in plain English that all stakeholders can understand. So it's a great collaborative environment.
So what was the pre-Soda world? How was that once like?
That was my [inaudible 00:27:30]-
... yes. It was ugly. It was Friday nights. Yeah, we had a habit of the main deliverables on Friday and it was painful. But yeah, I mean look, we did a decent job. We weren't atrocious. It was kind of a threefold process where we would be rigorous on the code and logic that was checked in, had to get peer reviewed, everything was sound there. We had monitoring on our Airflow jobs to make sure that all the various pipeline steps ran successfully. And if not, alerts to be triggered to share channels. We all did that visibility.
And then finally there's regress reviews, like the output file that was sent directly to the client did an S3 bucket or if it was a internal production table in Redshift. We had some Excel, and I've had notebook [inaudible 00:28:17] kind of QA that. So that was a decent setup, but it did leave you exposed. Where if there is an issue in the output file, you have to go back. And some of these pipelines that Ben and Jolie will touch on, it's like there's 20 or 30 different steps of data transformation and enrichment and mapping. And it's a lot to work backwards from.
So with Soda now, it's incredible. Because I think we have hundreds of checks in some of these pipelines. Where it's like every step of the way, there's no doubt in our mind that this step ran successfully and the output is good and then we can proceed. Whereas if just the output files messed up, it's like a wild goose chase and we've got to figure it out.
And how did you create the operating model around it? So when do you guys start thinking about the checks? When do you involve the customer? Who's involved with that internally?
Yeah, I think it's a collaborative effort. I think obviously our processes, first it starts with ingesting the raw data directly from the client. And so we kind of know what the target output data products will be. So we know what fields are important and what's kind of necessary to be able to produce the desired output. And then as we build these proof of concepts and different data products, we have various transformations that happen within the pipeline.
And I'd say we probably finished the pipeline first and we figure out the transformations. Then we kind of go back and figure out what checks are sufficient to catch anything unexpected at each step of the way. I think that's a normal process. So figure out the end product, build boosted charts. So diagram all of the steps of the pipeline and figure out how to implement the checks to make sure that nothing breaks.
Yeah.
And going off of what Tommy was saying, we finally have a productionized table. So we'll put down proactive Soda Core checks. So as the engineer, you know what you want out of your table, where are nulls allowed, where are they not allowed? And then based off of that view, then you know that you have a set of eyes on that table 24/7 whenever it gets changed.
Yeah, it's an extra pair eyes.
Exactly. It's not you, which is nice. So that's like-
Just for some peace.
Yeah, it's definitely way more peaceful. So that's kind of the first stage of the pipelines built. And then you have initial checks on the pipelines. And then moving forward now as data issues arise, like, "Oh, how could we get closer to that issue? Where else can we put checks down?" And then that's how the pipelines evolve. And then ideally less time every time. And it's kind of a slow decrease of data issues and issues with the clients and things like that.
How does your internal tech stack look like? Do you standardize for all your clients, do you standardize it all into one or do you follow to some extent the technology that the customer has?
Yeah, typically the customer's tech for the most part has no impact on the decisions that we make and what we use to solve problems. We're mostly an AWS shop. We use Redshift for our data warehouse and those types of big data transformations, SPARK for speed up particular jobs. We're pretty much a Python shop as well. Python's great for data science, data analytics, engineering, also heavy Python.
And then depending on the type of problem, Lambda's SQS for queuing, we play with SageMaker for ML stuff. We also just write a lot of ML ourselves. Yeah, it's all pretty standard tooling. We have Snowflake as an ingestion and exit point for data because that's for some of our clients want to either deliver or pick up. Yeah, pretty standard stack in Outlook already. Airflow for DAGs and DIME, yeah, Postgres, DYNAMO, all the very standard stuff.
Everything under the sun.
Yeah, all the AWS tools. Yeah.
Awesome. That's cool. And if we think about the efficiency gains that some of this tooling can give you, do you have examples in mind as to where it has helped you the most? Or I think it's always nice to talk about use cases or, so I'm not sure if-
You mean like tooling within AWS that's been improvements or particularly for Soda improvements?
Well particularly for, I guess, data quality control. Or how does that... Any certain kind of improvements in terms of that you'd say, "Well we used to do it X, but now forget about all this whole process, and how we do it like?"
Well. I think a lot of the manual stuff we don't have to do now, there's so much they covered in Soda that frees us up. And again, as I was mentioning, enables us to multiply our time to do other things and ideally we want to be concentrating on building IP for the company, not manually checking data every day or something. So a lot of those processes have changed or matured, right?
Yeah, big time. I mean before Soda we were, basically, all of these pipeline steps would run, then we'd kind of be waiting for it to hit and then we would be crossing our fingers once the output was there. It's like, "All right, let's check it out. Hopefully it looks good or else the weekend's shot." So that scenario has totally gone away, thank goodness. So yeah, that's really gone. Since we literally catch issues at the top of the funnel. If a client sends us wrong... Like that issue that you raised, we catch it right away and we don't run the pipeline, because the data at the top of the funnel is incorrect. I think that's it.
I mean, that's a huge one, right? If you're catching stuff at the end and the final land of that data, you just ran up all of our bridge money pipeline, you can't even use the data. You don't even know where to fix it.
Yeah, [...]
Soda allows to super surgical, right? We know, we roughly know exactly where that failed. We have it in... Jolie worked on integration, so our Slack notices for... So to go directly into Jira with a description and everything. So any engineer, even if they're not really well versed in it on our team, could go pick that up, diagnose it to a point that they can at least answer back in the client channel that, "Hey, we saw this, looks like dup rose, you need to check this out. We'll keep you posted if we find anything more." Doesn't require a lot. But that's a huge trust signal and benefit to our clients that we can be that surgical about their issues and so fast to respond.
But I think that's one of, at least, the things that I've taken away is that you just do it. In order to become surgical, you need to pretty much have checks on your data everywhere and as early as possible, which in a way it gives you exact, "Okay, this is the first point of failure, let's analyze." You can start communicating about everyone's happy. But not until they fix it in general.
We didn't really talk that much about the end, end of the data products. What is the end benefits? Or this, talking about the cool thing about data products is that in every industry you see companies or figure out new ways to become excellent or better than their competition. They use data in a certain innovative way to become truly much more competitive. Do you guys have examples within your customer base, data points? [inaudible 00:36:13].
Sure, yeah.
Yeah. Just to give, I guess, a concrete example of the client success story and how the final data product was used by the market to generate those valuable insights you're talking about. I even thought about before, it's like the point of sale customer that we had that we built the first pipeline for, or maybe we shouldn't tell them it was the first one we built, but it's okay. But anyway, so anyway, this is a large established company that had about 25,000 point of sale systems in various convenience stores across the US. And their core business is delivering value to the owner of that convenience store to help them get money from brands, whose products they're selling at their store and create loyalty programs. Just basically helping the owner of the store that's like their core business. But as a result of their success and their core business, they were collecting a ton of really interesting data that's just point of sales.
It's literally item level detail on what people are buying from a large number of applications. You have millions of individuals across various locations in the US. And so that data before we worked with them was just sitting internally, not being used by the end market. So we had an extensive process with them where we kind of dove deep on what they were collecting, the market opportunity and the different products that kind of help them meet that market opportunity.
And to your question, who's the end user of that? In this case it lets various hedge funds that were looking to get interesting insights into these grants and companies that are tracked in this data set that weren't tracked elsewhere because credit card, debit card data is very common in the hedge fund space. But this point of sale detail, and also this was an area mostly Midwest skewed that just didn't have a lot of visibility elsewhere. So this was a really interesting data set that provided deep granularity on the performance of various brands that were being sold as convenience stores. So I mean think tobacco, hard seltzer, all the bad stuff for you basically, but it makes a lot of money. The really interesting details on that.
They have information that helps them create, I guess they call it, alpha, or?
I guess, yeah, that works, yeah.
They know more than everyone else. Doesn't mean anything.
Yeah, you get an early read on a consumer brand. And also we could track individual shoppers' preferences. You could see brand switching, different regions or brand preferences for different states or even zip codes, all right, like city versus rural buying behavior is really different. So really cool data set. We were able to unlock that value that was just sitting in their internal systems that was valuable to a bunch of end consumers.
Nice. So Jolie, the future of FirstParty and Soda?
I know this. I know what it is.
Where does it need to go? And for free, I'll take notes in the meantime. Anything we need to improve?
I guess first and foremost in kind of an immediate sense, we will continue to implement Soda Core checks as new data issues arise to get as close to the source of these problems as possible. Moving forward some more interesting stuff. We're really excited to start exploring providing our clients with SLA dashboard views.
Long nights.
So that hopefully they can gain-
Just to bring that more even closer to-
Yeah, we'll show them everything. But yeah, just give them more transparency into their data and what we've been doing for them. And finally, I think Soda Core was in some way an afterthought for our first pipeline build and it was a little bit better integrated in our second one. But as we move forward with new pipeline builds and new clients, it's going to be core, pun intended, core to our pipeline-
The way you were.
... building process, because it's really important to have those checks on every time a production level table is built.
Yeah, they're earlier to then... like even during, as you define what the scope is of a data point, what it needs to deliver, et cetera, you're actually building up a lot of knowledge as a team about that thing. Then you might go on to the next thing afterwards, et cetera. And as you're doing it that level of detail is not only a description, it's something that's executable that you can start tracking [inaudible 00:40:42]. That is the best possible time to capture it as well.
Yeah.
All right. Ash has been like, "More drinks, more food, that's coming." So one last thing for him, because you guys, I think, are on the forefront of building data points, right? Your customer, that's really what your bread and butter is. So what would you share with everyone else that maybe doesn't have that many data products towards their customers, that doesn't have that direct feedback yet, that's a bit earlier in their journey, what would you guys give us advice? I'll keep it open to whoever wants to answer it.
I think part of the expertise of this company and the founders of their prior company is that they have the networks, they know the people, they understand these markets and they know how to monetize product, productionalize and do all this stuff. So even if you're not sure about, "Is there value in this data?" That's our job. And like I said, with the 300 key value pairs packed in that thing, we'll figure it out. We'll figure it out. So just send us your data, and we'll look at it, we'll analyze it, we have experts. And we can figure out if there's a monetization strategy in the data product.
Yeah.
And yeah, for me, Ben touched on it, but prioritization, I think I've been at shops where people can get really distracted. I'm like cool stuff that you could do, or you'd get a one-off request, and you can start going off and building it. But I think it's really important to, as you go down the route of building data products, especially if you're trying to sell to external audiences, really get some market intel, create an evidence-based strategy on what you're doing and why. I'm not saying a lot of the value add that I think we provide, because it's really easy to get distracted and scattered. As you build for different markets, it's important to keep that in mind or else you burn precious resources and have a negative ROI, which is not good.
And then, yeah, the other thing is just quality is paramount, right? You can do all this work and if you deliver a data part that has a glaring error, it really can ruin the relationship and will just diminish the trust that the end user has. And they won't want to work with your data, because they can't trust it.
Yeah. The quality checks create trust. You share them to your customers, already from that point on it creates just a, "Oh, those guys, they're thinking about it." Only just that helps. Very cool. Maybe you can open... I had a couple of gifts, but Belgium chocolates, obviously.
Oh, yeah?
Oh, yeah?
They're nice, but I forgot them in my bag. I'll go get them in a little bit. But I don't know if there's any questions from anyone? Let's open it up. Let's start here.
Hi, I'm Ben. I'm a data quality engineer. My question's addressed to the gentleman at the end, and I'm really sorry I didn't catch your name.
Ben.
I'm Ben as well.
Ben. Awesome. I had a question about your company's zero defect policy. So I've heard of this type of policy before, but your company is a little bit unusual in that primarily what your product is is transform data from a third party where you don't have control over the third party data. So how is your zero defect policy defined? Is it a zero introduced defect policy or is there some other metric?
Yeah. Do you want to answer that, or?
Yeah, actually that's a great question. I think, yeah, it's right, it's important to make the distinction between the data that's coming directly from our clients. Obviously, we can't control how they're directly collecting their data and then transferring it to us. So when we say zero defect, yeah, it applies to our processes, but makes sure that we don't introduce any errors as a result of our transformations, our mapping, our aggregations, whatever logic we're applying to the raw data. Because that's really in the realm of what we can control. If the client sends us erroneous data at the top of the funnel, we'll alert them immediately and we won't actually execute the pipeline. But we can't commit to saying that that data coming from our client that we haven't touched yet is going to be a CR defect. So it's really internal for us in our realm of what we can control.
And to that point, if you think about another type of data we work with is clickstream data. So we'll take a large volume of clickstream data, but Jolie and Albert have worked on this data where we'll try to determine purchase costs for public companies to see like, "Okay, is somebody at Chewy or Amazon?" They're purchasing and this value shows some correlation that we can tie to some EDGAR filing KPI against that company and show correlation. That's going to be put into an alpha generation strategy as you were talking about. That has to be zero defect. That cannot be wrong.
So the data that we pull down, the statistics or ML that we apply, yes, there's going to be variance in things that are probabilistic by nature, but really that stuff has to be very accurate. Because if it's wrong and somebody puts us into a... we lose that customer, right? They're not... yeah.
How are you implementing those third party tie outs? Is it with Soda reconciliation checks?
Say again? Sorry.
So you described a third party tie out, where you're tying that type of value derived from your own data to a third party source. Are you using Soda's reconciliation check to perform that?
Yes, we use that one. It's great.
Yeah.
So you're a Soda Cloud customer?
But we're actually so... Reconciliation checks is one of those things that came pretty late in terms of we have all these different check tabs, but reconciliation checks was kind of the hardest one. Because it needs to connect to different data sources and then reconcile things. So yeah.
So since this is a tech event, I feel incumbent to ask the GenAI question.
Great.
Yeah. There you go.
Let's go.
So my theory about governance of data quality is that it's probably going to be much more important because GenAI is prompting a lot of companies to start to invest not just in language models, but larger workflows that start to do very cool things with words and with data, or I should say facts from tables. And I'm curious if you're seeing any renewed interest of your clients in data observability because of projects that have GenAI attached to them?
Yeah. So we have a client where we're doing any recognition and any resolution through like LLMs, right? So trying to take something like a transaction stream. So normally it'd be solving in a recurrent non-network with training set and a reference table trying to move away from those hardcoded reference tables and RegEx into models that can determine from a transaction string. You have A and F, can you map that velocity to Abercrombie & Fitch? And then so on and so forth. But these things are probabilistic by nature, so it may map differently. There may be things there.
So being able to enforce accuracy and have that is something that we're trying to figure out. So we have a client that wants to return N number of the top five results, but the LLM, if you're thinking about how an LLM works, it's probabilistically a probability for the next sequence of-
Right, words.
... words or characters, right?
Yeah.
So we can tell the accuracy there, but how do we derive the accuracy of five results return from that? That's kind of a blocked box for the LLM. There isn't accuracy of the individual results or things. So we're exploring that for a client right now, trying to figure out, "We can return one thing that's very accurate, how do we return five things with degrees of accuracy from the LLM?" So yeah, it's those types of things we're starting to look into.
Okay, very interesting. So does that involve fine-tuning the LLM on domain-specific data.
And we are doing that, yeah. So right now we're fine-tuning the Mistral, the 7 billion render model that came out on data. And now we're actually trying to figure out what GPU platform we're going to run that, because it's going to require a certain amount of VRAM to do that. So probably won't fit on one GPU, probably a few GPUs, right? So we're trying to figure that out right now.
Very cool. Thank you.
Yeah, sure thing.
[Inaudible 00:49:35] and this might be a more silly question, because actually it'd be great to hear, and I'm a little clicking through with FirstParty [inaudible 00:49:46]. Because at the end of our pipeline [inaudible 00:49:49] were saying [inaudible 00:49:51] to primarily provide insights in different languages. So at that level of checking and this level of or this section within pipeline that we also introduced from there, again, what were your thoughts of roadmap of just checking the goal or within the pipeline? It is if only that it being required against the context in whatever passing through. So it should never work from wherever it's trying out. Is that something that you're looking into, are those types of checks involved?
The simple answer to this is first and foremost, not specifically. No, we're not looking or thinking about that right now. It's like how we do that check type. However, we build it in such a way that we want to become the biggest library of checks. And we're always looking at customers paying, not paying, so doesn't really matter. Come to us and say, "Well, this is what we want to check, because it's a recurring pattern," and we will build it because it's so easy.
We've made it super extensible from that perspective. We do want to of course make sure that because we're introducing language constructs that they make sense for other customers as well. That that use case is repeatable, and then we go out and do it. And honestly, if I look at reconciliation checks, it was quite complicated, but we did it with only two people in the team for about a week or so, or maybe two weeks and that was it. So all the check types are usually way simpler and faster. So just ping any one of us in the community or email or whatever, we'll write a little feature spec that kind of outlines, we will share with you our customers and then we'll just go at it. So we don't have anything right now, but we'd love to extend it in that marriage.
That's the only thing worth mentioning is there's a Soda FirstParty channel that some of you are in. We can respond directly. And there's the Soda community. So I know Jolie's been in there asking questions, "Hey, we're trying to solve this," but when we were doing our Z scores, there's a lot of ways we could evolve that, we were asking the community. The community's been great and the support's been awesome.
So awesome.
You have a story about features, right?
Yeah. So this is kind of a good answer to your questions. Back then I had a question about running checks over a range of variables. Because we wanted to run our checks over all the days that our pipeline was updating for. So I messaged in the channel like, "Hey, can you help me run checks concurrently? How do I do that?" And they said, "Oh, just try it. I don't know if it would work or not. So go for it." And then from there, over the course of multiple weeks, I would run into an issue, ping the channel, like, "Hey, this didn't work, what do you think about this?" And then we would get a nice dialogue going back and forth. And then ultimately led to the release of a new feature for me to help maybe [inaudible 00:52:48].
I've heard that was the same day even. Well the thing is like-
We didn't name it Jolie, but well [...]
No, correct code name Julius.
It'll be named after me. But I think with Soda Core where there's a will, there's a way. Sometimes you can do a workaround to a Soda Core check to use the tool to still get all of the amazing value out of the tool to fit your use case. And then if you really run into a brick wall, then the team will be there to solve it and we'll literally write a check for you if [inaudible 00:53:17].
Yeah, we love it. We want even more engagement in the community than we have today. I think it will come, we have a few thousand people already there, not necessarily active, but if we look every day, almost every day, there's questions there. And we pride ourselves to answer the same day at least, and ideally get a solution in the same day as well. So yeah, we'd love to or we can catch up, we can kind of work out how to get that done...
But now, so what would you say is the definition of business value?
I think for our case is what matters to us is that what we produce drives some incremental revenue for the client that we're delivering that to. Because if they're not successful in launching that fund, in a recent revenue-
I think it's just revenue.
... company. Like yeah.
No deploy Fridays, zero defect policy and preemptive quality checks, we can all learn a lot from a data company whose core product is built on data. Good data rules their world. There'll be more from Club Soda. Stay connected, visit Soda.io and discover our community, our product, and the potential for data you can confidently build your business on.