Let’s meet Paco Nathan, the player/coach affectionately known as the evil mad scientist. Paco has more than 40 years of tech industry experience with expertise in data science, natural language, and cloud computing.
In this episode, you’ll hear how his early career choices have become industry milestones; how he used data whilst serving in the US army; and, which tech game-changers he’s worked with. Paco has built up practice as a hands-on manager and loves leading smaller teams and eschews job titles.
Welcome to the Soda Podcast. We're talking about the Data Dream Team with Jesse Anderson. There's a new approach needed to align how the organization, the team, the people are structured and organized around data. New roles, shifted accountability, breaking silos, and forging new channels of collaboration. The lineup of guests is fantastic. We're excited for everyone to listen, learn, and like. Without further ado, here's your host, Jesse Anderson.
My guest today is Paco Nathan. Paco Nathan is known as a player/coach. We'll ask what that means. He's done all kinds of things. He's a good friend of mine. I'm actually going to learn even more about Paco today, some things and some stories that I haven't heard about before.
But one of the things that I think is important and interesting for this conversation was in 2015 he created a series of training videos for O'Reilly called Building Data Science Teams. That was kind of the catalyst for my journey as well and I think a bit of his journey. So let's get into it, Paco. What sorts of things should we know about you? What makes you a full-stack technology person?
Thank you, Jesse. I'm delighted to be here. Thank you for the kind words. Let's see. Full-stack developer? Well, a lot of it is that I had training in some different areas early on. That ranged from working with math and very advanced math applications in data analytics all the way over to machine learning, which was really early at the time, and then all the way over into circuit design and working with hardware and distributed systems. So I had kind of a diverse background starting out, and I think it just led to a lot more development of interdisciplinary work.
For instance, I made a fundamental mistake about 40 years ago, by studying, against all advice, studying artificial intelligence in grad school. Everybody told me that it was a non-starter. It really wouldn't lead anywhere. Then I made the further mistake of doing a deep dive into an area called neural networks. I spent several years working there and including hardware accelerators for neural networks. This was in the '80s and '90s. And so, frankly, got a pretty good grounding in some interesting areas but during the AI winter nobody cared, so I went off to be a network engineer. I did work in building cross compilers for a while. I've done a lot of work with embedded systems during that period. And I think at the end of it, coming into the early '00s, I had a more balanced background, so to speak, on how to leverage the resources.
That's pretty impressive. When most of the time I say I'm a full-stack developer, that meant I did not just one thing. I did front-end and back-end. When Paco says full, that means full and going to design your computer for you all the way down to the chips.
So you also talk about yourself as a player coach, tell me more about why you call yourself a player coach.
Certainly. You were mentioning the Building Data Science Teams, I want to shout out to a mutual friend of ours, my friend, DJ Patil. I think he'd actually done an article called Building Data Science Teams just before I shot that video.
If you go back to the early days of big data and a lot of the industry adoption of having data science teams, data engineering teams, there were certainly elements of what I would call more old-school approaches where the executives would expect that they would get a manager in place for a data team who really was really just a manager. They went out and delegated, and they hired the right PhDs in data science. It was a fairly old-school kind of notion. My view of that was abject horror because I felt like the 1960s had called and they wanted their dinosaurs back. Realistically in this space, there weren't any PhDs in data science. That just didn't exist, and I wouldn't have hired them probably anyway.
Instead, I really built up a practice as a very hands-on manager and typically leading smaller teams. Sometimes I think my higher watermark was maybe 25, but usually it was a much smaller team. I liked working with people who could wear a number of hats. I really, strongly agree with, say, Eric Colson on that point of having more generalists. So I always kept very hands-on myself as well. I think that when you're really going to bring value for data analytics, you’ve got to be able to cross that divide between what the business needs and what the data's telling you.
That's interesting. Did you happen to watch the Netflix show called The Playbook?
I've heard of that. No, I haven't seen it yet.
You haven't? Okay. In there it's about these interviews with these coaches. It's super interesting because sometimes the coaches were players, and sometimes the coaches had never played. It was interesting seeing the players... They also interviewed the players that played for them. Sometimes they'd say, "Hey, this is great that the coach has actually played and can understand it." I see that especially in data teams where they could say, "Hey, you're not just dictating, you’ve actually been in the trenches." Is that what you're seeing?
Yeah, very much so. For me, it's very poignant in the sense that executives often would want to delegate responsibility, in some cases delegate blame. I won't name names, but I was at a firm that invented the term "lean startup," and was also really early in continuous deployment. The product managers had the idea there that they were going to delegate responsibility for A/B testing to a data science team. And if the product was crap, frankly, and the A/B testing results came back saying, "Really, nobody likes this," they would blame the data scientists because they were the ones delivering the report. I was like, "No, actually I think the product manager deserves some blame in this case."
Being more of a hands-on leader in data science, I could go to the maths and I could say, "Hey, wait. I can tell you what's going on here. Don't shoot the messenger and blame the analyst. There's something fundamentally wrong in the business approach. And the data's telling us, and here's why." I think that's where the player/coach side comes in, but also just the mentoring over a number of years really felt that it was important to invest time in mentoring certain people and kept up those relationships. It really helps over time. I mean some of them are now engineering managers at Apple and Google and whatnot, already have their own startups. Some have moved into exec roles even. It's great to be able to see the world partly through their eyes. Because of that relationship over time, it definitely opens me up to what are priorities that are emerging that I wouldn't have seen in my experience. So I think that mentoring is a two-way street, and it's also part of the player/coach role.
Switching gears a bit, you learned about the value of data pretty early on in your career. Would you mind telling that story that you've never told before? In fact, I've never heard this story before either.
Yeah, that was a fun one. I was in the US Army. I was in a unit that really liked doing parades. Parade competition was kind of big there. I had a lot of background... Well, I had a lot of access, let's say, to computing. This was pretty early in my career in computing, but I definitely had mainframe access, UNIX box, and I was on MILNET. I was on ARPANET early on. So I'm coming up next month celebrating my 40th year of using email for work, which is kind of a milestone.
Long story short, I landed in a unit. The company that I was in would always come in last in parade competitions, like regimental parade contest. The problem was that this particular part of the Army had a long tradition of parades, and they'd been around for over 150 years. During that time, their idea of learning and feedback was you reward the winners and you punish the losers. I was in a unit that was getting punished and not rewarded. Oddly enough, one of my colleagues in that same unit, he was the first person I ever knew who owned a Portapak. This was back before camcorders. He owned a Portapak, and oddly enough, he'd also broken his leg, so he couldn't march.
I was on guard duty one night, and I was looking at this stack of judges' scores that had come in. It was basically all of our failures. It was the data about our consistent failures. Everybody was just upset, grumbling about it. But I was staring at that. I was working on some Fortran code for statistics just at the time, and I thought, "Wow, what if I just enter all that data into a file? I'll run analysis of variants, and I'll check out which are the parade turns where our unit is consistently getting low scores." Part of this was looking into the data. This is what it was showing. What we were able to do was to identify where we had those pain points and then send out the guy with the Portapak to videotape. Then we would go back and show the videos to the squad level leaders and find out which individuals were really making mistakes that were causing the bad scores.
Anyway, long story short, we won the regimental competition that year in parade. The brigadier general in charge came in, WTAF. He was like, "How? How did this happen?" We had a CEO who was - I think it would be polite to say he was a little clueless in some areas, but he was our commanding officer. He just flippantly said, "Well, there's a couple guys using computers and videos to train soldiers." At which point the brigadier general just flipped. Because this was the early '80s. This was actually a really big research priority back then. If you've ever studied some of the history of where GPUs come from, the US military was really investing in video feedback, computer augmented early virtual reality. When the general heard about this, basically the next thing happened is we had people flying in from the Pentagon to do a code review.
I go back sometimes. I've been back to that unit. These days they do combat training in these VR caves. The teams go into VR. I like to think that we had at least a small role in the history of moving the needle there, of moving away from this idea of reward and punishment at a very high aggregate level down to very fine-grained detail of individuals getting immediate feedback through data.
That's an awesome story. Are you worried that somebody from North Korea's going to hear that idea and learn how to march better?
No, I'm worried that the Navy will hear that and learn how to march better. Have you ever seen the Army/Navy games, that is very telling.
You don't need to march very well if you're on the sea, I guess.
Yeah, exactly. Where's the boat?
You got to get your sea legs about you. Now coming forward from those '80s, what were some of your significant milestones and game-changing moments that happened in the recent past?
I was fortunate to get to be a guinea pig for AWS. Just the timing, I was technically a co-founder for a startup that got funded the same time that AWS launched, so we went all in. We were one of the early companies doing 100% cloud architecture. I got to work with the AWS team closely. Also, there was a paper coming out of Berkeley after a couple years into this. The paper was released in early 2009, so it had two years of cloud history. It was Dave Patterson, Ion Stoica, and it's called Above the Clouds. I got to critique that based on industry experience. Then I gave a guest lecture at Berkeley about it. There's a video of it where Dave is basically doing thesis defense on me, and I'm sweating it out for an hour. Anyway, long story short, there were some really interesting people in the audience. Matei Zaharia was a first-year PhD then. Most of the founding team for Databricks was in the audience. It was really great to bring in this industry perspective, having a couple early years of cloud experience at scale, and then talk with people who would end up also being real game-changers.
That led into the mid-2010s. I was involved at Databricks during its hyper growth. I was a community evangelist for Apache Spark. We see this rise of open source whether we're talking about Spark or Pandas or Jupyter that could learn all these open-source projects. They led to a lot more advanced tooling that gets leveraged in business. I thought that that was a real game-changer because what we saw was IT no longer held dominion over the resources. Instead, the resources were much more opened up to the line of business units. Literally, that was our early customers at Databricks were line of business, business managers ending around some of the latency, laggardness of IT.
Then getting into the late 2010s, the other milestone is I had seen neural networks since the mid-'80s evolving. For what it's worth, at one point I worked on a team with John Hopfield so early, early neural network stuff. But by the 2010s, of course, people were taking deep learning seriously. By the late 2010s, this is a fact of life, one of the things I really liked out of, say, circa 2018 was where Nvidia and others began looking at more of a holistic view of business applications for machine learning and where the typical bottlenecks are. When you think of deep learning, people probably talk about using GPUs and they think about resolving a CPU bottleneck. It's not the case. In the larger business applications of deep learning, you're typically bandwidth limited, and you have to be thinking about data transfer rates and network. So one of the things coming out of the late 2010s was a much more holistic view of performance analysis and how hardware and software are having to evolve together.
Now that you've got this cloud, you have the infrastructure, now you start to switch your ideas and thoughts to your team. What's your ideal sketch or layout for your team, for your data team I should say?
I do tend to like smaller teams, and I do tend to like people who wear multiple hats, again, people who have some background in business but also a background in math but also a lot of coding. I really, really don't like titles. I sort of eschew titles. And I really don't like specialized roles, highly specialized roles. I mean I can see it in a large enterprise. Of course, there's a need for some. But when I'm running a data team, I want people to be a bit more open to pairing with experts in other fields. So I tend to like having a small team that has expertise at the systems level as well as serious software engineering expertise as well as the analytics and the math that's driving it and, of course, staying close in with the people who have the domain expertise in the business. That does lead to a lot more hands-on, but I just feel like that's the proper way, and I see results from it.
The other thing, too, is I've almost used a graduate seminar approach. There's a lot of argument about, do we use a service bureau model? Do we use hub and spoke? Do we use embedded teams? How do we deploy the data experts in an organization? I've been engaged in some of these arguments for large organizations where they're grappling with it: Facebook and Google and others. My view is I really like having almost a graduate seminar approach where the different data science teams get together maybe once a week for lunch. People put their ideas or what they're facing as challenges up on the board. Other people comment or shoot it down or whatever. Maybe you can have stakeholders coming in, but they have a non-speaking role. It's kind of a safe place for the data people to sort through issues just like in a graduate seminar. I found that that works really well in industry.
What you're saying reminds me of what Eric Colson says that they did at Stitch Fix. It was this luncheon where they talked about those sorts of things.
Yeah.
A question that you mentioned getting too specialized. What's too specialized? Do you think data scientist is too specialized, or do you think I only specialize in graph manipulation for SQuirreLs?
Right, exactly. I would run screaming if I have encountered people guide. I remember interviewing circa 2008 for a data team that we had to staff up very rapidly. The company pumped a lot of money into recruiting. But we would get these people who were like, "I did my PhD on this obscure area - support vector machines," and it was sort of like graph computing for SQuirreLs. Yeah, I run screaming in the opposite direction.
Frankly, I have never used the term "data scientist" to describe myself, and I would get very nervous to do so because I just don't like the term. I don't like the term "data engineer," although I see where both these are necessary, but I just really eschew titles. So I would tend to want to see more generalist people. Obviously, there are some people who have a PhD in stats. They haven't really spent the time to pick up software engineering or had the opportunity to pick it up yet, but maybe they want to. You're not going to throw them at building out your Kubernetes cluster. You're going to give people roles that play to their strengths but also help them to grow in their role and learn new areas. The short answer is even data scientist would be a little bit too specific for me.
That's interesting. You're killing me a little bit, Paco. Here I wrote a book about this. Maybe you need to write the rebuttal for me.
Again, this is a personal bias. I really do like smaller teams. Definitely riffing off what you're presenting wonderfully in Data Teams, I can definitely see especially when you get into regulated environments like in finance, you have to build out a large team. There's a lot more reason for having more specialized roles. I can definitely see, especially when there's a lot of compliance and people have to go through a lot of certification to be in their role whether they're in health care or finance or some of these other areas that absolutely require it. I get that.
The thing that I don’t want to see - let me rephrase this a bit - I don't want to see somebody who is principal senior machine learning engineer in training or some ridiculously long title like that. It doesn't really mean anything because the fact is that this field is evolving so quickly that you hire people for a specific role in 2019. By 2022, we're not sure we even need that role. I mean the needs have changed. I think that I would rather see people who are going to evolve in their roles and be a bit more fluid and pair up with people who have drastically different expertise.
Speaking of roles on a data team, you had a very interesting way of not really looking at roles but figuring out where you're weak. You actually talked about this, and that was where I first saw it was in that Building Data Science Teams. Would you talk more about that?
Certainly. We came on the idea trying to express this especially to business stakeholders, why should they invest in data teams? We came on this as kind of a visual way of doing gap analysis. It's pretty simple. For a given team, write out what are the top four or five kinds of needs that you have, and I mean at a gross level. For that, for business needs, I don't mind titles there. I like titles. Can you say that you really need to have some strong work in infrastructure as code, building up clusters for this given data project? Can you say that you really need to have some expertise in time series?
So chart out what are the top components of needs that you have and use that as, say, the columns. Then look at the people and what kind of strengths that they have and use that as the rows. So literally build up a matrix and then put some color blobs for the individuals to show where is it that they have their strengths. Once you put that together, you can start to see really what's missing. What's the gap analysis? Do you need to get additional training for people? Do you need to bring other people in who have different types of expertise? So that kind of visual gap analysis I found to be really effective. Constructing a matrix like that is actually a fun exercise for taking a real objective view on your team.
Those of you who are listening, if you're drawing a blank of what this looks like, think of a heat map for people and skills.
Oh, yeah.
This is so effective. I talk about it, both my Data Teams book as well as Data Engineering Teams book. Please do this. You may find that you're completely weak, or conversely, way too heavy in certain areas. What sorts of challenges do you think you're hitting frequently when building a data team?
Well, from my perspective I mentioned, I tend to prefer smaller teams with people who wear multiple hats and probably have more senior background and experience in different fields. I guess, as I said, my bias would be, I tend towards smaller teams that are much more interdisciplinary. I do run into a lot of executives who are convinced they must go out and build a really large team and have people in much more specialist roles. There was definitely a tension in that. Maybe it's a preference. I see the value for the kind of areas that I work in. Again, the kind of areas I work in tend to be more heavy on the math applications and some of the edge cases in machine learning that have good ROI in industry, but they're not going to be something where you're really going to be able to hire a really large staff for.
There are teams that I've worked with where we're talking about the behemoths, the 10,000 people, 100,000 people, and that's where we get into some of the issues of actually having to deal with large teams. Have you experienced that at all?
To some extent, yes. Having supported that, having to work within that context, yes, definitely. I recognize the need for it especially like you were saying, when there are regulated environments, for instance, if you want to do data engineering and data science in health care. Ben Lorica and I have been doing these industry surveys and industry reports. One thing that was really, really startling to me is how much of a priority there is on certification if you want to be working on a data team in health care. So you need people who have more specialist backgrounds and more certification if you're in finance, if you're in health care, I can imagine in pharma and other areas like that.
Here's where I would agree with you and both disagree with you. Given a choice between, let's say, 10, even 100 beginners, versus a five-person experienced team, I will choose a five-person experienced team. It's because the five-person experienced team will perform significantly better. I think this is where execs get out of whack. They look at the cost per person, they think, "Oh, I can't do that," when the reality is, no, you're going to get higher ROI.
You have to do it. I mean there's a lot of room for people coming in at entry level certainly. But if you have a team that's dealing with large risks and large rewards, yeah, working with the smaller, more expert team is always going to be better ROI.
We're going to be talking to Jordan Morrow about data literacy. Would you mind sharing your thoughts on data literacy and some of the data democratization efforts we've been seeing?
Yeah, definitely. I have some mixed feelings on that. On the one hand, it's very important. I definitely would like to see literacy in general for working with data. I'm a little bit skeptical with some of the product claims that I see about data democratization. I definitely don't think that everybody in the world should be a data scientist. I think that's ludicrous. I also don't think that it's necessarily a good idea that everybody in a company can walk up and access any of the data and run a query and then start making an argument with it. Obviously, there's a lot of privacy and security concerns about that, but the point of not just running a data query but also the point of, say, decision science. Cassie Kozyrkov is doing some excellent talks along these lines.
Data itself is not the end goal. Making the decision in a complex business context, that's the end goal. If you just have data democratization as your headline of how you're going to solve the problem, it doesn't go all the way to the last mile. It doesn't go to, how do we support decisions being made all day, every day and getting feedback and making better decisions as an organization? Instead, what I've seen, unfortunately, is this anti-pattern where a lot of the data democratization was leading toward anybody who wants to jump up and become a squeaky wheel and make an argument because they've got a query that proves their argument. That doesn't really help in an organization. So I really have mixed feelings about it.
What I would say is I think that some of the better voices on this... Definitely, we talked about Eric Colson out of Stitch Fix. He's done a lot of writing at HBR and others. Along with that, Eric had come out of Netflix, as I understand. Another person out of Netflix that I really like, Michelle Ufford. Both of them have done incredible writing about the leadership of data teams.
The other things I'd point to, I really like, there's the Cynefin framework by Dave Snowden out of IBM. It's the source for - that was written in, what, 1999? - and it's what Rumsfeld quoted in the early '00s with the unknown unknowns. But understanding, do you have a complex context? Are you in crisis management? Do you have a simple context? Understanding that from a leadership level but also from an analyst level is super important because you don't want to conflate overcomplicating your analysis, and you don't want to conflate oversimplifying your context. Both those lead to disaster. That's where I think some of the root problems with data literacy come in is you get people who, "I've got some data but I'm going to oversimplify the problem."
Let me make this a little bit more concrete. I was in an organization where there were product managers that needed information on pricing for a product. They would run the data and aggregate it and calculate an average. They would say, "Okay, our price point for this product is $10." But if you dug into the data, you could clearly see there was a bimodal distribution. If you just visualize it, you could see there's a big group of people that want a $5 product and a big group of people that want a $20 product. If you give them a $10 product, nobody's going to buy it. In fact, that's what happened in this case.
So part of the problem is that the literacy is not just SQL query. Stats is a hard thing. Even for people who work in statistics, stats is a weird, hard domain, and there's a lot of competing views. You really have to know the analytic tools that I'm using. What do they really mean? How do I really evaluate the edge cases for them? If you don't have a lot of hands-on with understanding the edge cases in statistics, you're probably going to misuse them. The same thing with machine learning, if you don't know how to evaluate models, that's a very arcane subject. What is coming out of information gain? What is coming out of the AUC?
I think that the data literacy side, yeah, I would actually go back to the ground and say people should spend more time learning about learning and learning about decision making. Certainly, there's great work on pedagogy. How is it that, say, novices learn a complex topic versus how experts learn in the same topic? It's very salient for data because you're perpetually faced with this challenge: "There's some problem in the business we're not exactly sure what's happening. We believe we have some data sources that we can zero in on this." So you're perpetually faced with that problem of having to rapidly go from being a novice to an expert in your field and understanding, again, from learning science, from pedagogy, from cognitive psychology, how is that people function when they're faced with those challenges? I think that's the crucial part. To me, it's a lot more important for people in that kind of role than, say, nuances of SQL queries.
Most of the time when I hear citizen data science, I think of the South Park joke of, "Step one, hire a citizen data scientist. Step two, question mark, question mark, question mark. Step three, profit."
Exactly.
It turns out step two is... there's a lot more steps, and it's a lot more difficult than that.
I will shout out, I've seen... a citizen data science program is really effective inside DOD especially for intelligence analysts where when they use the word "analyst," they mean somebody who's an expert in Syria and speaks Arabic and can read the literature coming out. Their citizen data science is a practice in DOD that involves... Some friends of mine helped set it up. It involves like 10,000 people daily where there are people who have the domain expertise who are learning how to use the data tools and running analyses and modeling themselves, but they're getting coaching and mentoring from people who are data engineering and data science experts. I see practices like that where it's more of a disciplined environment, and military can pull that off. Some enterprises, like in finance or regulated environments, they can pull it off. I think your typical Silicon Valley company, it would probably be a shouting match and a free-for-all and not quite the discipline.
Speaking of Silicon Valley, we're seeing some of the technology vendors saying, "Hey, our products are easy. We've made data science easy. We've made data engineering easy." What's your take on that?
No, there's nothing easy about it. Especially when you look at fast data... Let's put it this way. Suppose you happen to have a large manufacturing business and you're buying and selling products all over the world, there's a global supply chain, and there's new competitors coming on the radar every day, and there's new regulations all over the world. For that matter, you also have plants with real-time processes happening, things that could go awry or explode or leak or whatever, enormous amounts of data. If you look into an environment like that, from a traditional view, well, maybe they've got SAP. They've got a data lake. They're doing some reporting. But there's really a lot more data under the hood that nobody's bothered to touch.
When we look at the business environment, we're here in 2021 now, if we look 10 years out, we're going to have to account for much finer granularity of pulling in those real-time data streams or near-time data streams and doing a lot more streaming analytics. The challenge goes up orders of magnitude. It's not just the data management, but it's also the analysis of - what does this mean. That requires much more sophisticated types of analysis. You know me, I work a lot in graphs. There's a lot of areas of leveraging graph algorithms for disambiguation but also for measuring uncertainty, the kind of things that just weren't traditionally there in a relational data warehouse.
I think that if anybody's saying that they have done a low-code solution to automate data science and just put it in a black box, they're looking in the rear view mirror. They're looking at the requirements that are already five years, 10 years out of date. They're not looking at what does industry need five years from now. Frankly, we aren't going to have enough people to do it, and we aren't going to have enough compute power to do it unless there's some breakthroughs in physics. I don't really buy the data scientist in a box thing.
Thinking about technologies, what are you excited about for this future, the future that you were just talking about?
Cool. I would say that there are some areas that I've been working on that I really like. I've been working with Anyscale and doing tutorials for Ray, but this notion of doing composable futures, it's very exciting what people are able to accomplish at scale whether you look at QuantumBlack, the team that helped design the winning entry for the America's Cup, and just the mass simulations, industrial control problems that they had to do there, whether you look at, say, Ant Financial, what they're doing with ray and others in manufacturing as well.
I really like what's going on with Ray whether you're in Python or Java or SQL+, but this idea of composable futures for distributed systems has a lot of legs to it, and I think we're barely scratching the surface. We see that being used by Google with their people at DeepMind. They've got JAX out, which is basically composable futures on NumPy array calculations, if you will, to do deep learning. So it's a total re-architecture of deep learning that fits more with very good distributed systems practice, also very good mathematical properties in terms of functional programming. So I like that area.
Another two areas, one is I'm looking at newer kinds of cluster schedulers. I like the work coming out of, I believe, it's Los Alamos and Stanford on [Leadgate 00:39:07] and Legion. I think there's a lot of room ahead for more intelligent, more workflow aware cluster schedulers, better hardware aware. Another area, too, just to round it out is I really like working with uncertainty. So when you look at tools like probabilistic soft logic, doing probabilistic graphs to measure uncertainty in data, data quality checks, hypothesis testing, if you will, leading toward more causality analysis, that's a super interesting area. Also, some great applications of reinforcement learning there.
Again, a lot of this stuff, if you try to describe it to today's data teams, they think you're an alien. They'd never heard of it. They don't know why it's important. Again, this is why I just stress that this field evolves so quickly. I've got a book out. Dean Wampler and I did a book called Hardware > Software > Process. We did that recently, just a couple months ago working with the team leads for open source machine learning at Nvidia. I do think that understanding what's happening, how fast the hardware's evolving, and how much, again, like cluster schedulers, etc., how much the software has to be aware of this to meet the needs that we have for data challenges. That's where I'm at and where I think a lot of the game will be played over the next several years.
Now going back in time, what do you think are some of the lessons we could take from those early days of managing data teams all the way to today?
Looking from the early days, one of the things that I ran into very early on, we touched on it early, was this idea of never getting yourself in the situation of explaining the business. You're working with stakeholders who know their business. Make them explain the business. A case in point, your team is charged with troubleshooting some part of the business as being very challenging. You pull the data, and you do your analysis, and you build your visualizations, and you go to present to exec staff. Sort of a data team 101 mistake is that some data scientist walks up and shows the charts and starts explaining it to a bunch of vice presidents. Don't ever do that.
Instead, make a really good infographic, have the analytics and the data to back it up, drop the graphic in the middle of the table and let the executives argue over it because they know their business, and they're the one who should be arguing, not you. What I learned early on was work with the stakeholders. Make them do their work. Don't try to do their work for them. Don't take the blame for their successes or failures because it's very easy to do that. It's just a common anti-pattern that I see in organizations.
What keeps you up at night?
I think I was joking about a fine balance of melatonin and caffeine. I really like to get up in the middle of the night and write code and write unit tests and outline out a talk or something. I do enjoy working at night.
Things that really worry me, I got to be honest, I was around when Mesos was making a run for it. I definitely saw the wars between certain types of cluster schedulers whether we're talking YARN or Mesos or what was going on at Google. Eventually we had Kubernetes come out. And that's great. There's a lot of great infrastructure building Kubernetes. I do think that we've kind of backed ourselves into a corner in some ways because we have data infrastructure that's really not all that aware of what's going on inside of workflows and not all that aware of the nuances of what could happen in the hardware.
Certainly we make this case, there's a lot of great material in, like I say, the mini-book that Dean and I just did. I would like to see a world where you can run a workflow and have a scheduler that understands about memory objects and how they should persist when you're trying to execute something because I would like to be in a situation where I can run a cluster of servers in the cloud that have GPUs inside of them and I can be running a trillion-node graph. Technically speaking, it's possible with the hardware. Technically speaking, the software management layers get in the way. I know the business needs for fast data, for large-scale data, for very expensive computation, like I say, graph algorithms, disconfigurations and things like that.
I know that the business appetite for that is increasing, and the hardware is coming along. Certainly when you look at multi-node, multi-GPU architectures coming out of Nvidia, it's super interesting, but that Kubernetes layer is kind of getting in the way. It's not really aware enough to prevent some of the problems we see. Like, if you pull a cluster trace at a large cloud practice, you'll find that so much of the work goes into moving memory from one place to another. Maybe 25% of the cluster will be spending its resources just moving data in ridiculous ways. I do believe that we need to have smarter software at that layer. That really keeps me up at night. If I had my ways, we would be running some clusters on the polar craters of the moon. It's just a great environment. We can talk about that more. I look forward to what can happen with cloud as it evolves, let's put it that way.
You had one other about chip supply that I'd never even heard of before.
God, yeah. If you look at Moore's law, it's really interesting. It was held for a long time. It was really fascinating just how much advance could be based on this perpetual increase, almost exponentially increasing in price performance with chips, but it ran out. There are a lot of tricks with physics, but eventually Intel hit the wall roughly... Intel was able to deliver on Moore's law kinds of effects up until the late 2010s. By 2017, 2018, they started making promises about new chips that they were never able to deliver. TSMC has eclipsed them. Frankly, a lot of the vendors, I believe even IBM and AMD now are just outsourcing to TSMC. So the thing is the whole landscape has changed.
I referenced a podcast from my friend... does the Next Billion Seconds. Mark Pesce does that down in Australia. He has a really good series called Geopolichips. The thing is that building a fab these days costs $10 or maybe $15 billion. I was working on hardware when it cost a billion dollars, and that was seen as absurd. But these days it's a huge capital outlay to build a fab. Then you see things like where Intel made a misstep and there's leaking across multiple processes running in the chip because of how the cache is structured. So an entire generation of fabs gets wiped out because of security concerns. You also see a lot of geopolitical espionage and cyber-threat and whatnot that leads to the fact that these investments that are ever-increasing in hardware are also becoming very high risk. There is a global shortage right now. You can't manufacture enough cars for demand because the cars have so much reliance on embedded systems, and they just can't get enough of the right chips.
I do think that the hardware game is a really interesting area. I also see it as a major constraint for how enterprises are approaching their data challenges because you're waking up to the fact that companies like BASF or Siemens actually have larger problems, for instance, in graph computing than, say, Google has. Everybody talks about Google Knowledge Graph. It's small compared to the industry use cases. So I think that the hardware game is really one of the areas geopolitically that's going to be problematic. Most of the chips in the world right now are coming out of Taiwan or at least headquartered there even though there's a lot of fabs elsewhere. They're not many miles off the coast of a country that doesn't necessarily like them. So this is going to get interesting.
Well, Paco, thank you so much. I really appreciate you sharing your pearls of wisdom and your extensive experience and knowledge on the subject with us. Is there anything that you'd like to plug before we head out?
Thank you, Jesse. I just want to say I really enjoy your podcasts. I enjoy your writings and appreciate getting to work with you on projects. I've been doing a lot of work in graph lately. One thing I'll plug is we find that in graph technologies about half the people will run screaming in the opposite direction. They don't like it. They don't want to do it. The other half are like, "Wow, this really accelerates what I was working on." So I would say give it a try. I've been involved a lot in a couple of conferences, Knowledge Graph Conference and also Connected Data World. There's a lot going on in that space, and they have a lot of enterprise applications, so definitely check it out.
Well, good. Well, thank you again, Paco, and thank you so much for contributing to this podcast and, in fact, the Data Teams book. I appreciate it.
Thank you, Jesse. Thank you very much.
With that, I'd like to thank you for listening. Tune in for the next episode where we're going to be talking even more about our data dream teams.
Another great story, another perspective shared on data, and the tools, technologies, methodologies, and people that use it every day. I loved it. It was informative, refreshing, and just the right dose of inspiration. Remember to check dreamteam.soda.io for additional resources and more great episodes. We’ll meet you back here soon at the Soda Podcast.