Blog

Data Quality

The 4 Levels of a Data Engineer

Published May 13, 2026

Tom Baeyens

CTO and Co-founder at Soda

Table of Contents

Every data engineer I know is one of four kinds. Level 1, 2, or 3, plus a small, important fourth group that the industry doesn't know how to talk about.

Level 1 is SQL and basic ETL.
Level 2 is distributed systems and cloud warehouses.
Level 3 is streaming, orchestration, and big data architecture.
Level 4 is when you look at a request and say, "we don't need to build a pipeline for this."

That last level is the one most data engineers never reach. Not because they can't, but because the industry doesn't reward it.

We promote based on stack depth. We measure productivity in pipelines shipped. We celebrate complexity as if it's the same as capability.

None of which produces what the business actually needs: data it can trust.

This is about what each level looks like, why Level 4 is the destination, and why data quality is what gets you there.

The 4 Levels

Each level is a real expansion in what you can do. None of them are wrong. The progression matters because where you stop tells you something about how your team thinks about data engineering, and what it's optimizing for.

Level 1: SQL and basic ETL pipelines

At Level 1, you can:

write SQL that does what you mean
build a pipeline that moves data from system A to system B and transforms it on the way.

The transformations are mostly correct. The pipelines are mostly linear.

This is real work. A surprising amount of data engineering is still SQL and Python glue, and an engineer who can do that well is producing value from day one.

The limitation is that everything you build at Level 1 lives in your head. There's no architectural model for the system as a whole. When something breaks, you debug it by remembering what you did last week.

Level 2: Distributed systems and cloud data warehouses

Level 2 is where the size of the data starts to shape the work.

You are:

partitioning tables in Snowflake, BigQuery, or Databricks
thinking about how cost and compute trade off against each other
learning that a query that runs fine on a million rows can take an hour on a billion.

The system stops being a single machine and starts being a cluster, and that changes how you reason about failure.

You also start writing code that other engineers will read. You're not the only owner anymore. You document. You name things consistently. You build pipelines that someone else can debug without pinging you.

Level 3: Streaming, orchestration, and big data architecture

Level 3 is when you stop thinking in pipelines and start thinking in systems. Airflow, Dagster, Kafka, Spark — these are tools, but the real shift is architectural.

You're reasoning about:

dependencies between jobs, and what happens when one delays
retries, backfills, and failure recovery that doesn't corrupt data
schema evolution across producers and consumers
how a change in one place ripples through everything downstream

This is the level most senior data engineers reach and stop at. It's a comfortable place. You're building things that didn't exist before. The tooling is sophisticated. The work feels important. And it is.

But Level 3 has a trap inside it. The deeper your stack gets, the easier it is to confuse "I built a complex system" with "I built the right system."

Level 4: "We don't need to build a pipeline for this"

Level 4 is the maturity to look at a request and recognize that more infrastructure isn't always the answer.

Sometimes:

a query is fine
a spreadsheet is fine
the right move is a direct connection between two tools that already exist
the request itself is wrong, and you push back instead of building

At Level 4, restraint is the skill. You're not less technical than a Level 3 engineer. You've added one more capability on top of all the others: the judgment to know when not to apply them.

The Over-Engineering Trap

It's worth naming the specific failure mode that keeps engineers at Level 3. And that’s when complexity becomes the deliverable.

You're building things. You're using sophisticated tools. Your stack looks impressive in a diagram. But the team isn't getting better answers from its data than it was a year ago. That's the trap.

A team of five doesn't need a lakehouse to run a weekly report. A startup with one analyst doesn't need Databricks. A dashboard that updates daily doesn't need a streaming pipeline. These statements are obvious in isolation and very hard to act on when your career has been built on saying yes to scope.

The argument against this is usually that every team needs a modern data stack to compete. They don't. They need data they can trust. The "modern data stack" became shorthand for a specific set of tools, but the underlying need is reliability, not toolchain. A team running Postgres and a careful set of data contracts is more capable than a team with a full lakehouse and silent failures.

Level 4 engineers learn to ask: "What's the smallest system that solves this problem?" Then they build that, and stop.

Why Level 4 Matters: The Quality Thesis

If Level 4 is about knowing when not to build, why does that matter?

Because every pipeline you build is a liability as well as an asset. It needs to be maintained. It can fail silently. It produces data that someone, somewhere, is making decisions from.

When you build less, you have more capacity to make what you build correct.

This is where the conversation has to shift. For a decade, data engineering has measured itself by what it ships: pipelines, dashboards, modeled data, real-time feeds. The implicit assumption was that if the data was moving, it was working. The last three years have made it clear that's not true.

The future of data engineering isn't deeper infrastructure. It's reliable data.

A pipeline that succeeds while producing wrong data is worse than no pipeline. A dashboard that's always up but quietly stale is more dangerous than a dashboard that's down, because nobody knows how to question it.

Most data quality incidents I've seen weren't caused by broken pipelines. They were caused by pipelines that did exactly what they were told to do, on data that had silently changed.

The Level 4 engineer treats this as the actual problem. They don't ask "how do I move this data faster?" They ask "how do I make sure the data is correct when it gets there?" Those are different questions, and they lead to different architectures.

The Level 4 mindset and data quality aren't separate ideas. They're the same.

Restraint about what to build creates the space to make what you do build trustworthy. That's the destination.

Which Level Are You?

Let’s take an honest look at how you spend your week.

[ ] If most of your time is in SQL and you can move data between two systems reliably, you're at Level 1. That's a real foundation, and the rest builds on it.

[ ] If you're partitioning, optimizing, debugging queries that touch a billion rows, and writing code that other engineers maintain, you're at Level 2.

[ ] If you're designing orchestrated systems with dependencies, retries, schema evolution, and downstream impact analysis, you're at Level 3. Most senior data engineers are here. It's a great place to be.

[ ] If your default response to a new request is "let's see if we can solve this without building anything new," and you're right often enough that the business trusts that answer, you're at Level 4.

If you're still at Level 1 or 2, none of this is a waiting room. Data quality isn't a Level 4 concern that gets bolted on later, it's the discipline that makes every level work better.

The harder problem is when your org rewards Level 3 work and ignores Level 4. Restraint is invisible — you don't get credit for the pipeline you didn't build. The way through is to make the cost of overengineering visible: incident counts, maintenance hours, stale dashboards, AI projects that stall on data quality. Once leaders see the cost, Level 4 thinking becomes career-positive.

Wherever you are, the way up is usually less about adding tools and more about caring about quality. Reliable data is the deliverable. Everything else is plumbing.

If you want to read more on what quality-driven thinking looks like in practice, we've got a whole library of pieces written for data engineers at every level, including deeper dives on contracts, observability, and the architectural patterns that let you build less while delivering more.

Product

Solutions

Pricing

Templates

Blog

Book a demo

Product

Solutions

Templates

Pricing

Blog

Book a demo

Case studies

Trusted by the world’s leading enterprises

Real stories from companies using Soda to keep their data reliable, accurate, and ready for action.

At the end of the day, we don’t want to be in there managing the checks, updating the checks, adding the checks. We just want to go and observe what’s happening, and that’s what Soda is enabling right now.

Sid Srivastava

Director of Data Governance, Quality and MLOps

Read the story

Investing in data quality is key for cross-functional teams to make accurate, complete decisions with fewer risks and greater returns, using initiatives such as product thinking, data governance, and self-service platforms.

Mario Konschake

Director of Product-Data Platform

Read the story

Soda has integrated seamlessly into our technology stack and given us the confidence to find, analyze, implement, and resolve data issues through a simple self-serve capability.

Sutaraj Dutta

Data Engineering Manager

Read the story

Our goal was to deliver high-quality datasets in near real-time, ensuring dashboards reflect live data as it flows in. But beyond solving technical challenges, we wanted to spark a cultural shift - empowering the entire organization to make decisions grounded in accurate, timely data.

Gu Xie

Head of Data Engineering

Read the story

4.4 of 5

Your data has problems.
Now they fix themselves.

Automated data quality, remediation, and management.

One platform, agents that do the work, you approve.

Book a demo

Trusted by

Terms & Conditions

Case studies

Trusted by the world’s leading enterprises

Real stories from companies using Soda to keep their data reliable, accurate, and ready for action.

Sid Srivastava

Director of Data Governance, Quality and MLOps

Read the story

Mario Konschake

Director of Product-Data Platform

Read the story

Soda has integrated seamlessly into our technology stack and given us the confidence to find, analyze, implement, and resolve data issues through a simple self-serve capability.

Sutaraj Dutta

Data Engineering Manager

Read the story

Gu Xie

Head of Data Engineering

Read the story

4.4 of 5

Your data has problems.
Now they fix themselves.

Automated data quality, remediation, and management.

One platform, agents that do the work, you approve.

Book a demo

Trusted by

Terms & Conditions

Case studies

Trusted by the world’s leading enterprises

Real stories from companies using Soda to keep their data reliable, accurate, and ready for action.

Sid Srivastava

Director of Data Governance, Quality and MLOps

Read the story

Mario Konschake

Director of Product-Data Platform

Read the story

Soda has integrated seamlessly into our technology stack and given us the confidence to find, analyze, implement, and resolve data issues through a simple self-serve capability.

Sutaraj Dutta

Data Engineering Manager

Read the story

Gu Xie

Head of Data Engineering

Read the story

4.4 of 5

Your data has problems.
Now they fix themselves.

Automated data quality, remediation, and management.

One platform, agents that do the work, you approve.

Book a demo

Trusted by

The 4 Levels of a Data Engineer

The 4 Levels of a Data Engineer

The 4 Levels of a Data Engineer

Trusted by the world’s leading enterprises

Your data has problems.Now they fix themselves.

Trusted by the world’s leading enterprises

Your data has problems.Now they fix themselves.

Trusted by the world’s leading enterprises

Your data has problems.Now they fix themselves.

Your data has problems.
Now they fix themselves.

Your data has problems.
Now they fix themselves.

Your data has problems.
Now they fix themselves.