Data Engineers Are Software Engineers
By Alex
Data engineering has come a long way, and most people now appreciate how complex and demanding the work actually is. But the idea that it's mostly writing SQL and connecting no-code tools still lingers in some corners; and it's worth unpacking why that picture falls short.
There is a version of data engineering that exists in a lot of people’s heads: someone who writes complex SQL, connects a few tools together, and keeps the dashboards running. I don’t think it’s a fair picture, and I’ve spent enough time in this field to feel like it’s worth saying out loud.
Data engineers build software. They design and operate distributed systems, write production-grade code, manage infrastructure, and build the platforms that analysts, scientists, and decision-makers depend on. The discipline grew out of a genuine need: how do you move, store, and transform data reliably at scale? And the answer has always been roughly the same as for any other complex system: with care, with craft, and with proper engineering discipline.
The craft is the point
Andrew Hunt and David Thomas open The Pragmatic Programmer with something that sounds simple but sticks: care about your craft. Not just whether the thing works, but whether it is built well. They write that it is not enough to be good: you need to be consistently good. I find that applies as directly to a data pipeline as to any other piece of software.
A pipeline that works today but nobody understands, that has no tests and falls over whenever the source schema changes, is not really finished. It is a problem deferred. The best data engineers I’ve worked alongside build things that are maintainable, observable, and honest about their own limits. When those standards slip (usually because the work is treated as something other than real engineering) you end up with the broken, untrusted data infrastructure that teams spend years trying to dig out of.
DRY (don’t repeat yourself) is another principle from the same book that’s easy to violate in data work when craft isn’t treated as important. Transformation logic duplicated across fifteen SQL files, business rules hardcoded in the pipeline and then again in the BI layer, metric definitions that diverge silently between teams. These aren’t really data problems. They’re software engineering problems, and they have software engineering solutions.
A platform others build on top of
Something that took me a while to put into words clearly: a data engineer’s primary customer is often not the person looking at the dashboard. It’s the analyst, the scientist, or the product team who needs reliable, well-structured data to do their own work. That makes data engineering a kind of platform engineering: you’re building foundations that other people build on.
That framing changes how you think about quality. A bug in a dashboard is visible quickly. A bug in the underlying pipeline might corrupt weeks of data before anyone notices, and the downstream effects (wrong decisions, models trained on dirty data) can be genuinely hard to trace. When your reliability is someone else’s dependency, the infrastructure side of the work matters more than it might look from the outside. Orchestration, monitoring, alerting, lineage: none of that is overhead. They’re a meaningful part of what makes the system trustworthy.
It’s not just SQL
SQL is elegant and I use it constantly. But reducing data engineering to SQL is a bit like saying backend engineers just write JSON. It misses everything that makes the work interesting and hard.
Modern data engineering involves Python for orchestration logic, framework-level thinking in tools like dbt and Spark, infrastructure-as-code in Terraform or Pulumi, CI/CD pipelines for deploying data assets, and distributed systems reasoning at scale. Version control, code review, testing strategies, architectural trade-offs. The same things any software team grapples with.
The Pragmatic Programmer has a concept called tracer bullets: building a thin but complete path through a system end to end, so you learn where the real complexity lives before over-engineering the parts that turn out not to matter. It’s one of the most useful mental models I bring into data projects. Before you build the perfect pipeline, build the one that actually runs. Understand the shape of the problem first, then refine.
What this means in practice
This isn’t really an argument about titles or status. It’s more that the expectations you bring to data engineering work shape the quality of what gets built. If you’re hiring for the role or building a data function, it’s worth asking not just “can this person write SQL?” but whether they think in terms of maintainability, failure modes, and the people downstream who depend on what they build.
The data discipline has matured a lot in recent years. The tooling is better and the patterns are more established. Bringing genuine software engineering craft to the work, not as an afterthought but as a baseline, is in my experience what separates data infrastructure that compounds in value over time from the kind that quietly turns into a swamp.
Building a data platform?
Free discovery call. Tell me where your stack is today and where you need it to go.