Data Platform Modernisation

Rebuilt a legacy Redshift warehouse into a production lakehouse on Apache Iceberg - cutting warehousing costs from £2.8k to £390/month and dropping worst-case query times from 5 hours to under 1 hour, delivered in 8 weeks.

Client · Series A AI Learning Platform Industry · EdTech · Cambridge, UK Year · 2025

£2.4k Monthly cost saved

8 weeks End-to-end delivery

5hrs → 1hr Worst-case query time

Stack Python Apache Iceberg AWS S3 AWS Glue AWS Athena AWS Redshift AWS Step Functions Docker Great Expectation dbt

The Problem

The platform had grown fast - but the data infrastructure hadn’t grown with it.

The core issue wasn’t the data. The data was there. The problem was that the existing team had made infrastructure decisions without a deep understanding of the fundamental differences between OLTP and OLAP systems. The Redshift cluster had been sized and configured for transactional workloads - wrong node types, wrong number of nodes for the actual data volume. The result was a warehouse that was simultaneously over-provisioned in the wrong areas and under-powered where it mattered. At £2.8k/month and climbing with no clear ceiling, it was the most expensive part of the stack - and the least reliable.

On top of that, the data itself was hard to use. Not because it was missing, but because it had never been modelled properly. Tables reflected how the source systems stored data, not how the business needed to consume it. There was no semantic layer, no naming conventions, no documentation. Analysts were writing complex SQL to answer questions that should have been simple - and waiting up to 5 hours for results. That’s not a performance problem. That’s a business problem.

There was also no governance to speak of. No RBAC, no access controls tied to roles, no audit trail. Enterprise clients were asking compliance questions the team couldn’t answer - not because the data wasn’t there, but because nobody could prove who had access to what.

The team was capable. They just needed proper infrastructure built by someone who understood both the technical and architectural side of data systems from the ground up.

Our Approach

The first week was about understanding before touching anything.

We mapped every inbound data source, every query pattern, and every downstream consumer - dashboards, scheduled reports, product features pulling directly from Redshift. The goal was to understand what the business actually needed, not just what was running.

From there we agreed a phased migration: run the new stack in parallel, validate outputs against the legacy system row by row, then cut over without downtime. No big-bang migrations. No surprises for the teams relying on this data every day.

What We Built

The new platform is built around a clear principle: one source of truth, fully owned, fully documented, no black boxes.

Raw data lands in S3 and is stored in Apache Iceberg format - giving us ACID compliance, schema evolution, and time travel out of the box. Athena sits on top for ad-hoc exploration and occasional analytical queries, paying only for what we scan. Nothing is proprietary, nothing is always-on compute.

AWS Glue handles ingestion - jobs, crawlers, and the Glue Catalog as the central metadata layer. AWS Step Functions orchestrates the pipeline end to end, giving us a visual execution graph, built-in retry logic, and clean failure handling without managing any infrastructure. A backfill mechanism is built into the orchestration layer - historical data can be reprocessed for any time window without manual intervention, making schema changes and pipeline fixes safe to deploy without data gaps.

dbt runs on Docker on ECS, managing all transformations and owning the semantic layer. We started with Inmon methodology - building normalised, integration-focused data structures first to establish a clean, trusted core. As the business matured and stakeholders developed a clearer picture of what they actually needed, we introduced Kimball-style dimensional models on top. The result is a hybrid architecture that reflects how the business actually evolved rather than a rigid upfront design that would have needed rebuilding six months in.

The entire pipeline is infrastructure-as-code using CDK. Every change goes through a proper CI/CD pipeline - no manual deployments, no configuration drift, no “it works on my machine.” The infrastructure is reproducible from scratch in a single command.

Quality checks run at every stage using Great Expectations as the primary framework, extended with custom jobs where GX couldn’t accommodate specific business rules. Failures halt the pipeline before bad data reaches downstream consumers. No silent failures, no stale dashboards.

Everything is documented in Markdown files living inside the codebase itself - alongside the code that implements it. Column descriptions, pipeline logic, architectural decisions, data contracts. Not in a Confluence page that goes stale the moment someone forgets to update it. The documentation ships with every deployment and is always in sync with what’s actually running.

The result is a stack the team can maintain without us. Every component is open source or a managed AWS service - no vendor lock-in, no proprietary tooling, no dependency on AzCoding to keep the lights on.

The Results

The numbers came in ahead of schedule.

Warehousing costs dropped from £2.8k/month to around £390 - a saving of £2.4k/month, or roughly £28k over the course of a year. The saving came from three compounding factors: right-sizing the Redshift cluster to actual workload patterns, offloading ad-hoc queries to Athena’s pay-per-query model, and Iceberg’s columnar storage format drastically reducing the data scanned per query.

Query performance improved across the board. Worst-case queries that had been running for 5 hours came down to under 1 hour. The majority of analytical workloads - previously 30 minutes or more - dropped to a few minutes. The fastest queries were already sub-minute and stayed that way. The combination of Iceberg’s file pruning, Athena’s parallel execution, and properly modelled dbt transformations meant queries were hitting clean, well-partitioned data rather than fighting through an unstructured schema.

The data team’s day-to-day changed fundamentally. Before, the majority of their time went on incident response - debugging pipelines, investigating data quality issues, explaining discrepancies to stakeholders. After, that time went on building new data products. The quality checks, the backfill mechanism, and the CI/CD pipeline eliminated the class of problems that had been consuming the team.

The full engagement - discovery, architecture, build, parallel validation, cutover, and handover - completed in 8 weeks. The client’s engineering team took full ownership of the platform on day one of handover. The documentation, the CI/CD pipeline, and the modular dbt project meant they could ship changes independently from week one - no dependency on AzCoding to keep things running.

"Alex is one of those rare people you can hand a complex problem to and trust completely to get on with it. He rebuilt our entire data platform from a costy Redshift warehouse to a sleek Apache Iceberg lakehouse in just eight weeks. Our warehousing costs dropped from £2.8k to under £400 a month, and queries that used to take five hours now finish in under one. He's sharp, pragmatic, and genuinely easy to work with. If you get the chance to work with him, take it."

- ex-CDO, Dr. Christopher Pedder

Start a similar project

Tell me where your data stack is today and where you need it to go. Free discovery call, no commitment.

Get in touch