• Home
  • Selected Work
  • Writing
  • CV
  • Work With Me

On this page

  • The Problem
  • The Temptation: Incremental Processing
  • Why I Didn’t Trust the First Benchmark
  • The Bug That Scales With Data
  • The Real Fix: Shared-Incremental with MERGE
  • The Benchmark That Saved Us
    • Production Impact
  • The Benchmark Design That Saved Us
    • Why EXCEPT Validation, Not Checksums
  • The Real Lesson
  • How I Built This

Fast Wrong Is Worse Than Slow Right

How I redesigned a daily full-recompute pipeline into a shared incremental one

Data Engineering
Performance Optimization
Snowflake
The most dangerous optimization I tested was also the fastest—until I discovered it was silently corrupting data.
Author

Farrukh Nauman

Published

April 7, 2026

Modified

April 7, 2026

Keywords

data pipeline optimization, incremental processing, data correctness, Snowflake, benchmark design

The most dangerous optimization I tested was also the fastest. On paper, it looked great. It avoided full recomputes, cut runtime dramatically, and made the benchmark tables look easy. It was also wrong—and that changed everything about how I approached the problem.

The Problem

A critical daily analytics pipeline had been running for years. Every morning, it recomputed four downstream tables from a ~30 billion row source. The job took about 90 minutes on a large Snowpark-optimized warehouse and consumed roughly 4,500 Snowflake credits annually. At 3 billion rows of test data, 98% of the runtime was spent on two full table scans—not the output writes, not the aggregations, not the window functions. Just scanning the same massive dataset twice.

The pipeline was expensive, but it was correct. That matters when downstream teams depend on your output for operational decisions.

The Temptation: Incremental Processing

If the bottleneck is scanning 30 billion rows, why scan them every day? Most of that data hasn’t changed. Only new rows arriving since yesterday are actually new.

The idea was straightforward: build an incremental system with two phases. Phase A would run once to seed a persistent activity table from historical data. Phase B would run daily—scan only new rows since a watermark, aggregate them, append to the cache, and recompute the four small output tables from the cached activity layer (~6 million rows, not 30 billion).

I built this “Dual-Incremental” approach with separate tables for the normal and filtered processing paths. The benchmarks were stunning:

Dataset Full Recompute Dual-Incremental Daily Speedup
300M rows 136 seconds 17 seconds 8x
3B rows 350 seconds 28 seconds 12.6x

At production scale, this would cut a 90-minute job to roughly 7 minutes. The annual credit savings would be substantial. I was ready to write the deployment plan.

Then I ran the correctness validation.

Why I Didn’t Trust the First Benchmark

Here’s what a typical “optimization test” looks like in practice:

  1. Take a small dataset from production
  2. Run the old code once → 50 seconds
  3. Run the new code once → 15 seconds
  4. Conclude: “3x faster!” → Open PR → Merge → Move on

This is exactly what I refused to do. The Dual-Incremental approach would have passed that test with flying colors—and would have corrupted production data for months before anyone noticed.

Instead, I built a multi-dimensional benchmark that tested along three axes:

Axis What I Tested Why It Matters
Data Scale 30M → 300M → 3B rows (1/1000x to 1/10x production) Small data hides bugs. The Dual-Incremental bug showed 0 errors at 30M rows but 29 phantom + 8 missing rows at 3B.
Warehouse Size Small through Large including 2 variants of Medium Large warehouses parallelize scans so effectively they hide algorithmic costs. On a small warehouse, the incremental advantage was 4.0x; on large, only 1.2x.
Correctness Row-level EXCEPT validation on every single run Aggregate checksums can match while individual rows are wrong. Row-level diffs caught the structural errors that checksums missed.

That meant 36 benchmark combinations (24 completed—3B timed out on smaller warehouses), producing timing entries and validation records all stored in database tables for reproducible analysis.

The bug that killed the Dual-Incremental approach only appeared at scale. At 30 million rows—fast for iteration—everything looked clean. At 300 million, 3 phantom rows per warehouse. At 3 billion, 29 phantom rows. Extrapolating to production: hundreds of incorrect rows.

The Bug That Scales With Data

Consider an activity (device X, activity #123) that has telemetry rows both before and after the watermark cutoff. During Phase A, the pipeline aggregates rows A and B from historical data. During Phase B, it discovers new rows C and D for the same activity.

The problem: the append-based architecture creates a second activity row from rows C and D. It cannot go back and update the Phase A row. The correct result should be a single row aggregating all four rows. Instead, we get phantom rows in weekly and monthly tables, missing keys when split data fails filter conditions, and wrong aggregate values computed on partial data.

This “split-activity” bug doesn’t appear at small scale. At 30 million rows—fast to iterate on—everything looked clean. At 300 million rows, there were 3 phantom rows per warehouse. At 3 billion rows, 29 phantom rows and 8 missing rows. At production scale with ~580 million activities, this would produce hundreds of incorrect rows.

The fastest optimization I had found was unusable. The bug was fundamental to the append architecture, not a tuning issue. No amount of watermark tweaking could fix it.

The Real Fix: Shared-Incremental with MERGE

Two key insights led to the working solution.

First: Use MERGE instead of APPEND. When new data arrives for an existing activity, we need to re-read the activity’s full history from the source, recompute the aggregation from scratch, and update the existing row. SQL MERGE handles exactly this: match on the activity key, update columns when matched, insert when not matched.

Second: One shared table instead of two. Both processing paths were scanning the same source rows—the only difference was a derived column calculation. By storing the raw value once and deriving at output time, we halved the Phase A scan cost.

The Shared-Incremental approach identifies which activity keys were touched by new data, re-reads their complete history, recomputes aggregations, and MERGEs updates back into the activity table. LAG window functions require a second MERGE pass because updating one row can change the LAG value for the next row in the same device partition.

The Benchmark That Saved Us

Here’s what the numbers looked like after I built proper ground-truth validation:

Approach Extra Keys Missing Keys Verdict
Dual-Incremental 29 (at 3B scale) 8 (at 3B scale) INCORRECT
Shared-Incremental 0 0 Correct

The Shared-Incremental approach was not as fast as the broken design—roughly 5–8x faster than baseline instead of 12.6x—but it was correct. Zero structural validation errors across all 24 completed benchmark runs.

Production Impact

Metric Baseline (Full Recompute) Shared-Incremental Improvement
Daily runtime ~92 minutes ~12–19 minutes (est.) 5–8x faster
Daily credits ~12.3 ~1.6–2.5 80–87% savings
Annual credits ~4,475 ~580–920 ~3,500–3,900 saved
One-time seed cost — ~30–45 minutes Paid back in 2–3 days

The speedup increases with data size because Phase B cost is bounded by touched activities, not total row count. The bottleneck wasn’t Snowpark syntax. It was query shape.

The Benchmark Design That Saved Us

The validation methodology is what made the difference between shipping a 12.6x speedup that corrupts data and a 5–8x speedup that preserves correctness.

I tested across three data scales, four warehouse sizes, and three architectural approaches. Every run included row-level EXCEPT validation against ground-truth tables. I disabled result caching. I recorded timings to database tables for reproducible analysis.

Small data hides bugs. Small data also hides scaling problems. At 30 million rows, all approaches looked similar because fixed overhead dominated. At 3 billion rows, the real computation costs separated. Large warehouses hide problems too—on the biggest warehouse, the full scan was so parallelized that the incremental advantage looked smaller than it actually was.

Why EXCEPT Validation, Not Checksums

Aggregate checksums can match while individual rows are wrong. A checksum of 1,000,000 rows that sum to 42 tells you nothing about whether those are the right 1,000,000 rows.

Row-level EXCEPT queries compare every column of every row between the candidate output and the known-correct ground truth:

-- Rows in candidate but not in ground truth (phantoms)
SELECT * FROM candidate EXCEPT SELECT * FROM ground_truth

-- Rows in ground truth but not in candidate (missing)
SELECT * FROM ground_truth EXCEPT SELECT * FROM candidate

This caught the structural errors that aggregate checks would have missed entirely.

The Real Lesson

The question stopped being “how do I make this pipeline faster?” and became “what part of the current design is fundamentally wasting work, and how do I change that without breaking correctness?”

Fast wrong is worse than slow right. The useful pattern was: fix the query shape, benchmark across multiple scales, validate with row-level EXCEPT, and only then talk about speedup.

The one-line cache_result() fix gave a quick 20% improvement with zero risk. The architectural redesign took more work but delivered 5–8x speedup with correctness intact. The tempting incremental approach that looked fastest in benchmarks would have silently corrupted production data for months.

That’s the optimization work that actually matters—not just making things fast, but proving they’re right before they go live.

How I Built This

This optimization wasn’t a solo coding effort. I worked with two AI coding agents in parallel, using each for what it does best.

For planning and architectural decisions, I used GPT 5.4 Pro inside Codex CLI. It excelled at reasoning through the tradeoffs between different incremental approaches, identifying the split-activity bug conceptually before we saw it in the data, and helping design the MERGE-based correction strategy.

For implementation and iterative debugging, I used Opus 4.6 inside Cortex Code CLI. It handled the heavy lifting: writing the Snowpark dataframes, building the parameterized benchmark notebook, constructing the EXCEPT validation queries, and iterating through the 36 benchmark combinations. When a validation query returned unexpected diffs, it traced through the logic and fixed the root cause without losing context across a long session.

The pattern that worked: GPT 5.4 Pro for “what should we do?” and Opus 4.6 for “let’s make it happen.” I was the decision-maker—choosing which approach to pursue, interpreting the benchmark results, and calling off the Dual-Incremental approach when the data showed it was wrong. The agents compressed the distance between deciding and executing.

Copyright 2026, Farrukh Nauman