Fast Wrong Is Worse Than Slow Right
How I redesigned a daily full-recompute pipeline into a shared incremental one
data pipeline optimization, incremental processing, data correctness, Snowflake, benchmark design
The most dangerous optimization I tested was also the fastest. On paper, it looked great. It avoided full recomputes, cut runtime dramatically, and made the benchmark tables look easy. It was also wrong—and that changed everything about how I approached the problem.
The Problem
A critical daily analytics pipeline had been running for years. Every morning, it recomputed four downstream tables from a ~30 billion row source. The job took about 90 minutes on a large Snowpark-optimized warehouse and consumed roughly 4,500 Snowflake credits annually. At 3 billion rows of test data, 98% of the runtime was spent on two full table scans—not the output writes, not the aggregations, not the window functions. Just scanning the same massive dataset twice.
The pipeline was expensive, but it was correct. That matters when downstream teams depend on your output for operational decisions.
The Temptation: Incremental Processing
If the bottleneck is scanning 30 billion rows, why scan them every day? Most of that data hasn’t changed. Only new rows arriving since yesterday are actually new.
The idea was straightforward: build an incremental system with two phases. Phase A would run once to seed a persistent activity table from historical data. Phase B would run daily—scan only new rows since a watermark, aggregate them, append to the cache, and recompute the four small output tables from the cached activity layer (~6 million rows, not 30 billion).
I built this “Dual-Incremental” approach with separate tables for the normal and filtered processing paths. The benchmarks were stunning:
| Dataset | Full Recompute | Dual-Incremental Daily | Speedup |
|---|---|---|---|
| 300M rows | 136 seconds | 17 seconds | 8x |
| 3B rows | 350 seconds | 28 seconds | 12.6x |
At production scale, this would cut a 90-minute job to roughly 7 minutes. The annual credit savings would be substantial. I was ready to write the deployment plan.
Then I ran the correctness validation.
Why I Didn’t Trust the First Benchmark
Here’s what a typical “optimization test” looks like in practice:
- Take a small dataset from production
- Run the old code once → 50 seconds
- Run the new code once → 15 seconds
- Conclude: “3x faster!” → Open PR → Merge → Move on
This is exactly what I refused to do. The Dual-Incremental approach would have passed that test with flying colors—and would have corrupted production data for months before anyone noticed.
Instead, I built a multi-dimensional benchmark that tested along three axes:
| Axis | What I Tested | Why It Matters |
|---|---|---|
| Data Scale | 30M → 300M → 3B rows (1/1000x to 1/10x production) | Small data hides bugs. The Dual-Incremental bug showed 0 errors at 30M rows but 29 phantom + 8 missing rows at 3B. |
| Warehouse Size | Small through Large including 2 variants of Medium | Large warehouses parallelize scans so effectively they hide algorithmic costs. On a small warehouse, the incremental advantage was 4.0x; on large, only 1.2x. |
| Correctness | Row-level EXCEPT validation on every single run | Aggregate checksums can match while individual rows are wrong. Row-level diffs caught the structural errors that checksums missed. |
That meant 36 benchmark combinations (24 completed—3B timed out on smaller warehouses), producing timing entries and validation records all stored in database tables for reproducible analysis.
The bug that killed the Dual-Incremental approach only appeared at scale. At 30 million rows—fast for iteration—everything looked clean. At 300 million, 3 phantom rows per warehouse. At 3 billion, 29 phantom rows. Extrapolating to production: hundreds of incorrect rows.
The Bug That Scales With Data
Consider an activity (device X, activity #123) that has telemetry rows both before and after the watermark cutoff. During Phase A, the pipeline aggregates rows A and B from historical data. During Phase B, it discovers new rows C and D for the same activity.
The problem: the append-based architecture creates a second activity row from rows C and D. It cannot go back and update the Phase A row. The correct result should be a single row aggregating all four rows. Instead, we get phantom rows in weekly and monthly tables, missing keys when split data fails filter conditions, and wrong aggregate values computed on partial data.
This “split-activity” bug doesn’t appear at small scale. At 30 million rows—fast to iterate on—everything looked clean. At 300 million rows, there were 3 phantom rows per warehouse. At 3 billion rows, 29 phantom rows and 8 missing rows. At production scale with ~580 million activities, this would produce hundreds of incorrect rows.
The fastest optimization I had found was unusable. The bug was fundamental to the append architecture, not a tuning issue. No amount of watermark tweaking could fix it.
The Benchmark That Saved Us
Here’s what the numbers looked like after I built proper ground-truth validation:
| Approach | Extra Keys | Missing Keys | Verdict |
|---|---|---|---|
| Dual-Incremental | 29 (at 3B scale) | 8 (at 3B scale) | INCORRECT |
| Shared-Incremental | 0 | 0 | Correct |
The Shared-Incremental approach was not as fast as the broken design—roughly 5–8x faster than baseline instead of 12.6x—but it was correct. Zero structural validation errors across all 24 completed benchmark runs.
Production Impact
| Metric | Baseline (Full Recompute) | Shared-Incremental | Improvement |
|---|---|---|---|
| Daily runtime | ~92 minutes | ~12–19 minutes (est.) | 5–8x faster |
| Daily credits | ~12.3 | ~1.6–2.5 | 80–87% savings |
| Annual credits | ~4,475 | ~580–920 | ~3,500–3,900 saved |
| One-time seed cost | — | ~30–45 minutes | Paid back in 2–3 days |
The speedup increases with data size because Phase B cost is bounded by touched activities, not total row count. The bottleneck wasn’t Snowpark syntax. It was query shape.
The Benchmark Design That Saved Us
The validation methodology is what made the difference between shipping a 12.6x speedup that corrupts data and a 5–8x speedup that preserves correctness.
I tested across three data scales, four warehouse sizes, and three architectural approaches. Every run included row-level EXCEPT validation against ground-truth tables. I disabled result caching. I recorded timings to database tables for reproducible analysis.
Small data hides bugs. Small data also hides scaling problems. At 30 million rows, all approaches looked similar because fixed overhead dominated. At 3 billion rows, the real computation costs separated. Large warehouses hide problems too—on the biggest warehouse, the full scan was so parallelized that the incremental advantage looked smaller than it actually was.
Why EXCEPT Validation, Not Checksums
Aggregate checksums can match while individual rows are wrong. A checksum of 1,000,000 rows that sum to 42 tells you nothing about whether those are the right 1,000,000 rows.
Row-level EXCEPT queries compare every column of every row between the candidate output and the known-correct ground truth:
-- Rows in candidate but not in ground truth (phantoms)
SELECT * FROM candidate EXCEPT SELECT * FROM ground_truth
-- Rows in ground truth but not in candidate (missing)
SELECT * FROM ground_truth EXCEPT SELECT * FROM candidateThis caught the structural errors that aggregate checks would have missed entirely.
The Real Lesson
The question stopped being “how do I make this pipeline faster?” and became “what part of the current design is fundamentally wasting work, and how do I change that without breaking correctness?”
Fast wrong is worse than slow right. The useful pattern was: fix the query shape, benchmark across multiple scales, validate with row-level EXCEPT, and only then talk about speedup.
The one-line cache_result() fix gave a quick 20% improvement with zero risk. The architectural redesign took more work but delivered 5–8x speedup with correctness intact. The tempting incremental approach that looked fastest in benchmarks would have silently corrupted production data for months.
That’s the optimization work that actually matters—not just making things fast, but proving they’re right before they go live.
How I Built This
This optimization wasn’t a solo coding effort. I worked with two AI coding agents in parallel, using each for what it does best.
For planning and architectural decisions, I used GPT 5.4 Pro inside Codex CLI. It excelled at reasoning through the tradeoffs between different incremental approaches, identifying the split-activity bug conceptually before we saw it in the data, and helping design the MERGE-based correction strategy.
For implementation and iterative debugging, I used Opus 4.6 inside Cortex Code CLI. It handled the heavy lifting: writing the Snowpark dataframes, building the parameterized benchmark notebook, constructing the EXCEPT validation queries, and iterating through the 36 benchmark combinations. When a validation query returned unexpected diffs, it traced through the logic and fixed the root cause without losing context across a long session.
The pattern that worked: GPT 5.4 Pro for “what should we do?” and Opus 4.6 for “let’s make it happen.” I was the decision-maker—choosing which approach to pursue, interpreting the benchmark results, and calling off the Dual-Incremental approach when the data showed it was wrong. The agents compressed the distance between deciding and executing.