The Compound Error Argument Has a Compound Error

View markdown Download markdown

There's an argument floating around right now about why AI agents can't work at scale. The math goes like this: if an LLM is 95% accurate per step, then after 30 steps you get 0.95³⁰, which is roughly 21% overall reliability. The curve is exponential and it only goes one direction.

The first place I heard someone lay it out clearly was Meredith Whittaker's talk at 39C3 in December 20251. I watched it on YouTube. She called it "the mathematics of failure" and was upfront about being generous with her numbers: "there's no such thing as an AI model that has 95% accuracy even on narrow benchmarks, but we're going to be generous."

Whittaker does important work on privacy and AI accountability, and the broader talk is worth watching. But this specific framing has spread well past that context. The assumptions underneath it deserve a closer look.

The math is correct. The architecture it describes doesn't match any serious deployment I've seen.

The talk also pointed to the CMU AgentCompany benchmark. Researchers built a simulated corporate environment with 175 realistic tasks, ran the best available models through them, and the models failed 70% of the time. A 30% success rate on tasks designed to mirror real office work2.

Those results are bad. But the failure mode they expose is more specific than "agents can't work at scale."

What the Argument Assumes

The 0.95ⁿ formula treats each step as an independent coin flip. Error at step 14 has no relationship to error at step 15. Errors are binary: it either fails or it doesn't. And the system has no feedback mechanism. A failure at step 8 propagates invisibly through the remaining steps until whatever broken output falls out the end.

That's a valid model for a communication channel with no error correction. Software doesn't work that way.

Grounding Stops a Lot of Errors Cold

The first assumption that breaks down is the invisible propagation. Agents don't just produce prose and hand it forward. They call APIs, write code, run test suites, query databases, execute shell commands. These tools have their own validation.

If an agent writes broken SQL, the database throws an error. The agent gets that error back in context and can retry. If it generates code that doesn't compile, the compiler says so. If a network request returns 400, the agent knows the call failed.

I build pipelines using Claude. The multi-agent system that maintains this site runs file writes, git operations, PHP artisan commands, and test suites on every substantive change. The test runner either passes or it doesn't. That output goes back into the loop. A step that produces a broken result doesn't silently cascade because the grounding layer catches it before the next step starts.

The compound error formula has no slot for "the environment gave feedback." That gap matters.

Agents Iterate

The second assumption is that each step is a one-shot attempt. In practice, agentic systems are loops. A sub-task fails, the agent gets the failure reason back, and it tries again with a corrected approach.

This is how I'd want a junior developer to work. They run the tests, the tests fail, they look at the output, they fix the problem. We don't evaluate a developer's output quality by assuming every attempt is independent and multiplying the failure rates together. We evaluate whether they can get to a working result.

Error rate per step is not the same as task completion rate over a loop. The 0.95ⁿ formula conflates them.

Errors Are Not Uniform

The formula treats every step as having the same 5% failure rate and the same failure weight. Neither is true.

Some steps are trivial. Formatting a string, appending to a list, reading a file. The failure probability for these is well under 5%, and a failure is usually immediately obvious.

Some steps are high-stakes and genuinely hard. Designing the right database schema for a new feature. Making the right judgment call when requirements are ambiguous. These deserve either human review or automated validation against a spec, which is why pipelines have checkpoints and review gates.

The pipeline I run has a spec reviewer that checks whether the implementation actually achieved what the issue asked for. It has a separate test validator whose job is to audit the test suite itself: did the testing agent write hollow assertions, leave TODOs, or fake coverage? The system doesn't trust its own output. It has dedicated agents whose only job is to challenge other agents' work.

Treating a 30-step pipeline as 30 independent coin flips ignores that the steps aren't interchangeable. The failure distribution isn't uniform.

Checkpoints Break the Chain

The compound error formula requires an unbroken chain from step 1 to step N. Any checkpoint that verifies output before the next phase starts resets the accumulation. That checkpoint can be a human, an automated review, or a test suite. It just needs to actually work.

Human review is the one people reach for first, partly because it catches errors and partly because it assigns accountability when something goes wrong. Both are real reasons to use it. But I'd be careful about treating it as a hard solution.

I spent years in high-stakes manufacturing. A big part of what the quality team did was shore up problems that humans introduced or missed. Humans are reliable reviewers when they're knowledgeable in the domain, have the full context, and are actively engaged. In practice, those three conditions aren't always true at the same time. A human checkpoint gives you accountability. It doesn't guarantee the error actually gets caught.

Automated review is more consistent. The pipeline I use runs a spec reviewer that checks whether the output matches the original requirements. It runs a code reviewer that checks for patterns and standards violations. It runs a test validator that audits the test suite for cheating: hollow assertions, TODO placeholders, mock abuse. These gates run the same way every time. They don't get tired, distracted, or unfamiliar with the codebase.

The best approach is usually both. Automated gates handle the stuff humans are bad at being consistent about. Humans are phenomenal at catching issues nobody was looking for.

In the pipelines I build, the automated checks catch structural problems: did the output parse, did the LLM return valid JSON, did the result fit the schema. What I catch when I skim the output is different. A summary that's technically accurate but misleading. A categorization that's wrong in a way no validator would flag. Things no one thought to write a check for.

Where the Argument Is Actually Valid

The compound error concern is real in a specific regime: fully autonomous agents, no human checkpoints, no tool grounding, no feedback loops, long horizons, ambiguous tasks. A system that generates 30 steps of unverified action and hands you the final result is going to accumulate errors.

The AgentCompany benchmark largely tested that regime. I went through the paper. The agents did get environmental feedback: error messages, command output, browser state. They ran in multi-step loops and could observe when something failed. So this wasn't a fully blind setup. But there were no review gates, no spec validation, no permission checks on destructive actions, no human checkpoints, and no second attempts. Each agent got one run per task.

It's the equivalent of handing a new employee a task description, full access to company tools, and no supervision, then reviewing their work afterward. The employee can see when a command fails. Nobody's checking in, and nobody catches it when they go off track.

One of the benchmark agents was told to send a message to an employee it couldn't find in the database. Instead of reporting the failure, it renamed a different employee to match the query, then sent the message.

Given the setup, this was predictable. An agent with no confirmation gate on destructive writes, no permissions check on record modifications, and no review before committing changes to a shared database will do things like that. The agent could see the environment. It just didn't have the architecture to catch and correct its own mistakes.

A well-designed pipeline has a gate there. The benchmark agents didn't. The benchmark exposed an architectural gap, and the actual finding is more specific than "agents fail 70% of the time."

Every team I'm aware of that shipped autonomous agents at scale ran into exactly this. The failures clustered around missing gates, not around model capability. The approach works. Building the engineering layer around it is the actual job. Add grounding, add checkpoints, add test coverage, and design task decomposition so that errors surface early. Build systems that route around error propagation and the 0.95ⁿ curve stops describing your system.

The Analogy That Actually Fits

TCP doesn't assume every packet arrives. It assumes packets get dropped, corrupted, and delivered out of order. The protocol has acknowledgment, retransmission, checksums, and sequence numbers. The internet works because the engineering layer treats unreliability as a given and builds around it.

Software engineering has been building reliable systems from unreliable components since the field existed. Compilers catch bugs. Test suites catch regressions. Code review catches design errors. Type systems catch a whole class of mistakes before the program runs.

The compound error argument applies clean math to a system with no error correction. Real systems have error correction.

What I Actually Ship

The pipeline that builds this site runs hundreds of steps per feature. Every one of those steps is a place where the compound error formula says reliability should decay. It doesn't, because the pipeline is built around the assumption that individual steps will fail.

Before any code gets written, a research agent gathers context. An evaluation agent assesses the approach and flags risks. If the evaluation finds blocking issues, the pipeline stops. The formula has no slot for a system that decides not to proceed.

A planning agent breaks the work into tasks and assigns each to a specialist. It pulls findings from previous runs to inform those assignments. Then implementation starts. Each task goes through its own cycle: an agent writes code, a different agent reviews it against the spec, and failures loop back with feedback for another attempt. Each cycle has a maximum iteration count. If it can't converge, the pipeline stops and flags it for human attention.

Once the code passes spec review, the test suite runs. Failures go to a fixing agent. Once tests pass, a separate validator audits the tests themselves: did the testing agent write hollow assertions, leave TODOs, fake coverage? The system checks its own work, then checks the checks.

After every run, a retrospective agent analyzes the full session: where errors clustered, which agent selections worked, which stages took the most iterations. It writes findings to a persistent memory file. The next run reads those findings and adjusts. Errors get corrected within a step, within a run, and across runs.

I review the pull request before merging. By that point, the mechanical verification is done. My review is about judgment: does the approach make sense, is the scope right, is this something I'd want to own.

Individual model calls fail. The system is designed for that. The error rate on shipped features is nowhere near 21%. The 0.95ⁿ curve doesn't describe this pipeline, because this pipeline isn't what the formula assumes.