AI Is Making Your Team Slower — The Math Your CEO Won’t Show You

AI Engineering
Software Engineering
Engineering Leadership
Every company measuring AI productivity is counting the wrong thing. When you measure both sides — output volume and downstream cost — the numbers tell a different story.
Author

B. Talvinder

Published

March 18, 2026

Every company measuring AI productivity is counting the wrong thing.

They’re measuring output volume: PRs merged, lines written, tickets closed. They’re not measuring the cost of what ships: the review burden, the debugging time, the incidents caused by code nobody understood before it hit production.

When you count both sides, the math doesn’t work the way your CEO’s slide deck says it does.

The Evidence Is Piling Up

This week, The Pragmatic Engineer catalogued what’s actually happening inside companies that went all-in on AI coding agents. The findings aren’t theoretical.

Amazon’s retail engineering team saw a leap in outages caused directly by AI agents. The fix? Requiring senior engineer sign-off on all AI-assisted changes from junior developers. That’s not a productivity gain. That’s adding a bottleneck to compensate for unreliable output.

Anthropic — the company that builds Claude — ships over 80% of its production code with AI. Their flagship website degraded so badly that paying customers noticed before anyone internally did. The irony writes itself.

Meta and Uber are tracking AI token usage in performance reviews. Engineers who don’t use AI tools enough look unproductive. Engineers who use them indiscriminately look great on paper — until the bugs ship.

The Three Taxes You’re Not Counting

Here’s the falsifiable claim: teams that measure AI productivity only by output volume will see their incident rate and mean-time-to-resolve increase by 30% or more within 12 months, compared to teams that gate AI output with validation layers.

The mechanism has three parts.

The Review Tax

Every AI-generated PR still needs human review. But AI-generated code is harder to review than human-written code, because the reviewer can’t infer intent from the author’s history.

With human code, you know the developer’s context: what they were trying to solve, what trade-offs they considered, what they tested. With AI code, you’re reverse-engineering intent from output. That’s slower, not faster.

Amazon learned this the hard way. Junior engineers using AI agents shipped code that looked correct — clean formatting, reasonable variable names, passing tests — but had subtle logical errors that only surfaced in production. Reviewers couldn’t distinguish “AI wrote this well” from “AI wrote this plausibly.”

The Refactoring Freeze

Dax Reed, who built OpenCode, points out something every experienced engineer recognises: AI agents discourage refactoring. When code is cheap to generate, nobody wants to clean it up. Why spend an afternoon restructuring a module when the agent writes a new one in ten minutes?

The result is an expanding codebase where nothing gets simplified, patterns don’t converge, and cognitive load increases week over week.

This is the velocity trap. Short-term speed, long-term slowdown. Sentry’s CTO observed the same pattern: AI removes the barrier to getting started, which sounds great until you realise that “getting started” was never the bottleneck. The bottleneck was maintaining, debugging, and evolving what you built. AI makes the first part trivially easy and the second part measurably harder.

The Incentive Poison

When companies tie AI token usage to performance reviews, they’re telling engineers: “Use the tool, regardless of whether it helps.”

This is the corporate equivalent of measuring developer productivity by lines of code written. It rewards volume, punishes judgment, and guarantees that the engineers who are most careful about code quality look the least productive.

Engineers who know the AI output is mediocre ship it anyway, because slowing down to rewrite it makes their metrics look bad. The codebase degrades. The team slows down. The metrics still look great, because the metrics are measuring the wrong thing.

What This Looks Like Up Close

I’ve seen this pattern building multi-agent systems at Ostronaut. We generate training content — presentations, videos, quizzes. Early on, the agents were fast. They produced a complete training module in minutes. The output looked good. Formatting was clean. Structure was reasonable.

It was also wrong about 15-20% of the time. Not obviously wrong — subtly wrong. A slide deck where the concept progression didn’t build properly. A quiz where the distractors were too close to the correct answer. A video script that repeated a key point in slightly different words, creating confusion instead of reinforcement.

We didn’t fix this with better prompts. We fixed it by building a validation layer — automated checks that ran after every generation step, before anything reached a human reviewer. Content validation caught conceptual errors. Design validation caught structural problems. Integration validation caught mismatches between components.

That validation layer was harder to build than the generation layer. It took longer. It required more engineering judgment. And it’s the only reason the system works reliably.

The companies in Gergely’s article skipped this step. They deployed AI agents without validation gates, measured the output volume, and declared victory. Then the incidents started.

Why Better Models Won’t Save You

I used to think the answer was better models. If GPT-4 produces code that’s 80% reliable, GPT-5 will be 95% reliable, and eventually you won’t need validation.

That was wrong for two reasons.

First, the remaining failures are the expensive ones. The bugs that survive better models are the subtle, context-dependent bugs that cause production incidents. Better models don’t make validation cheaper — they make it more necessary, because what gets through is harder to catch.

Second, the validation layer isn’t just catching bugs. It’s encoding team knowledge. Our quality checks embed years of domain expertise — what makes a good slide progression, what makes a quiz effective, what makes a video script clear. That knowledge doesn’t exist in the model. It exists in the team. The validation layer is how you transfer institutional knowledge into the AI pipeline.

Companies that skip this aren’t just accepting more bugs. They’re disconnecting their AI pipeline from their institutional knowledge.

What to Measure Instead

What Leadership Measures What Actually Happens
PRs merged per week (+52%) Review time per PR (+40%)
Lines of code written (3x) Lines nobody understands (3x)
Time to first commit (-60%) Time to resolve incidents (+35%)
Token usage per engineer Refactoring frequency (-70%)

If you’re measuring AI impact, stop counting PRs. Start counting:

  1. Incident rate per AI-assisted commit versus human-only commits
  2. Review time per PR — is it actually decreasing, or are reviewers rubber-stamping?
  3. Refactoring frequency — is your team still simplifying code, or just adding to it?
  4. Mean-time-to-resolve for bugs in AI-generated code versus human-written

The companies that will win with AI coding agents are not the ones that deploy them fastest. They’re the ones that build the validation layer first and measure what matters — not how fast code is written, but how fast correct code ships and stays correct in production.

Speed without verification isn’t velocity. It’s technical debt with a marketing budget.