The Pass@k Trap: Why Running Your AI Agent 3 Times Makes Answers Worse

Agentic Systems

Running AI agents 3x and voting on the answer makes factual accuracy worse, not better. The errors are correlated by training data — more agents amplify the same hallucination.

Author

B. Talvinder

Published

March 19, 2026

Pass@k is the most popular reliability pattern in production agent systems right now. Run the same task k times, take a majority vote on the output, ship the consensus answer. It works beautifully for code generation — a function either passes the test suite or it doesn’t. The objective verification is external to the agents.

For factual accuracy, the pattern collapses. And most teams deploying it haven’t figured out why yet.

The failure is structural, not probabilistic. Consensus voting assumes that errors are independent and randomly distributed. If Agent A hallucinates, Agent B probably won’t hallucinate the same thing. With enough agents, truth wins by majority. This assumption holds for coding tasks because the test suite is the arbiter. It does not hold for factual claims because there is no test suite for truth.

Three failure modes

Correlated hallucination. LLMs trained on similar data hallucinate in similar ways. Ask three instances of GPT-4o whether a specific paper exists, and if the title sounds plausible, all three will confidently confirm it. The errors aren’t independent — they’re correlated by training distribution. Majority vote amplifies the shared bias instead of cancelling it.

This is not a theoretical concern. Research from MIT and Google DeepMind has shown that Pass@k reliability for factual tasks degrades rather than improves as k increases, precisely because the error correlation exceeds the independence assumption. According to a 2024 study by Huang et al., LLM hallucination rates on factual recall tasks remain at 15-25% even with state-of-the-art models — and those errors are correlated across model families sharing similar training distributions. More agents, worse answers.

The popularity trap. Consensus selects for the most common answer, not the most accurate one. In domains where the popular understanding is wrong — emerging science, contrarian market analysis, novel technical approaches — consensus voting systematically suppresses correct minority positions.

Three agents asked whether a particular drug interaction is dangerous will converge on whatever the training data’s majority position is. If the latest research contradicts the common understanding, the consensus will be confidently, democratically wrong.

Strategic ambiguity. When agents are optimized for agreement (as many multi-agent debate frameworks encourage), they learn to hedge toward safe, middle-ground positions. Not because the middle ground is true, but because it minimizes disagreement. The agents aren’t lying — they’re conflict-averse. The output reads as measured and reasonable. It’s also systematically biased toward conventional wisdom.

Why this matters now

The “just run it three times” pattern is spreading fast. Every agentic framework has a retry-and-vote mechanism. LangChain, CrewAI, AutoGen — all support multi-agent voting as a reliability strategy. The assumption that consensus equals reliability is baked into the tooling. As of 2025, over 70% of multi-agent frameworks include a voting or consensus mechanism as a default reliability pattern.

Production systems using this pattern for anything beyond code generation are carrying unquantified risk. Customer-facing chatbots, research assistants, medical information systems, financial analysis tools — all domains where correlated hallucination is more dangerous than a single wrong answer, because the consensus gives the appearance of validation.

Reliability strategy	Works for	Fails for	Cost multiplier
Pass@k consensus	Code generation (test suite verifies)	Factual claims, reasoning	3-5x compute
Adversarial debate	Reasoning chains, logic errors	Shared knowledge gaps	2-3x compute
External anchoring (RAG verification)	Factual claims with source corpus	Novel analysis, opinion	1.5x compute + corpus maintenance
Confidence-weighted routing	Domain-specific accuracy	Cold-start domains	1x compute + calibration data

What actually works

The fix is not more agents or better prompts. It’s structural.

Separate generation from verification. The agent that produces the answer must not be the same agent (or same architecture) that verifies it. Verification requires a different model, different training data, or — ideally — a non-LLM check against a ground-truth source. At Ostronaut, our validation agents use rule-based scoring with deterministic rubrics, not LLM-as-judge. The quality gate is independent of the generation pipeline.

Adversarial framing over cooperative framing. Multi-agent debate works better when agents are explicitly tasked with finding flaws in each other’s outputs rather than converging on agreement. The incentive must be to disprove, not to confirm. This is the opposite of how most consensus systems are designed.

Confidence-weighted routing. Instead of majority vote, weight each agent’s contribution by its calibrated confidence on that specific task type. An agent that is well-calibrated on medical queries but poorly calibrated on legal queries should have different voting weights in each domain. This requires per-domain calibration data, which most teams don’t collect.

External anchoring. For factual claims, the gold standard is retrieval-augmented verification — check the claim against a curated, trustworthy source. Not RAG for generation (which has its own problems), but RAG specifically for post-generation verification. The verification retrieval corpus should be smaller and higher-quality than the generation corpus.

The pattern that misled us

The success of ensemble methods in machine learning created an intuition that more models = more reliability. In classical ML, this is largely true — bagging and boosting work because the base models have uncorrelated errors on well-defined features.

LLMs break this assumption. The base models share training data, architecture families, and optimization objectives. Their errors are correlated by construction. Treating them as independent voters is a category error borrowed from a domain where the independence assumption actually held.

I made this mistake early. When we built the multi-agent system, I assumed that running the content generation through multiple agents and selecting the best output would improve reliability. It didn’t. The agents agreed on the wrong things more often than they disagreed on the right things. We got reliability only after we separated the generation and verification functions entirely and made the verification independent of the generation architecture.

The open question

If consensus doesn’t work for truthfulness, what’s the right reliability primitive for multi-agent systems operating on factual domains?

Adversarial verification is better than consensus, but it’s expensive — you’re paying for agents whose job is to destroy, not create. External anchoring works but requires maintaining a ground-truth corpus, which is itself a maintenance burden that scales with domain breadth.

The field is converging on hybrid approaches — consensus for subjective quality, external verification for factual claims, adversarial debate for reasoning chains. But nobody has a clean, general-purpose pattern yet.

The teams that figure this out first will have a genuine architectural advantage. Not because their models are better, but because their reliability infrastructure is honest about what consensus can and cannot verify.