We Were Running AI Agents Before ‘Agentic’ Became a Buzzword

Agentic Systems

In early 2024 we deployed a multi-agent system before the frameworks existed. Here’s what we learned — and the pattern I’m calling Agent Debt.

Author

B. Talvinder

Published

March 20, 2026

In early 2024, we deployed a multi-agent system for Ostronaut before anyone called it “agentic AI.” We called it “the pipeline.” By late 2024, every vendor deck had “agentic” in the title. The architecture didn’t change. The vocabulary did.

Here’s the pattern that experience revealed: Agent Debt. The hidden complexity that accumulates when you treat agents as black boxes instead of understanding their failure modes. It isn’t technical debt. It’s operational blindness. You don’t see it until an agent hallucinates in production, burns through your API budget, or produces output so confidently wrong that users trust it.

Building without frameworks meant hitting every orchestration failure, every context bleed, every runaway cost directly. That’s what taught us what actually matters.

The Architecture We Built

Ostronaut generates corporate training content — presentations, videos, quizzes, games — from unstructured input. A client uploads a PDF. The system outputs interactive learning formats.

We built agents in four functional groups because the problem naturally decomposed that way:

Agent Type	Responsibility
Planner agents	Break input into learning objectives, decide format mix
Structure agents	Design slide sequences, video scripts, quiz flows
Content agents	Generate text, voiceovers, visual descriptions
Validation agents	Check quality gates, flag hallucinations, verify completeness

The planner-worker pattern: one planner agent analyzes the input and creates a generation plan. Worker agents execute tasks from that plan. Validation agents run post-generation checks.

This wasn’t novel architecture. It was obvious once you tried to build the thing. But in early 2024, there was no CrewAI to handle orchestration. No LangGraph to manage state. We wrote the coordination logic ourselves.

What that meant in practice:

Context management was manual. Each agent needed the right slice of information: not too much (cost), not too little (hallucination). We built a context router that decided what each agent could see based on its task. It broke constantly. An agent would reference information from a previous step that wasn’t in its context window. Output would be incoherent.

Tool-calling was brittle. Agents needed to invoke APIs for image generation, video rendering, database writes. Early LLM tool-calling was unreliable. An agent would call the wrong API, pass malformed parameters, or retry indefinitely on failure. We added a validation layer that parsed tool calls before execution. That caught 30% of bad calls.

Cost control was reactive. We didn’t know what “normal” token usage looked like for a multi-agent pipeline. First month in production, we burned through our OpenAI budget in 2 weeks. The problem: redundant context. Multiple agents were processing the same source material because we hadn’t optimized context sharing. We added a caching layer. Cost dropped 40%.

The Quality Crisis

Month 4, we hit the ceiling.

A healthcare client used Ostronaut to generate training for a clinical health program. The system produced a quiz. One question asked: “What is the recommended daily caloric deficit for healthy weight loss?” The agent-generated answer: “1000-1200 calories.”

That’s dangerously high for most people. The correct range is 500-750 calories.

The agent didn’t hallucinate randomly. It pulled from a source document that mentioned 1000-1200 as an upper bound for specific cases. The agent extracted the number without the qualifier. The validation agent didn’t flag it because it checked for factual consistency with the source, not medical safety.

We caught it in QA. But it revealed the core problem: agents optimize for coherence, not correctness. They will confidently generate plausible-but-wrong output if your validation layer doesn’t encode domain constraints.

This is the failure mode that no prompt tuning fixes. You can instruct the model to “be accurate” as many times as you want. It will still extract numbers from context and strip their qualifiers, because that’s what extracting the salient point looks like to the model.

What we changed:

Built domain-specific validation gates. For healthcare content, we added rules: flag any caloric recommendation above X, flag any medication dosage, flag any symptom-diagnosis claim. Not LLM-based validation. Rule-based checks that ran before content went to the client.

Added confidence scoring. Each agent outputs a confidence score for its generation. Low-confidence outputs go to human review. The scoring isn’t sophisticated (token probability and context match), but it works. 15% of generations now route to human QA. That’s acceptable.

Switched to template + generative hybrid. For high-risk content types (medical, financial, legal), we don’t generate from scratch. We use templates with generative fill-ins. Reduces creative output, increases safety. Clients accepted the trade-off.

What We Got Wrong

Universal reasoning engine. We initially tried to build one planner agent that could handle all content types. A presentation has different structural constraints than a video. A quiz has different validation rules than a game. We split the planner into format-specific planners. That added agents but improved output quality significantly.

LLM-as-judge for validation. Early on, we used an LLM to validate other LLMs’ output. “Does this quiz question make sense? Is this slide coherent?” That’s circular. The validator had the same failure modes as the generator. We moved to rule-based validation for anything safety-critical. LLMs still validate style and tone. They don’t validate facts. This failure mode is documented in more detail in why LLM-as-judge stacks fail for Indian markets — the underlying issue is the same regardless of geography.

Centralized orchestration. We built one orchestrator that managed all agents. It became a bottleneck. Every new feature required changing the orchestrator. We should have built federated orchestration, where each agent cluster (planner, worker, validator) manages its own coordination. We haven’t refactored this yet. It’s still painful.

Then vs. Now

If we built Ostronaut today with 2025 tooling, here’s what would be easier:

What We Built by Hand	What Exists Now
Context routing logic	LangGraph state management
Tool-call validation layer	Built-in tool schemas in GPT-4
Agent orchestration	CrewAI, n8n workflows
Retry and error handling	Framework-level retry policies

What’s still hard:

Domain-specific validation. No framework gives you medical safety checks or financial compliance rules. You build that yourself.

Cost optimization. Frameworks don’t tell you which agents are burning tokens unnecessarily. You need observability and profiling. This is the same problem Indian SaaS companies are well-positioned to solve — twenty years of optimizing for constrained infrastructure builds exactly this instinct.

Failure mode discovery. Agents fail in creative ways. A framework might handle retries, but it won’t tell you why an agent is producing inconsistent output. You learn that by watching production traffic.

The real difference: In 2024, we had to understand agent internals to build anything reliable. In 2025, you can deploy agents without understanding them. That’s progress. But it creates Agent Debt.

The Falsifiable Claim

Teams that deploy agent systems without understanding planner-worker coordination, context boundaries, and validation layers will hit a quality ceiling within 3-6 months that no amount of prompt tuning will fix.

The ceiling shows up as:

Inconsistent output quality (works 80% of the time, fails unpredictably)
Cost spirals (agents making redundant API calls, over-generating)
User trust erosion (one bad generation destroys confidence in 10 good ones)

This isn’t a prediction. It’s a pattern I’ve watched repeat across every team that reached out after deploying agents without validation gates. The vendors selling “agentic platforms” are solving orchestration and deployment. They’re not solving validation, cost control, or failure mode discovery. Those are still your problem.

This dynamic connects to something broader happening in the shift from software to agentware — as the abstraction layer rises, the hidden complexity doesn’t disappear. It concentrates at the failure modes the frameworks don’t cover.

The Question Worth Asking

If you’re deploying agents today, ask this: Can you explain why an agent made a specific decision?

Not “what did it output?” but “why did it choose this approach over alternatives?”

If the answer is “the LLM decided,” you have Agent Debt. You’re trusting a black box. That works until it doesn’t.

The teams that will build reliable agent systems aren’t the ones using the fanciest frameworks. They’re the ones who understand what happens when context bleeds between agents, when a planner makes a bad decomposition, when a validator misses a hallucination.

We learned that by building without frameworks. You can learn it faster now — but only if you look under the hood.

--- categories: [Agentic Systems] image: assets/og-image.png date: 2026-03-20 description: In early 2024 we deployed a multi-agent system before the frameworks existed. Here's what we learned — and the pattern I'm calling Agent Debt. draft: false resources: - assets/devto-cover.png - assets/og-image.png title: We Were Running AI Agents Before 'Agentic' Became a Buzzword --- In early 2024, we deployed a multi-agent system for Ostronaut before anyone called it "agentic AI." We called it "the pipeline." By late 2024, every vendor deck had "agentic" in the title. The architecture didn't change. The vocabulary did. Here's the pattern that experience revealed: **Agent Debt**. The hidden complexity that accumulates when you treat agents as black boxes instead of understanding their failure modes. It isn't technical debt. It's operational blindness. You don't see it until an agent hallucinates in production, burns through your API budget, or produces output so confidently wrong that users trust it. Building without frameworks meant hitting every orchestration failure, every context bleed, every runaway cost directly. That's what taught us what actually matters. ## The Architecture We Built Ostronaut generates corporate training content — presentations, videos, quizzes, games — from unstructured input. A client uploads a PDF. The system outputs interactive learning formats. We built agents in four functional groups because the problem naturally decomposed that way: | Agent Type | Responsibility | |------------|----------------| | Planner agents | Break input into learning objectives, decide format mix | | Structure agents | Design slide sequences, video scripts, quiz flows | | Content agents | Generate text, voiceovers, visual descriptions | | Validation agents | Check quality gates, flag hallucinations, verify completeness | The planner-worker pattern: one planner agent analyzes the input and creates a generation plan. Worker agents execute tasks from that plan. Validation agents run post-generation checks. This wasn't novel architecture. It was obvious once you tried to build the thing. But in early 2024, there was no CrewAI to handle orchestration. No LangGraph to manage state. We wrote the coordination logic ourselves. **What that meant in practice:** Context management was manual. Each agent needed the right slice of information: not too much (cost), not too little (hallucination). We built a context router that decided what each agent could see based on its task. It broke constantly. An agent would reference information from a previous step that wasn't in its context window. Output would be incoherent. Tool-calling was brittle. Agents needed to invoke APIs for image generation, video rendering, database writes. Early LLM tool-calling was unreliable. An agent would call the wrong API, pass malformed parameters, or retry indefinitely on failure. We added a validation layer that parsed tool calls before execution. That caught 30% of bad calls. Cost control was reactive. We didn't know what "normal" token usage looked like for a multi-agent pipeline. First month in production, we burned through our OpenAI budget in 2 weeks. The problem: redundant context. Multiple agents were processing the same source material because we hadn't optimized context sharing. We added a caching layer. Cost dropped 40%. ## The Quality Crisis Month 4, we hit the ceiling. A healthcare client used Ostronaut to generate training for a clinical health program. The system produced a quiz. One question asked: "What is the recommended daily caloric deficit for healthy weight loss?" The agent-generated answer: "1000-1200 calories." That's dangerously high for most people. The correct range is 500-750 calories. The agent didn't hallucinate randomly. It pulled from a source document that mentioned 1000-1200 as an *upper bound* for specific cases. The agent extracted the number without the qualifier. The validation agent didn't flag it because it checked for factual consistency with the source, not medical safety. We caught it in QA. But it revealed the core problem: **agents optimize for coherence, not correctness**. They will confidently generate plausible-but-wrong output if your validation layer doesn't encode domain constraints. This is the failure mode that no prompt tuning fixes. You can instruct the model to "be accurate" as many times as you want. It will still extract numbers from context and strip their qualifiers, because that's what extracting the salient point looks like to the model. **What we changed:** Built domain-specific validation gates. For healthcare content, we added rules: flag any caloric recommendation above X, flag any medication dosage, flag any symptom-diagnosis claim. Not LLM-based validation. Rule-based checks that ran before content went to the client. Added confidence scoring. Each agent outputs a confidence score for its generation. Low-confidence outputs go to human review. The scoring isn't sophisticated (token probability and context match), but it works. 15% of generations now route to human QA. That's acceptable. Switched to template + generative hybrid. For high-risk content types (medical, financial, legal), we don't generate from scratch. We use templates with generative fill-ins. Reduces creative output, increases safety. Clients accepted the trade-off. ## What We Got Wrong **Universal reasoning engine.** We initially tried to build one planner agent that could handle all content types. A presentation has different structural constraints than a video. A quiz has different validation rules than a game. We split the planner into format-specific planners. That added agents but improved output quality significantly. **LLM-as-judge for validation.** Early on, we used an LLM to validate other LLMs' output. "Does this quiz question make sense? Is this slide coherent?" That's circular. The validator had the same failure modes as the generator. We moved to rule-based validation for anything safety-critical. LLMs still validate style and tone. They don't validate facts. This failure mode is documented in more detail in [why LLM-as-judge stacks fail for Indian markets](/build-logs/llm-judge-india-failure/index.qmd) — the underlying issue is the same regardless of geography. **Centralized orchestration.** We built one orchestrator that managed all agents. It became a bottleneck. Every new feature required changing the orchestrator. We should have built federated orchestration, where each agent cluster (planner, worker, validator) manages its own coordination. We haven't refactored this yet. It's still painful. ## Then vs. Now If we built Ostronaut today with 2025 tooling, here's what would be easier: | What We Built by Hand | What Exists Now | |------------------------|-----------------| | Context routing logic | LangGraph state management | | Tool-call validation layer | Built-in tool schemas in GPT-4 | | Agent orchestration | CrewAI, n8n workflows | | Retry and error handling | Framework-level retry policies | **What's still hard:** Domain-specific validation. No framework gives you medical safety checks or financial compliance rules. You build that yourself. Cost optimization. Frameworks don't tell you which agents are burning tokens unnecessarily. You need observability and profiling. This is the same problem [Indian SaaS companies are well-positioned to solve](/field-notes/indian-saas-agent-reliability/index.qmd) — twenty years of optimizing for constrained infrastructure builds exactly this instinct. Failure mode discovery. Agents fail in creative ways. A framework might handle retries, but it won't tell you *why* an agent is producing inconsistent output. You learn that by watching production traffic. **The real difference:** In 2024, we had to understand agent internals to build anything reliable. In 2025, you can deploy agents without understanding them. That's progress. But it creates Agent Debt. ## The Falsifiable Claim Teams that deploy agent systems without understanding planner-worker coordination, context boundaries, and validation layers will hit a quality ceiling within 3-6 months that no amount of prompt tuning will fix. The ceiling shows up as: - Inconsistent output quality (works 80% of the time, fails unpredictably) - Cost spirals (agents making redundant API calls, over-generating) - User trust erosion (one bad generation destroys confidence in 10 good ones) This isn't a prediction. It's a pattern I've watched repeat across every team that reached out after deploying agents without validation gates. The vendors selling "agentic platforms" are solving orchestration and deployment. They're not solving validation, cost control, or failure mode discovery. Those are still your problem. This dynamic connects to something broader happening in [the shift from software to agentware](/frameworks/agentware/index.qmd) — as the abstraction layer rises, the hidden complexity doesn't disappear. It concentrates at the failure modes the frameworks don't cover. ## The Question Worth Asking If you're deploying agents today, ask this: **Can you explain why an agent made a specific decision?** Not "what did it output?" but "why did it choose this approach over alternatives?" If the answer is "the LLM decided," you have Agent Debt. You're trusting a black box. That works until it doesn't. The teams that will build reliable agent systems aren't the ones using the fanciest frameworks. They're the ones who understand what happens when context bleeds between agents, when a planner makes a bad decomposition, when a validator misses a hallucination. We learned that by building without frameworks. You can learn it faster now — but only if you look under the hood.