The OS-Paged Context Engine
Every production agent system I’ve worked on has the same failure mode. Context rot. Stale artefacts silently served to the model. No audit trail for what was included or excluded. Token budgets blown with no graceful recovery. Multi-agent context bleeding across scopes.
The standard fix is “use RAG.” RAG solves retrieval. It doesn’t solve lifecycle.
The counter-argument I hear most: context windows are getting larger. Claude does 200K tokens. Gemini does 1M. Just dump everything in. The math doesn’t hold. At $15 per million input tokens, stuffing 847 artefacts (~200K tokens) into every call costs $3 per inference. At 100 calls per day per agent, that’s $9,000/month for a single agent. And you still can’t audit what the model saw, still can’t catch stale data, still can’t prevent hallucinations from compounding into memory.
Context has no lifecycle. That’s the root cause. I went looking for prior art in constrained computing, where managing scarce resources under real-time pressure has been solved for decades.
Same Query, Two Outcomes
A support agent is handling a billing escalation. The context store has 847 artefacts: ticket history, knowledge base articles, past chat transcripts, agent notes, CRM records.
The query is the same. The model is the same. The only difference is what sits between the store and the prompt.
Without lifecycle management (standard RAG): the agent runs a semantic search, takes the top-K matches, stuffs them in.
- A refund policy from six months ago loads because it’s semantically close. The policy was updated two weeks ago. The agent cites the old $200 limit to a customer whose refund should be $400 under the current policy.
- An agent’s internal note (unreviewed, unvalidated) loads as context. The model treats a scratchpad draft as a confirmed resolution.
- Token budget blows out at 140%. The API silently truncates the prompt, dropping the most recent ticket update.
- The agent’s response gets written to memory. The outdated policy is now a “fact.” Next session, it compounds.
With the OS-Paged Context Engine: the same 847 artefacts enter a four-stage pipeline.
- Triage: 312 artefacts expire on TTL. The internal note scores below provenance threshold (SCRATCHPAD rank). The stale policy is BLACK-tagged. 20 survive for semantic scoring.
- Paging: a knowledge base article that did survive has a dirty bit set (source updated 2 weeks ago). Re-fetched with current policy before the model sees it.
- Assembly: 31,200 tokens against a 40,000 budget. No truncation.
- Validation: response scores 0.88 confidence. Committed to memory. Below 0.7, it would have been flagged for review and not persisted.
| Failure Mode | Standard RAG | OS-Paged Engine |
|---|---|---|
| Stale artefact loaded | Serves 6-month-old policy as current | TTL expires it. Dirty bit catches mid-session staleness. |
| Unvalidated note treated as fact | Loads if semantically close | SCRATCHPAD provenance rank filters it in triage |
| Token budget overflow | Silent API truncation | Graceful degradation through four tiers |
| Hallucination persisted to memory | Written back without checks | Commit gate: low confidence triggers rollback |
| Audit trail | None | Immutable manifest: trace ID, artefact list, tier, commit status |
Every one of these is a lifecycle failure, not a retrieval failure.
The Fix: Four Borrowed Techniques
I built a four-stage pipeline. Each stage borrows one technique from a domain that solved this class of problem decades ago. No framework lock-in. Single Python file. Works with any LLM API.
I’m calling it the OS-Paged Context Engine, because the core insight is that your context window is RAM, your long-term memory is disk, and you need an operating system between them.
Stage 1: Triage Scoring
The failure it catches: embedding 1,000 artefacts per call at ~1ms each = 1 second of latency before inference starts.
Borrowed from: ER START Protocol, 1983. You don’t need full diagnosis to correctly prioritise. Score all candidates on three cheap signals first:
R (Recency) is a timestamp diff. O(1). P (Provenance) is an enum rank: human-verified > RAG chunk > tool output > agent scratchpad. O(1). S (Semantic) is cosine distance. Computed only for artefacts that survive R+P filtering.
| Source Type | Score Bias | Triage Outcome |
|---|---|---|
| Human-verified memory | Provenance-heavy (P=0.5) | Highest priority, loaded first |
| RAG chunk (recent) | Balanced (R=0.4, S=0.4) | High — recency and relevance both count |
| Tool output | Recency-heavy (R=0.5) | Medium — freshness matters most |
| Agent scratchpad | Semantic-heavy (S=0.5) | Low — must be highly relevant to survive |
| Expired artefact | TTL=0 | Excluded before scoring even starts |
Stage 2: Paged Context Store
The failure it catches: serving stale context because nobody checked whether the source changed since it was loaded.
Borrowed from: OS Virtual Memory, 1962. The page table decided what lived in fast memory, evicted least-recently-used pages, and tracked modifications via dirty bit.
LRU eviction: when the window is full, evict what was accessed longest ago. Dirty bit: if the source changed since the artefact was loaded, flag it dirty and re-fetch before use.
def access(self, artefact_id):
art = self._lru[artefact_id]
current_hash = hash(self._long_term[artefact_id].content)
if current_hash != art._source_hash:
art._dirty = True # source changed → force re-fetch
self._lru.move_to_end(artefact_id) # promote to MRU
return artRAG retrieves once and serves forever. A paged store tracks whether the source has changed.
Stage 3: Speculative Assembly
The failure it catches: hallucinations compounding across sessions because agent-generated context is written to memory without validation.
Borrowed from: CPU Reorder Buffer, Intel P6, 1995. Execute speculatively, hold results in a buffer, commit only when confirmed valid. Wrong? Rollback.
Assemble context optimistically. Start inference. If confidence exceeds threshold, commit to memory. If not, flag for human review. Do not write to long-term store. Without this gate, session one’s hallucination becomes session two’s “memory” becomes session three’s “fact.”
# After model responds:
if evaluator_confidence >= 0.7:
manifest.committed = True # safe to write to long-term store
else:
manifest.flagged_for_review = True # hold — do not persistAt Ostronaut, we saw exactly this: unvalidated agent-generated context compounding into confidently wrong output downstream. The commit gate cut that class of failure by roughly half.
Here’s the falsifiable claim: any multi-agent system without a commit/rollback gate on context writes will compound hallucinations across sessions within 30 days of production use.
Stage 4: Graceful Degradation
The failure it catches: token budget overflows that crash the API call or silently truncate critical context.
Borrowed from: Radio Programme Stack, 1930s. Dead air could never happen. When content overran, drop to the next segment. The broadcast always continued.
| Tier | Triggers at | Strategy | Example |
|---|---|---|---|
| 1 (Full) | < 80% budget | All triage winners | Happy path. Everything fits. |
| 2 (Summarised) | 80-95% | Compress memories, truncate RAG | Chat transcripts become 200-token summaries. |
| 3 (Core only) | 95-110% | Human-verified facts + system prompt | Only ground truth. Scratchpad and RAG dropped. |
| 4 (Minimal) | > 110% | System prompt only. Human review flag. | Emergency. Escalate. |
The Composed Pipeline
async def assemble_context(query, store, budget, scope):
candidates = await store.get_candidates(scope=scope)
scored = triage.score(candidates, query=query, top_k=50)
loaded = store.load_page(scored, token_budget=budget)
manifest = speculator.assemble(loaded, budget=budget)
if manifest.token_count > budget:
manifest = fallback_stack.degrade(manifest, budget=budget)
return manifestEvery call produces an immutable manifest. When the compliance team asks “why did the agent say that?” you hand them the manifest.
I argued previously that context is infrastructure, not a feature. This is the implementation pattern.
What I Got Wrong
The first version didn’t have the two-pass triage. Every artefact got embedded on every call. At 1ms per embedding multiplied by 1,000 artefacts, that’s a full second of latency before inference starts. Adding R+P pre-filtering dropped that to roughly 20 embeddings per call. The two-pass approach seems obvious in retrospect. It’s literally how ER triage works. But the RAG literature doesn’t teach you to pre-filter before embedding.
The other mistake: not implementing the dirty bit from day one. We had artefacts in the context window from external tools that had returned fresh data hours ago. The model was reasoning about stale state. Adding dirty bit tracking on access (not just on write) was a one-line fix that eliminated an entire class of silent failures.
The third mistake is in the commit gate itself. The code checks evaluator_confidence >= 0.7, but who computes that score? If the model self-evaluates, you’re trusting the same system that may have hallucinated to judge whether it hallucinated. LLM confidence self-assessment is poorly calibrated. The honest answer: the library deliberately does not compute confidence. The caller must supply it via an external evaluator, a rule-based checker, or human-in-the-loop for high-stakes domains. The commit gate is necessary. What sits behind it is not yet solved.
When This Pattern Is Overkill
Not every agent needs lifecycle management. If your agent doesn’t write to its own memory and doesn’t persist across sessions, standard RAG is sufficient. Single-session chatbots, prototypes with fewer than 100 artefacts, read-only Q&A over a fixed corpus: the overhead of triage, paging, and commit gates exceeds the benefit. This pattern pays off when context has a lifecycle. If it doesn’t, skip it.
What’s Still Open
What remains genuinely unresolved is governance at scale. When an agent has six months of context about a customer, who owns it? What happens under GDPR deletion requests? Do you tombstone or purge? If you purge, does the agent’s behaviour change in ways that affect other customers? I’m working through that question next.
The full library is a single Python file, zero dependencies, open for anyone building production agents. The techniques are borrowed. The composition is yours to steal.