The AI industry has spent the past two years building better language models, better retrieval systems and better agent frameworks.
Yet every production AI application still rebuilds the same infrastructure. Every engineering team writes custom code to decide:
These decisions are currently embedded in application code. They should be infrastructure.
Context Runtime is a provider-agnostic optimization layer that automatically determines the optimal context execution plan before any model is called.
Applications describe intent. The runtime determines the initial execution strategy. Over time, it continuously improves those decisions using measured execution outcomes. Cost-based planning gets you the first execution; learning gets you the thousandth.
Since v1.1, the reference implementation has grown a body of runnable, measured evidence for that learning loop — including a result we did not expect on the messiest, most realistic input of all: a single user's mix of very different data. Those results are in the Evidence section, and every number is produced by an example, not asserted.
Current AI systems are built around static pipelines — User → Prompt → RAG → Model → Answer. Every serious application eventually extends that pipeline: hybrid search, BM25, GraphRAG, conversation memory, prompt caching, summarization, agent routing, verification, model selection, policy enforcement.
Every feature introduces another branch of application logic. As systems become more capable, developers spend less time building products and more time maintaining context pipelines. This is rapidly becoming the dominant engineering cost of production AI.
The static pipeline — and the sprawl it grows. Every capability bolted onto User → Prompt → RAG → Model → Answer (hybrid search, BM25, GraphRAG, conversation memory, prompt caching, summarization, agent routing, verification, model selection, policy enforcement) becomes another branch of application code — the maintenance burden, not the product.
Today's leading AI companies optimize different parts of the stack:
These approaches are complementary. None provides a unified runtime responsible for deciding how context should flow through an AI system — and learning to route it better over time.
Context management should become an operating-system service rather than application code. Applications should no longer decide retrieval strategy, chunk size, reranking, prompt assembly, model routing, verification, compression or caching.
Applications should express intent. The runtime should determine the optimal execution plan — and refine that plan as it observes what actually works.
This is exactly what modern database query planners did for relational databases. Developers write SQL; the database determines how to execute it. Context Runtime applies the same principle to AI — and, like adaptive query optimizers, it uses execution feedback and statistics to get better with use.
The same abstraction databases have had since 1979. You write intent; the planner decides execution.
| Relational database | AI system | |
|---|---|---|
| SQL | ≈ | Goal |
| Query planner | ≈ | Context Runtime |
| Execution plan | ≈ | Execution graph |
| Result | ≈ | Verified result |
Instead of manually assembling prompts, applications submit goals. The runtime evaluates multiple strategies while respecting application constraints — latency, cost, security, token budget, verification requirements and provider capabilities. Only then does execution begin.
And execution is not the end of the loop. Each run produces a measured outcome — a reward — that updates the runtime's cost model and strategy selection, so the next plan for a similar intent is better than the last.
Context Runtime — plan, execute, learn.
Goal (the application's intent)
│
Intent analysis
│
Candidate plans (several strategies)
│
Cost optimization ── CONSTRAINTS ──────────────
(score plans • latency • token budget
vs constraints) • cost • verification
│ • security • provider capabilities
Execution graph
│
Execution (pluggable scheduler)
│
Reward (measured outcome)
│
Bandit learning + cost-model update
│
Updated planner ──▶ (better next plan)
Intent in, a cost-optimized execution graph out — verified before a single token is generated, and improved after every run.
Context Runtime separates planning from execution. The planner produces an Execution Graph; a pluggable scheduler executes that graph. Foundation models and retrieval systems become interchangeable plugins, verification becomes a runtime policy — and the application remains unchanged.
The runtime emits a backend-independent Execution Graph. The reference implementation includes an in-process scheduler and defines interfaces for distributed execution engines such as Dagster, Ray and Spark — so the same plan can run locally today and on distributed backends tomorrow without changing the planner or the application.
Data lives behind the same seam. A store is a plugin: the reference implementation ships an in-process store, a DuckDB backend (a persistent, embedded full-text index — one file per source or per user's local data) and a Postgres backend (tsvector), all interchangeable under the planner. Where information lives becomes a runtime decision, not application code.
Architecture — planning separated from execution.
Application (expresses intent — submits a Goal · stays unchanged)
│
┌────────────── CONTEXT RUNTIME · THE PLANNER ──────────────┐
│ Provider-independent Execution graph + verification │
│ cost-based planning (verification = a runtime policy)│
│ + outcome-driven learning │
└────────────────────────────────────────────────────────────┘
│
EXECUTION BACKENDS & INTERCHANGEABLE PLUGINS
[ in-process ] [ Dagster ] [ Ray ] [ Spark ] [ Models ] [ Retrieval ] [ Stores ]
Plan once, run anywhere: the planner is provider-independent; models, retrieval, stores and execution engines are swappable underneath it.
Retrieval is not a single fixed strategy but a set of runtime primitives the planner selects and cost-models per request:
The loop above is not aspirational. The reference implementation ships runnable benchmarks; every figure below is produced by an example program, not asserted.
Heterogeneous data — the case real users actually have. A person's local files are a mix of domains, and indexed as one flat store they collide: discharge (hospital vs. debt), statement (patient vs. financial), balance (fluid vs. sheet), chronic, liability. Measured on real financial 10-K filings plus a medical corpus:
| Retrieval strategy | Answer recalled | Cross-domain noise (top-5) |
|---|---|---|
| Flat mixed index | yes | 2.5 wrong-domain docs / query |
| Sharded + rank fusion | yes | 2.9 — fusion alone makes it worse |
| Sharded + learned routing | yes | 0.0 |
The surprising part: the correct answer is recalled in every case — this is not a "the model can't find it" problem. The failure is context pollution — a medical question drags financial pages into the window, and naïve fusion adds more. The runtime learns which domain each question belongs to and routes to it, driving cross-domain noise to zero while preserving recall, in both directions. This is the self-optimizing loop applied to a single user's own heterogeneous data — an edge no fixed pipeline has.
Agent memory that learns what to recall. With conversation memory exposed as three indices, the runtime learns — per question type — which one to read. Against a baseline that always reads all three, the learned policy improves reward (recall value minus read-cost) by +2.9, discovering on its own that follow-ups want recency, factual questions want semantic recall, and entity questions want the entity index.
The learning generalizes. Eleven reference applications each expose one context decision and one measured outcome; on every one, the learned policy beats a fixed full-context baseline — the same plan-execute-learn loop, eleven different domains.
A public, no-login demo runs this live: point it at a corpus and watch the runtime route and improve in real time.
Today's AI applications are tightly coupled to providers and frameworks. Tomorrow's applications will depend on infrastructure that automatically determines:
Context Runtime becomes that infrastructure layer.
The project begins as an AGPL-licensed reference implementation. Its purpose is to validate the architecture and establish an open specification for context management. The initial implementation focuses on:
The goal is not another agent framework. The goal is defining a new infrastructure abstraction. Multiple implementations of that abstraction are possible: Python and Go reference implementations exist today, at feature parity — the same planner, cost model, learning loop and retrieval seams in both, which is the strongest evidence that this is an abstraction, not a library.
As AI workloads become larger and increasingly distributed, context planning itself becomes a distributed-systems problem. The long-term roadmap covers distributed planning and execution across Kubernetes, Dagster, Spark, Ray and future execution backends. This enables:
The planner remains provider-independent while execution engines continue to evolve independently.
Every production AI system already performs context planning. Most simply do it manually. The industry has standardized foundation models. It has standardized tool calling. It has standardized retrieval. It has not standardized context optimization.
That missing layer represents an opportunity to define a new category of AI infrastructure.
Compilers removed the need to write assembly. Operating systems removed the need to manage hardware directly. Database query planners removed the need to handcraft execution strategies — and adaptive optimizers later removed the need to hand-tune them as data changed.
Context Runtime brings the same arc to AI. Applications describe intent. The runtime determines the optimal execution plan, and learns from every execution. Everything else becomes implementation.
Context Runtime is not another AI framework. It is the infrastructure layer that makes AI systems simpler, cheaper, more reliable, provider-independent — and self-improving.