How we build a retrieval engine for agents

In our previous post, we explained why agents need retrieval infrastructure built for their workloads. Here's more on how we're building it.

Successful coding agent implementations build feedback loops. Give the agent access to a compiler and a way to test the code, and it can verify its own output. This is why the CLI form factor has overtaken IDE integrations: CLI-based coding agents run inside your development container with access to everything needed to build, test, and integrate. The verifiability of code makes it perfect for reinforcement learning (RL), and model companies have invested heavily because it enables recursive improvement: coding agents that can code better coding agents.

And we may soon see all agents become coding agents .

We're applying the same feedback loop principles to Hornet. By making our API surface verifiable (configuration, queries, indexing, deployment), agents can not only use Hornet but learn to configure and optimize it. Retrieval is treated like code: configurations are the source files, API validation is the compiler, behavioral metrics are the tests, and deployments are versioned rollouts that can be verified and safely reverted. This accelerates every developer trying to build retrieval applications.

The challenge

Hornet's API surface isn't in any LLM's pre-training data. We tried the obvious approaches first: injecting documentation into the prompt, relying on in-context learning, hoping frontier models would figure it out. None of it worked well enough. Launching a new engine with a new API in this context in 2026 feels near impossible, but the feedback loop that makes coding agents work so well pointed us toward a solution.

Our approach: Verifiable APIs

We make Hornet's entire API surface verifiable. This lets agents learn how to use the engine directly. It might not create a valid configuration or query at first try, but as the API is verifiable, the agent can observe the failure and associated error response. The agent can read the response and correct it until success. This is the core of the agentic feedback loop. Hornet's schema-first design makes structure explicit before data enters the system.

We also want the API surface to look similar to coding, aligning with how frontier model companies already do post-training using RL. Concretely, this means that a large part of Hornet's API surface is just a structured file system. A coding agent writes, edits, and reads these Hornet configuration files just as it does when it creates a Next.js app.

Fully verifiable areas include:

Configuration, documents, and collection schemas
Queries, scoring, and document operations
Deployments and changing a production deployment safely

With this, an agent can configure, deploy, and use Hornet end to end.

Three levels of verification

Syntactic validation

The simplest form of validation. Hornet APIs are defined by an OpenAPI specification. This way, the agents create syntactically correct document and query schemas, just like checking if code compiles. Frontier LLMs of 2026 are excellent at creating syntactically correct code or configuration.

Semantic validation

Deeper validation across multiple configurations and combinations. For example, some settings can't be used together. Syntax checks alone can't catch this, but a Hornet configuration model can. We model which combinations are allowed and give the model concrete and detailed feedback during validation. When an agent produces an invalid configuration, Hornet returns a detailed error message so the agent can self-correct. Even without any additional RL-tuning, frontier models handle Hornet API surfaces smoothly because this feedback loop mimics the familiar coding domain.

Behavioral validation

The hardest type of validation. Here we check whether the engine behaves as expected: Do the right documents appear? Are they ranked correctly? Does the query perform well? Is the resource footprint acceptable?

This level requires modeling expected behavior across queries, rankings, and resource usage. It's hard because "correct" is often subjective, but by making quality metrics observable and comparable, agents can not just query Hornet but improve relevance, tune recall/latency tradeoffs, and safely roll out production changes with validation overrides.

The retrieval gap

Most organizations struggle with building great retrieval for AI: complex engines, steep learning curves, and heavy operational overhead. Not to mention that few have experience with building retrieval systems where relevance is king.

We built Hornet for this.

By making configurations, queries, and documents verifiable, developers and agents build production-ready retrieval through guided feedback loops. Hornet also verifies changes between application versions, making production updates safe.

Self-improving agents

Hornet lets agents optimize their own context retrieval. Better context means better reasoning, which means better outcomes. They improve the quality of the context they receive, which improves their ability to reach goals, which lets them tune retrieval further. The feedback loop becomes self-reinforcing.

Consider a customer support agent that notices its retrieval keeps missing recent policy updates. With verifiable APIs, the agent can adjust its query configuration, test against known-good results, and deploy the fix. No human intervention required.

When retrieval infrastructure can learn from agent behavior and agents can improve their own context supply, we get systems that were previously impossible to build. These systems can adapt to new queries, new documents, and new contexts, making them more robust and reliable. This is Agentic Retrieval.