Mutually Assured Distraction

What seems like a very long time ago, I studied neuroscience. At my university the Mosers had recently won a Nobel Prize, and naturally every department scrambled to associate itself with their success. Mine, the fringe science department of AI at the time, also vectored in that direction. Computational neuroscience, but still.

One thing that stuck with me was what we, as humans, do not attend to. As our species evolved by rummaging through savannas and jungles and forests, we were bombarded with perception signals. Very little of that was relevant for spotting that lion or finding those berries. But the ability to focus on what matters and drown out the rest was a survival necessity. This spotlight attention mechanism suppresses the irrelevant to focus on the salient.

Which brings me to my current work: search and retrieval. For the longest time, search has been a list of more or less relevant items we skim through to find what we need. When presented with such a list, humans are remarkably adept at filtering out the irrelevant. The mental cost of skipping to the next result is low.

AIs, however, have evolved along a different path.

For an LLM, retrieval is not skimming through a list of results, it is context injection. The retrieved documents are not presented for review but are ingested in parallel and actively shape the model's probabilistic reasoning. Unlike the human brain, which can compartmentalize and discard noise, the Transformer architecture processes input through a mechanism of self-attention where every token competes for probability mass. There is no "ignore".

The real threat is the distractor. Text that is plausible, semantically adjacent, high-confidence, and wrong. Distractors look like evidence. They survive ranking. And the moment they enter the context window, reliability degrades rapidly. Irrelevance becomes destabilizing.

This is what I mean by Mutually Assured Distraction (MAD): a dynamic where locally rational optimizations in retrieval and reasoning lead to global system instability. Better retrievers produce more convincing distractors; better reasoners trust those distractors more deeply. Both sides improve; both sides lose.

Most current retrieval pipelines are measuring the wrong variables and are thus optimizing for fragility. A more defensive approach to retrieval, one based on utility-based metrics, verifiable infrastructure, and distraction-aware ranking, can help. But without it, autonomous agents face a hard reliability ceiling.

Lost in the noise

Recent evidence for this comes from a study in "Lost in the Noise" . They stress-tested models across many tasks by injecting different kinds of noise: random documents, irrelevant chat history, and hard negative distractors. The findings regarding hard negatives are particularly brutal. In some specific tasks, such as multi-hop reasoning and tool-use scenarios, accuracy dropped by up to 80%.

The danger is plausible irrelevance. A document that "looks right" but is wrong survives the initial phase of retrieval because it scores high on signals such as similarity. By the time the model attempts to reason over the data, the distractor has already been integrated into the context as a valid premise.

Here is an uncomfortable part for retrieval. As retrievers improve, their mistakes get better too. Stronger retrievers create better distractors.

Counterintuitively, "The Power of Noise" finds that padding context with random documents can actually improve accuracy. The hypothesis is that random noise increases attention entropy, preventing the model from latching onto any single source. Distractors, the plausibly irrelevant, do the opposite. They look credible, so attention sharpens around the wrong evidence.

What I find unsettling is the phenomenon of "inverse scaling under noise". This challenges the prevailing "scale is all you need" orthodoxy, which posits that larger models and longer reasoning chains (chain-of-thought) universally improve performance. In clean, sanitized benchmarks, allowing a model to generate intermediate reasoning steps significantly boosts accuracy. The model uses the extra tokens to break down the problem and verify its logic.

However, this study reveals that this relationship inverts in the presence of distractors. When the context is polluted with hard negatives, giving the model more time to think (generating a longer chain of thought) actually lowers accuracy. This "inverse scaling" suggests that reasoning capacity is a liability if the input context is not rigorously sanitized. It forces a strategic pivot: we cannot rely on the model's intelligence to filter out bad context after the fact. We must ensure the context is clean before it enters the reasoning loop.

You cannot reason your way out of bad context, and you cannot compute your way out of distraction.

Agentic collapse

The risks of distraction are compounded when we move from transactional RAG (a single query-response cycle) to agentic workflows (loops of reasoning-acting-observing). This transition creates the conditions for Agentic Collapse . In an agentic loop, the output of step t becomes part of the context for step t+1. If a distractor causes a minor error in step t (e.g., a slightly incorrect tool parameter or a misguided sub-goal), that error is re-injected into the context window for the next iteration, which creates a feedback loop of error propagation.

This compounds exponentially. A system with a 90% success rate per step will have a roughly 53% success rate after just 6 steps. However, with the active interference of distractors, the degradation is often faster because the model over-trusts its own previous outputs. This is not just a failure of accuracy; it is emergent misalignment. The agent may pursue hallucinated sub-goals, generate vast amounts of useless data (context rot), or loop indefinitely.

In a transactional RAG system, a distractor ruins one answer. In an agentic system, a distractor can ruin the entire workflow and potentially trigger runaway resource consumption.

Adversarial distractors

Everything described so far assumes benign inputs with retrieval systems accidentally surfacing the wrong evidence. But the same architectural vulnerability that enables accidental distraction also enables deliberate exploitation. Prompt injection attacks embed malicious instructions in untrusted content that gets ingested as context, and because the model cannot reliably distinguish trusted instructions from attacker-controlled input, a single well-crafted string can hijack behavior.

The recent explosion of the OpenClaw ecosystem offers a case study in what this looks like at scale. Prompt injection is framed with directness: "someone messaged the bot and the bot did what they asked". Real-world demonstrations have shown prompt injection leading to data exfiltration and misuse of connected accounts when agents have email or messaging permissions. Traditional trust boundaries collapse because agentic assistants operate with authorized permissions but remain vulnerable to attacker-controlled context.

This is MAD with an adversary applying pressure. Retrieval is a primary attack surface, and defensive retrieval is not only about precision but also about treating context as a security boundary.

Measuring the wrong thing

As I see it, the root cause of Mutually Assured Distraction is the misalignment between what we optimize for and what actually matters. The industry has historically relied on Information Retrieval metrics like nDCG (normalized Discounted Cumulative Gain), MAP (Mean Average Precision), and MRR (Mean Reciprocal Rank). These metrics were designed for the era of human search and are fundamentally flawed when applied to agentic retrieval.

The argument is made clearly in "Redefining Retrieval Evaluation in the Era of LLMs" , which identifies two critically invalid assumptions:

Monotonic decay. Traditional IR metrics assume that the value of a document decreases smoothly with its rank position. A document at rank 1 is much more valuable than at rank 5. This models a human scanning a list from top to bottom.
Harmless zeroes. Traditional metrics treat irrelevant documents as having a utility of 0: a "miss" but not a penalty.

For language models, both assumptions break. They ingest the whole retrieved bundle at once, and they show positional biases that are not the same as human scan order. They tend to over-attend to early and late tokens, a primacy and recency effect rather than a simple top-to-bottom scan, which makes it rational to float critical evidence to the edges of context.

More importantly, "irrelevant" is not one thing. Some passages are actively harmful. A retriever can score high on nDCG while consistently injecting distractors that sabotage downstream reasoning. The metric says retrieval is improving while the system gets worse.

This is the incentive misalignment at the heart of MAD. If your metric cannot penalize harm, the locally rational strategy is to push recall as high as possible. But in retrieval, higher recall almost always comes at the cost of lower precision. You accept more borderline candidates to avoid missing anything. Those borderline candidates are the perfect distractors. We optimize for recall and produce fragility.

To address this, Trappolini et al. introduced UDCG (Utility and Distraction-aware Cumulative Gain). This metric represents a shift from measuring "relevance" to measuring "utility." A key innovation is it assigns negative utility scores to distractors. Concretely, utility is derived from model behavior: if a passage makes the model answer correctly it is positive, and if it makes the model answer when it should abstain it is negative.

This realigns the local retrieval objective with system-level reliability, and it correlates better with end-to-end outcomes than classic IR metrics. In their experiments, they report up to a 36% improvement in correlation with end-to-end answer accuracy compared to traditional metrics.

UDCG exposes that precision is more valuable than recall in the agentic era, because a miss costs less than a distractor.

Low-k, high risk

If distractors are so dangerous, one obvious response is to retrieve less. In loops where errors compound and inverse scaling under noise applies, low-k becomes a primary stabilization strategy. It limits how much distracting context can enter the model, keeps the attention budget concentrated, and reduces the surface area for a single near-miss to seed a long failure chain.

But low-k comes with a hard requirement. When you only retrieve three to five items, one single distractor can become the premise the model latches onto. So with low-k your retriever must be able to score precision extremely well, including the hard case where the wrong document is semantically adjacent and confidence-sounding. Similarity alone is not a strong enough signal; you need scoring that can distinguish "useful" from "harmful" and be willing to return less rather than pad the window with borderline candidates.

One mitigation is to stop treating retrieval as "always return k" and start treating it as a sufficiency process. Dynamic-k retrieval pulls documents sequentially and stops as soon as a sufficiency threshold is met, when the model has enough evidence to answer without guessing. If the next candidate doesn’t increase confidence (or increases contradiction risk), it doesn’t get injected. The objective becomes to include only what improves the answer, not everything that looks vaguely related.

And when sufficiency is not met, the system should abstain. Abstention is a control signal. In an agentic loop, "insufficient evidence" should trigger a retry with a different query, a different retrieval strategy, or a different tool, instead of forcing the model to reason over weak context.

Breaking the cycle

In the world of clean benchmarks, we could retrieve more, fill the context window, and let the model think longer until an answer is reached. In the world of hard negative distractors, we can't rely on that. When junk is plausible, reasoning suffers.

This is the MAD dynamic: a systemic instability where improvements in retrieval recall and model reasoning capabilities paradoxically lead to higher failure rates due to the hard negative distractors.

We can break the cycle with more defensive retrieval, where we treat context like an interface with failure modes. This changes three things:

What you measure: move from relevance to utility, and explicitly assign negative utility to distractors that make the model confidently wrong.
What you inject: treat every extra chunk as a liability unless it increases sufficiency, and prefer dynamic-k over "always return k".
What you do under uncertainty: make abstention a first-class outcome, and use it as a control signal to retry with a different query, different filters, or a different tool.

If thinking longer on bad data wastes compute and degrades performance, then the cost-efficiency of AI systems is directly tied to the purity of their retrieval. Clean context becomes a premium asset. Data providers and retrieval engines may begin to compete not on the size of their index, but on the purity and verifiability of the context they can deliver. Reliability will be the primary bottleneck.

This is exactly why we are building Hornet: to give teams the tools to achieve the relevance they actually need, not the relevance that looks good on the wrong metrics.

Mutually Assured Distraction

Lost in the noise

Agentic collapse

Adversarial distractors

Measuring the wrong thing

Low-k, high risk

Breaking the cycle

About the Author

Continue Reading

How we build a retrieval engine for agents

The case for a new retrieval engine for agents