Why Simple Observation Masking Beats LLM Summarisation

Efficient context management is a cornerstone for building capable, cost-effective AI agents — especially those powered by Large Language Models (LLMs) and operating in complex, multi-step environments like software engineering. As these agents interact with tools, process feedback, and reason over many steps, their context window quickly fills up, driving up computational costs and introducing challenges like the “lost in the middle” problem, where critical information becomes inaccessible or diluted.
Traditionally, developers have used LLM-based summarisation to control context size, but recent research demonstrates that a much simpler approach — simple observation masking — can match or even outperform summarisation while being far more efficient.
Why Context Management Matters for LLM Agents
LLM agents, particularly in domains like software engineering, generate large context histories:
-
Each agent step adds outputs from tools, observations, and reasoning.
-
Left unmanaged, context grows rapidly — doubling inference costs and making it harder for the agent to focus on what matters.
Key problems with unmanaged or poorly managed context:
-
High computational/inference cost due to token bloat and transformer complexity
-
Context window overflow, leading to truncation or omission of important details
-
Performance degradation as LLMs struggle with long, mostly-irrelevant context (“lost in the middle” effect)
LLM Summarisation: A Traditional, Costly Solution
LLM summarisation compresses past context into concise summaries using additional LLM calls. While this reduces history size, it introduces new issues:
-
Added cost: Summarisation calls can account for up to 7% of total inference cost for strong models.
-
Trajectory elongation: Summaries can mask failure signals, causing agents to persist in unproductive loops.
-
Complexity: Requires careful tuning and extra engineering effort.
Simple Observation Masking: A Minimal, Effective Alternative
Observation masking simplifies context management by replacing older tool observations with a placeholder (e.g., “Previous 8 lines omitted for brevity”), while keeping the most recent (M) observations in full. The agent’s reasoning and actions are fully preserved, but distant context is condensed.
Why does it work?
-
Halves computational costs compared to keeping all history.
-
Matches or slightly exceeds solve rates of LLM summarisation — even with state-of-the-art models like Qwen3-Coder 480B
-
Prevents trajectory elongation by not smoothing over failures, making agents less likely to get stuck.
-
Simplifies engineering: No additional LLM calls or summary logic needed.
Key finding: Simplicity wins. Observation masking is often as effective — or better — than complex summarisation for agent context management, at a fraction of the cost.
Practical Example: Observation Masking in a Diagnostic Agent
Scenario:
An AI agent is tasked with diagnosing and fixing network latency on a server. It:
-
Breaks the task into subtasks (check status, ping, traceroute, review logs)
-
Records each tool output in its context history
The Problem:
After 10 diagnostic steps, the context fills with thousands of lines of logs and outputs. The LLM’s context window is nearly full, degrading performance.
The Solution:
Apply observation masking — retain only the last 3 observations in full; mask earlier ones.
MAX_CONTEXT_OBSERVATIONS = 3 # Only keep last 3 observationsdef mask_old_observations(history):
observations = [item for item in history if item['type'] == 'observation']
if len(observations) > MAX_CONTEXT_OBSERVATIONS:
for obs in observations[:-MAX_CONTEXT_OBSERVATIONS]:
obs['content'] = "<Observation masked for brevity>"
return history# Before building the LLM prompt
agent_history = mask_old_observations(agent_history)Prompt Before Masking:
1. Ran 'ping server X' -> Output: [full ping output]
2. Ran 'traceroute server X' -> Output: [full traceroute output]
...
10. Ran 'cat /var/log/syslog' -> Output: [full syslog]Prompt After Masking:
1. Ran 'ping server X' -> Output: <Observation masked for brevity>
2. Ran 'traceroute server X' -> Output: <Observation masked for brevity>
...
8. Ran 'netstat -tulnp' -> Output: [full output]
9. Ran 'cat /etc/resolv.conf' -> Output: [full output]
10. Ran 'cat /var/log/syslog' -> Output: [full output]Result:
-
The agent remains focused on the latest, most relevant information.
-
No summary generation is needed — saving compute and prompt tokens.
-
Empirical studies show this approach can improve downstream task performance compared to both full-history and summarised context.
When to Use Masking vs. Summarization
-
Masking: Best when past observations are rarely revisited, or context window is a limiting factor.
-
Summarisation: Useful if a compact record of prior steps is essential for reasoning — but should be used sparingly.
-
Hybrid: Some systems combine masking for raw data and summarisation for high-level reasoning.
Practical Implications
-
Simplicity wins: Simple observation masking is highly efficient, low-maintenance, and robust across model scales and tasks.
-
Cost savings: Reducing unnecessary tokens directly lowers latency and compute costs.
-
Interpretability: Masking preserves the full reasoning and action trace, aiding debugging and transparency.
Conclusion
Agent context management isn’t about maximizing history, but curating what matters most. For many real-world workflows, simple observation masking is more efficient and effective than complex summarization — delivering lower costs, better transparency, and robust agent performance. As LLM-based agents proliferate, this insight will help build scalable, economical, and trustworthy AI systems.
References: