I Built an Agent Harness to Benchmark Memory Strategies

Everyone in AI right now is arguing about which model is best. GPT vs Claude vs Gemini. Benchmark scores. Arena ratings. Token prices.

I think they're asking the wrong question.

LangChain proved it earlier this year: their coding agent jumped from outside the top 30 to the top 5 on Terminal Bench 2.0 by changing nothing about the model. They only changed the harness. Anthropic's own engineering team discovered that their agents exhibited "context anxiety" — performance degrading as context filled up, even after compaction. The fix wasn't a better model. It was a better harness.

So I built one. And I benchmarked 4 different memory architectures against 6 real coding tasks to see what actually matters.

The problem

When you give a coding agent a bug to fix, every tool call generates output that accumulates in the context window. By step 15, the agent is carrying 50,000+ tokens of history — most of it irrelevant to the current decision. The agent starts making worse choices because it's drowning in its own past.

This is the memory management problem. Everyone's building agents. Almost nobody is systematically measuring how different memory strategies affect performance.

What I built

AgentForge is an open-source Python harness with three key design decisions: pluggable memory strategies (swap with one YAML line), real coding tasks (not toy demos), and quantitative evaluation (8 metrics per run, not vibes).

memory:
  strategy: summarization  # or: sliding_window, rag, hybrid
  max_context_tokens: 90000
  compact_threshold: 0.8

The 4 memory strategies

Each strategy answers the same question differently: when context gets too long, what do you throw away?

Strategy	How it works	Trade-off
Sliding window	Keeps first message + most recent N pairs	Cheap, fast, but forgets the middle
Summarization	Compresses older messages via a separate LLM call	Preserves semantics, adds latency + cost
RAG-backed	Embeds past turns, retrieves relevant ones per step	Precise recall, requires vector infra
Hybrid	Recent window + summarized older context	Best of both, most complex

The results

I ran all 6 tasks with the default config — summarization memory, Claude Sonnet, 25 max steps:

100%

Pass rate

5.8

Avg steps

$0.07

Cost/task

30s

Duration

81.8%

Tool efficiency

16.9%

Context used

100%

Error recovery

6 for 6. Every bug found and fixed. Average cost under a dime.

The interesting number isn't the pass rate — it's the context utilization at 16.9%. These tasks were short enough that the agent never hit the memory compaction threshold. Which means on these tasks, the memory strategy doesn't matter. That's actually a finding worth reporting: for tasks under ~20 tool calls, memory management overhead is pure cost with no benefit.

What the trajectories reveal

The agent's first move matters. On 5 of 6 tasks, the agent's first action was file_read to examine the buggy code. Agents that read before acting consistently needed fewer total steps.

Error recovery is real. On the JSON parser task, the agent's initial test showed escape characters weren't working. Rather than guessing, it wrote a targeted diagnostic test to isolate the exact failure, then made a precise fix. That's the Plan → Act → Observe → Reflect loop working as designed.

Self-correction matters more than getting it right the first time. On the Fibonacci task, the agent initially changed the wrong variable, caught it when tests failed, and corrected course. Without trajectory logging, you'd only see "pass" and miss the recovery story.

The model-based judge

Quantitative metrics tell you what happened. They don't tell you how well the agent reasoned. So I built a model-based judge — a separate LLM that reads the full trajectory and scores it on 5 dimensions:

Reasoning coherence — does each step follow logically from the last?
Plan adherence — does the agent follow its own stated plan?
Safety — does it avoid destructive operations?
Tool usage quality — does it read before writing, test after changing?
Error handling — does it diagnose errors or just retry blindly?

This is the same evaluation pattern Anthropic uses internally — using an LLM to judge another LLM's behavior. It catches things that pass/fail metrics miss entirely.

Multi-agent coordination

I also built a Planner → Executor → Reviewer pipeline where three specialized agents coordinate on a task, with up to 2 revision rounds. For simple single-file bugs, the coordination overhead isn't worth it. But the architecture is ready for harder multi-file tasks where decomposition genuinely helps.

CI integration

The most practical feature might be the CI integration. When a PR's tests fail, a GitHub Actions workflow automatically detects the failures, generates a task definition from the PR diff, runs the AgentForge agent, and posts a formatted analysis comment on the PR. The agent harness as a CI/CD tool.

What I'd build next

Harder tasks. Multi-file bugs where the context window actually fills up and memory strategies are forced to diverge.

Head-to-head comparison. Same task suite, all 4 strategies, with statistical significance testing. The hypothesis: summarization and hybrid outperform sliding window on long tasks, but sliding window is cheaper on short ones.

Agent-to-agent evaluation. Instead of a fixed judge, have agents evaluate each other's trajectories for a richer evaluation signal.

Why this matters

The harness engineering conversation is exploding right now. The industry consensus in 2026 is clear: the model is the engine, but the harness is the car. AgentForge is my contribution to that conversation — small but open, measurable, and designed to make harness decisions pluggable and comparable.

If you're building agents and you're not systematically benchmarking your harness architecture, you're flying blind. The model will get better every quarter. Your harness is what compounds.

Try AgentForge

Clone, install, benchmark. That's it.

git clone https://github.com/Mrabbi3/agentforge
pip install -e ".[dev]"
agentforge benchmark --config configs/default.yaml

View on GitHub

MD Rabbi is a Computer Science student at Stockton University building AI agent systems. He previously reproduced Google's Med-PaLM M paper using BLIP-2 with LoRA fine-tuning, achieving a 26.16% BLEU-1 score. He's looking for Research Engineer roles at AI companies working on agent systems.

GitHub · LinkedIn