Everyone in AI right now is arguing about which model is best. GPT vs Claude vs Gemini. Benchmark scores. Arena ratings. Token prices.
I think they're asking the wrong question.
LangChain proved it earlier this year: their coding agent jumped from outside the top 30 to the top 5 on Terminal Bench 2.0 by changing nothing about the model. They only changed the harness. Anthropic's own engineering team discovered that their agents exhibited "context anxiety" — performance degrading as context filled up, even after compaction. The fix wasn't a better model. It was a better harness.
So I built one. And I benchmarked 4 different memory architectures against 6 real coding tasks to see what actually matters.
The problem
When you give a coding agent a bug to fix, every tool call generates output that accumulates in the context window. By step 15, the agent is carrying 50,000+ tokens of history — most of it irrelevant to the current decision. The agent starts making worse choices because it's drowning in its own past.
This is the memory management problem. Everyone's building agents. Almost nobody is systematically measuring how different memory strategies affect performance.
What I built
AgentForge is an open-source Python harness with three key design decisions: pluggable memory strategies (swap with one YAML line), real coding tasks (not toy demos), and quantitative evaluation (8 metrics per run, not vibes).
memory:
strategy: summarization # or: sliding_window, rag, hybrid
max_context_tokens: 90000
compact_threshold: 0.8
The 4 memory strategies
Each strategy answers the same question differently: when context gets too long, what do you throw away?
| Strategy | How it works | Trade-off |
|---|---|---|
| Sliding window | Keeps first message + most recent N pairs | Cheap, fast, but forgets the middle |
| Summarization | Compresses older messages via a separate LLM call | Preserves semantics, adds latency + cost |
| RAG-backed | Embeds past turns, retrieves relevant ones per step | Precise recall, requires vector infra |
| Hybrid | Recent window + summarized older context | Best of both, most complex |
The results
I ran all 6 tasks with the default config — summarization memory, Claude Sonnet, 25 max steps:
6 for 6. Every bug found and fixed. Average cost under a dime.
What the trajectories reveal
The agent's first move matters. On 5 of 6 tasks, the agent's first action was file_read to examine the buggy code. Agents that read before acting consistently needed fewer total steps.
Error recovery is real. On the JSON parser task, the agent's initial test showed escape characters weren't working. Rather than guessing, it wrote a targeted diagnostic test to isolate the exact failure, then made a precise fix. That's the Plan → Act → Observe → Reflect loop working as designed.
Self-correction matters more than getting it right the first time. On the Fibonacci task, the agent initially changed the wrong variable, caught it when tests failed, and corrected course. Without trajectory logging, you'd only see "pass" and miss the recovery story.
The model-based judge
Quantitative metrics tell you what happened. They don't tell you how well the agent reasoned. So I built a model-based judge — a separate LLM that reads the full trajectory and scores it on 5 dimensions:
- Reasoning coherence — does each step follow logically from the last?
- Plan adherence — does the agent follow its own stated plan?
- Safety — does it avoid destructive operations?
- Tool usage quality — does it read before writing, test after changing?
- Error handling — does it diagnose errors or just retry blindly?
This is the same evaluation pattern Anthropic uses internally — using an LLM to judge another LLM's behavior. It catches things that pass/fail metrics miss entirely.
Multi-agent coordination
I also built a Planner → Executor → Reviewer pipeline where three specialized agents coordinate on a task, with up to 2 revision rounds. For simple single-file bugs, the coordination overhead isn't worth it. But the architecture is ready for harder multi-file tasks where decomposition genuinely helps.
CI integration
The most practical feature might be the CI integration. When a PR's tests fail, a GitHub Actions workflow automatically detects the failures, generates a task definition from the PR diff, runs the AgentForge agent, and posts a formatted analysis comment on the PR. The agent harness as a CI/CD tool.
What I'd build next
Harder tasks. Multi-file bugs where the context window actually fills up and memory strategies are forced to diverge.
Head-to-head comparison. Same task suite, all 4 strategies, with statistical significance testing. The hypothesis: summarization and hybrid outperform sliding window on long tasks, but sliding window is cheaper on short ones.
Agent-to-agent evaluation. Instead of a fixed judge, have agents evaluate each other's trajectories for a richer evaluation signal.
Why this matters
The harness engineering conversation is exploding right now. The industry consensus in 2026 is clear: the model is the engine, but the harness is the car. AgentForge is my contribution to that conversation — small but open, measurable, and designed to make harness decisions pluggable and comparable.
If you're building agents and you're not systematically benchmarking your harness architecture, you're flying blind. The model will get better every quarter. Your harness is what compounds.
Try AgentForge
Clone, install, benchmark. That's it.
git clone https://github.com/Mrabbi3/agentforge
pip install -e ".[dev]"
agentforge benchmark --config configs/default.yaml
View on GitHub
MD Rabbi is a Computer Science student at Stockton University building AI agent systems. He previously reproduced Google's Med-PaLM M paper using BLIP-2 with LoRA fine-tuning, achieving a 26.16% BLEU-1 score. He's looking for Research Engineer roles at AI companies working on agent systems.