Open Source · v0.1.0

Benchmark your AI agent's memory strategy

An open-source harness that lets you plug in different memory architectures — sliding window, summarization, RAG, hybrid — and measure what actually matters on real coding tasks.

$ agentforge benchmark --config default.yaml Running 6 tasks... Fix Fibonacci Bug ........... PASS 7 steps Fix Merge Sort .............. PASS 5 steps Fix BFS Shortest Path ....... PASS 8 steps Fix Rate Limiter ............ PASS 5 steps Fix LRU Cache ............... PASS 4 steps Fix JSON Parser ............. PASS 6 steps 6/6 passed · $0.07/task · 30s avg

100%

Pass rate

5.8

Avg steps

$0.07

Per task

30s

Avg duration

What's inside

Agent Loop

Plan, Act, Observe, Reflect. Tool dispatch with configurable step limits and early stopping.

8 Metrics

Pass rate, tool efficiency, cost tracking, context utilization, error recovery, steps, duration.

4 Memory Strategies

Sliding window, summarization, RAG, and hybrid. Switch with one config line.

Model-Based Judge

LLM-as-judge scoring trajectories on reasoning coherence, plan adherence, safety.

Multi-Agent Pipeline

Planner, Executor, Reviewer. Orchestrated revision loops with message passing.

CI Integration

GitHub Actions auto-diagnoses failing PRs and posts agent analysis as comments.

Start benchmarking in 2 minutes

Clone, install, run. Your first results in under 60 seconds.

View on GitHub
pip install -e ".[dev]" && agentforge benchmark