An open-source harness that lets you plug in different memory architectures — sliding window, summarization, RAG, hybrid — and measure what actually matters on real coding tasks.
Pass rate
Avg steps
Per task
Avg duration
Plan, Act, Observe, Reflect. Tool dispatch with configurable step limits and early stopping.
Pass rate, tool efficiency, cost tracking, context utilization, error recovery, steps, duration.
Sliding window, summarization, RAG, and hybrid. Switch with one config line.
LLM-as-judge scoring trajectories on reasoning coherence, plan adherence, safety.
Planner, Executor, Reviewer. Orchestrated revision loops with message passing.
GitHub Actions auto-diagnoses failing PRs and posts agent analysis as comments.
Clone, install, run. Your first results in under 60 seconds.
View on GitHub