From Analysis to Automation: Streamlining Agent Evaluation with GitHub Copilot

In the fast-paced world of software engineering, automation often emerges from a blend of inspiration, frustration, and occasional laziness. Engineers build tools to eliminate repetitive tasks, freeing themselves for more creative work — only to end up maintaining those very systems. As an AI researcher on the Copilot Applied Science team, I took this cycle to a new level by automating my own intellectual toil. The result is a tool that empowers my entire team to do the same, transforming how we analyze and improve coding agents.

The Challenge: Decoding Agent Trajectories

A core part of my role involves evaluating coding agent performance against standardized benchmarks like TerminalBench2 and SWEBench-Pro. These evaluations generate what we call trajectories — detailed logs of an agent’s thought processes and actions as it solves each task. Each trajectory is a JSON file containing hundreds of lines of code, and a single benchmark run might include dozens of tasks. Multiply that by multiple runs per day, and you’re staring at hundreds of thousands of lines of code to analyze manually.

From Analysis to Automation: Streamlining Agent Evaluation with GitHub Copilot — Source: github.blog

This is clearly an impossible task for a human alone. My usual approach was to enlist GitHub Copilot to surface patterns in the trajectories, reducing the data I needed to examine from hundreds of thousands of lines to a few hundred. But I soon realized I was repeating the same loop: analyze, find patterns, investigate. The engineer in me said, “I want to automate that.” That’s when eval-agents was born.

The Solution: Eval-Agents

Eval-agents is a framework designed to automate the intellectual work of analyzing agent trajectories. Instead of manually sifting through JSON files, researchers can deploy agents that perform the analysis themselves. The system leverages GitHub Copilot not just as a coding assistant but as a core component in the automation pipeline.

The key insight was that the analysis loop itself could be captured and automated. By creating specialized agents that understand the structure of trajectories, we could offload the pattern recognition and hypothesis generation to code. This freed up human researchers to focus on higher-level interpretation and decision-making.

How It Works

At its heart, eval-agents defines a set of reusable components that handle common analysis tasks. These include:

Trajectory Parsers — Extract structured data from raw JSON logs.
Pattern Detectors — Identify recurring behaviors (e.g., agents getting stuck on certain steps).
Summary Generators — Produce human-readable reports from detected patterns.

Agents are composed of these building blocks, making it easy to create new analysis workflows. The entire system is designed for collaboration — agents can be shared, modified, and improved by the whole team.

Key Design Principles

From the outset, I had three guiding goals for eval-agents:

Make agents easy to share and use — Leveraging GitHub’s strengths in collaboration and version control.
Make it easy to author new agents — Lower the barrier for researchers to contribute their own analysis logic.
Make coding agents the primary vehicle for contributions — Encourage a culture where improvements are delivered as code, not documentation or manual processes.

These principles align with skills I honed as a maintainer of the GitHub CLI. By designing for reuse and simplicity, eval-agents accelerates the entire team’s work.

Impact and Future Directions

Since deploying eval-agents, the Copilot Applied Science team has seen a significant reduction in the time spent on routine analysis. Researchers can now run multiple benchmark evaluations in parallel, automatically generate reports, and quickly spot performance regressions or improvements. The tool has become an integral part of our development cycle.

Looking ahead, we plan to expand eval-agents to support more types of analysis, integrate with other evaluation frameworks, and provide even richer visualizations. Our ultimate goal is to create a self-improving system where agents not only analyze but also suggest optimizations to the underlying models and prompts.

This project demonstrates that automation isn’t limited to menial tasks — with the right tools, even intellectual toil can be handed off to machines. By combining the power of GitHub Copilot with thoughtful agent design, we’ve unlocked a new level of productivity for AI research.

Tags:

From Analysis to Automation: Streamlining Agent Evaluation with GitHub Copilot

The Challenge: Decoding Agent Trajectories

The Solution: Eval-Agents

How It Works

Key Design Principles

Impact and Future Directions

Related Articles

Recommended

Discover More