Evals System Architecture
Overview
The evals system is a distributed evaluation platform that runs AI coding tasks in isolated VS Code environments. It solves two critical problems in AI evaluation:
- Dependency Management: Eliminates the complexity of setting up multiple programming language environments by packaging everything into pre-configured containers
- Resource Isolation: Prevents memory exhaustion and state contamination by running each task in a fresh, isolated container instead of sequentially in a single VS Code instance
The architecture consists of three main components: a Next.js web application for management, a controller container that orchestrates evaluation runs, and multiple runner containers that execute individual tasks.
Problems Solved
Simplified Setup and Deployment
Traditional AI evaluation setups require complex dependency management across multiple programming languages, development tools, and VS Code extensions. The evals system eliminates this friction by:
- One-Command Deployment: Single
docker compose up command starts the entire evaluation infrastructure
- Pre-configured Environments: Runner containers include all necessary language runtimes, tools, and VS Code extensions
- Dependency Isolation: No host system contamination or version conflicts between different language requirements
- Reproducible Environments: Identical evaluation conditions across different machines and deployments
Resource Management and Isolation
Running multiple AI evaluation tasks sequentially in a single VS Code instance creates several problems:
- Memory Accumulation: VS Code instances gradually consume more memory with each task, eventually leading to crashes
- State Contamination: Previous tasks can leave behind files, settings, or processes that affect subsequent evaluations
- Resource Contention: Multiple tasks competing for the same VS Code instance create bottlenecks and inconsistent performance
- Failure Propagation: A single problematic task can crash the entire evaluation session
The containerized approach solves these issues by:
- Fresh Environments: Each task starts with a clean VS Code instance and workspace
- Memory Reset: Container termination automatically reclaims all memory and resources
- Parallel Execution: Multiple tasks can run simultaneously without interference
- Fault Isolation: Individual task failures don't affect other running evaluations
Architecture Components
graph TB
Web[Admin Web App] <--> Redis[(Redis<br/>PubSub & Registration)]
Web <--> DB[(PostgreSQL<br/>Runs & Tasks)]
Web --> Controller[Run Controller / PQueue]
Controller <--> DB
Controller --> Runner1[Task Runner 1]
Controller --> Runner2[...]
Controller --> RunnerN[Task Runner N]
Runner1 <--> Redis
Runner2 <--> Redis
RunnerN <--> Redis
Redis <--> Web
Core Components
Next.js Web Application
The web application serves as the primary interface for creating and monitoring evaluation runs. It provides:
- Run Management: Create evaluation runs with configurable parameters (model, concurrency, exercise selection)
- Real-time Monitoring: Live progress tracking via Server-Sent Events
- Results Dashboard: View task completion status, metrics, and outcomes
- Container Orchestration: Spawns controller containers for new runs
Controller Container
A specialized instance of the evals-runner container that acts as the run orchestrator. The controller:
- In-Memory Task Queue: Uses the
p-queue npm package to manage task distribution with configurable concurrency limits
- Git Workspace Setup: Prepares exercise repositories and manages version control
- Runner Coordination: Spawns and monitors individual task runner containers
- Heartbeat Monitoring: Maintains Redis heartbeat to track controller health
- Result Aggregation: Collects task results and finalizes run metrics
Runner Containers
Individual containers that execute single evaluation tasks. Each runner:
- Isolated Environment: Fresh VS Code instance with pre-installed language tools and extensions
- Task Execution: Runs AI agent with evaluation prompt in VS Code environment
- IPC Communication: Connects to VS Code via Unix socket for real-time interaction
- Unit Testing: Validates task completion using language-specific test suites
- Metrics Collection: Tracks token usage, costs, tool usage, and execution time
Supporting Infrastructure
- Redis: Provides pub/sub messaging for real-time events and runner registration tracking (not used for task queuing)
- PostgreSQL: Stores run configurations, task definitions, execution metrics, and results
- Docker: Container orchestration for isolation and scalability
Execution Flow
1. Run Initialization
The web application creates an evaluation run with specified parameters:
- Suite Type: Full evaluation (all exercises) or partial (selected exercises)
- Model Configuration: AI model selection and settings
- Concurrency: Number of parallel task executions (1-25)
- Exercise Selection: Programming language and specific coding challenges
2. Controller Deployment
The web application spawns a controller container that:
- Loads Run Configuration: Retrieves run parameters and associated tasks from database
- Prepares Workspace: Sets up git repository with exercise code and test suites
- Establishes Monitoring: Starts Redis heartbeat and event publishing
- Creates Task Queue: Initializes concurrent task processing with specified limits
3. Task Distribution
The controller distributes tasks across runner containers using an in-memory queue:
- p-queue Management: Uses the
p-queue npm package to manage task concurrency in memory
- Container Spawning: Creates isolated runner containers for each task
- Resource Management: Enforces concurrency limits to prevent resource exhaustion
- Task Assignment: Each runner receives a single task with full context
- Progress Tracking: Monitors runner registration and task status via Redis pub/sub
4. Task Execution
Individual runners execute evaluation tasks:
- Environment Setup: Launches VS Code with Roo extension in isolated container
- Prompt Delivery: Sends evaluation prompt to AI agent via IPC
- Code Generation: AI agent writes code using available tools and context
- Real-time Events: Publishes progress updates, token usage, and completion status
- Validation: Runs language-specific unit tests to verify task completion
5. Result Collection
The system aggregates and reports results:
- Event Streaming: Real-time progress updates flow from runners through Redis to web interface
- Metrics Aggregation: Controller collects execution metrics, costs, and success rates
- Run Completion: Final results stored in database with comprehensive analytics
- Cleanup: Containers terminated and resources released
Technical Implementation
CLI System
The evaluation system is driven by a command-line interface that can operate in two modes:
- Run Mode: Orchestrates complete evaluation runs with multiple tasks
- Task Mode: Executes individual tasks within runner containers
The CLI automatically detects its execution environment and adapts behavior accordingly, using containerized task execution when running within Docker.
Container Architecture
Both controller and runner containers use the same base image but serve different purposes:
Runner Container Features
- Multi-language Support: Pre-installed runtimes for Go, Java, JavaScript, Python, and Rust
- Development Tools: VS Code with language-specific extensions and Roo Code extension
- Containerization: Docker-in-Docker capability for nested container execution
- Exercise Repository: Git clone of evaluation exercises with test suites
Container Isolation
Each task executes in complete isolation with:
- Fresh VS Code Instance: Clean environment with no shared state
- Dedicated Workspace: Task-specific directory with relevant exercise files
- Resource Limits: Controlled CPU and memory allocation
- Network Isolation: Containers communicate only through Redis pub/sub
Communication Architecture
The system uses multiple communication channels:
IPC (Inter-Process Communication)
- Unix Sockets: Direct communication between CLI and VS Code extension
- Event Streaming: Real-time task progress and AI agent interactions
- Command Interface: Task lifecycle management (start, cancel, close)
Redis Pub/Sub
- Event Broadcasting: Task events published to run-specific channels
- Runner Registration: Active runner tracking per evaluation run
- Heartbeat Monitoring: Controller health and availability status
- Not Used for Queuing: Task queue management is handled in-memory by the controller using
p-queue
HTTP/SSE
- Web Interface: REST API for run management and configuration
- Real-time Updates: Server-Sent Events for live progress monitoring
- Result Retrieval: Task metrics and completion status
Task Lifecycle Management
Each evaluation task follows a structured lifecycle:
- Initialization: Container startup and VS Code launch
- Connection: IPC socket establishment and extension activation
- Prompt Delivery: Evaluation challenge sent to AI agent
- Execution: AI agent writes code using available tools
- Validation: Unit test execution to verify correctness
- Cleanup: Container termination and resource cleanup
Error Handling and Timeouts
The system implements comprehensive error handling:
- Task Timeouts: 30-minute maximum execution time per task
- Process Cleanup: Automatic termination of hung processes
- Container Recovery: Failed containers are cleaned up and resources released
- Graceful Degradation: Individual task failures don't affect other tasks in the run
Metrics and Monitoring
Comprehensive tracking of evaluation performance:
- Token Usage: Input/output tokens and context size tracking
- Cost Analysis: API costs per task and aggregated run costs
- Tool Usage: Frequency and success rates of different AI tools
- Execution Time: Task duration and queue wait times
- Success Rates: Pass/fail statistics across languages and exercises
Configuration and Customization
Run Configuration
Evaluation runs support extensive customization:
- Model Selection: Choose from available AI models via OpenRouter integration
- Concurrency Control: 1-25 parallel task executions based on resource availability
- Exercise Selection: Full suite (all exercises) or partial (selected exercises)
- Custom Settings: Override default AI agent configuration and behavior
- System Prompts: Optional custom prompts for specialized evaluation scenarios
Exercise Management
The system uses a separate Git repository containing:
- Language-specific Exercises: Coding challenges organized by programming language
- Test Suites: Automated validation for each exercise
- Prompt Templates: Standardized evaluation instructions per language
- Workspace Configuration: Language-specific development environment setup
Scalability Considerations
The architecture supports horizontal scaling:
- Container Orchestration: Multiple controller instances can run simultaneously
- Resource Management: Configurable concurrency prevents resource exhaustion
- Database Optimization: Efficient task querying and result storage
- Redis Clustering: Pub/sub system can scale with message volume
Operational Characteristics
Performance
- Task Isolation: Complete environment isolation prevents interference between tasks
- Parallel Execution: Configurable concurrency maximizes resource utilization
- Efficient Communication: Unix sockets and Redis provide low-latency messaging
- Resource Cleanup: Automatic container termination prevents resource leaks
Reliability
- Fault Tolerance: Individual task failures don't impact other tasks
- Timeout Management: Prevents hung tasks from consuming resources indefinitely
- Health Monitoring: Controller heartbeat and runner registration tracking
- Graceful Shutdown: Proper cleanup of containers and database connections
Observability
- Real-time Monitoring: Live progress tracking through web interface
- Comprehensive Logging: Detailed execution logs for debugging and analysis
- Metrics Collection: Performance and cost analytics for optimization
- Event Auditing: Complete task lifecycle tracking for accountability
This architecture provides a robust, scalable platform for evaluating AI coding capabilities across multiple programming languages while maintaining strict isolation and comprehensive monitoring.