# Evals System Architecture
## Overview
The evals system is a distributed evaluation platform that runs AI coding tasks in isolated VS Code environments. It solves two critical problems in AI evaluation:
1. **Dependency Management**: Eliminates the complexity of setting up multiple programming language environments by packaging everything into pre-configured containers
2. **Resource Isolation**: Prevents memory exhaustion and state contamination by running each task in a fresh, isolated container instead of sequentially in a single VS Code instance
The architecture consists of three main components: a Next.js web application for management, a controller container that orchestrates evaluation runs, and multiple runner containers that execute individual tasks.
## Problems Solved
### Simplified Setup and Deployment
Traditional AI evaluation setups require complex dependency management across multiple programming languages, development tools, and VS Code extensions. The evals system eliminates this friction by:
- **One-Command Deployment**: Single `docker compose up` command starts the entire evaluation infrastructure
- **Pre-configured Environments**: Runner containers include all necessary language runtimes, tools, and VS Code extensions
- **Dependency Isolation**: No host system contamination or version conflicts between different language requirements
- **Reproducible Environments**: Identical evaluation conditions across different machines and deployments
### Resource Management and Isolation
Running multiple AI evaluation tasks sequentially in a single VS Code instance creates several problems:
- **Memory Accumulation**: VS Code instances gradually consume more memory with each task, eventually leading to crashes
- **State Contamination**: Previous tasks can leave behind files, settings, or processes that affect subsequent evaluations
- **Resource Contention**: Multiple tasks competing for the same VS Code instance create bottlenecks and inconsistent performance
- **Failure Propagation**: A single problematic task can crash the entire evaluation session
The containerized approach solves these issues by:
- **Fresh Environments**: Each task starts with a clean VS Code instance and workspace
- **Memory Reset**: Container termination automatically reclaims all memory and resources
- **Parallel Execution**: Multiple tasks can run simultaneously without interference
- **Fault Isolation**: Individual task failures don't affect other running evaluations
## Architecture Components
```mermaid
graph TB
Web[Admin Web App] <--> Redis[(Redis
PubSub & Registration)]
Web <--> DB[(PostgreSQL
Runs & Tasks)]
Web --> Controller[Run Controller / PQueue]
Controller <--> DB
Controller --> Runner1[Task Runner 1]
Controller --> Runner2[...]
Controller --> RunnerN[Task Runner N]
Runner1 <--> Redis
Runner2 <--> Redis
RunnerN <--> Redis
Redis <--> Web
```
### Core Components
#### Next.js Web Application
The web application serves as the primary interface for creating and monitoring evaluation runs. It provides:
- **Run Management**: Create evaluation runs with configurable parameters (model, concurrency, exercise selection)
- **Real-time Monitoring**: Live progress tracking via Server-Sent Events
- **Results Dashboard**: View task completion status, metrics, and outcomes
- **Container Orchestration**: Spawns controller containers for new runs
#### Controller Container
A specialized instance of the `evals-runner` container that acts as the run orchestrator. The controller:
- **In-Memory Task Queue**: Uses the `p-queue` npm package to manage task distribution with configurable concurrency limits
- **Git Workspace Setup**: Prepares exercise repositories and manages version control
- **Runner Coordination**: Spawns and monitors individual task runner containers
- **Heartbeat Monitoring**: Maintains Redis heartbeat to track controller health
- **Result Aggregation**: Collects task results and finalizes run metrics
#### Runner Containers
Individual containers that execute single evaluation tasks. Each runner:
- **Isolated Environment**: Fresh VS Code instance with pre-installed language tools and extensions
- **Task Execution**: Runs AI agent with evaluation prompt in VS Code environment
- **IPC Communication**: Connects to VS Code via Unix socket for real-time interaction
- **Unit Testing**: Validates task completion using language-specific test suites
- **Metrics Collection**: Tracks token usage, costs, tool usage, and execution time
#### Supporting Infrastructure
- **Redis**: Provides pub/sub messaging for real-time events and runner registration tracking (not used for task queuing)
- **PostgreSQL**: Stores run configurations, task definitions, execution metrics, and results
- **Docker**: Container orchestration for isolation and scalability
## Execution Flow
### 1. Run Initialization
The web application creates an evaluation run with specified parameters:
- **Suite Type**: Full evaluation (all exercises) or partial (selected exercises)
- **Model Configuration**: AI model selection and settings
- **Concurrency**: Number of parallel task executions (1-25)
- **Exercise Selection**: Programming language and specific coding challenges
### 2. Controller Deployment
The web application spawns a controller container that:
- **Loads Run Configuration**: Retrieves run parameters and associated tasks from database
- **Prepares Workspace**: Sets up git repository with exercise code and test suites
- **Establishes Monitoring**: Starts Redis heartbeat and event publishing
- **Creates Task Queue**: Initializes concurrent task processing with specified limits
### 3. Task Distribution
The controller distributes tasks across runner containers using an in-memory queue:
- **p-queue Management**: Uses the `p-queue` npm package to manage task concurrency in memory
- **Container Spawning**: Creates isolated runner containers for each task
- **Resource Management**: Enforces concurrency limits to prevent resource exhaustion
- **Task Assignment**: Each runner receives a single task with full context
- **Progress Tracking**: Monitors runner registration and task status via Redis pub/sub
### 4. Task Execution
Individual runners execute evaluation tasks:
- **Environment Setup**: Launches VS Code with Roo extension in isolated container
- **Prompt Delivery**: Sends evaluation prompt to AI agent via IPC
- **Code Generation**: AI agent writes code using available tools and context
- **Real-time Events**: Publishes progress updates, token usage, and completion status
- **Validation**: Runs language-specific unit tests to verify task completion
### 5. Result Collection
The system aggregates and reports results:
- **Event Streaming**: Real-time progress updates flow from runners through Redis to web interface
- **Metrics Aggregation**: Controller collects execution metrics, costs, and success rates
- **Run Completion**: Final results stored in database with comprehensive analytics
- **Cleanup**: Containers terminated and resources released
## Technical Implementation
### CLI System
The evaluation system is driven by a command-line interface that can operate in two modes:
- **Run Mode**: Orchestrates complete evaluation runs with multiple tasks
- **Task Mode**: Executes individual tasks within runner containers
The CLI automatically detects its execution environment and adapts behavior accordingly, using containerized task execution when running within Docker.
### Container Architecture
Both controller and runner containers use the same base image but serve different purposes:
#### Runner Container Features
- **Multi-language Support**: Pre-installed runtimes for Go, Java, JavaScript, Python, and Rust
- **Development Tools**: VS Code with language-specific extensions and Roo Code extension
- **Containerization**: Docker-in-Docker capability for nested container execution
- **Exercise Repository**: Git clone of evaluation exercises with test suites
#### Container Isolation
Each task executes in complete isolation with:
- **Fresh VS Code Instance**: Clean environment with no shared state
- **Dedicated Workspace**: Task-specific directory with relevant exercise files
- **Resource Limits**: Controlled CPU and memory allocation
- **Network Isolation**: Containers communicate only through Redis pub/sub
### Communication Architecture
The system uses multiple communication channels:
#### IPC (Inter-Process Communication)
- **Unix Sockets**: Direct communication between CLI and VS Code extension
- **Event Streaming**: Real-time task progress and AI agent interactions
- **Command Interface**: Task lifecycle management (start, cancel, close)
#### Redis Pub/Sub
- **Event Broadcasting**: Task events published to run-specific channels
- **Runner Registration**: Active runner tracking per evaluation run
- **Heartbeat Monitoring**: Controller health and availability status
- **Not Used for Queuing**: Task queue management is handled in-memory by the controller using `p-queue`
#### HTTP/SSE
- **Web Interface**: REST API for run management and configuration
- **Real-time Updates**: Server-Sent Events for live progress monitoring
- **Result Retrieval**: Task metrics and completion status
### Task Lifecycle Management
Each evaluation task follows a structured lifecycle:
1. **Initialization**: Container startup and VS Code launch
2. **Connection**: IPC socket establishment and extension activation
3. **Prompt Delivery**: Evaluation challenge sent to AI agent
4. **Execution**: AI agent writes code using available tools
5. **Validation**: Unit test execution to verify correctness
6. **Cleanup**: Container termination and resource cleanup
### Error Handling and Timeouts
The system implements comprehensive error handling:
- **Task Timeouts**: 30-minute maximum execution time per task
- **Process Cleanup**: Automatic termination of hung processes
- **Container Recovery**: Failed containers are cleaned up and resources released
- **Graceful Degradation**: Individual task failures don't affect other tasks in the run
### Metrics and Monitoring
Comprehensive tracking of evaluation performance:
- **Token Usage**: Input/output tokens and context size tracking
- **Cost Analysis**: API costs per task and aggregated run costs
- **Tool Usage**: Frequency and success rates of different AI tools
- **Execution Time**: Task duration and queue wait times
- **Success Rates**: Pass/fail statistics across languages and exercises
## Configuration and Customization
### Run Configuration
Evaluation runs support extensive customization:
- **Model Selection**: Choose from available AI models via OpenRouter integration
- **Concurrency Control**: 1-25 parallel task executions based on resource availability
- **Exercise Selection**: Full suite (all exercises) or partial (selected exercises)
- **Custom Settings**: Override default AI agent configuration and behavior
- **System Prompts**: Optional custom prompts for specialized evaluation scenarios
### Exercise Management
The system uses a separate Git repository containing:
- **Language-specific Exercises**: Coding challenges organized by programming language
- **Test Suites**: Automated validation for each exercise
- **Prompt Templates**: Standardized evaluation instructions per language
- **Workspace Configuration**: Language-specific development environment setup
### Scalability Considerations
The architecture supports horizontal scaling:
- **Container Orchestration**: Multiple controller instances can run simultaneously
- **Resource Management**: Configurable concurrency prevents resource exhaustion
- **Database Optimization**: Efficient task querying and result storage
- **Redis Clustering**: Pub/sub system can scale with message volume
## Operational Characteristics
### Performance
- **Task Isolation**: Complete environment isolation prevents interference between tasks
- **Parallel Execution**: Configurable concurrency maximizes resource utilization
- **Efficient Communication**: Unix sockets and Redis provide low-latency messaging
- **Resource Cleanup**: Automatic container termination prevents resource leaks
### Reliability
- **Fault Tolerance**: Individual task failures don't impact other tasks
- **Timeout Management**: Prevents hung tasks from consuming resources indefinitely
- **Health Monitoring**: Controller heartbeat and runner registration tracking
- **Graceful Shutdown**: Proper cleanup of containers and database connections
### Observability
- **Real-time Monitoring**: Live progress tracking through web interface
- **Comprehensive Logging**: Detailed execution logs for debugging and analysis
- **Metrics Collection**: Performance and cost analytics for optimization
- **Event Auditing**: Complete task lifecycle tracking for accountability
This architecture provides a robust, scalable platform for evaluating AI coding capabilities across multiple programming languages while maintaining strict isolation and comprehensive monitoring.