Browse Source

update e2e evals to use cline cli (#6977)

* remove un-implemented tests and create foundation for running cline in cli for exercism

* running version for python language

* remove unused code and reorder benchmark adapter

* remove optional helper functions from BenchmarkAdapter

* unskipping tests for java and javascript

* updating db schema

* updating output to match schema

* functional tests for all languages

* clean up unused commit and stored result

* nits

* small changes to wording

* adding to the test outputs

* using stdin for cline task send

* adding results dir to gitignore

* updating readme

* small nits for readme
Toshii 2 months ago
parent
commit
978a8a0aa6

+ 3 - 3
evals/.gitignore

@@ -1,6 +1,6 @@
 repositories
-
-results/evals.db
+temp-files
+results
 
 diff-edits/cases/
 diff-edits/results/
@@ -21,4 +21,4 @@ diff_editing/test_outputs/
 # Python bytecode cache
 *__pycache__/
 
-diff-edits/cases.zip
+diff-edits/cases.zip

+ 32 - 70
evals/README.md

@@ -15,48 +15,32 @@ The Cline Evaluation System allows you to:
 
 The evaluation system consists of two main components:
 
-1. **Test Server**: Enhanced HTTP server in `src/services/test/TestServer.ts` that provides detailed task results
-2. **CLI Tool**: Command-line interface in `evals/cli/` for orchestrating evaluations
-3. **Diff Edit Benchmark**: Separate command using the CLI tool that runs a comprehensive diff editing benchmark suite on real world cases, along with a streamlit dashboard displaying the results. For more details, see the [Diff Edit Benchmark README](./diff-edits/README.md). Make sure you add a `evals/diff-edits/cases` folder with all the conversation jsons. 
+1. **CLI Tool**: Command-line interface in `evals/cli/` for orchestrating evaluations
+2. **Diff Edit Benchmark**: Separate command using the CLI tool that runs a comprehensive diff editing benchmark suite on real world cases, along with a streamlit dashboard displaying the results. For more details, see the Diff Edit Benchmark [README](./diff-edits/README.md). Make sure you add a `evals/diff-edits/cases` folder with all the conversation jsons.
 
 ## Directory Structure
 
 ```
-cline-repo/
-├── src/
-│   ├── services/
-│   │   ├── test/
-│   │   │   ├── TestServer.ts         # Enhanced HTTP server for task execution
-│   │   │   ├── GitHelper.ts          # Git utilities for file tracking
-│   │   │   └── ...
-│   │   └── ...
-│   └── ...
-├── evals/                            # Main directory for evaluation system
-│   ├── cli/                          # CLI tool for orchestrating evaluations
-│   │   ├── src/
-│   │   │   ├── index.ts              # CLI entry point
-│   │   │   ├── commands/             # CLI commands (setup, run, report)
-│   │   │   ├── adapters/             # Benchmark adapters
-│   │   │   ├── db/                   # Database management
-│   │   │   └── utils/                # Utility functions
-│   │   ├── package.json
-│   │   └── tsconfig.json
-│   ├── diff-edits/                   # Diff editing evaluation suite
-│   │   ├── cases/                    # Test case JSON files
-│   │   ├── results/                  # Evaluation results
-│   │   ├── diff-apply/               # Diff application logic
-│   │   ├── parsing/                  # Assistant message parsing
-│   │   └── prompts/                  # System prompts
-│   ├── repositories/                 # Cloned benchmark repositories
-│   │   ├── exercism/                 # Modified Exercism (from pashpashpash/evals)
-│   │   ├── swe-bench/                # SWE-Bench repository
-│   │   ├── swelancer/                # SWELancer repository
-│   │   └── multi-swe/                # Multi-SWE-Bench repository
-│   ├── results/                      # Evaluation results storage
-│   │   ├── runs/                     # Individual run results
-│   │   └── reports/                  # Generated reports
-│   └── README.md                     # This file
-└── ...
+evals/                            # Main directory for evaluation system
+├── cli/                          # CLI tool for orchestrating evaluations
+│   └── src/
+│       ├── index.ts              # CLI entry point
+│       ├── commands/             # CLI commands (setup, run, report)
+│       ├── adapters/             # Benchmark adapters
+│       ├── db/                   # Database management
+│       └── utils/                # Utility functions
+├── diff-edits/                   # Diff editing evaluation suite
+│   ├── cases/                    # Test case JSON files
+│   ├── results/                  # Evaluation results
+│   ├── diff-apply/               # Diff application logic
+│   ├── parsing/                  # Assistant message parsing
+│   └── prompts/                  # System prompts
+├── repositories/                 # Cloned benchmark repositories
+│   └── exercism/                 # Exercism (Aider Polyglot)
+├── results/                      # Evaluation results storage
+│   ├── runs/                     # Individual run results
+│   └── reports/                  # Generated reports
+└── README.md                     # This file
 ```
 
 ## Getting Started
@@ -67,25 +51,14 @@ cline-repo/
 - VSCode with Cline extension installed
 - Git
 
-### Activation Mechanism
-
-The evaluation system uses an `evals.env` file approach to activate test mode in the Cline extension. When an evaluation is run:
-
-1. The CLI creates an `evals.env` file in the workspace directory
-2. The Cline extension activates due to the `workspaceContains:evals.env` activation event
-3. The extension detects this file and automatically enters test mode
-4. After evaluation completes, the file is automatically removed
-
-This approach eliminates the need for environment variables during the build process and allows for targeted activation only when needed for evaluations. The extension remains dormant during normal use, only activating when an evals.env file is present. For more details, see [Evals Env Activation](./docs/evals-env-activation.md).
-
 ### Installation
 
 1. Build the CLI tool:
 
 ```bash
-cd evals/cli
+cd evals
 npm install
-npm run build
+npm run build:cli
 ```
 
 ### Usage
@@ -106,13 +79,14 @@ node dist/index.js setup --benchmarks exercism
 #### Running Evaluations
 
 ```bash
-node dist/index.js run --model claude-3-opus-20240229 --benchmark exercism
+node dist/index.js run --benchmark exercism --count 10
 ```
 
 Options:
-- `--model`: The model to evaluate (default: claude-3-opus-20240229)
-- `--benchmark`: Specific benchmark to run (default: all)
-- `--count`: Number of tasks to run (default: all)
+- `--benchmark`: Specific benchmark to run (default: exercism)
+- `--count`: Number of tasks to run (default: all available tasks)
+
+**Note:** Model selection is currently configured through the Cline CLI itself, not through evaluation flags.
 
 #### Generating Reports
 
@@ -124,24 +98,11 @@ Options:
 - `--format`: Report format (json, markdown) (default: markdown)
 - `--output`: Output path for the report
 
-#### Managing Test Mode Activation
-
-The CLI provides a command to manually manage the evals.env file for test mode activation:
-
-```bash
-node dist/index.js evals-env create  # Create evals.env file in current directory
-node dist/index.js evals-env remove  # Remove evals.env file from current directory
-node dist/index.js evals-env check   # Check if evals.env file exists in current directory
-```
-
-Options:
-- `--directory`: Specify a directory other than the current one
-
 ## Benchmarks
 
 ### Exercism
 
-Modified Exercism exercises from the [pashpashpash/evals](https://github.com/pashpashpash/evals) repository. These are small, focused programming exercises in various languages.
+Modified Exercism exercises from the [polyglot-benchmark](https://github.com/Aider-AI/polyglot-benchmark) repository. These are small, focused programming exercises in various languages.
 
 ### SWE-Bench (Coming Soon)
 
@@ -350,7 +311,8 @@ The evaluation system collects the following metrics:
 - **Duration**: Time taken to complete tasks
 - **Tool Usage**: Number of tool calls and failures
 - **Success Rate**: Percentage of tasks completed successfully
-- **Functional Correctness**: Percentage of tests passed
+- **Test Success Rate**: Percentage of tests passed
+- **Functional Correctness**: Ratio of tests passed to total tests
 
 ## Reports
 

+ 465 - 37
evals/cli/src/adapters/exercism.ts

@@ -1,6 +1,7 @@
 import * as path from "path"
 import * as fs from "fs"
 import execa from "execa"
+import chalk from "chalk"
 import { BenchmarkAdapter, Task, VerificationResult } from "./types"
 
 const EVALS_DIR = path.resolve(__dirname, "../../../")
@@ -20,8 +21,12 @@ export class ExercismAdapter implements BenchmarkAdapter {
 
 		if (!fs.existsSync(exercismDir)) {
 			console.log(`Cloning Exercism repository to ${exercismDir}...`)
-			await execa("git", ["clone", "https://github.com/pashpashpash/evals.git", exercismDir])
+			await execa("git", ["clone", "https://github.com/Aider-AI/polyglot-benchmark.git", exercismDir])
 			console.log("Exercism repository cloned successfully")
+			
+			// Unskip all JavaScript and Java tests after cloning
+			this.unskipAllJavaScriptTests(exercismDir)
+			this.unskipAllJavaTests(exercismDir)
 		} else {
 			console.log(`Exercism repository already exists at ${exercismDir}`)
 
@@ -29,6 +34,10 @@ export class ExercismAdapter implements BenchmarkAdapter {
 			console.log("Pulling latest changes...")
 			await execa("git", ["pull"], { cwd: exercismDir })
 			console.log("Repository updated successfully")
+			
+			// Unskip tests again after pulling
+			this.unskipAllJavaScriptTests(exercismDir)
+			this.unskipAllJavaTests(exercismDir)
 		}
 	}
 
@@ -51,7 +60,7 @@ export class ExercismAdapter implements BenchmarkAdapter {
 			.filter((dir) => !dir.startsWith(".") && !["node_modules", ".git"].includes(dir))
 
 		for (const language of languages) {
-			const languageDir = path.join(exercisesDir, language)
+			const languageDir = path.join(exercisesDir, language, "exercises", "practice")
 
 			// Read exercise directories
 			const exercises = fs.readdirSync(languageDir).filter((dir) => fs.statSync(path.join(languageDir, dir)).isDirectory())
@@ -61,7 +70,7 @@ export class ExercismAdapter implements BenchmarkAdapter {
 
 				// Read instructions
 				let description = ""
-				const instructionsPath = path.join(exerciseDir, "docs", "instructions.md")
+				const instructionsPath = path.join(exerciseDir, ".docs", "instructions.md")
 				if (fs.existsSync(instructionsPath)) {
 					description = fs.readFileSync(instructionsPath, "utf-8")
 				}
@@ -69,20 +78,23 @@ export class ExercismAdapter implements BenchmarkAdapter {
 				// Determine test commands based on language
 				let testCommands: string[] = []
 				switch (language) {
+					case "cpp":
+						testCommands = ["cmake -DEXERCISM_RUN_ALL_TESTS=1 .", "make"]
+						break
 					case "javascript":
-						testCommands = ["npm install", "npm test"]
+						testCommands = ["npm install", "npm test -- --testNamePattern=."]
 						break
 					case "python":
-						testCommands = ["python -m pytest -o markers=task *_test.py"]
+						testCommands = ["python3 -m pytest -o markers=task *_test.py"]
 						break
 					case "go":
-						testCommands = ["go test"]
+						testCommands = ["GOWORK=off go test -v"]
 						break
 					case "java":
 						testCommands = ["./gradlew test"]
 						break
 					case "rust":
-						testCommands = ["cargo test"]
+						testCommands = ["cargo test -- --include-ignored"]
 						break
 					default:
 						testCommands = []
@@ -118,53 +130,116 @@ export class ExercismAdapter implements BenchmarkAdapter {
 			throw new Error(`Task ${taskId} not found`)
 		}
 
-		// Check if Git repository is already initialized
-		const gitDirExists = fs.existsSync(path.join(task.workspacePath, ".git"))
+		// Create temp directory outside workspace for hiding files
+		const tempDir = path.join(EVALS_DIR, "temp-files", task.id)
+		fs.mkdirSync(tempDir, { recursive: true })
 
-		try {
-			// Initialize Git repository if needed
-			if (!gitDirExists) {
-				await execa("git", ["init"], { cwd: task.workspacePath })
-			}
+		// Read config.json to get solution and test files
+		const configPath = path.join(task.workspacePath, ".meta", "config.json")
+		let config: any = { files: { solution: [], test: [] } }
+		
+		if (fs.existsSync(configPath)) {
+			config = JSON.parse(fs.readFileSync(configPath, "utf-8"))
+		}
 
-			// Create a dummy file to ensure there's something to commit
-			const dummyFilePath = path.join(task.workspacePath, ".eval-timestamp")
-			fs.writeFileSync(dummyFilePath, new Date().toISOString())
+		// Build enhanced description with instructions
+		let description = ""
+		const instructionsPath = path.join(task.workspacePath, ".docs", "instructions.md")
+		const appendPath = path.join(task.workspacePath, ".docs", "instructions.append.md")
 
-			// Add all files and commit
-			await execa("git", ["add", "."], { cwd: task.workspacePath })
+		if (fs.existsSync(instructionsPath)) {
+			description = fs.readFileSync(instructionsPath, "utf-8")
+		}
 
-			try {
-				await execa("git", ["commit", "-m", "Initial commit"], { cwd: task.workspacePath })
-			} catch (error: any) {
-				// If commit fails because there are no changes, that's okay
-				if (!error.stderr?.includes("nothing to commit")) {
-					throw error
+		if (fs.existsSync(appendPath)) {
+			description += "\n\n" + fs.readFileSync(appendPath, "utf-8")
+		}
+
+		// Add solution files constraint to description
+		const solutionFiles = config.files.solution || []
+		const fileList = solutionFiles.join(", ")
+		description += `\n\nUse the above instructions to modify the supplied files: ${fileList}. Don't change the names of existing functions or classes, as they may be referenced from other code like unit tests, etc. Only use standard libraries, don't suggest installing any packages.`
+		description += " You should ignore all test or test related files in this directory. The final test file has been removed and will be used to evaluate your work after your implementation is complete. Think deeply about the problem prior to working on the implementation. Consider all edge cases and test your solution prior to finalizing."
+
+		// Move test files to temp directory
+		if (config.files.test) {
+			config.files.test.forEach((testFile: string) => {
+				const src = path.join(task.workspacePath, testFile)
+				if (fs.existsSync(src)) {
+					const dest = path.join(tempDir, testFile)
+					fs.mkdirSync(path.dirname(dest), { recursive: true })
+					fs.renameSync(src, dest)
+				}
+			})
+		}
+
+		// Move all dot directories (except .git) to temp directory
+		const items = fs.readdirSync(task.workspacePath)
+		items.forEach((item) => {
+			if (item.startsWith(".") && item !== ".git") {
+				const src = path.join(task.workspacePath, item)
+				const stat = fs.statSync(src)
+				if (stat.isDirectory()) {
+					const dest = path.join(tempDir, item)
+					fs.renameSync(src, dest)
 				}
 			}
-		} catch (error: any) {
-			console.warn(`Warning: Git operations failed: ${error.message}`)
-			console.warn("Continuing without Git initialization")
+		})
+
+		return {
+			...task,
+			description,
+			metadata: {
+				...task.metadata,
+				solutionFiles,
+				tempDir,
+				config,
+			},
 		}
+	}
+
+	/**
+	 * Cleanup after task execution (restores hidden files from temp directory)
+	 * @param task The task that was executed
+	 */
+	async cleanupTask(task: Task): Promise<void> {
+		const tempDir = path.join(EVALS_DIR, "temp-files", task.id)
+
+		if (fs.existsSync(tempDir)) {
+			const items = fs.readdirSync(tempDir)
+			items.forEach((item) => {
+				const src = path.join(tempDir, item)
+				const dest = path.join(task.workspacePath, item)
+				// Only move if destination doesn't exist (keeps newer test artifacts like .pytest_cache)
+				if (!fs.existsSync(dest)) {
+					fs.renameSync(src, dest)
+				}
+			})
 
-		return task
+			// Clean up temp directory
+			fs.rmSync(tempDir, { recursive: true, force: true })
+		}
 	}
 
 	/**
-	 * Verify the result of a task execution
+	 * Verify the result of a task execution by running tests
 	 * @param task The task that was executed
-	 * @param result The result of the task execution
 	 */
-	async verifyResult(task: Task, result: any): Promise<VerificationResult> {
+	async verifyResult(task: Task): Promise<VerificationResult> {
 		// Run verification commands
 		let success = true
 		let output = ""
 
 		for (const command of task.verificationCommands) {
 			try {
-				const [cmd, ...args] = command.split(" ")
-				const { stdout } = await execa(cmd, args, { cwd: task.workspacePath })
+				const { stdout, stderr } = await execa(command, {
+					cwd: task.workspacePath,
+					shell: true,
+				})
 				output += stdout + "\n"
+				if (stderr) {
+					output += stderr + "\n"
+				}
 			} catch (error: any) {
 				success = false
 				if (error.stdout) {
@@ -176,13 +251,92 @@ export class ExercismAdapter implements BenchmarkAdapter {
 			}
 		}
 
-		// Parse test results
-		const testsPassed = (output.match(/PASS/g) || []).length
-		const testsFailed = (output.match(/FAIL/g) || []).length
+		// Log the raw output
+		// console.log("\n=== TEST OUTPUT START ===")
+		// console.log(output)
+		// console.log("=== TEST OUTPUT END ===\n")
+
+		// Parse test results based on language
+		const language = task.metadata.language
+		let testsPassed = 0
+		let testsFailed = 0
+
+		switch (language) {
+			case "python":
+				const pyPassMatch = output.match(/(\d+) passed/)
+				const pyFailMatch = output.match(/(\d+) failed/)
+				testsPassed = pyPassMatch ? parseInt(pyPassMatch[1]) : 0
+				testsFailed = pyFailMatch ? parseInt(pyFailMatch[1]) : 0
+				break
+
+			case "javascript":
+				const jestMatch = output.match(/Tests:\s+(?:\d+ skipped,\s+)?(\d+) passed(?:,\s+(\d+) failed)?/)
+				if (jestMatch) {
+					testsPassed = parseInt(jestMatch[1])
+					testsFailed = jestMatch[2] ? parseInt(jestMatch[2]) : 0
+				} else {
+					// Fallback to counting test suites
+					testsPassed = (output.match(/PASS/g) || []).length
+					testsFailed = (output.match(/FAIL/g) || []).length
+				}
+				break
+
+			case "go":
+				// This incorrectly counts the parent, but minor and doesn't affect final boolean metric
+				testsPassed = (output.match(/--- PASS:/g) || []).length
+				testsFailed = (output.match(/--- FAIL:/g) || []).length
+				break
+
+			case "rust":
+				// Rust runs multiple test suites (unit, integration, doc tests)
+				// Sum results across all test result lines
+				const resultLines = output.match(/test result:.*?(\d+) passed; (\d+) failed/g)
+				if (resultLines) {
+					testsPassed = 0
+					testsFailed = 0
+					for (const line of resultLines) {
+						const match = line.match(/(\d+) passed; (\d+) failed/)
+						if (match) {
+							testsPassed += parseInt(match[1])
+							testsFailed += parseInt(match[2])
+						}
+					}
+				}
+				break
+
+			case "java":
+				testsPassed = (output.match(/PASSED/g) || []).length
+				testsFailed = (output.match(/FAILED/g) || []).length
+				break
+
+			case "cpp":
+				const cppAllPassedMatch = output.match(/All tests passed \(.*?(\d+) test cases?\)/)
+				const cppTestCasesMatch = output.match(/test cases?: (\d+) \| (\d+) passed/)
+				const cppFailedMatch = output.match(/(\d+) failed/)
+				
+				if (cppAllPassedMatch) {
+					// All tests passed - extract total test cases
+					testsPassed = parseInt(cppAllPassedMatch[1])
+					testsFailed = 0
+				} else if (cppTestCasesMatch) {
+					// Mixed results - extract passed count and calculate failed
+					const totalTests = parseInt(cppTestCasesMatch[1])
+					testsPassed = parseInt(cppTestCasesMatch[2])
+					testsFailed = cppFailedMatch ? parseInt(cppFailedMatch[1]) : (totalTests - testsPassed)
+				}
+				break
+
+			default:
+				// Fallback to generic PASS/FAIL counting
+				testsPassed = (output.match(/PASS/g) || []).length
+				testsFailed = (output.match(/FAIL/g) || []).length
+		}
+
 		const testsTotal = testsPassed + testsFailed
 
 		return {
 			success,
+			rawOutput: output,
 			metrics: {
 				testsPassed,
 				testsFailed,
@@ -191,4 +345,278 @@ export class ExercismAdapter implements BenchmarkAdapter {
 			},
 		}
 	}
+
+	/**
+	 * Hide test files by moving them to temp directory
+	 * @param task The task to hide test files for
+	 */
+	private hideTestFiles(task: Task): void {
+		const tempDir = task.metadata.tempDir
+		const config = task.metadata.config
+
+		if (config?.files?.test) {
+			config.files.test.forEach((testFile: string) => {
+				const src = path.join(task.workspacePath, testFile)
+				if (fs.existsSync(src)) {
+					const dest = path.join(tempDir, testFile)
+					fs.mkdirSync(path.dirname(dest), { recursive: true })
+					fs.renameSync(src, dest)
+				}
+			})
+		}
+
+		// Hide dot directories again (except .git)
+		const items = fs.readdirSync(task.workspacePath)
+		items.forEach((item) => {
+			if (item.startsWith(".") && item !== ".git") {
+				const src = path.join(task.workspacePath, item)
+				if (fs.existsSync(src)) {
+					const stat = fs.statSync(src)
+					if (stat.isDirectory()) {
+						const dest = path.join(tempDir, item)
+						if (!fs.existsSync(dest)) {
+							fs.renameSync(src, dest)
+						}
+					}
+				}
+			}
+		})
+	}
+
+	/**
+	 * Restore test files by moving them from temp directory
+	 * @param task The task to restore test files for
+	 */
+	private restoreTestFiles(task: Task): void {
+		const tempDir = task.metadata.tempDir
+		const config = task.metadata.config
+
+		if (config?.files?.test) {
+			config.files.test.forEach((testFile: string) => {
+				const src = path.join(tempDir, testFile)
+				if (fs.existsSync(src)) {
+					const dest = path.join(task.workspacePath, testFile)
+					fs.mkdirSync(path.dirname(dest), { recursive: true })
+					fs.renameSync(src, dest)
+				}
+			})
+		}
+
+		// Restore dot directories (except .git)
+		if (fs.existsSync(tempDir)) {
+			const items = fs.readdirSync(tempDir)
+			items.forEach((item) => {
+				if (item.startsWith(".") && item !== ".git") {
+					const src = path.join(tempDir, item)
+					const dest = path.join(task.workspacePath, item)
+					if (fs.existsSync(src) && !fs.existsSync(dest)) {
+						fs.renameSync(src, dest)
+					}
+				}
+			})
+		}
+	}
+
+	/**
+	 * Builds retry message with test errors and fix instructions
+	 * @param testOutput The raw test output showing errors
+	 * @param solutionFiles List of solution files to fix
+	 * @returns Formatted retry message
+	 */
+	private buildRetryMessage(testOutput: string, solutionFiles: string[]): string {
+		const fileList = solutionFiles.join(", ")
+		return `${testOutput}\n\nSee the testing errors above. The tests are correct, don't try and change them. Fix the code in ${fileList} to resolve the errors.`
+	}
+
+	/**
+	 * Unskip all JavaScript tests in the repository by replacing xtest with test
+	 * @param repoPath Path to the exercism repository
+	 */
+	private unskipAllJavaScriptTests(repoPath: string): void {
+		const jsDir = path.join(repoPath, "javascript", "exercises", "practice")
+		
+		if (!fs.existsSync(jsDir)) {
+			console.log("JavaScript exercises directory not found, skipping test unskipping")
+			return
+		}
+		
+		// Walk through all exercise directories
+		const exercises = fs.readdirSync(jsDir).filter(dir => {
+			const fullPath = path.join(jsDir, dir)
+			return fs.statSync(fullPath).isDirectory()
+		})
+
+		let filesModified = 0
+		for (const exercise of exercises) {
+			const exerciseDir = path.join(jsDir, exercise)
+			
+			// Find all .spec.js files
+			const files = fs.readdirSync(exerciseDir).filter(file => file.endsWith('.spec.js'))
+			
+			for (const file of files) {
+				const filePath = path.join(exerciseDir, file)
+				let content = fs.readFileSync(filePath, 'utf-8')
+				const originalContent = content
+				
+				// Replace xtest with test to unskip tests
+				content = content.replace(/xtest\(/g, 'test(')
+				
+				if (content !== originalContent) {
+					fs.writeFileSync(filePath, content)
+					filesModified++
+				}
+			}
+		}
+		
+		console.log(`Unskipped tests in ${filesModified} JavaScript test files`)
+	}
+
+	/**
+	 * Unskip all Java tests in the repository by removing @Disabled annotations
+	 * @param repoPath Path to the exercism repository
+	 */
+	private unskipAllJavaTests(repoPath: string): void {
+		const javaDir = path.join(repoPath, "java", "exercises", "practice")
+		
+		if (!fs.existsSync(javaDir)) {
+			console.log("Java exercises directory not found, skipping test unskipping")
+			return
+		}
+		
+		// Walk through all exercise directories
+		const exercises = fs.readdirSync(javaDir).filter(dir => {
+			const fullPath = path.join(javaDir, dir)
+			return fs.statSync(fullPath).isDirectory()
+		})
+
+		let filesModified = 0
+		for (const exercise of exercises) {
+			const testDir = path.join(javaDir, exercise, "src", "test", "java")
+			
+			if (!fs.existsSync(testDir)) {
+				continue
+			}
+			
+			// Find all .java test files
+			const files = fs.readdirSync(testDir).filter(file => file.endsWith('.java'))
+			
+			for (const file of files) {
+				const filePath = path.join(testDir, file)
+				let content = fs.readFileSync(filePath, 'utf-8')
+				const originalContent = content
+				
+				// Remove @Disabled("Remove to run test") annotations
+				content = content.replace(/@Disabled\("Remove to run test"\)\s*\n/g, '')
+				
+				if (content !== originalContent) {
+					fs.writeFileSync(filePath, content)
+					filesModified++
+				}
+			}
+		}
+		
+		console.log(`Unskipped tests in ${filesModified} Java test files`)
+	}
+
+	/**
+	 * Runs a Cline task with automatic retry on test failure
+	 * Creates a new Cline instance, runs the task, verifies with tests,
+	 * and retries once if tests fail
+	 * @param task The task to execute
+	 * @returns The final verification result, or null
+	 */
+	async runTask(task: Task): Promise<VerificationResult | null> {
+		const startTime = Date.now()
+		let instanceAddress: string | null = null
+		let attempts = 0
+		let finalVerification: VerificationResult | null = null
+
+		try {
+			// Step 1: Start a new Cline instance in the working directory
+			const instanceResult = await execa("cline", ["instance", "new"], {
+				cwd: task.workspacePath,
+				stdin: "ignore",
+			})
+
+			// Step 2: Parse the instance address from output
+			const addressMatch = instanceResult.stdout.match(/Address:\s*([\d.]+:\d+)/)
+			if (!addressMatch) {
+				throw new Error("Failed to parse instance address from output")
+			}
+			instanceAddress = addressMatch[1]
+
+			// Step 3: Create the initial task on this specific instance
+			await execa("cline", ["task", "new", "--yolo", "--address", instanceAddress, task.description], {
+				cwd: task.workspacePath,
+				stdin: "ignore",
+			})
+
+			// Step 4: Wait for initial implementation to complete
+			console.log(chalk.blue(`Waiting for first attempt to complete...`))
+			await execa("cline", ["task", "view", "--follow-complete", "--address", instanceAddress], {
+				cwd: task.workspacePath,
+				stdin: "ignore",
+			})
+
+			// Step 5: Run first test attempt
+			console.log(chalk.blue(`Running tests (attempt 1)...`))
+			this.restoreTestFiles(task)
+			attempts = 1
+			const firstVerification = await this.verifyResult(task)
+			finalVerification = firstVerification
+
+			// Step 6: Retry if tests failed
+			if (!firstVerification.success) {
+				console.log(chalk.blue(`Tests failed on first attempt. Retrying...`))
+
+				// Hide test files again for retry
+				this.hideTestFiles(task)
+
+				attempts = 2
+				const solutionFiles = task.metadata.solutionFiles || []
+				const retryMessage = this.buildRetryMessage(firstVerification.rawOutput || "", solutionFiles)
+
+				// Send retry task message
+				await execa("cline", ["task", "send", "--yolo", "--address", instanceAddress], {
+					cwd: task.workspacePath,
+					input: retryMessage,
+				})
+
+				// Follow retry until complete
+				await execa("cline", ["task", "view", "--follow-complete", "--address", instanceAddress], {
+					cwd: task.workspacePath,
+					stdin: "ignore",
+				})
+
+				// Run second test attempt (final)
+				console.log(chalk.blue(`Running tests (attempt 2)...`))
+				this.restoreTestFiles(task)
+				const secondVerification = await this.verifyResult(task)
+				finalVerification = secondVerification
+			}
+
+			const duration = Date.now() - startTime
+			console.log(
+				chalk.green(`Task completed in ${(duration / 1000).toFixed(1)}s after ${attempts} attempt${attempts > 1 ? "s" : ""}`),
+			)
+
+			return finalVerification
+		} catch (error: any) {
+			const duration = Date.now() - startTime
+			console.error(chalk.red(`Task failed after ${(duration / 1000).toFixed(1)}s: ${error.message}`))
+
+			return finalVerification
+		} finally {
+			// Step 7: Always clean up the instance, even if task failed
+			if (instanceAddress) {
+				try {
+					await execa("cline", ["instance", "kill", instanceAddress], {
+						stdin: "ignore",
+					})
+				} catch (cleanupError: any) {
+					console.error(chalk.yellow(`Warning: Failed to kill instance ${instanceAddress}: ${cleanupError.message}`))
+				}
+			}
+		}
+	}
 }

+ 0 - 9
evals/cli/src/adapters/index.ts

@@ -1,18 +1,9 @@
 import { BenchmarkAdapter } from "./types"
 import { ExercismAdapter } from "./exercism"
-import { SWEBenchAdapter } from "./swe-bench"
-import { SWELancerAdapter } from "./swelancer"
-import { MultiSWEAdapter } from "./multi-swe"
 
 // Registry of all available adapters
 const adapters: Record<string, BenchmarkAdapter> = {
-	// Exercism is the primary adapter with real implementation
 	exercism: new ExercismAdapter(),
-
-	// Dummy adapters for testing
-	"swe-bench": new SWEBenchAdapter(),
-	swelancer: new SWELancerAdapter(),
-	"multi-swe": new MultiSWEAdapter(),
 }
 
 /**

+ 0 - 192
evals/cli/src/adapters/multi-swe.ts

@@ -1,192 +0,0 @@
-import * as path from "path"
-import * as fs from "fs"
-import execa from "execa"
-import { BenchmarkAdapter, Task, VerificationResult } from "./types"
-
-const EVALS_DIR = path.resolve(__dirname, "../../../")
-
-/**
- * Dummy adapter for the Multi-SWE-Bench benchmark
- */
-export class MultiSWEAdapter implements BenchmarkAdapter {
-	name = "multi-swe"
-
-	/**
-	 * Set up the Multi-SWE-Bench benchmark repository (dummy implementation)
-	 */
-	async setup(): Promise<void> {
-		console.log("Multi-SWE-Bench dummy setup completed")
-
-		// Create repositories directory if it doesn't exist
-		const repoDir = path.join(EVALS_DIR, "repositories", "multi-swe")
-		if (!fs.existsSync(repoDir)) {
-			fs.mkdirSync(repoDir, { recursive: true })
-			console.log(`Created dummy Multi-SWE-Bench directory at ${repoDir}`)
-		}
-	}
-
-	/**
-	 * List all available tasks in the Multi-SWE-Bench benchmark (dummy implementation)
-	 */
-	async listTasks(): Promise<Task[]> {
-		return [
-			{
-				id: "multi-swe-task-1",
-				name: "Multi-Language API Integration",
-				description:
-					"Implement a system that integrates a Python backend with a TypeScript frontend and a Rust processing service.",
-				workspacePath: path.join(EVALS_DIR, "repositories", "multi-swe"),
-				setupCommands: [],
-				verificationCommands: [],
-				metadata: {
-					languages: ["python", "typescript", "rust"],
-					complexity: "high",
-					type: "multi-swe",
-				},
-			},
-			{
-				id: "multi-swe-task-2",
-				name: "Cross-Platform Mobile App",
-				description: "Create a cross-platform mobile app using React Native with native modules in Swift and Kotlin.",
-				workspacePath: path.join(EVALS_DIR, "repositories", "multi-swe"),
-				setupCommands: [],
-				verificationCommands: [],
-				metadata: {
-					languages: ["javascript", "swift", "kotlin"],
-					complexity: "medium",
-					type: "multi-swe",
-				},
-			},
-			{
-				id: "multi-swe-task-3",
-				name: "Microservice Architecture",
-				description: "Design and implement a microservice architecture with services written in Go, Node.js, and Java.",
-				workspacePath: path.join(EVALS_DIR, "repositories", "multi-swe"),
-				setupCommands: [],
-				verificationCommands: [],
-				metadata: {
-					languages: ["go", "javascript", "java"],
-					complexity: "high",
-					type: "multi-swe",
-				},
-			},
-		]
-	}
-
-	/**
-	 * Prepare a specific task for execution (dummy implementation)
-	 * @param taskId The ID of the task to prepare
-	 */
-	async prepareTask(taskId: string): Promise<Task> {
-		const tasks = await this.listTasks()
-		const task = tasks.find((t) => t.id === taskId)
-
-		if (!task) {
-			throw new Error(`Task ${taskId} not found`)
-		}
-
-		// Create a dummy workspace for the task
-		const taskDir = path.join(task.workspacePath, taskId)
-		if (!fs.existsSync(taskDir)) {
-			fs.mkdirSync(taskDir, { recursive: true })
-
-			// Create a dummy file for the task
-			fs.writeFileSync(
-				path.join(taskDir, "README.md"),
-				`# ${task.name}\n\n${task.description}\n\nThis is a dummy task for testing purposes.`,
-			)
-
-			// Create additional dummy files based on task type
-			if (task.id === "multi-swe-task-1") {
-				// Python backend
-				fs.mkdirSync(path.join(taskDir, "backend"), { recursive: true })
-				fs.writeFileSync(
-					path.join(taskDir, "backend", "app.py"),
-					`# TODO: Implement Python backend\nfrom flask import Flask\n\napp = Flask(__name__)\n\[email protected]('/')\ndef hello():\n    return "Hello, World!"\n`,
-				)
-
-				// TypeScript frontend
-				fs.mkdirSync(path.join(taskDir, "frontend"), { recursive: true })
-				fs.writeFileSync(
-					path.join(taskDir, "frontend", "app.ts"),
-					`// TODO: Implement TypeScript frontend\nconsole.log('Frontend starting...');\n`,
-				)
-
-				// Rust processing service
-				fs.mkdirSync(path.join(taskDir, "processor"), { recursive: true })
-				fs.writeFileSync(
-					path.join(taskDir, "processor", "main.rs"),
-					`// TODO: Implement Rust processing service\nfn main() {\n    println!("Processor starting...");\n}\n`,
-				)
-			} else if (task.id === "multi-swe-task-2") {
-				// React Native app
-				fs.mkdirSync(path.join(taskDir, "app"), { recursive: true })
-				fs.writeFileSync(
-					path.join(taskDir, "app", "App.js"),
-					`// TODO: Implement React Native app\nimport React from 'react';\nimport { View, Text } from 'react-native';\n\nexport default function App() {\n  return (\n    <View>\n      <Text>Hello, World!</Text>\n    </View>\n  );\n}\n`,
-				)
-
-				// Swift native module
-				fs.mkdirSync(path.join(taskDir, "ios"), { recursive: true })
-				fs.writeFileSync(
-					path.join(taskDir, "ios", "NativeModule.swift"),
-					`// TODO: Implement Swift native module\nimport Foundation\n\n@objc(NativeModule)\nclass NativeModule: NSObject {\n  @objc\n  func hello() -> String {\n    return "Hello from Swift"\n  }\n}\n`,
-				)
-
-				// Kotlin native module
-				fs.mkdirSync(path.join(taskDir, "android"), { recursive: true })
-				fs.writeFileSync(
-					path.join(taskDir, "android", "NativeModule.kt"),
-					`// TODO: Implement Kotlin native module\npackage com.example.app\n\nclass NativeModule {\n  fun hello(): String {\n    return "Hello from Kotlin"\n  }\n}\n`,
-				)
-			} else if (task.id === "multi-swe-task-3") {
-				// Go service
-				fs.mkdirSync(path.join(taskDir, "service-go"), { recursive: true })
-				fs.writeFileSync(
-					path.join(taskDir, "service-go", "main.go"),
-					`// TODO: Implement Go service\npackage main\n\nimport "fmt"\n\nfunc main() {\n\tfmt.Println("Go service starting...")\n}\n`,
-				)
-
-				// Node.js service
-				fs.mkdirSync(path.join(taskDir, "service-node"), { recursive: true })
-				fs.writeFileSync(
-					path.join(taskDir, "service-node", "server.js"),
-					`// TODO: Implement Node.js service\nconsole.log('Node.js service starting...');\n`,
-				)
-
-				// Java service
-				fs.mkdirSync(path.join(taskDir, "service-java"), { recursive: true })
-				fs.writeFileSync(
-					path.join(taskDir, "service-java", "Main.java"),
-					`// TODO: Implement Java service\npublic class Main {\n    public static void main(String[] args) {\n        System.out.println("Java service starting...");\n    }\n}\n`,
-				)
-			}
-		}
-
-		// Update the task's workspace path to the task-specific directory
-		return {
-			...task,
-			workspacePath: taskDir,
-		}
-	}
-
-	/**
-	 * Verify the result of a task execution (dummy implementation)
-	 * @param task The task that was executed
-	 * @param result The result of the task execution
-	 */
-	async verifyResult(task: Task, result: any): Promise<VerificationResult> {
-		// Always return success for dummy implementation
-		return {
-			success: true,
-			metrics: {
-				testsPassed: 1,
-				testsFailed: 0,
-				testsTotal: 1,
-				functionalCorrectness: 1.0,
-				crossLanguageIntegration: 0.9, // Dummy metric specific to Multi-SWE
-				architectureQuality: 0.85, // Dummy metric specific to Multi-SWE
-			},
-		}
-	}
-}

+ 0 - 125
evals/cli/src/adapters/swe-bench.ts

@@ -1,125 +0,0 @@
-import * as path from "path"
-import * as fs from "fs"
-import execa from "execa"
-import { BenchmarkAdapter, Task, VerificationResult } from "./types"
-
-const EVALS_DIR = path.resolve(__dirname, "../../../")
-
-/**
- * Dummy adapter for the SWE-Bench benchmark
- */
-export class SWEBenchAdapter implements BenchmarkAdapter {
-	name = "swe-bench"
-
-	/**
-	 * Set up the SWE-Bench benchmark repository (dummy implementation)
-	 */
-	async setup(): Promise<void> {
-		console.log("SWE-Bench dummy setup completed")
-
-		// Create repositories directory if it doesn't exist
-		const repoDir = path.join(EVALS_DIR, "repositories", "swe-bench")
-		if (!fs.existsSync(repoDir)) {
-			fs.mkdirSync(repoDir, { recursive: true })
-			console.log(`Created dummy SWE-Bench directory at ${repoDir}`)
-		}
-	}
-
-	/**
-	 * List all available tasks in the SWE-Bench benchmark (dummy implementation)
-	 */
-	async listTasks(): Promise<Task[]> {
-		return [
-			{
-				id: "swe-bench-task-1",
-				name: "Fix React Component Bug",
-				description: "Fix a bug in a React component where the state is not properly updated.",
-				workspacePath: path.join(EVALS_DIR, "repositories", "swe-bench"),
-				setupCommands: [],
-				verificationCommands: [],
-				metadata: {
-					repository: "facebook/react",
-					issue: "#12345",
-					type: "swe-bench",
-				},
-			},
-			{
-				id: "swe-bench-task-2",
-				name: "Optimize Database Query",
-				description: "Optimize a slow database query in a Django application.",
-				workspacePath: path.join(EVALS_DIR, "repositories", "swe-bench"),
-				setupCommands: [],
-				verificationCommands: [],
-				metadata: {
-					repository: "django/django",
-					issue: "#6789",
-					type: "swe-bench",
-				},
-			},
-			{
-				id: "swe-bench-task-3",
-				name: "Fix Memory Leak",
-				description: "Fix a memory leak in a Node.js application.",
-				workspacePath: path.join(EVALS_DIR, "repositories", "swe-bench"),
-				setupCommands: [],
-				verificationCommands: [],
-				metadata: {
-					repository: "nodejs/node",
-					issue: "#9876",
-					type: "swe-bench",
-				},
-			},
-		]
-	}
-
-	/**
-	 * Prepare a specific task for execution (dummy implementation)
-	 * @param taskId The ID of the task to prepare
-	 */
-	async prepareTask(taskId: string): Promise<Task> {
-		const tasks = await this.listTasks()
-		const task = tasks.find((t) => t.id === taskId)
-
-		if (!task) {
-			throw new Error(`Task ${taskId} not found`)
-		}
-
-		// Create a dummy workspace for the task
-		const taskDir = path.join(task.workspacePath, taskId)
-		if (!fs.existsSync(taskDir)) {
-			fs.mkdirSync(taskDir, { recursive: true })
-
-			// Create a dummy file for the task
-			fs.writeFileSync(
-				path.join(taskDir, "README.md"),
-				`# ${task.name}\n\n${task.description}\n\nThis is a dummy task for testing purposes.`,
-			)
-		}
-
-		// Update the task's workspace path to the task-specific directory
-		return {
-			...task,
-			workspacePath: taskDir,
-		}
-	}
-
-	/**
-	 * Verify the result of a task execution (dummy implementation)
-	 * @param task The task that was executed
-	 * @param result The result of the task execution
-	 */
-	async verifyResult(task: Task, result: any): Promise<VerificationResult> {
-		// Always return success for dummy implementation
-		return {
-			success: true,
-			metrics: {
-				testsPassed: 1,
-				testsFailed: 0,
-				testsTotal: 1,
-				functionalCorrectness: 1.0,
-				performanceImprovement: 0.25, // Dummy metric specific to SWE-Bench
-				codeQuality: 0.9, // Dummy metric specific to SWE-Bench
-			},
-		}
-	}
-}

+ 0 - 143
evals/cli/src/adapters/swelancer.ts

@@ -1,143 +0,0 @@
-import * as path from "path"
-import * as fs from "fs"
-import execa from "execa"
-import { BenchmarkAdapter, Task, VerificationResult } from "./types"
-
-const EVALS_DIR = path.resolve(__dirname, "../../../")
-
-/**
- * Dummy adapter for the SWELancer benchmark
- */
-export class SWELancerAdapter implements BenchmarkAdapter {
-	name = "swelancer"
-
-	/**
-	 * Set up the SWELancer benchmark repository (dummy implementation)
-	 */
-	async setup(): Promise<void> {
-		console.log("SWELancer dummy setup completed")
-
-		// Create repositories directory if it doesn't exist
-		const repoDir = path.join(EVALS_DIR, "repositories", "swelancer")
-		if (!fs.existsSync(repoDir)) {
-			fs.mkdirSync(repoDir, { recursive: true })
-			console.log(`Created dummy SWELancer directory at ${repoDir}`)
-		}
-	}
-
-	/**
-	 * List all available tasks in the SWELancer benchmark (dummy implementation)
-	 */
-	async listTasks(): Promise<Task[]> {
-		return [
-			{
-				id: "swelancer-task-1",
-				name: "Create Landing Page",
-				description: "Create a responsive landing page for a new product using HTML, CSS, and JavaScript.",
-				workspacePath: path.join(EVALS_DIR, "repositories", "swelancer"),
-				setupCommands: [],
-				verificationCommands: [],
-				metadata: {
-					client: "TechStartup Inc.",
-					difficulty: "medium",
-					type: "swelancer",
-				},
-			},
-			{
-				id: "swelancer-task-2",
-				name: "Build REST API",
-				description: "Create a RESTful API for a blog application using Node.js and Express.",
-				workspacePath: path.join(EVALS_DIR, "repositories", "swelancer"),
-				setupCommands: [],
-				verificationCommands: [],
-				metadata: {
-					client: "BlogCo",
-					difficulty: "hard",
-					type: "swelancer",
-				},
-			},
-			{
-				id: "swelancer-task-3",
-				name: "Fix CSS Layout Issues",
-				description: "Fix layout issues in a responsive website across different screen sizes.",
-				workspacePath: path.join(EVALS_DIR, "repositories", "swelancer"),
-				setupCommands: [],
-				verificationCommands: [],
-				metadata: {
-					client: "DesignAgency",
-					difficulty: "easy",
-					type: "swelancer",
-				},
-			},
-		]
-	}
-
-	/**
-	 * Prepare a specific task for execution (dummy implementation)
-	 * @param taskId The ID of the task to prepare
-	 */
-	async prepareTask(taskId: string): Promise<Task> {
-		const tasks = await this.listTasks()
-		const task = tasks.find((t) => t.id === taskId)
-
-		if (!task) {
-			throw new Error(`Task ${taskId} not found`)
-		}
-
-		// Create a dummy workspace for the task
-		const taskDir = path.join(task.workspacePath, taskId)
-		if (!fs.existsSync(taskDir)) {
-			fs.mkdirSync(taskDir, { recursive: true })
-
-			// Create a dummy file for the task
-			fs.writeFileSync(
-				path.join(taskDir, "README.md"),
-				`# ${task.name}\n\n${task.description}\n\nThis is a dummy task for testing purposes.`,
-			)
-
-			// Create additional dummy files based on task type
-			if (task.id === "swelancer-task-1") {
-				fs.writeFileSync(
-					path.join(taskDir, "index.html"),
-					`<!DOCTYPE html>\n<html>\n<head>\n  <title>Landing Page</title>\n</head>\n<body>\n  <!-- TODO: Implement landing page -->\n</body>\n</html>`,
-				)
-			} else if (task.id === "swelancer-task-2") {
-				fs.writeFileSync(
-					path.join(taskDir, "server.js"),
-					`// TODO: Implement REST API\nconsole.log('Server starting...');`,
-				)
-			} else if (task.id === "swelancer-task-3") {
-				fs.writeFileSync(
-					path.join(taskDir, "styles.css"),
-					`/* TODO: Fix layout issues */\nbody {\n  margin: 0;\n  padding: 0;\n}`,
-				)
-			}
-		}
-
-		// Update the task's workspace path to the task-specific directory
-		return {
-			...task,
-			workspacePath: taskDir,
-		}
-	}
-
-	/**
-	 * Verify the result of a task execution (dummy implementation)
-	 * @param task The task that was executed
-	 * @param result The result of the task execution
-	 */
-	async verifyResult(task: Task, result: any): Promise<VerificationResult> {
-		// Always return success for dummy implementation
-		return {
-			success: true,
-			metrics: {
-				testsPassed: 1,
-				testsFailed: 0,
-				testsTotal: 1,
-				functionalCorrectness: 1.0,
-				clientSatisfaction: 0.95, // Dummy metric specific to SWELancer
-				timeEfficiency: 0.85, // Dummy metric specific to SWELancer
-			},
-		}
-	}
-}

+ 4 - 1
evals/cli/src/adapters/types.ts

@@ -17,6 +17,7 @@ export interface Task {
 export interface VerificationResult {
 	success: boolean
 	metrics: Record<string, any>
+	rawOutput?: string
 }
 
 /**
@@ -27,5 +28,7 @@ export interface BenchmarkAdapter {
 	setup(): Promise<void>
 	listTasks(): Promise<Task[]>
 	prepareTask(taskId: string): Promise<Task>
-	verifyResult(task: Task, result: any): Promise<VerificationResult>
+	cleanupTask(task: Task): Promise<void>
+	verifyResult(task: Task): Promise<VerificationResult>
+	runTask(task: Task): Promise<VerificationResult | null>
 }

+ 0 - 53
evals/cli/src/commands/evals-env.ts

@@ -1,53 +0,0 @@
-import * as path from "path"
-import chalk from "chalk"
-import { createEvalsEnvFile, removeEvalsEnvFile, checkEvalsEnvFile } from "../utils/evals-env"
-
-interface EvalsEnvOptions {
-	action: "create" | "remove" | "check"
-	directory?: string
-}
-
-/**
- * Handler for the evals-env command
- * @param options Command options
- */
-export async function evalsEnvHandler(options: EvalsEnvOptions): Promise<void> {
-	// Determine the directory to use - default to repository root instead of current directory
-	const currentDir = process.cwd()
-	const repoRoot = path.resolve(currentDir, "..", "..") // Navigate up from evals/cli to root
-	const directory = options.directory || repoRoot
-
-	console.log(chalk.blue(`Working with directory: ${directory}`))
-
-	// Perform the requested action
-	switch (options.action) {
-		case "create":
-			console.log(chalk.blue("Creating evals.env file..."))
-			createEvalsEnvFile(directory)
-			console.log(chalk.green("The Cline extension should now detect this file and enter test mode."))
-			console.log(chalk.yellow("Note: You may need to reload VSCode for the changes to take effect."))
-			break
-
-		case "remove":
-			console.log(chalk.blue("Removing evals.env file..."))
-			removeEvalsEnvFile(directory)
-			console.log(chalk.green("The Cline extension should now exit test mode."))
-			console.log(chalk.yellow("Note: You may need to reload VSCode for the changes to take effect."))
-			break
-
-		case "check":
-			console.log(chalk.blue("Checking for evals.env file..."))
-			const exists = checkEvalsEnvFile(directory)
-			if (exists) {
-				console.log(chalk.green("The Cline extension should be in test mode."))
-			} else {
-				console.log(chalk.yellow("The Cline extension should not be in test mode."))
-			}
-			break
-
-		default:
-			console.error(chalk.red(`Unknown action: ${options.action}`))
-			console.log(chalk.yellow("Valid actions are: create, remove, check"))
-			break
-	}
-}

+ 41 - 55
evals/cli/src/commands/report.ts

@@ -34,7 +34,6 @@ export async function reportHandler(options: ReportOptions): Promise<void> {
 		// Generate summary report
 		const summary = {
 			runs: runs.length,
-			models: [...new Set(runs.map((run) => run.model))],
 			benchmarks: [...new Set(runs.map((run) => run.benchmark))],
 			tasks: 0,
 			successRate: 0,
@@ -45,6 +44,10 @@ export async function reportHandler(options: ReportOptions): Promise<void> {
 			totalToolFailures: 0,
 			toolSuccessRate: 0,
 			toolUsage: {} as Record<string, { calls: number; failures: number }>,
+			totalTests: 0,
+			totalTestsPassed: 0,
+			totalTestsFailed: 0,
+			testSuccessRate: 0,
 		}
 
 		let totalTasks = 0
@@ -54,6 +57,9 @@ export async function reportHandler(options: ReportOptions): Promise<void> {
 		let totalDuration = 0
 		let totalToolCalls = 0
 		let totalToolFailures = 0
+		let totalTests = 0
+		let totalTestsPassed = 0
+		let totalTestsFailed = 0
 
 		for (const run of runs) {
 			const tasks = db.getRunTasks(run.id)
@@ -73,6 +79,14 @@ export async function reportHandler(options: ReportOptions): Promise<void> {
 				totalCost += metrics.find((m) => m.name === "cost")?.value || 0
 				totalDuration += metrics.find((m) => m.name === "duration")?.value || 0
 
+				// Collect test metrics
+				const testsPassed = metrics.find((m) => m.name === "testsPassed")?.value || 0
+				const testsFailed = metrics.find((m) => m.name === "testsFailed")?.value || 0
+				const testsTotal = metrics.find((m) => m.name === "testsTotal")?.value || 0
+				totalTestsPassed += testsPassed
+				totalTestsFailed += testsFailed
+				totalTests += testsTotal
+
 				// Collect tool call metrics
 				totalToolCalls += task.total_tool_calls || 0
 				totalToolFailures += task.total_tool_failures || 0
@@ -99,6 +113,12 @@ export async function reportHandler(options: ReportOptions): Promise<void> {
 		summary.totalToolFailures = totalToolFailures
 		summary.toolSuccessRate = totalToolCalls > 0 ? 1 - totalToolFailures / totalToolCalls : 1.0
 
+		// Calculate test metrics
+		summary.totalTests = totalTests
+		summary.totalTestsPassed = totalTestsPassed
+		summary.totalTestsFailed = totalTestsFailed
+		summary.testSuccessRate = totalTests > 0 ? totalTestsPassed / totalTests : 0
+
 		summary.tasks = totalTasks
 		summary.successRate = totalTasks > 0 ? successfulTasks / totalTasks : 0
 		summary.averageTokens = totalTasks > 0 ? totalTokens / totalTasks : 0
@@ -112,12 +132,15 @@ export async function reportHandler(options: ReportOptions): Promise<void> {
 			const benchmarkRuns = runs.filter((run) => run.benchmark === benchmark)
 			const benchmarkSummary = {
 				runs: benchmarkRuns.length,
-				models: [...new Set(benchmarkRuns.map((run) => run.model))],
 				tasks: 0,
 				successRate: 0,
 				averageTokens: 0,
 				averageCost: 0,
 				averageDuration: 0,
+				totalTests: 0,
+				totalTestsPassed: 0,
+				totalTestsFailed: 0,
+				testSuccessRate: 0,
 			}
 
 			let benchmarkTasks = 0
@@ -125,6 +148,9 @@ export async function reportHandler(options: ReportOptions): Promise<void> {
 			let benchmarkTotalTokens = 0
 			let benchmarkTotalCost = 0
 			let benchmarkTotalDuration = 0
+			let benchmarkTotalTests = 0
+			let benchmarkTotalTestsPassed = 0
+			let benchmarkTotalTestsFailed = 0
 
 			for (const run of benchmarkRuns) {
 				const tasks = db.getRunTasks(run.id)
@@ -143,6 +169,14 @@ export async function reportHandler(options: ReportOptions): Promise<void> {
 
 					benchmarkTotalCost += metrics.find((m) => m.name === "cost")?.value || 0
 					benchmarkTotalDuration += metrics.find((m) => m.name === "duration")?.value || 0
+
+					// Collect test metrics
+					const testsPassed = metrics.find((m) => m.name === "testsPassed")?.value || 0
+					const testsFailed = metrics.find((m) => m.name === "testsFailed")?.value || 0
+					const testsTotal = metrics.find((m) => m.name === "testsTotal")?.value || 0
+					benchmarkTotalTestsPassed += testsPassed
+					benchmarkTotalTestsFailed += testsFailed
+					benchmarkTotalTests += testsTotal
 				}
 			}
 
@@ -151,60 +185,14 @@ export async function reportHandler(options: ReportOptions): Promise<void> {
 			benchmarkSummary.averageTokens = benchmarkTasks > 0 ? benchmarkTotalTokens / benchmarkTasks : 0
 			benchmarkSummary.averageCost = benchmarkTasks > 0 ? benchmarkTotalCost / benchmarkTasks : 0
 			benchmarkSummary.averageDuration = benchmarkTasks > 0 ? benchmarkTotalDuration / benchmarkTasks : 0
+			benchmarkSummary.totalTests = benchmarkTotalTests
+			benchmarkSummary.totalTestsPassed = benchmarkTotalTestsPassed
+			benchmarkSummary.totalTestsFailed = benchmarkTotalTestsFailed
+			benchmarkSummary.testSuccessRate = benchmarkTotalTests > 0 ? benchmarkTotalTestsPassed / benchmarkTotalTests : 0
 
 			benchmarkReports[benchmark] = benchmarkSummary
 		}
 
-		// Generate model-specific reports
-		const modelReports: Record<string, any> = {}
-
-		for (const model of summary.models) {
-			const modelRuns = runs.filter((run) => run.model === model)
-			const modelSummary = {
-				runs: modelRuns.length,
-				benchmarks: [...new Set(modelRuns.map((run) => run.benchmark))],
-				tasks: 0,
-				successRate: 0,
-				averageTokens: 0,
-				averageCost: 0,
-				averageDuration: 0,
-			}
-
-			let modelTasks = 0
-			let modelSuccessfulTasks = 0
-			let modelTotalTokens = 0
-			let modelTotalCost = 0
-			let modelTotalDuration = 0
-
-			for (const run of modelRuns) {
-				const tasks = db.getRunTasks(run.id)
-				modelTasks += tasks.length
-
-				for (const task of tasks) {
-					if (task.success) {
-						modelSuccessfulTasks++
-					}
-
-					const metrics = db.getTaskMetrics(task.id)
-
-					const tokensIn = metrics.find((m) => m.name === "tokensIn")?.value || 0
-					const tokensOut = metrics.find((m) => m.name === "tokensOut")?.value || 0
-					modelTotalTokens += tokensIn + tokensOut
-
-					modelTotalCost += metrics.find((m) => m.name === "cost")?.value || 0
-					modelTotalDuration += metrics.find((m) => m.name === "duration")?.value || 0
-				}
-			}
-
-			modelSummary.tasks = modelTasks
-			modelSummary.successRate = modelTasks > 0 ? modelSuccessfulTasks / modelTasks : 0
-			modelSummary.averageTokens = modelTasks > 0 ? modelTotalTokens / modelTasks : 0
-			modelSummary.averageCost = modelTasks > 0 ? modelTotalCost / modelTasks : 0
-			modelSummary.averageDuration = modelTasks > 0 ? modelTotalDuration / modelTasks : 0
-
-			modelReports[model] = modelSummary
-		}
-
 		// Save reports
 		const reportDir = path.join(path.resolve(__dirname, "../../../"), "results", "reports")
 		fs.mkdirSync(reportDir, { recursive: true })
@@ -217,14 +205,12 @@ export async function reportHandler(options: ReportOptions): Promise<void> {
 
 			fs.writeFileSync(path.join(reportDir, `benchmarks-${timestamp}.json`), JSON.stringify(benchmarkReports, null, 2))
 
-			fs.writeFileSync(path.join(reportDir, `models-${timestamp}.json`), JSON.stringify(modelReports, null, 2))
-
 			spinner.succeed(`JSON reports generated in ${reportDir}`)
 		} else {
 			// Generate markdown report
 			const outputPath = options.output || path.join(reportDir, `report-${timestamp}.md`)
 
-			generateMarkdownReport(summary, benchmarkReports, modelReports, outputPath)
+			generateMarkdownReport(summary, benchmarkReports, outputPath)
 
 			spinner.succeed(`Markdown report generated at ${outputPath}`)
 		}

+ 34 - 49
evals/cli/src/commands/run.ts

@@ -1,18 +1,13 @@
-import * as path from "path"
 import { v4 as uuidv4 } from "uuid"
 import chalk from "chalk"
 import ora from "ora"
 import { getAdapter } from "../adapters"
 import { ResultsDatabase } from "../db"
-import { spawnVSCode, cleanupVSCode } from "../utils/vscode"
-import { sendTaskToServer } from "../utils/task"
 import { storeTaskResult } from "../utils/results"
 
 interface RunOptions {
 	benchmark?: string
-	model: string
 	count?: number
-	apiKey?: string
 }
 
 /**
@@ -21,12 +16,10 @@ interface RunOptions {
  */
 export async function runHandler(options: RunOptions): Promise<void> {
 	// Determine which benchmarks to run
-	const benchmarks = options.benchmark ? [options.benchmark] : ["exercism"] // Default to exercism for now
-	const model = options.model
+	const benchmarks = options.benchmark ? [options.benchmark] : ["exercism"] // Default to exercism
 	const count = options.count || Infinity
 
-	console.log(chalk.blue(`Running evaluations for model: ${model}`))
-	console.log(chalk.blue(`Benchmarks: ${benchmarks.join(", ")}`))
+	console.log(chalk.blue(`Running evaluations for the following benchmarks: ${benchmarks.join(", ")}`))
 
 	// Create a run for each benchmark
 	for (const benchmark of benchmarks) {
@@ -36,7 +29,7 @@ export async function runHandler(options: RunOptions): Promise<void> {
 		console.log(chalk.green(`\nStarting run for benchmark: ${benchmark}`))
 
 		// Create run in database
-		db.createRun(runId, model, benchmark)
+		db.createRun(runId, benchmark)
 
 		// Get adapter for this benchmark
 		try {
@@ -63,58 +56,51 @@ export async function runHandler(options: RunOptions): Promise<void> {
 				const preparedTask = await adapter.prepareTask(task.id)
 				prepareSpinner.succeed("Task prepared")
 
-				// Spawn VSCode
-				console.log("Spawning VSCode...")
-				await spawnVSCode(preparedTask.workspacePath)
+				let cleanedUp = false
 
-				// Send task to server
-				const sendSpinner = ora("Sending task to server...").start()
 				try {
-					const result = await sendTaskToServer(preparedTask.description, options.apiKey)
-					sendSpinner.succeed("Task completed")
+					// Run task using adapter's execution strategy
+					const finalVerification = await adapter.runTask(preparedTask)
 
-					// Verify result
-					const verifySpinner = ora("Verifying result...").start()
-					const verification = await adapter.verifyResult(preparedTask, result)
+					// Cleanup task
+					const cleanupSpinner = ora("Cleaning up task...").start()
+					await adapter.cleanupTask(preparedTask)
+					cleanedUp = true
+					cleanupSpinner.succeed("Cleanup complete")
+
+					// Use final verification from runTask
+					const verification = finalVerification || (await adapter.verifyResult(preparedTask))
 
 					if (verification.success) {
-						verifySpinner.succeed(
-							`Verification successful: ${verification.metrics.testsPassed}/${verification.metrics.testsTotal} tests passed`,
+						console.log(
+							chalk.green(
+								`Tests passed: ${verification.metrics.testsPassed}/${verification.metrics.testsTotal}`,
+							),
 						)
 					} else {
-						verifySpinner.fail(
-							`Verification failed: ${verification.metrics.testsPassed}/${verification.metrics.testsTotal} tests passed`,
+						console.log(
+							chalk.red(
+								`Tests failed: ${verification.metrics.testsPassed}/${verification.metrics.testsTotal}`,
+							),
 						)
 					}
 
 					// Store result
 					const storeSpinner = ora("Storing result...").start()
-					await storeTaskResult(runId, preparedTask, result, verification)
+					await storeTaskResult(runId, preparedTask, {}, verification)
 					storeSpinner.succeed("Result stored")
-
-					console.log(chalk.green(`Task completed. Success: ${verification.success}`))
-
-					// Clean up VS Code and temporary files
-					const cleanupSpinner = ora("Cleaning up...").start()
-					try {
-						await cleanupVSCode(preparedTask.workspacePath)
-						cleanupSpinner.succeed("Cleanup completed")
-					} catch (cleanupError: any) {
-						cleanupSpinner.fail(`Cleanup failed: ${cleanupError.message}`)
-						console.error(chalk.yellow(cleanupError.stack))
-					}
 				} catch (error: any) {
-					sendSpinner.fail(`Task failed: ${error.message}`)
-					console.error(chalk.red(error.stack))
-
-					// Clean up VS Code and temporary files even if the task failed
-					const cleanupSpinner = ora("Cleaning up...").start()
-					try {
-						await cleanupVSCode(preparedTask.workspacePath)
-						cleanupSpinner.succeed("Cleanup completed")
-					} catch (cleanupError: any) {
-						cleanupSpinner.fail(`Cleanup failed: ${cleanupError.message}`)
-						console.error(chalk.yellow(cleanupError.stack))
+					console.error(chalk.red(`Task failed: ${error.message}`))
+				} finally {
+					// Ensure cleanup always happens
+					if (!cleanedUp) {
+						try {
+							const finalCleanupSpinner = ora("Performing cleanup...").start()
+							await adapter.cleanupTask(preparedTask)
+							finalCleanupSpinner.succeed("Cleanup complete")
+						} catch (cleanupError: any) {
+							console.error(chalk.red(`Cleanup failed: ${cleanupError.message}`))
+						}
 					}
 				}
 			}
@@ -125,7 +111,6 @@ export async function runHandler(options: RunOptions): Promise<void> {
 			console.log(chalk.green(`\nRun complete for benchmark: ${benchmark}`))
 		} catch (error: any) {
 			console.error(chalk.red(`Error running benchmark ${benchmark}: ${error.message}`))
-			console.error(error.stack)
 		}
 	}
 

+ 4 - 5
evals/cli/src/db/index.ts

@@ -34,16 +34,15 @@ export class ResultsDatabase {
 	/**
 	 * Create a new evaluation run
 	 * @param id Run ID
-	 * @param model Model name
 	 * @param benchmark Benchmark name
 	 */
-	createRun(id: string, model: string, benchmark: string): void {
+	createRun(id: string, benchmark: string): void {
 		const stmt = this.db.prepare(`
-      INSERT INTO runs (id, timestamp, model, benchmark)
-      VALUES (?, ?, ?, ?)
+      INSERT INTO runs (id, timestamp, benchmark)
+      VALUES (?, ?, ?)
     `)
 
-		stmt.run(id, Date.now(), model, benchmark)
+		stmt.run(id, Date.now(), benchmark)
 	}
 
 	/**

+ 0 - 1
evals/cli/src/db/schema.ts

@@ -5,7 +5,6 @@ export const SCHEMA = `
 CREATE TABLE IF NOT EXISTS runs (
   id TEXT PRIMARY KEY,
   timestamp INTEGER NOT NULL,
-  model TEXT NOT NULL,
   benchmark TEXT NOT NULL,
   completed INTEGER NOT NULL DEFAULT 0
 );

+ 2 - 20
evals/cli/src/index.ts

@@ -4,7 +4,6 @@ import chalk from "chalk"
 import { setupHandler } from "./commands/setup"
 import { runHandler } from "./commands/run"
 import { reportHandler } from "./commands/report"
-import { evalsEnvHandler } from "./commands/evals-env"
 import { runDiffEvalHandler } from "./commands/runDiffEval"
 
 // Create the CLI program
@@ -20,7 +19,7 @@ program
 	.option(
 		"-b, --benchmarks <benchmarks>",
 		"Comma-separated list of benchmarks to set up",
-		"exercism,swe-bench,swelancer,multi-swe",
+		"exercism",
 	)
 	.action(async (options) => {
 		try {
@@ -36,9 +35,7 @@ program
 	.command("run")
 	.description("Run evaluations")
 	.option("-b, --benchmark <benchmark>", "Specific benchmark to run")
-	.option("-m, --model <model>", "Model to evaluate", "claude-3-opus-20240229")
 	.option("-c, --count <count>", "Number of tasks to run", parseInt)
-	.option("-k, --api-key <apiKey>", "Cline API key to use for evaluations")
 	.action(async (options) => {
 		try {
 			await runHandler(options)
@@ -63,21 +60,6 @@ program
 		}
 	})
 
-// Evals-env command
-program
-	.command("evals-env")
-	.description("Manage evals.env files for test mode activation")
-	.argument("<action>", "Action to perform: create, remove, or check")
-	.option("-d, --directory <directory>", "Directory to create/remove/check evals.env file in (defaults to current directory)")
-	.action(async (action, options) => {
-		try {
-			await evalsEnvHandler({ action, ...options })
-		} catch (error) {
-			console.error(chalk.red(`Error managing evals.env file: ${error instanceof Error ? error.message : String(error)}`))
-			process.exit(1)
-		}
-	})
-
 // Run-diff-eval command
 program
 	.command("run-diff-eval")
@@ -90,7 +72,7 @@ program
 	.option("--max-attempts-per-case <number>", "Maximum total attempts per test case (default: 10x valid attempts)")
 	.option("--max-cases <number>", "Maximum number of test cases to run (limits total cases loaded)")
 	.option("--parsing-function <name>", "The parsing function to use", "parseAssistantMessageV2")
-	.option("--diff-edit-function <name>", "The diff editing function to use", "constructNewFileContentV2")
+	.option("--diff-edit-function <name>", "The diff editing function to use", "diff-06-26-25")
 	.option("--thinking-budget <tokens>", "Set the thinking tokens budget", "0")
 	.option("--provider <provider>", "API provider to use (openrouter, openai)", "openrouter")
 	.option("--parallel", "Run tests in parallel", false)

+ 0 - 79
evals/cli/src/utils/evals-env.ts

@@ -1,79 +0,0 @@
-import * as fs from "fs"
-import * as path from "path"
-import chalk from "chalk"
-
-/**
- * Creates an evals.env file in the specified directory
- * @param directory The directory where the evals.env file should be created
- * @returns True if the file was created, false if it already exists
- */
-export function createEvalsEnvFile(directory: string): boolean {
-	const evalsEnvPath = path.join(directory, "evals.env")
-
-	// Check if the file already exists
-	if (fs.existsSync(evalsEnvPath)) {
-		console.log(chalk.yellow(`evals.env file already exists at ${evalsEnvPath}`))
-		return false
-	}
-
-	// Create the file
-	try {
-		const content = `# This file activates Cline test mode
-# Created at: ${new Date().toISOString()}
-# 
-# This file is automatically detected by the Cline extension
-# and enables test mode for automated evaluations.
-#
-# Delete this file to deactivate test mode.
-`
-		fs.writeFileSync(evalsEnvPath, content)
-		console.log(chalk.green(`Created evals.env file at ${evalsEnvPath}`))
-		return true
-	} catch (error) {
-		console.error(chalk.red(`Error creating evals.env file: ${error}`))
-		return false
-	}
-}
-
-/**
- * Removes an evals.env file from the specified directory
- * @param directory The directory where the evals.env file should be removed
- * @returns True if the file was removed, false if it doesn't exist
- */
-export function removeEvalsEnvFile(directory: string): boolean {
-	const evalsEnvPath = path.join(directory, "evals.env")
-
-	// Check if the file exists
-	if (!fs.existsSync(evalsEnvPath)) {
-		console.log(chalk.yellow(`No evals.env file found at ${evalsEnvPath}`))
-		return false
-	}
-
-	// Remove the file
-	try {
-		fs.unlinkSync(evalsEnvPath)
-		console.log(chalk.green(`Removed evals.env file from ${evalsEnvPath}`))
-		return true
-	} catch (error) {
-		console.error(chalk.red(`Error removing evals.env file: ${error}`))
-		return false
-	}
-}
-
-/**
- * Checks if an evals.env file exists in the specified directory
- * @param directory The directory to check for an evals.env file
- * @returns True if the file exists, false otherwise
- */
-export function checkEvalsEnvFile(directory: string): boolean {
-	const evalsEnvPath = path.join(directory, "evals.env")
-	const exists = fs.existsSync(evalsEnvPath)
-
-	if (exists) {
-		console.log(chalk.green(`evals.env file found at ${evalsEnvPath}`))
-	} else {
-		console.log(chalk.yellow(`No evals.env file found at ${evalsEnvPath}`))
-	}
-
-	return exists
-}

+ 0 - 131
evals/cli/src/utils/extensions.ts

@@ -1,131 +0,0 @@
-import execa from "execa"
-import * as fs from "fs"
-import * as path from "path"
-import * as os from "os"
-
-/**
- * List of VSCode extensions to install for evaluation environments
- * These extensions provide language support and other useful features
- */
-export const REQUIRED_EXTENSIONS = [
-	"golang.go", // Go language support
-	"dbaeumer.vscode-eslint", // ESLint support
-	"redhat.java", // Java support
-	"ms-python.python", // Python support
-	"rust-lang.rust-analyzer", // Rust support
-	"ms-vscode.cpptools", // C/C++ support
-]
-
-/**
- * Install required VSCode extensions in the specified extensions directory
- * @param extensionsDir The directory where extensions should be installed
- * @returns Promise that resolves when all extensions are installed
- */
-export async function installRequiredExtensions(extensionsDir: string): Promise<void> {
-	console.log("Installing required VSCode extensions...")
-
-	// Create the extensions directory if it doesn't exist
-	if (!fs.existsSync(extensionsDir)) {
-		fs.mkdirSync(extensionsDir, { recursive: true })
-	}
-
-	// Install each extension
-	for (const extension of REQUIRED_EXTENSIONS) {
-		try {
-			console.log(`Installing extension: ${extension}...`)
-			await execa("code", ["--extensions-dir", extensionsDir, "--install-extension", extension, "--force"])
-			console.log(`✅ Extension ${extension} installed successfully`)
-		} catch (error: any) {
-			console.warn(`⚠️ Failed to install extension ${extension}: ${error.message}`)
-			// Continue with other extensions even if one fails
-		}
-	}
-
-	console.log("✅ All required extensions installed")
-}
-
-/**
- * Check if a VSCode extension is installed in the specified directory
- * @param extensionsDir The directory to check for installed extensions
- * @param extensionId The ID of the extension to check
- * @returns True if the extension is installed, false otherwise
- */
-export function isExtensionInstalled(extensionsDir: string, extensionId: string): boolean {
-	// Extensions are installed in directories named publisher.name-version
-	// We need to check if any directory starts with the extensionId
-	const extensionPrefix = extensionId.toLowerCase() + "-"
-
-	try {
-		const files = fs.readdirSync(extensionsDir)
-		return files.some((file) => {
-			const lowerCaseFile = file.toLowerCase()
-			return lowerCaseFile === extensionId.toLowerCase() || lowerCaseFile.startsWith(extensionPrefix)
-		})
-	} catch (error) {
-		return false
-	}
-}
-
-/**
- * Get the path to the VSCode settings file in the specified user data directory
- * @param userDataDir The VSCode user data directory
- * @returns The path to the settings.json file
- */
-export function getSettingsPath(userDataDir: string): string {
-	const settingsDir = path.join(userDataDir, "User")
-	fs.mkdirSync(settingsDir, { recursive: true })
-	return path.join(settingsDir, "settings.json")
-}
-
-/**
- * Configure extension settings in the VSCode user data directory
- * @param userDataDir The VSCode user data directory
- */
-export function configureExtensionSettings(userDataDir: string): void {
-	const settingsPath = getSettingsPath(userDataDir)
-
-	// Read existing settings if they exist
-	let settings = {}
-	if (fs.existsSync(settingsPath)) {
-		try {
-			settings = JSON.parse(fs.readFileSync(settingsPath, "utf8"))
-		} catch (error) {
-			console.warn(`Error reading settings file: ${error}`)
-		}
-	}
-
-	// Add or update extension-specific settings
-	const updatedSettings = {
-		...settings,
-		// Go extension settings
-		"go.toolsManagement.autoUpdate": false,
-		"go.survey.prompt": false,
-
-		// ESLint settings
-		"eslint.enable": true,
-		"eslint.run": "onSave",
-
-		// Java settings
-		"java.configuration.checkProjectSettingsExclusions": false,
-		"java.configure.checkForOutdatedExtensions": false,
-		"java.help.firstView": false,
-
-		// Python settings
-		"python.experiments.enabled": false,
-		"python.showStartPage": false,
-
-		// Rust settings
-		"rust-analyzer.checkOnSave.command": "check",
-
-		// C/C++ settings
-		"C_Cpp.intelliSenseEngine": "default",
-
-		// General extension settings
-		"extensions.autoUpdate": false,
-		"extensions.ignoreRecommendations": true,
-	}
-
-	// Write updated settings
-	fs.writeFileSync(settingsPath, JSON.stringify(updatedSettings, null, 2))
-	console.log("✅ Extension settings configured")
-}

+ 10 - 35
evals/cli/src/utils/markdown.ts

@@ -1,17 +1,14 @@
 import * as fs from "fs"
-import * as path from "path"
 
 /**
  * Generate a markdown report from evaluation results
  * @param summary Overall summary
  * @param benchmarkReports Benchmark-specific reports
- * @param modelReports Model-specific reports
  * @param outputPath Output file path
  */
 export function generateMarkdownReport(
 	summary: any,
 	benchmarkReports: Record<string, any>,
-	modelReports: Record<string, any>,
 	outputPath: string,
 ): void {
 	let markdown = `# Cline Evaluation Report\n\n`
@@ -19,10 +16,13 @@ export function generateMarkdownReport(
 	// Generate summary section
 	markdown += `## Summary\n\n`
 	markdown += `- **Total Runs:** ${summary.runs}\n`
-	markdown += `- **Models:** ${summary.models.join(", ")}\n`
 	markdown += `- **Benchmarks:** ${summary.benchmarks.join(", ")}\n`
 	markdown += `- **Total Tasks:** ${summary.tasks}\n`
-	markdown += `- **Success Rate:** ${(summary.successRate * 100).toFixed(2)}%\n`
+	markdown += `- **Task Success Rate:** ${(summary.successRate * 100).toFixed(2)}%\n`
+	markdown += `- **Total Tests:** ${summary.totalTests}\n`
+	markdown += `- **Tests Passed:** ${summary.totalTestsPassed}\n`
+	markdown += `- **Tests Failed:** ${summary.totalTestsFailed}\n`
+	markdown += `- **Test Success Rate:** ${(summary.testSuccessRate * 100).toFixed(2)}%\n`
 	markdown += `- **Average Tokens:** ${Math.round(summary.averageTokens)}\n`
 	markdown += `- **Average Cost:** $${summary.averageCost.toFixed(4)}\n`
 	markdown += `- **Average Duration:** ${(summary.averageDuration / 1000).toFixed(2)}s\n`
@@ -48,23 +48,12 @@ export function generateMarkdownReport(
 	for (const [benchmark, report] of Object.entries(benchmarkReports)) {
 		markdown += `### ${benchmark}\n\n`
 		markdown += `- **Runs:** ${report.runs}\n`
-		markdown += `- **Models:** ${report.models.join(", ")}\n`
 		markdown += `- **Tasks:** ${report.tasks}\n`
-		markdown += `- **Success Rate:** ${(report.successRate * 100).toFixed(2)}%\n`
-		markdown += `- **Average Tokens:** ${Math.round(report.averageTokens)}\n`
-		markdown += `- **Average Cost:** $${report.averageCost.toFixed(4)}\n`
-		markdown += `- **Average Duration:** ${(report.averageDuration / 1000).toFixed(2)}s\n\n`
-	}
-
-	// Generate model results section
-	markdown += `## Model Results\n\n`
-
-	for (const [model, report] of Object.entries(modelReports)) {
-		markdown += `### ${model}\n\n`
-		markdown += `- **Runs:** ${report.runs}\n`
-		markdown += `- **Benchmarks:** ${report.benchmarks.join(", ")}\n`
-		markdown += `- **Tasks:** ${report.tasks}\n`
-		markdown += `- **Success Rate:** ${(report.successRate * 100).toFixed(2)}%\n`
+		markdown += `- **Task Success Rate:** ${(report.successRate * 100).toFixed(2)}%\n`
+		markdown += `- **Total Tests:** ${report.totalTests}\n`
+		markdown += `- **Tests Passed:** ${report.totalTestsPassed}\n`
+		markdown += `- **Tests Failed:** ${report.totalTestsFailed}\n`
+		markdown += `- **Test Success Rate:** ${(report.testSuccessRate * 100).toFixed(2)}%\n`
 		markdown += `- **Average Tokens:** ${Math.round(report.averageTokens)}\n`
 		markdown += `- **Average Cost:** $${report.averageCost.toFixed(4)}\n`
 		markdown += `- **Average Duration:** ${(report.averageDuration / 1000).toFixed(2)}s\n\n`
@@ -87,20 +76,6 @@ export function generateMarkdownReport(
 
 	markdown += "```\n\n"
 
-	// Success rate by model chart
-	markdown += `### Success Rate by Model\n\n`
-	markdown += "```mermaid\n"
-	markdown += "graph TD\n"
-	markdown += "  title[Success Rate by Model]\n"
-	markdown += "  style title fill:none,stroke:none\n\n"
-
-	for (const [model, report] of Object.entries(modelReports)) {
-		const successRate = (report.successRate * 100).toFixed(2)
-		markdown += `  ${model.replace(/[-\.]/g, "_")}[${model}: ${successRate}%]\n`
-	}
-
-	markdown += "```\n\n"
-
 	// Add timestamp
 	markdown += `\n\n---\n\nReport generated on ${new Date().toISOString()}\n`
 

+ 0 - 52
evals/cli/src/utils/task.ts

@@ -1,52 +0,0 @@
-import fetch from "node-fetch"
-import chalk from "chalk"
-
-/**
- * Send a task to the Cline test server
- * @param task The task description to send
- * @param apiKey Optional Cline API key to use for the task
- * @returns The result of the task execution
- */
-export async function sendTaskToServer(task: string, apiKey?: string): Promise<any> {
-	const SERVER_URL = "http://localhost:9876/task"
-
-	try {
-		console.log(chalk.blue(`Sending task to server: ${task.substring(0, 100)}${task.length > 100 ? "..." : ""}`))
-
-		const response = await fetch(SERVER_URL, {
-			method: "POST",
-			headers: {
-				"Content-Type": "application/json",
-			},
-			body: JSON.stringify({
-				task,
-				apiKey,
-			}),
-		})
-
-		if (!response.ok) {
-			const errorText = await response.text()
-			throw new Error(`Server responded with status ${response.status}: ${errorText}`)
-		}
-
-		const result = await response.json()
-
-		if (!result.success) {
-			throw new Error(`Task execution failed: ${result.error || "Unknown error"}`)
-		}
-
-		if (result.timeout) {
-			throw new Error("Task execution timed out")
-		}
-
-		return result
-	} catch (error: any) {
-		if (error.code === "ECONNREFUSED") {
-			throw new Error(
-				"Could not connect to the test server. Make sure VSCode is running with the Cline extension and the test server is active.",
-			)
-		}
-
-		throw error
-	}
-}

+ 0 - 598
evals/cli/src/utils/vscode.ts

@@ -1,598 +0,0 @@
-import execa from "execa"
-import * as path from "path"
-import * as fs from "fs"
-import fetch from "node-fetch"
-import * as os from "os"
-import { installRequiredExtensions, configureExtensionSettings } from "./extensions"
-
-// Store temporary directories for cleanup
-interface VSCodeResources {
-	tempUserDataDir: string
-	tempExtensionsDir: string
-	vscodePid?: number
-}
-
-// Global map to track resources for each workspace
-const workspaceResources = new Map<string, VSCodeResources>()
-
-/**
- * Spawn a VSCode instance with the Cline extension
- * @param workspacePath The workspace path to open
- * @param vsixPath Optional path to a VSIX file to install
- * @returns The resources created for this VS Code instance
- */
-export async function spawnVSCode(workspacePath: string, vsixPath?: string): Promise<VSCodeResources> {
-	// Ensure the workspace path exists
-	if (!fs.existsSync(workspacePath)) {
-		throw new Error(`Workspace path does not exist: ${workspacePath}`)
-	}
-
-	// If no VSIX path is provided, build one with IS_TEST=true
-	if (!vsixPath) {
-		try {
-			// Build the VSIX (no longer need to set IS_TEST=true as we'll use evals.env file)
-			console.log("Building VSIX...")
-			const clineRoot = path.resolve(process.cwd(), "..", "..")
-			await execa("npx", ["vsce", "package"], {
-				cwd: clineRoot,
-				stdio: "inherit",
-			})
-
-			// Find the generated VSIX file(s)
-			const files = fs.readdirSync(clineRoot)
-			const vsixFiles = files.filter((file) => file.endsWith(".vsix"))
-
-			if (vsixFiles.length > 0) {
-				// Get file stats to find the most recent one
-				const vsixFilesWithStats = vsixFiles.map((file) => {
-					const filePath = path.join(clineRoot, file)
-					return {
-						file,
-						path: filePath,
-						mtime: fs.statSync(filePath).mtime,
-					}
-				})
-
-				// Sort by modification time (most recent first)
-				vsixFilesWithStats.sort((a, b) => b.mtime.getTime() - a.mtime.getTime())
-
-				// Use the most recent VSIX
-				vsixPath = vsixFilesWithStats[0].path
-				console.log(`Using most recent VSIX: ${vsixPath} (modified ${vsixFilesWithStats[0].mtime.toISOString()})`)
-
-				// Log all found VSIX files for debugging
-				if (vsixFiles.length > 1) {
-					console.log(`Found ${vsixFiles.length} VSIX files:`)
-					vsixFilesWithStats.forEach((f) => {
-						console.log(`  - ${f.file} (modified ${f.mtime.toISOString()})`)
-					})
-				}
-			} else {
-				console.warn("Could not find generated VSIX file")
-			}
-		} catch (error) {
-			console.warn("Failed to build test VSIX:", error)
-		}
-	}
-
-	// Create a temporary user data directory for this VS Code instance
-	const tempUserDataDir = path.join(os.tmpdir(), `vscode-cline-eval-${Date.now()}`)
-	fs.mkdirSync(tempUserDataDir, { recursive: true })
-	console.log(`Created temporary user data directory: ${tempUserDataDir}`)
-
-	// Create a temporary extensions directory to ensure no other extensions are loaded
-	const tempExtensionsDir = path.join(os.tmpdir(), `vscode-cline-eval-ext-${Date.now()}`)
-	fs.mkdirSync(tempExtensionsDir, { recursive: true })
-	console.log(`Created temporary extensions directory: ${tempExtensionsDir}`)
-
-	// Create evals.env file in the workspace to trigger test mode
-	console.log(`Creating evals.env file in workspace: ${workspacePath}`)
-	const evalsEnvPath = path.join(workspacePath, "evals.env")
-	fs.writeFileSync(
-		evalsEnvPath,
-		`# This file activates Cline test mode
-# Created at: ${new Date().toISOString()}
-# 
-# This file is automatically detected by the Cline extension
-# and enables test mode for automated evaluations.
-#
-# Delete this file to deactivate test mode.
-`,
-	)
-
-	// Create settings.json in the temporary user data directory to disable workspace trust
-	// and configure Cline to auto-open on startup
-	const settingsDir = path.join(tempUserDataDir, "User")
-	fs.mkdirSync(settingsDir, { recursive: true })
-	const settingsPath = path.join(settingsDir, "settings.json")
-	const settings = {
-		// Disable workspace trust
-		"security.workspace.trust.enabled": false,
-		"security.workspace.trust.startupPrompt": "never",
-		"security.workspace.trust.banner": "never",
-		"security.workspace.trust.emptyWindow": true,
-
-		// Configure startup behavior
-		"workbench.startupEditor": "none",
-
-		// Auto-open Cline on startup
-		"cline.autoOpenOnStartup": true,
-
-		// Show the activity bar and sidebar
-		"workbench.activityBar.visible": true,
-		"workbench.sideBar.visible": true,
-		"workbench.view.extension.saoudrizwan.claude-dev-ActivityBar.visible": true,
-		"workbench.view.alwaysShowHeaderActions": true,
-		"workbench.editor.openSideBySideDirection": "right",
-
-		// Disable GitLens from opening automatically
-		"gitlens.views.repositories.autoReveal": false,
-		"gitlens.views.fileHistory.autoReveal": false,
-		"gitlens.views.lineHistory.autoReveal": false,
-		"gitlens.views.compare.autoReveal": false,
-		"gitlens.views.search.autoReveal": false,
-		"gitlens.showWelcomeOnInstall": false,
-		"gitlens.showWhatsNewAfterUpgrades": false,
-
-		// Disable other extensions that might compete for startup focus
-		"extensions.autoUpdate": false,
-	}
-	fs.writeFileSync(settingsPath, JSON.stringify(settings, null, 2))
-	console.log(`Created settings.json to disable workspace trust and auto-open Cline`)
-
-	// Create keybindings.json to automatically open Cline on startup
-	const keybindingsPath = path.join(settingsDir, "keybindings.json")
-	const keybindings = [
-		{
-			key: "alt+c",
-			command: "workbench.view.extension.saoudrizwan.claude-dev-ActivityBar",
-			when: "viewContainer.workbench.view.extension.saoudrizwan.claude-dev-ActivityBar.enabled",
-		},
-	]
-	fs.writeFileSync(keybindingsPath, JSON.stringify(keybindings, null, 2))
-	console.log(`Created keybindings.json to help with Cline activation`)
-
-	// Build the command arguments with custom user data directory
-	const args = [
-		// Use a custom user data directory to isolate this instance
-		"--user-data-dir",
-		tempUserDataDir,
-		// Use a custom extensions directory to ensure only our extension is loaded
-		"--extensions-dir",
-		tempExtensionsDir,
-		// Disable workspace trust
-		"--disable-workspace-trust",
-		"-n",
-		workspacePath,
-		// Force the extension to be activated on startup
-		"--start-up-extension",
-		"saoudrizwan.claude-dev",
-		// Run a command on startup to open Cline
-		"--command",
-		"workbench.view.extension.saoudrizwan.claude-dev-ActivityBar",
-		// Additional flags to help with extension activation
-		"--disable-gpu=false",
-		"--max-memory=4096",
-	]
-
-	// Create a startup script to run commands after VS Code launches
-	const startupScriptPath = path.join(settingsDir, "startup.js")
-	const startupScript = `
-		// This script will be executed when VS Code starts
-		setTimeout(() => {
-			// Try to open Cline in the sidebar
-			require('vscode').commands.executeCommand('workbench.view.extension.saoudrizwan.claude-dev-ActivityBar');
-		}, 5000);
-	`
-	fs.writeFileSync(startupScriptPath, startupScript)
-	console.log(`Created startup script to activate Cline`)
-
-	// If a VSIX is provided, install it
-	if (vsixPath) {
-		if (!fs.existsSync(vsixPath)) {
-			throw new Error(`VSIX file does not exist: ${vsixPath}`)
-		}
-		args.unshift("--install-extension", vsixPath)
-	}
-
-	// Install required extensions
-	console.log("Installing required VSCode extensions...")
-	await installRequiredExtensions(tempExtensionsDir)
-
-	// Configure extension settings
-	console.log("Configuring extension settings...")
-	configureExtensionSettings(tempUserDataDir)
-
-	// Execute the command
-	try {
-		// We don't need to install extensions globally anymore since we're using a custom user data directory
-		// The VSIX will be installed in the isolated environment if provided in the args
-
-		// Launch VS Code
-		console.log("Launching VS Code...")
-		await execa("code", args, {
-			stdio: "inherit",
-		})
-
-		// Wait longer for VSCode to initialize and extension to load
-		console.log("Waiting for VS Code to initialize...")
-		await new Promise((resolve) => setTimeout(resolve, 30000))
-
-		// Create a JavaScript file that will be loaded as a VS Code extension
-		const extensionDir = path.join(tempExtensionsDir, "cline-activator")
-		fs.mkdirSync(extensionDir, { recursive: true })
-
-		// Create package.json for the extension
-		const packageJsonPath = path.join(extensionDir, "package.json")
-		const packageJson = {
-			name: "cline-activator",
-			displayName: "Cline Activator",
-			description: "Activates Cline and starts the test server",
-			version: "0.0.1",
-			engines: {
-				vscode: "^1.60.0",
-			},
-			main: "./extension.js",
-			activationEvents: ["*"],
-			contributes: {
-				commands: [
-					{
-						command: "cline-activator.activate",
-						title: "Activate Cline",
-					},
-				],
-			},
-		}
-		fs.writeFileSync(packageJsonPath, JSON.stringify(packageJson, null, 2))
-
-		// Create extension.js
-		const extensionJsPath = path.join(extensionDir, "extension.js")
-		const extensionJs = `
-			const vscode = require('vscode');
-			
-			/**
-			 * @param {vscode.ExtensionContext} context
-			 */
-			function activate(context) {
-				console.log('Cline Activator is now active!');
-				
-				// Register the command to activate Cline
-				let disposable = vscode.commands.registerCommand('cline-activator.activate', async function () {
-					try {
-						// Make sure the Cline extension is activated
-						const extension = vscode.extensions.getExtension('saoudrizwan.claude-dev');
-						if (!extension) {
-							console.error('Cline extension not found');
-							return;
-						}
-						
-						if (!extension.isActive) {
-							console.log('Activating Cline extension...');
-							await extension.activate();
-						}
-						
-						// Show the Cline sidebar
-						console.log('Opening Cline sidebar...');
-						await vscode.commands.executeCommand('workbench.view.extension.saoudrizwan.claude-dev-ActivityBar');
-						
-						// Wait a moment for the sidebar to initialize
-						await new Promise(resolve => setTimeout(resolve, 2000));
-						
-						// Create the test server if it doesn't exist
-						console.log('Creating test server...');
-						
-						// Get the visible webview instance
-						const clineRootPath = '${path.resolve(process.cwd(), "..", "..")}';
-						const visibleWebview = require(path.join(clineRootPath, 'src', 'core', 'webview')).WebviewProvider.getVisibleInstance();
-						if (visibleWebview) {
-							require(path.join(clineRootPath, 'src', 'services', 'test', 'TestServer')).createTestServer(visibleWebview);
-							console.log('Test server created successfully');
-						} else {
-							console.error('No visible webview instance found');
-						}
-					} catch (error) {
-						console.error('Error activating Cline:', error);
-					}
-				});
-				
-				context.subscriptions.push(disposable);
-				
-				// Automatically run the command after a delay
-				setTimeout(() => {
-					vscode.commands.executeCommand('cline-activator.activate');
-				}, 5000);
-			}
-			
-			function deactivate() {}
-			
-			module.exports = {
-				activate,
-				deactivate
-			}
-		`
-		fs.writeFileSync(extensionJsPath, extensionJs)
-		console.log(`Created Cline Activator extension`)
-
-		// Try multiple approaches to activate the extension
-		let serverStarted = false
-
-		// Create an activation script to run in VS Code
-		const activationScriptPath = path.join(settingsDir, "activate-cline.js")
-		const activationScript = `
-			// This script will be executed to activate Cline and start the test server
-			const vscode = require('vscode');
-			
-			// Execute the cline-activator.activate command
-			vscode.commands.executeCommand('cline-activator.activate');
-		`
-		fs.writeFileSync(activationScriptPath, activationScript)
-		console.log(`Created activation script to run in VS Code`)
-
-		// Execute the activation script
-		try {
-			console.log("Executing activation script to start Cline and test server...")
-			await execa(
-				"code",
-				[
-					"--user-data-dir",
-					tempUserDataDir,
-					"--extensions-dir",
-					tempExtensionsDir,
-					"--folder-uri",
-					`file://${workspacePath}`,
-					"--execute",
-					activationScriptPath,
-				],
-				{
-					stdio: "inherit",
-				},
-			)
-
-			// Wait for the test server to start
-			console.log("Waiting for test server to start...")
-			for (let i = 0; i < 30; i++) {
-				try {
-					// Try to connect to the test server
-					const response = await fetch("http://localhost:9876/task", {
-						method: "OPTIONS",
-						headers: {
-							"Content-Type": "application/json",
-						},
-					})
-
-					if (response.status === 204) {
-						console.log("Test server is running!")
-						serverStarted = true
-						break
-					}
-				} catch (error) {
-					// Server not started yet, wait and try again
-					await new Promise((resolve) => setTimeout(resolve, 1000))
-				}
-			}
-		} catch (error) {
-			console.warn("Failed to execute activation script:", error)
-		}
-
-		if (!serverStarted) {
-			console.warn("Test server did not start after multiple attempts")
-			console.log("You may need to manually open the Cline extension in VS Code")
-		}
-
-		// Store the resources for this workspace
-		const resources: VSCodeResources = {
-			tempUserDataDir,
-			tempExtensionsDir,
-		}
-
-		// Store in the global map
-		workspaceResources.set(workspacePath, resources)
-
-		// Return the resources
-		return resources
-	} catch (error: any) {
-		throw new Error(`Failed to spawn VSCode: ${error.message}`)
-	}
-}
-
-/**
- * Clean up VS Code resources and shut down the test server
- * @param workspacePath The workspace path to clean up resources for
- */
-export async function cleanupVSCode(workspacePath: string): Promise<void> {
-	console.log(`Cleaning up VS Code resources for workspace: ${workspacePath}`)
-
-	// Get the resources for this workspace
-	const resources = workspaceResources.get(workspacePath)
-	if (!resources) {
-		console.log(`No resources found for workspace: ${workspacePath}`)
-		return
-	}
-
-	// Try to shut down the test server
-	try {
-		console.log("Shutting down test server...")
-		await fetch("http://localhost:9876/shutdown", {
-			method: "POST",
-			headers: {
-				"Content-Type": "application/json",
-			},
-		}).catch(() => {
-			// Ignore errors, the server might already be down
-		})
-	} catch (error) {
-		console.warn(`Error shutting down test server: ${error}`)
-	}
-
-	// Try to gracefully close VS Code instead of killing it
-	try {
-		console.log("Attempting to gracefully close VS Code...")
-
-		// Create a settings file that will disable the crash reporter and the exit confirmation dialog
-		const settingsDir = path.join(resources.tempUserDataDir, "User")
-		const settingsPath = path.join(settingsDir, "settings.json")
-
-		// Read existing settings if they exist
-		let settings = {}
-		if (fs.existsSync(settingsPath)) {
-			try {
-				settings = JSON.parse(fs.readFileSync(settingsPath, "utf8"))
-			} catch (error) {
-				console.warn(`Error reading settings file: ${error}`)
-			}
-		}
-
-		// Update settings to disable crash reporter and exit confirmation
-		settings = {
-			...settings,
-			"window.confirmBeforeClose": "never",
-			"telemetry.enableCrashReporter": false,
-			"window.restoreWindows": "none",
-			"window.newWindowDimensions": "default",
-		}
-
-		// Write updated settings
-		fs.writeFileSync(settingsPath, JSON.stringify(settings, null, 2))
-
-		// On macOS, use AppleScript to quit VS Code gracefully
-		if (process.platform === "darwin") {
-			try {
-				// First try AppleScript to quit VS Code gracefully
-				await execa("osascript", ["-e", 'tell application "Visual Studio Code" to quit'])
-
-				// Wait a moment for VS Code to close
-				await new Promise((resolve) => setTimeout(resolve, 2000))
-			} catch (appleScriptError) {
-				console.warn(`Error using AppleScript to quit VS Code: ${appleScriptError}`)
-			}
-		} else if (process.platform === "win32") {
-			// On Windows, try to use taskkill without /F first
-			try {
-				await execa("taskkill", ["/IM", "code.exe"])
-
-				// Wait a moment for VS Code to close
-				await new Promise((resolve) => setTimeout(resolve, 2000))
-			} catch (taskkillError) {
-				console.warn(`Error using taskkill to quit VS Code: ${taskkillError}`)
-			}
-		} else {
-			// On Linux, try to use SIGTERM first
-			try {
-				// Find VS Code processes
-				const { stdout } = await execa("ps", ["aux"])
-				const lines = stdout.split("\n")
-
-				for (const line of lines) {
-					if (line.includes(resources.tempUserDataDir)) {
-						const parts = line.trim().split(/\s+/)
-						const pid = parseInt(parts[1])
-
-						if (pid && !isNaN(pid)) {
-							console.log(`Sending SIGTERM to VS Code process with PID: ${pid}`)
-							try {
-								// Use SIGTERM instead of SIGKILL for a graceful shutdown
-								process.kill(pid, "SIGTERM")
-							} catch (killError) {
-								console.warn(`Failed to terminate process ${pid}: ${killError}`)
-							}
-						}
-					}
-				}
-
-				// Wait a moment for VS Code to close
-				await new Promise((resolve) => setTimeout(resolve, 2000))
-			} catch (psError) {
-				console.warn(`Error listing processes: ${psError}`)
-			}
-		}
-
-		// If graceful methods failed, fall back to forceful termination as a last resort
-		// Check if VS Code is still running with the temp user data dir
-		let vsCodeStillRunning = false
-
-		if (process.platform !== "win32") {
-			try {
-				const { stdout } = await execa("ps", ["aux"])
-				vsCodeStillRunning = stdout.split("\n").some((line) => line.includes(resources.tempUserDataDir))
-			} catch (error) {
-				console.warn(`Error checking if VS Code is still running: ${error}`)
-			}
-		} else {
-			try {
-				const { stdout } = await execa("tasklist", ["/FI", `IMAGENAME eq code.exe`])
-				vsCodeStillRunning = stdout.includes("code.exe")
-			} catch (error) {
-				console.warn(`Error checking if VS Code is still running: ${error}`)
-			}
-		}
-
-		// If VS Code is still running, use forceful termination as a last resort
-		if (vsCodeStillRunning) {
-			console.log("Graceful shutdown failed, falling back to forceful termination...")
-
-			if (process.platform === "win32") {
-				try {
-					await execa("taskkill", ["/IM", "code.exe", "/F"])
-				} catch (error) {
-					console.warn(`Error forcefully terminating VS Code: ${error}`)
-				}
-			} else {
-				try {
-					const { stdout } = await execa("ps", ["aux"])
-					const lines = stdout.split("\n")
-
-					for (const line of lines) {
-						if (line.includes(resources.tempUserDataDir)) {
-							const parts = line.trim().split(/\s+/)
-							const pid = parseInt(parts[1])
-
-							if (pid && !isNaN(pid)) {
-								console.log(`Forcefully killing VS Code process with PID: ${pid}`)
-								try {
-									process.kill(pid, "SIGKILL")
-								} catch (killError) {
-									console.warn(`Failed to kill process ${pid}: ${killError}`)
-								}
-							}
-						}
-					}
-				} catch (error) {
-					console.warn(`Error forcefully terminating VS Code: ${error}`)
-				}
-			}
-		}
-	} catch (error) {
-		console.warn(`Error closing VS Code: ${error}`)
-	}
-
-	// Clean up temporary directories and evals.env file
-	try {
-		console.log(`Removing temporary user data directory: ${resources.tempUserDataDir}`)
-		fs.rmSync(resources.tempUserDataDir, { recursive: true, force: true })
-	} catch (error) {
-		console.warn(`Error removing temporary user data directory: ${error}`)
-	}
-
-	try {
-		console.log(`Removing temporary extensions directory: ${resources.tempExtensionsDir}`)
-		fs.rmSync(resources.tempExtensionsDir, { recursive: true, force: true })
-	} catch (error) {
-		console.warn(`Error removing temporary extensions directory: ${error}`)
-	}
-
-	// Remove the evals.env file
-	try {
-		const evalsEnvPath = path.join(workspacePath, "evals.env")
-		if (fs.existsSync(evalsEnvPath)) {
-			console.log(`Removing evals.env file: ${evalsEnvPath}`)
-			fs.unlinkSync(evalsEnvPath)
-		}
-	} catch (error) {
-		console.warn(`Error removing evals.env file: ${error}`)
-	}
-
-	// Remove from the global map
-	workspaceResources.delete(workspacePath)
-
-	console.log("Cleanup completed")
-}

+ 1422 - 0
evals/package-lock.json

@@ -12,6 +12,7 @@
         "axios": "^1.12.0",
         "better-sqlite3": "^11.10.0",
         "chalk": "5.6.2",
+        "cline": "^1.0.1",
         "commander": "^9.4.1",
         "dotenv": "^16.5.0",
         "execa": "^5.1.1",
@@ -331,6 +332,905 @@
         "url": "https://github.com/sponsors/sindresorhus"
       }
     },
+    "node_modules/cline": {
+      "version": "1.0.1",
+      "resolved": "https://registry.npmjs.org/cline/-/cline-1.0.1.tgz",
+      "integrity": "sha512-2ON8BaRqNINpl4l3FeS9fOA47fq96GNUvYZ/Kfm6IaFsOHAE3DHIg0FDZVKYJn7VeHQcupPcsuJZr7ziONBbUw==",
+      "bundleDependencies": [
+        "@grpc/grpc-js",
+        "@grpc/reflection",
+        "better-sqlite3",
+        "grpc-health-check",
+        "open",
+        "vscode-uri"
+      ],
+      "cpu": [
+        "x64",
+        "arm64"
+      ],
+      "hasInstallScript": true,
+      "license": "Apache-2.0",
+      "os": [
+        "darwin",
+        "linux"
+      ],
+      "dependencies": {
+        "@grpc/grpc-js": "^1.13.3",
+        "@grpc/reflection": "^1.0.4",
+        "better-sqlite3": "^12.2.0",
+        "grpc-health-check": "^2.0.2",
+        "open": "^10.1.2",
+        "vscode-uri": "^3.1.0"
+      },
+      "bin": {
+        "cline": "bin/cline",
+        "cline-host": "bin/cline-host"
+      },
+      "engines": {
+        "node": ">=20.0.0"
+      }
+    },
+    "node_modules/cline/node_modules/@grpc/grpc-js": {
+      "version": "1.13.3",
+      "inBundle": true,
+      "license": "Apache-2.0",
+      "dependencies": {
+        "@grpc/proto-loader": "^0.7.13",
+        "@js-sdsl/ordered-map": "^4.4.2"
+      },
+      "engines": {
+        "node": ">=12.10.0"
+      }
+    },
+    "node_modules/cline/node_modules/@grpc/proto-loader": {
+      "version": "0.7.15",
+      "inBundle": true,
+      "license": "Apache-2.0",
+      "dependencies": {
+        "lodash.camelcase": "^4.3.0",
+        "long": "^5.0.0",
+        "protobufjs": "^7.2.5",
+        "yargs": "^17.7.2"
+      },
+      "bin": {
+        "proto-loader-gen-types": "build/bin/proto-loader-gen-types.js"
+      },
+      "engines": {
+        "node": ">=6"
+      }
+    },
+    "node_modules/cline/node_modules/@grpc/reflection": {
+      "version": "1.0.4",
+      "inBundle": true,
+      "license": "Apache-2.0",
+      "dependencies": {
+        "@grpc/proto-loader": "^0.7.13",
+        "protobufjs": "^7.2.5"
+      },
+      "peerDependencies": {
+        "@grpc/grpc-js": "^1.8.21"
+      }
+    },
+    "node_modules/cline/node_modules/@js-sdsl/ordered-map": {
+      "version": "4.4.2",
+      "inBundle": true,
+      "license": "MIT",
+      "funding": {
+        "type": "opencollective",
+        "url": "https://opencollective.com/js-sdsl"
+      }
+    },
+    "node_modules/cline/node_modules/@protobufjs/aspromise": {
+      "version": "1.1.2",
+      "inBundle": true,
+      "license": "BSD-3-Clause"
+    },
+    "node_modules/cline/node_modules/@protobufjs/base64": {
+      "version": "1.1.2",
+      "inBundle": true,
+      "license": "BSD-3-Clause"
+    },
+    "node_modules/cline/node_modules/@protobufjs/codegen": {
+      "version": "2.0.4",
+      "inBundle": true,
+      "license": "BSD-3-Clause"
+    },
+    "node_modules/cline/node_modules/@protobufjs/eventemitter": {
+      "version": "1.1.0",
+      "inBundle": true,
+      "license": "BSD-3-Clause"
+    },
+    "node_modules/cline/node_modules/@protobufjs/fetch": {
+      "version": "1.1.0",
+      "inBundle": true,
+      "license": "BSD-3-Clause",
+      "dependencies": {
+        "@protobufjs/aspromise": "^1.1.1",
+        "@protobufjs/inquire": "^1.1.0"
+      }
+    },
+    "node_modules/cline/node_modules/@protobufjs/float": {
+      "version": "1.0.2",
+      "inBundle": true,
+      "license": "BSD-3-Clause"
+    },
+    "node_modules/cline/node_modules/@protobufjs/inquire": {
+      "version": "1.1.0",
+      "inBundle": true,
+      "license": "BSD-3-Clause"
+    },
+    "node_modules/cline/node_modules/@protobufjs/path": {
+      "version": "1.1.2",
+      "inBundle": true,
+      "license": "BSD-3-Clause"
+    },
+    "node_modules/cline/node_modules/@protobufjs/pool": {
+      "version": "1.1.0",
+      "inBundle": true,
+      "license": "BSD-3-Clause"
+    },
+    "node_modules/cline/node_modules/@protobufjs/utf8": {
+      "version": "1.1.0",
+      "inBundle": true,
+      "license": "BSD-3-Clause"
+    },
+    "node_modules/cline/node_modules/@types/node": {
+      "version": "22.15.18",
+      "inBundle": true,
+      "license": "MIT",
+      "dependencies": {
+        "undici-types": "~6.21.0"
+      }
+    },
+    "node_modules/cline/node_modules/ansi-regex": {
+      "version": "5.0.1",
+      "inBundle": true,
+      "license": "MIT",
+      "engines": {
+        "node": ">=8"
+      }
+    },
+    "node_modules/cline/node_modules/ansi-styles": {
+      "version": "4.3.0",
+      "inBundle": true,
+      "license": "MIT",
+      "dependencies": {
+        "color-convert": "^2.0.1"
+      },
+      "engines": {
+        "node": ">=8"
+      },
+      "funding": {
+        "url": "https://github.com/chalk/ansi-styles?sponsor=1"
+      }
+    },
+    "node_modules/cline/node_modules/base64-js": {
+      "version": "1.5.1",
+      "funding": [
+        {
+          "type": "github",
+          "url": "https://github.com/sponsors/feross"
+        },
+        {
+          "type": "patreon",
+          "url": "https://www.patreon.com/feross"
+        },
+        {
+          "type": "consulting",
+          "url": "https://feross.org/support"
+        }
+      ],
+      "inBundle": true,
+      "license": "MIT"
+    },
+    "node_modules/cline/node_modules/better-sqlite3": {
+      "version": "12.2.0",
+      "hasInstallScript": true,
+      "inBundle": true,
+      "license": "MIT",
+      "dependencies": {
+        "bindings": "^1.5.0",
+        "prebuild-install": "^7.1.1"
+      },
+      "engines": {
+        "node": "20.x || 22.x || 23.x || 24.x"
+      }
+    },
+    "node_modules/cline/node_modules/bindings": {
+      "version": "1.5.0",
+      "inBundle": true,
+      "license": "MIT",
+      "dependencies": {
+        "file-uri-to-path": "1.0.0"
+      }
+    },
+    "node_modules/cline/node_modules/bl": {
+      "version": "4.1.0",
+      "inBundle": true,
+      "license": "MIT",
+      "dependencies": {
+        "buffer": "^5.5.0",
+        "inherits": "^2.0.4",
+        "readable-stream": "^3.4.0"
+      }
+    },
+    "node_modules/cline/node_modules/buffer": {
+      "version": "5.7.1",
+      "funding": [
+        {
+          "type": "github",
+          "url": "https://github.com/sponsors/feross"
+        },
+        {
+          "type": "patreon",
+          "url": "https://www.patreon.com/feross"
+        },
+        {
+          "type": "consulting",
+          "url": "https://feross.org/support"
+        }
+      ],
+      "inBundle": true,
+      "license": "MIT",
+      "dependencies": {
+        "base64-js": "^1.3.1",
+        "ieee754": "^1.1.13"
+      }
+    },
+    "node_modules/cline/node_modules/bundle-name": {
+      "version": "4.1.0",
+      "inBundle": true,
+      "license": "MIT",
+      "dependencies": {
+        "run-applescript": "^7.0.0"
+      },
+      "engines": {
+        "node": ">=18"
+      },
+      "funding": {
+        "url": "https://github.com/sponsors/sindresorhus"
+      }
+    },
+    "node_modules/cline/node_modules/chownr": {
+      "version": "1.1.4",
+      "inBundle": true,
+      "license": "ISC"
+    },
+    "node_modules/cline/node_modules/cliui": {
+      "version": "8.0.1",
+      "inBundle": true,
+      "license": "ISC",
+      "dependencies": {
+        "string-width": "^4.2.0",
+        "strip-ansi": "^6.0.1",
+        "wrap-ansi": "^7.0.0"
+      },
+      "engines": {
+        "node": ">=12"
+      }
+    },
+    "node_modules/cline/node_modules/color-convert": {
+      "version": "2.0.1",
+      "inBundle": true,
+      "license": "MIT",
+      "dependencies": {
+        "color-name": "~1.1.4"
+      },
+      "engines": {
+        "node": ">=7.0.0"
+      }
+    },
+    "node_modules/cline/node_modules/color-name": {
+      "version": "1.1.4",
+      "inBundle": true,
+      "license": "MIT"
+    },
+    "node_modules/cline/node_modules/decompress-response": {
+      "version": "6.0.0",
+      "inBundle": true,
+      "license": "MIT",
+      "dependencies": {
+        "mimic-response": "^3.1.0"
+      },
+      "engines": {
+        "node": ">=10"
+      },
+      "funding": {
+        "url": "https://github.com/sponsors/sindresorhus"
+      }
+    },
+    "node_modules/cline/node_modules/deep-extend": {
+      "version": "0.6.0",
+      "inBundle": true,
+      "license": "MIT",
+      "engines": {
+        "node": ">=4.0.0"
+      }
+    },
+    "node_modules/cline/node_modules/default-browser": {
+      "version": "5.2.1",
+      "inBundle": true,
+      "license": "MIT",
+      "dependencies": {
+        "bundle-name": "^4.1.0",
+        "default-browser-id": "^5.0.0"
+      },
+      "engines": {
+        "node": ">=18"
+      },
+      "funding": {
+        "url": "https://github.com/sponsors/sindresorhus"
+      }
+    },
+    "node_modules/cline/node_modules/default-browser-id": {
+      "version": "5.0.0",
+      "inBundle": true,
+      "license": "MIT",
+      "engines": {
+        "node": ">=18"
+      },
+      "funding": {
+        "url": "https://github.com/sponsors/sindresorhus"
+      }
+    },
+    "node_modules/cline/node_modules/define-lazy-prop": {
+      "version": "3.0.0",
+      "inBundle": true,
+      "license": "MIT",
+      "engines": {
+        "node": ">=12"
+      },
+      "funding": {
+        "url": "https://github.com/sponsors/sindresorhus"
+      }
+    },
+    "node_modules/cline/node_modules/detect-libc": {
+      "version": "2.0.4",
+      "inBundle": true,
+      "license": "Apache-2.0",
+      "engines": {
+        "node": ">=8"
+      }
+    },
+    "node_modules/cline/node_modules/emoji-regex": {
+      "version": "8.0.0",
+      "inBundle": true,
+      "license": "MIT"
+    },
+    "node_modules/cline/node_modules/end-of-stream": {
+      "version": "1.4.5",
+      "inBundle": true,
+      "license": "MIT",
+      "dependencies": {
+        "once": "^1.4.0"
+      }
+    },
+    "node_modules/cline/node_modules/escalade": {
+      "version": "3.2.0",
+      "inBundle": true,
+      "license": "MIT",
+      "engines": {
+        "node": ">=6"
+      }
+    },
+    "node_modules/cline/node_modules/expand-template": {
+      "version": "2.0.3",
+      "inBundle": true,
+      "license": "(MIT OR WTFPL)",
+      "engines": {
+        "node": ">=6"
+      }
+    },
+    "node_modules/cline/node_modules/file-uri-to-path": {
+      "version": "1.0.0",
+      "inBundle": true,
+      "license": "MIT"
+    },
+    "node_modules/cline/node_modules/fs-constants": {
+      "version": "1.0.0",
+      "inBundle": true,
+      "license": "MIT"
+    },
+    "node_modules/cline/node_modules/get-caller-file": {
+      "version": "2.0.5",
+      "inBundle": true,
+      "license": "ISC",
+      "engines": {
+        "node": "6.* || 8.* || >= 10.*"
+      }
+    },
+    "node_modules/cline/node_modules/github-from-package": {
+      "version": "0.0.0",
+      "inBundle": true,
+      "license": "MIT"
+    },
+    "node_modules/cline/node_modules/grpc-health-check": {
+      "version": "2.0.2",
+      "inBundle": true,
+      "license": "Apache-2.0",
+      "dependencies": {
+        "@grpc/proto-loader": "^0.7.13"
+      }
+    },
+    "node_modules/cline/node_modules/ieee754": {
+      "version": "1.2.1",
+      "funding": [
+        {
+          "type": "github",
+          "url": "https://github.com/sponsors/feross"
+        },
+        {
+          "type": "patreon",
+          "url": "https://www.patreon.com/feross"
+        },
+        {
+          "type": "consulting",
+          "url": "https://feross.org/support"
+        }
+      ],
+      "inBundle": true,
+      "license": "BSD-3-Clause"
+    },
+    "node_modules/cline/node_modules/inherits": {
+      "version": "2.0.4",
+      "inBundle": true,
+      "license": "ISC"
+    },
+    "node_modules/cline/node_modules/ini": {
+      "version": "1.3.8",
+      "inBundle": true,
+      "license": "ISC"
+    },
+    "node_modules/cline/node_modules/is-docker": {
+      "version": "3.0.0",
+      "inBundle": true,
+      "license": "MIT",
+      "bin": {
+        "is-docker": "cli.js"
+      },
+      "engines": {
+        "node": "^12.20.0 || ^14.13.1 || >=16.0.0"
+      },
+      "funding": {
+        "url": "https://github.com/sponsors/sindresorhus"
+      }
+    },
+    "node_modules/cline/node_modules/is-fullwidth-code-point": {
+      "version": "3.0.0",
+      "inBundle": true,
+      "license": "MIT",
+      "engines": {
+        "node": ">=8"
+      }
+    },
+    "node_modules/cline/node_modules/is-inside-container": {
+      "version": "1.0.0",
+      "inBundle": true,
+      "license": "MIT",
+      "dependencies": {
+        "is-docker": "^3.0.0"
+      },
+      "bin": {
+        "is-inside-container": "cli.js"
+      },
+      "engines": {
+        "node": ">=14.16"
+      },
+      "funding": {
+        "url": "https://github.com/sponsors/sindresorhus"
+      }
+    },
+    "node_modules/cline/node_modules/is-wsl": {
+      "version": "3.1.0",
+      "inBundle": true,
+      "license": "MIT",
+      "dependencies": {
+        "is-inside-container": "^1.0.0"
+      },
+      "engines": {
+        "node": ">=16"
+      },
+      "funding": {
+        "url": "https://github.com/sponsors/sindresorhus"
+      }
+    },
+    "node_modules/cline/node_modules/lodash.camelcase": {
+      "version": "4.3.0",
+      "inBundle": true,
+      "license": "MIT"
+    },
+    "node_modules/cline/node_modules/long": {
+      "version": "5.3.2",
+      "inBundle": true,
+      "license": "Apache-2.0"
+    },
+    "node_modules/cline/node_modules/mimic-response": {
+      "version": "3.1.0",
+      "inBundle": true,
+      "license": "MIT",
+      "engines": {
+        "node": ">=10"
+      },
+      "funding": {
+        "url": "https://github.com/sponsors/sindresorhus"
+      }
+    },
+    "node_modules/cline/node_modules/minimist": {
+      "version": "1.2.8",
+      "inBundle": true,
+      "license": "MIT",
+      "funding": {
+        "url": "https://github.com/sponsors/ljharb"
+      }
+    },
+    "node_modules/cline/node_modules/mkdirp-classic": {
+      "version": "0.5.3",
+      "inBundle": true,
+      "license": "MIT"
+    },
+    "node_modules/cline/node_modules/napi-build-utils": {
+      "version": "2.0.0",
+      "inBundle": true,
+      "license": "MIT"
+    },
+    "node_modules/cline/node_modules/node-abi": {
+      "version": "3.77.0",
+      "inBundle": true,
+      "license": "MIT",
+      "dependencies": {
+        "semver": "^7.3.5"
+      },
+      "engines": {
+        "node": ">=10"
+      }
+    },
+    "node_modules/cline/node_modules/once": {
+      "version": "1.4.0",
+      "inBundle": true,
+      "license": "ISC",
+      "dependencies": {
+        "wrappy": "1"
+      }
+    },
+    "node_modules/cline/node_modules/open": {
+      "version": "10.1.2",
+      "inBundle": true,
+      "license": "MIT",
+      "dependencies": {
+        "default-browser": "^5.2.1",
+        "define-lazy-prop": "^3.0.0",
+        "is-inside-container": "^1.0.0",
+        "is-wsl": "^3.1.0"
+      },
+      "engines": {
+        "node": ">=18"
+      },
+      "funding": {
+        "url": "https://github.com/sponsors/sindresorhus"
+      }
+    },
+    "node_modules/cline/node_modules/prebuild-install": {
+      "version": "7.1.3",
+      "inBundle": true,
+      "license": "MIT",
+      "dependencies": {
+        "detect-libc": "^2.0.0",
+        "expand-template": "^2.0.3",
+        "github-from-package": "0.0.0",
+        "minimist": "^1.2.3",
+        "mkdirp-classic": "^0.5.3",
+        "napi-build-utils": "^2.0.0",
+        "node-abi": "^3.3.0",
+        "pump": "^3.0.0",
+        "rc": "^1.2.7",
+        "simple-get": "^4.0.0",
+        "tar-fs": "^2.0.0",
+        "tunnel-agent": "^0.6.0"
+      },
+      "bin": {
+        "prebuild-install": "bin.js"
+      },
+      "engines": {
+        "node": ">=10"
+      }
+    },
+    "node_modules/cline/node_modules/protobufjs": {
+      "version": "7.5.2",
+      "hasInstallScript": true,
+      "inBundle": true,
+      "license": "BSD-3-Clause",
+      "dependencies": {
+        "@protobufjs/aspromise": "^1.1.2",
+        "@protobufjs/base64": "^1.1.2",
+        "@protobufjs/codegen": "^2.0.4",
+        "@protobufjs/eventemitter": "^1.1.0",
+        "@protobufjs/fetch": "^1.1.0",
+        "@protobufjs/float": "^1.0.2",
+        "@protobufjs/inquire": "^1.1.0",
+        "@protobufjs/path": "^1.1.2",
+        "@protobufjs/pool": "^1.1.0",
+        "@protobufjs/utf8": "^1.1.0",
+        "@types/node": ">=13.7.0",
+        "long": "^5.0.0"
+      },
+      "engines": {
+        "node": ">=12.0.0"
+      }
+    },
+    "node_modules/cline/node_modules/pump": {
+      "version": "3.0.3",
+      "inBundle": true,
+      "license": "MIT",
+      "dependencies": {
+        "end-of-stream": "^1.1.0",
+        "once": "^1.3.1"
+      }
+    },
+    "node_modules/cline/node_modules/rc": {
+      "version": "1.2.8",
+      "inBundle": true,
+      "license": "(BSD-2-Clause OR MIT OR Apache-2.0)",
+      "dependencies": {
+        "deep-extend": "^0.6.0",
+        "ini": "~1.3.0",
+        "minimist": "^1.2.0",
+        "strip-json-comments": "~2.0.1"
+      },
+      "bin": {
+        "rc": "cli.js"
+      }
+    },
+    "node_modules/cline/node_modules/readable-stream": {
+      "version": "3.6.2",
+      "inBundle": true,
+      "license": "MIT",
+      "dependencies": {
+        "inherits": "^2.0.3",
+        "string_decoder": "^1.1.1",
+        "util-deprecate": "^1.0.1"
+      },
+      "engines": {
+        "node": ">= 6"
+      }
+    },
+    "node_modules/cline/node_modules/require-directory": {
+      "version": "2.1.1",
+      "inBundle": true,
+      "license": "MIT",
+      "engines": {
+        "node": ">=0.10.0"
+      }
+    },
+    "node_modules/cline/node_modules/run-applescript": {
+      "version": "7.0.0",
+      "inBundle": true,
+      "license": "MIT",
+      "engines": {
+        "node": ">=18"
+      },
+      "funding": {
+        "url": "https://github.com/sponsors/sindresorhus"
+      }
+    },
+    "node_modules/cline/node_modules/safe-buffer": {
+      "version": "5.2.1",
+      "funding": [
+        {
+          "type": "github",
+          "url": "https://github.com/sponsors/feross"
+        },
+        {
+          "type": "patreon",
+          "url": "https://www.patreon.com/feross"
+        },
+        {
+          "type": "consulting",
+          "url": "https://feross.org/support"
+        }
+      ],
+      "inBundle": true,
+      "license": "MIT"
+    },
+    "node_modules/cline/node_modules/semver": {
+      "version": "7.7.2",
+      "inBundle": true,
+      "license": "ISC",
+      "bin": {
+        "semver": "bin/semver.js"
+      },
+      "engines": {
+        "node": ">=10"
+      }
+    },
+    "node_modules/cline/node_modules/simple-concat": {
+      "version": "1.0.1",
+      "funding": [
+        {
+          "type": "github",
+          "url": "https://github.com/sponsors/feross"
+        },
+        {
+          "type": "patreon",
+          "url": "https://www.patreon.com/feross"
+        },
+        {
+          "type": "consulting",
+          "url": "https://feross.org/support"
+        }
+      ],
+      "inBundle": true,
+      "license": "MIT"
+    },
+    "node_modules/cline/node_modules/simple-get": {
+      "version": "4.0.1",
+      "funding": [
+        {
+          "type": "github",
+          "url": "https://github.com/sponsors/feross"
+        },
+        {
+          "type": "patreon",
+          "url": "https://www.patreon.com/feross"
+        },
+        {
+          "type": "consulting",
+          "url": "https://feross.org/support"
+        }
+      ],
+      "inBundle": true,
+      "license": "MIT",
+      "dependencies": {
+        "decompress-response": "^6.0.0",
+        "once": "^1.3.1",
+        "simple-concat": "^1.0.0"
+      }
+    },
+    "node_modules/cline/node_modules/string_decoder": {
+      "version": "1.3.0",
+      "inBundle": true,
+      "license": "MIT",
+      "dependencies": {
+        "safe-buffer": "~5.2.0"
+      }
+    },
+    "node_modules/cline/node_modules/string-width": {
+      "version": "4.2.3",
+      "inBundle": true,
+      "license": "MIT",
+      "dependencies": {
+        "emoji-regex": "^8.0.0",
+        "is-fullwidth-code-point": "^3.0.0",
+        "strip-ansi": "^6.0.1"
+      },
+      "engines": {
+        "node": ">=8"
+      }
+    },
+    "node_modules/cline/node_modules/strip-ansi": {
+      "version": "6.0.1",
+      "inBundle": true,
+      "license": "MIT",
+      "dependencies": {
+        "ansi-regex": "^5.0.1"
+      },
+      "engines": {
+        "node": ">=8"
+      }
+    },
+    "node_modules/cline/node_modules/strip-json-comments": {
+      "version": "2.0.1",
+      "inBundle": true,
+      "license": "MIT",
+      "engines": {
+        "node": ">=0.10.0"
+      }
+    },
+    "node_modules/cline/node_modules/tar-fs": {
+      "version": "2.1.3",
+      "inBundle": true,
+      "license": "MIT",
+      "dependencies": {
+        "chownr": "^1.1.1",
+        "mkdirp-classic": "^0.5.2",
+        "pump": "^3.0.0",
+        "tar-stream": "^2.1.4"
+      }
+    },
+    "node_modules/cline/node_modules/tar-stream": {
+      "version": "2.2.0",
+      "inBundle": true,
+      "license": "MIT",
+      "dependencies": {
+        "bl": "^4.0.3",
+        "end-of-stream": "^1.4.1",
+        "fs-constants": "^1.0.0",
+        "inherits": "^2.0.3",
+        "readable-stream": "^3.1.1"
+      },
+      "engines": {
+        "node": ">=6"
+      }
+    },
+    "node_modules/cline/node_modules/tunnel-agent": {
+      "version": "0.6.0",
+      "inBundle": true,
+      "license": "Apache-2.0",
+      "dependencies": {
+        "safe-buffer": "^5.0.1"
+      },
+      "engines": {
+        "node": "*"
+      }
+    },
+    "node_modules/cline/node_modules/undici-types": {
+      "version": "6.21.0",
+      "inBundle": true,
+      "license": "MIT"
+    },
+    "node_modules/cline/node_modules/util-deprecate": {
+      "version": "1.0.2",
+      "inBundle": true,
+      "license": "MIT"
+    },
+    "node_modules/cline/node_modules/vscode-uri": {
+      "version": "3.1.0",
+      "inBundle": true,
+      "license": "MIT"
+    },
+    "node_modules/cline/node_modules/wrap-ansi": {
+      "version": "7.0.0",
+      "inBundle": true,
+      "license": "MIT",
+      "dependencies": {
+        "ansi-styles": "^4.0.0",
+        "string-width": "^4.1.0",
+        "strip-ansi": "^6.0.0"
+      },
+      "engines": {
+        "node": ">=10"
+      },
+      "funding": {
+        "url": "https://github.com/chalk/wrap-ansi?sponsor=1"
+      }
+    },
+    "node_modules/cline/node_modules/wrappy": {
+      "version": "1.0.2",
+      "inBundle": true,
+      "license": "ISC"
+    },
+    "node_modules/cline/node_modules/y18n": {
+      "version": "5.0.8",
+      "inBundle": true,
+      "license": "ISC",
+      "engines": {
+        "node": ">=10"
+      }
+    },
+    "node_modules/cline/node_modules/yargs": {
+      "version": "17.7.2",
+      "inBundle": true,
+      "license": "MIT",
+      "dependencies": {
+        "cliui": "^8.0.1",
+        "escalade": "^3.1.1",
+        "get-caller-file": "^2.0.5",
+        "require-directory": "^2.1.1",
+        "string-width": "^4.2.3",
+        "y18n": "^5.0.5",
+        "yargs-parser": "^21.1.1"
+      },
+      "engines": {
+        "node": ">=12"
+      }
+    },
+    "node_modules/cline/node_modules/yargs-parser": {
+      "version": "21.1.1",
+      "inBundle": true,
+      "license": "ISC",
+      "engines": {
+        "node": ">=12"
+      }
+    },
     "node_modules/cliui": {
       "version": "8.0.1",
       "resolved": "https://registry.npmjs.org/cliui/-/cliui-8.0.1.tgz",
@@ -1750,6 +2650,528 @@
       "resolved": "https://registry.npmjs.org/cli-spinners/-/cli-spinners-2.9.2.tgz",
       "integrity": "sha512-ywqV+5MmyL4E7ybXgKys4DugZbX0FC6LnwrhjuykIjnK9k8OQacQ7axGKnjDXWNhns0xot3bZI5h55H8yo9cJg=="
     },
+    "cline": {
+      "version": "1.0.1",
+      "resolved": "https://registry.npmjs.org/cline/-/cline-1.0.1.tgz",
+      "integrity": "sha512-2ON8BaRqNINpl4l3FeS9fOA47fq96GNUvYZ/Kfm6IaFsOHAE3DHIg0FDZVKYJn7VeHQcupPcsuJZr7ziONBbUw==",
+      "requires": {
+        "@grpc/grpc-js": "^1.13.3",
+        "@grpc/reflection": "^1.0.4",
+        "better-sqlite3": "^12.2.0",
+        "grpc-health-check": "^2.0.2",
+        "open": "^10.1.2",
+        "vscode-uri": "^3.1.0"
+      },
+      "dependencies": {
+        "@grpc/grpc-js": {
+          "version": "1.13.3",
+          "bundled": true,
+          "requires": {
+            "@grpc/proto-loader": "^0.7.13",
+            "@js-sdsl/ordered-map": "^4.4.2"
+          }
+        },
+        "@grpc/proto-loader": {
+          "version": "0.7.15",
+          "bundled": true,
+          "requires": {
+            "lodash.camelcase": "^4.3.0",
+            "long": "^5.0.0",
+            "protobufjs": "^7.2.5",
+            "yargs": "^17.7.2"
+          }
+        },
+        "@grpc/reflection": {
+          "version": "1.0.4",
+          "bundled": true,
+          "requires": {
+            "@grpc/proto-loader": "^0.7.13",
+            "protobufjs": "^7.2.5"
+          }
+        },
+        "@js-sdsl/ordered-map": {
+          "version": "4.4.2",
+          "bundled": true
+        },
+        "@protobufjs/aspromise": {
+          "version": "1.1.2",
+          "bundled": true
+        },
+        "@protobufjs/base64": {
+          "version": "1.1.2",
+          "bundled": true
+        },
+        "@protobufjs/codegen": {
+          "version": "2.0.4",
+          "bundled": true
+        },
+        "@protobufjs/eventemitter": {
+          "version": "1.1.0",
+          "bundled": true
+        },
+        "@protobufjs/fetch": {
+          "version": "1.1.0",
+          "bundled": true,
+          "requires": {
+            "@protobufjs/aspromise": "^1.1.1",
+            "@protobufjs/inquire": "^1.1.0"
+          }
+        },
+        "@protobufjs/float": {
+          "version": "1.0.2",
+          "bundled": true
+        },
+        "@protobufjs/inquire": {
+          "version": "1.1.0",
+          "bundled": true
+        },
+        "@protobufjs/path": {
+          "version": "1.1.2",
+          "bundled": true
+        },
+        "@protobufjs/pool": {
+          "version": "1.1.0",
+          "bundled": true
+        },
+        "@protobufjs/utf8": {
+          "version": "1.1.0",
+          "bundled": true
+        },
+        "@types/node": {
+          "version": "22.15.18",
+          "bundled": true,
+          "requires": {
+            "undici-types": "~6.21.0"
+          }
+        },
+        "ansi-regex": {
+          "version": "5.0.1",
+          "bundled": true
+        },
+        "ansi-styles": {
+          "version": "4.3.0",
+          "bundled": true,
+          "requires": {
+            "color-convert": "^2.0.1"
+          }
+        },
+        "base64-js": {
+          "version": "1.5.1",
+          "bundled": true
+        },
+        "better-sqlite3": {
+          "version": "12.2.0",
+          "bundled": true,
+          "requires": {
+            "bindings": "^1.5.0",
+            "prebuild-install": "^7.1.1"
+          }
+        },
+        "bindings": {
+          "version": "1.5.0",
+          "bundled": true,
+          "requires": {
+            "file-uri-to-path": "1.0.0"
+          }
+        },
+        "bl": {
+          "version": "4.1.0",
+          "bundled": true,
+          "requires": {
+            "buffer": "^5.5.0",
+            "inherits": "^2.0.4",
+            "readable-stream": "^3.4.0"
+          }
+        },
+        "buffer": {
+          "version": "5.7.1",
+          "bundled": true,
+          "requires": {
+            "base64-js": "^1.3.1",
+            "ieee754": "^1.1.13"
+          }
+        },
+        "bundle-name": {
+          "version": "4.1.0",
+          "bundled": true,
+          "requires": {
+            "run-applescript": "^7.0.0"
+          }
+        },
+        "chownr": {
+          "version": "1.1.4",
+          "bundled": true
+        },
+        "cliui": {
+          "version": "8.0.1",
+          "bundled": true,
+          "requires": {
+            "string-width": "^4.2.0",
+            "strip-ansi": "^6.0.1",
+            "wrap-ansi": "^7.0.0"
+          }
+        },
+        "color-convert": {
+          "version": "2.0.1",
+          "bundled": true,
+          "requires": {
+            "color-name": "~1.1.4"
+          }
+        },
+        "color-name": {
+          "version": "1.1.4",
+          "bundled": true
+        },
+        "decompress-response": {
+          "version": "6.0.0",
+          "bundled": true,
+          "requires": {
+            "mimic-response": "^3.1.0"
+          }
+        },
+        "deep-extend": {
+          "version": "0.6.0",
+          "bundled": true
+        },
+        "default-browser": {
+          "version": "5.2.1",
+          "bundled": true,
+          "requires": {
+            "bundle-name": "^4.1.0",
+            "default-browser-id": "^5.0.0"
+          }
+        },
+        "default-browser-id": {
+          "version": "5.0.0",
+          "bundled": true
+        },
+        "define-lazy-prop": {
+          "version": "3.0.0",
+          "bundled": true
+        },
+        "detect-libc": {
+          "version": "2.0.4",
+          "bundled": true
+        },
+        "emoji-regex": {
+          "version": "8.0.0",
+          "bundled": true
+        },
+        "end-of-stream": {
+          "version": "1.4.5",
+          "bundled": true,
+          "requires": {
+            "once": "^1.4.0"
+          }
+        },
+        "escalade": {
+          "version": "3.2.0",
+          "bundled": true
+        },
+        "expand-template": {
+          "version": "2.0.3",
+          "bundled": true
+        },
+        "file-uri-to-path": {
+          "version": "1.0.0",
+          "bundled": true
+        },
+        "fs-constants": {
+          "version": "1.0.0",
+          "bundled": true
+        },
+        "get-caller-file": {
+          "version": "2.0.5",
+          "bundled": true
+        },
+        "github-from-package": {
+          "version": "0.0.0",
+          "bundled": true
+        },
+        "grpc-health-check": {
+          "version": "2.0.2",
+          "bundled": true,
+          "requires": {
+            "@grpc/proto-loader": "^0.7.13"
+          }
+        },
+        "ieee754": {
+          "version": "1.2.1",
+          "bundled": true
+        },
+        "inherits": {
+          "version": "2.0.4",
+          "bundled": true
+        },
+        "ini": {
+          "version": "1.3.8",
+          "bundled": true
+        },
+        "is-docker": {
+          "version": "3.0.0",
+          "bundled": true
+        },
+        "is-fullwidth-code-point": {
+          "version": "3.0.0",
+          "bundled": true
+        },
+        "is-inside-container": {
+          "version": "1.0.0",
+          "bundled": true,
+          "requires": {
+            "is-docker": "^3.0.0"
+          }
+        },
+        "is-wsl": {
+          "version": "3.1.0",
+          "bundled": true,
+          "requires": {
+            "is-inside-container": "^1.0.0"
+          }
+        },
+        "lodash.camelcase": {
+          "version": "4.3.0",
+          "bundled": true
+        },
+        "long": {
+          "version": "5.3.2",
+          "bundled": true
+        },
+        "mimic-response": {
+          "version": "3.1.0",
+          "bundled": true
+        },
+        "minimist": {
+          "version": "1.2.8",
+          "bundled": true
+        },
+        "mkdirp-classic": {
+          "version": "0.5.3",
+          "bundled": true
+        },
+        "napi-build-utils": {
+          "version": "2.0.0",
+          "bundled": true
+        },
+        "node-abi": {
+          "version": "3.77.0",
+          "bundled": true,
+          "requires": {
+            "semver": "^7.3.5"
+          }
+        },
+        "once": {
+          "version": "1.4.0",
+          "bundled": true,
+          "requires": {
+            "wrappy": "1"
+          }
+        },
+        "open": {
+          "version": "10.1.2",
+          "bundled": true,
+          "requires": {
+            "default-browser": "^5.2.1",
+            "define-lazy-prop": "^3.0.0",
+            "is-inside-container": "^1.0.0",
+            "is-wsl": "^3.1.0"
+          }
+        },
+        "prebuild-install": {
+          "version": "7.1.3",
+          "bundled": true,
+          "requires": {
+            "detect-libc": "^2.0.0",
+            "expand-template": "^2.0.3",
+            "github-from-package": "0.0.0",
+            "minimist": "^1.2.3",
+            "mkdirp-classic": "^0.5.3",
+            "napi-build-utils": "^2.0.0",
+            "node-abi": "^3.3.0",
+            "pump": "^3.0.0",
+            "rc": "^1.2.7",
+            "simple-get": "^4.0.0",
+            "tar-fs": "^2.0.0",
+            "tunnel-agent": "^0.6.0"
+          }
+        },
+        "protobufjs": {
+          "version": "7.5.2",
+          "bundled": true,
+          "requires": {
+            "@protobufjs/aspromise": "^1.1.2",
+            "@protobufjs/base64": "^1.1.2",
+            "@protobufjs/codegen": "^2.0.4",
+            "@protobufjs/eventemitter": "^1.1.0",
+            "@protobufjs/fetch": "^1.1.0",
+            "@protobufjs/float": "^1.0.2",
+            "@protobufjs/inquire": "^1.1.0",
+            "@protobufjs/path": "^1.1.2",
+            "@protobufjs/pool": "^1.1.0",
+            "@protobufjs/utf8": "^1.1.0",
+            "@types/node": ">=13.7.0",
+            "long": "^5.0.0"
+          }
+        },
+        "pump": {
+          "version": "3.0.3",
+          "bundled": true,
+          "requires": {
+            "end-of-stream": "^1.1.0",
+            "once": "^1.3.1"
+          }
+        },
+        "rc": {
+          "version": "1.2.8",
+          "bundled": true,
+          "requires": {
+            "deep-extend": "^0.6.0",
+            "ini": "~1.3.0",
+            "minimist": "^1.2.0",
+            "strip-json-comments": "~2.0.1"
+          }
+        },
+        "readable-stream": {
+          "version": "3.6.2",
+          "bundled": true,
+          "requires": {
+            "inherits": "^2.0.3",
+            "string_decoder": "^1.1.1",
+            "util-deprecate": "^1.0.1"
+          }
+        },
+        "require-directory": {
+          "version": "2.1.1",
+          "bundled": true
+        },
+        "run-applescript": {
+          "version": "7.0.0",
+          "bundled": true
+        },
+        "safe-buffer": {
+          "version": "5.2.1",
+          "bundled": true
+        },
+        "semver": {
+          "version": "7.7.2",
+          "bundled": true
+        },
+        "simple-concat": {
+          "version": "1.0.1",
+          "bundled": true
+        },
+        "simple-get": {
+          "version": "4.0.1",
+          "bundled": true,
+          "requires": {
+            "decompress-response": "^6.0.0",
+            "once": "^1.3.1",
+            "simple-concat": "^1.0.0"
+          }
+        },
+        "string_decoder": {
+          "version": "1.3.0",
+          "bundled": true,
+          "requires": {
+            "safe-buffer": "~5.2.0"
+          }
+        },
+        "string-width": {
+          "version": "4.2.3",
+          "bundled": true,
+          "requires": {
+            "emoji-regex": "^8.0.0",
+            "is-fullwidth-code-point": "^3.0.0",
+            "strip-ansi": "^6.0.1"
+          }
+        },
+        "strip-ansi": {
+          "version": "6.0.1",
+          "bundled": true,
+          "requires": {
+            "ansi-regex": "^5.0.1"
+          }
+        },
+        "strip-json-comments": {
+          "version": "2.0.1",
+          "bundled": true
+        },
+        "tar-fs": {
+          "version": "2.1.3",
+          "bundled": true,
+          "requires": {
+            "chownr": "^1.1.1",
+            "mkdirp-classic": "^0.5.2",
+            "pump": "^3.0.0",
+            "tar-stream": "^2.1.4"
+          }
+        },
+        "tar-stream": {
+          "version": "2.2.0",
+          "bundled": true,
+          "requires": {
+            "bl": "^4.0.3",
+            "end-of-stream": "^1.4.1",
+            "fs-constants": "^1.0.0",
+            "inherits": "^2.0.3",
+            "readable-stream": "^3.1.1"
+          }
+        },
+        "tunnel-agent": {
+          "version": "0.6.0",
+          "bundled": true,
+          "requires": {
+            "safe-buffer": "^5.0.1"
+          }
+        },
+        "undici-types": {
+          "version": "6.21.0",
+          "bundled": true
+        },
+        "util-deprecate": {
+          "version": "1.0.2",
+          "bundled": true
+        },
+        "vscode-uri": {
+          "version": "3.1.0",
+          "bundled": true
+        },
+        "wrap-ansi": {
+          "version": "7.0.0",
+          "bundled": true,
+          "requires": {
+            "ansi-styles": "^4.0.0",
+            "string-width": "^4.1.0",
+            "strip-ansi": "^6.0.0"
+          }
+        },
+        "wrappy": {
+          "version": "1.0.2",
+          "bundled": true
+        },
+        "y18n": {
+          "version": "5.0.8",
+          "bundled": true
+        },
+        "yargs": {
+          "version": "17.7.2",
+          "bundled": true,
+          "requires": {
+            "cliui": "^8.0.1",
+            "escalade": "^3.1.1",
+            "get-caller-file": "^2.0.5",
+            "require-directory": "^2.1.1",
+            "string-width": "^4.2.3",
+            "y18n": "^5.0.5",
+            "yargs-parser": "^21.1.1"
+          }
+        },
+        "yargs-parser": {
+          "version": "21.1.1",
+          "bundled": true
+        }
+      }
+    },
     "cliui": {
       "version": "8.0.1",
       "resolved": "https://registry.npmjs.org/cliui/-/cliui-8.0.1.tgz",

+ 2 - 1
evals/package.json

@@ -30,7 +30,8 @@
     "sqlite": "^4.1.2",
     "tiktoken": "^1.0.21",
     "uuid": "^9.0.0",
-    "yargs": "^17.6.2"
+    "yargs": "^17.6.2",
+    "cline": "^1.0.1"
   },
   "devDependencies": {
     "@types/better-sqlite3": "^7.6.3",