|
|
hai 5 días | |
|---|---|---|
| .. | ||
| cli | hai 1 mes | |
| diff-edits | hai 5 días | |
| .gitignore | hai 2 meses | |
| README.md | hai 2 meses | |
| package-lock.json | hai 1 mes | |
| package.json | hai 1 mes | |
| tsconfig.json | hai 6 meses | |
This directory contains the evaluation system for benchmarking Cline against various coding evaluation frameworks.
The Cline Evaluation System allows you to:
The evaluation system consists of two main components:
evals/cli/ for orchestrating evaluationsevals/diff-edits/cases folder with all the conversation jsons.evals/ # Main directory for evaluation system
├── cli/ # CLI tool for orchestrating evaluations
│ └── src/
│ ├── index.ts # CLI entry point
│ ├── commands/ # CLI commands (setup, run, report)
│ ├── adapters/ # Benchmark adapters
│ ├── db/ # Database management
│ └── utils/ # Utility functions
├── diff-edits/ # Diff editing evaluation suite
│ ├── cases/ # Test case JSON files
│ ├── results/ # Evaluation results
│ ├── diff-apply/ # Diff application logic
│ ├── parsing/ # Assistant message parsing
│ └── prompts/ # System prompts
├── repositories/ # Cloned benchmark repositories
│ └── exercism/ # Exercism (Aider Polyglot)
├── results/ # Evaluation results storage
│ ├── runs/ # Individual run results
│ └── reports/ # Generated reports
└── README.md # This file
Build the CLI tool:
cd evals
npm install
npm run build:cli
cd evals/cli
node dist/index.js setup
This will clone and set up all benchmark repositories. You can specify specific benchmarks:
node dist/index.js setup --benchmarks exercism
node dist/index.js run --benchmark exercism --count 10
Options:
--benchmark: Specific benchmark to run (default: exercism)--count: Number of tasks to run (default: all available tasks)Note: Model selection is currently configured through the Cline CLI itself, not through evaluation flags.
node dist/index.js report
Options:
--format: Report format (json, markdown) (default: markdown)--output: Output path for the reportModified Exercism exercises from the polyglot-benchmark repository. These are small, focused programming exercises in various languages.
Real-world software engineering tasks from the SWE-bench repository.
Freelance-style programming tasks from the SWELancer benchmark.
Multi-file software engineering tasks from the Multi-SWE-Bench repository.
The Cline Evaluation System includes a specialized suite for evaluating how well models can make precise edits to files using the replace_in_file tool.
Diff edit evaluations test a model's ability to:
diff-edits/
├── cases/ # Test case JSON files
├── results/ # Evaluation results
├── ClineWrapper.ts # Wrapper for model interaction
├── TestRunner.ts # Main test execution logic
├── types.ts # Type definitions
├── diff-apply/ # Diff application logic
├── parsing/ # Assistant message parsing
└── prompts/ # System prompts
Test cases are defined as JSON files in the diff-edits/cases/ directory. Each test case should include:
{
"test_id": "example_test_1",
"messages": [
{
"role": "user",
"text": "Please fix the bug in this code...",
"images": []
},
{
"role": "assistant",
"text": "I'll help you fix that bug..."
}
],
"file_contents": "// Original file content here\nfunction example() {\n // Code with bug\n}",
"file_path": "src/example.js",
"system_prompt_details": {
"mcp_string": "",
"cwd_value": "/path/to/working/directory",
"browser_use": false,
"width": 900,
"height": 600,
"os_value": "macOS",
"shell_value": "/bin/zsh",
"home_value": "/Users/username",
"user_custom_instructions": ""
},
"original_diff_edit_tool_call_message": ""
}
cd evals/cli
node dist/index.js run-diff-eval --model-ids "anthropic/claude-3-5-sonnet-20241022"
Compare multiple models in a single evaluation run:
# Compare Claude and Grok models
node dist/index.js run-diff-eval \
--model-ids "anthropic/claude-3-5-sonnet-20241022,x-ai/grok-beta" \
--max-cases 10 \
--valid-attempts-per-case 3 \
--verbose
# Compare multiple Claude variants
node dist/index.js run-diff-eval \
--model-ids "anthropic/claude-3-5-sonnet-20241022,anthropic/claude-3-5-haiku-20241022,anthropic/claude-3-opus-20240229" \
--max-cases 5 \
--valid-attempts-per-case 2 \
--parallel
--model-ids: Comma-separated list of model IDs to evaluate (required)--system-prompt-name: System prompt to use (default: "basicSystemPrompt")--valid-attempts-per-case: Number of attempts per test case per model (default: 1)--max-cases: Maximum number of test cases to run (default: all available)--parsing-function: Function to parse assistant messages (default: "parseAssistantMessageV2")--diff-edit-function: Function to apply diffs (default: "constructNewFileContentV2")--test-path: Path to test cases (default: diff-edits/cases)--thinking-budget: Tokens allocated for thinking (default: 0)--parallel: Run tests in parallel (flag)--replay: Use pre-recorded LLM output (flag)--verbose: Enable detailed logging (flag)# Quick test with 2 models, 4 cases, 2 attempts each
node dist/index.js run-diff-eval \
--model-ids "anthropic/claude-3-5-sonnet-20241022,x-ai/grok-beta" \
--max-cases 4 \
--valid-attempts-per-case 2 \
--verbose
# Comprehensive evaluation with parallel execution
node dist/index.js run-diff-eval \
--model-ids "anthropic/claude-3-5-sonnet-20241022,anthropic/claude-3-5-haiku-20241022" \
--system-prompt-name claude4SystemPrompt \
--valid-attempts-per-case 5 \
--max-cases 20 \
--parallel \
--verbose
All evaluation results are automatically stored in a SQLite database (diff-edits/evals.db) for advanced analytics and comparison. The database includes:
Launch the Streamlit dashboard to visualize and analyze evaluation results:
cd diff-edits/dashboard
streamlit run app.py
The dashboard provides:
# Run a quick evaluation
node cli/dist/index.js run-diff-eval \
--model-ids "anthropic/claude-3-5-sonnet-20241022,x-ai/grok-beta" \
--max-cases 4 \
--valid-attempts-per-case 2 \
--verbose
# Launch dashboard to view results
cd diff-edits/dashboard && streamlit run app.py
For backward compatibility, results are also saved as JSON files in the diff-edits/results/ directory. The JSON results include:
The evaluation system collects the following metrics:
Reports are generated in Markdown or JSON format and include:
evals/cli/src/adapters/BenchmarkAdapter interfaceevals/cli/src/adapters/index.tsTo add new metrics:
evals/cli/src/db/schema.tsevals/cli/src/utils/results.tsevals/cli/src/commands/report.ts