|
|
1 hari lalu | |
|---|---|---|
| .. | ||
| README.md | 3899469d76 feat(evals): comprehensive LLM evaluation framework with CI (#8909) | 1 hari lalu |
| run-cline-bench.ts | 3899469d76 feat(evals): comprehensive LLM evaluation framework with CI (#8909) | 1 hari lalu |
Full end-to-end tests using real-world tasks from cline-bench.
These tests run Cline against production-grade coding problems derived from actual user sessions. Each task:
Python 3.13 with uv
# macOS
brew install [email protected]
pip install uv
Harbor (benchmark execution framework)
uv tool install harbor
Docker (for local execution)
# Verify Docker is running
docker info
API Keys
export ANTHROPIC_API_KEY=sk-ant-...
# or
export API_KEY=sk-ant-... # Generic fallback
# Run all tasks with default settings (Anthropic, Docker)
npx tsx evals/e2e/run-cline-bench.ts
# Run specific task
npx tsx evals/e2e/run-cline-bench.ts --tasks discord
# Use different provider/model
npx tsx evals/e2e/run-cline-bench.ts --provider openai --model gpt-4o
# Run on Daytona cloud (faster, parallel)
export DAYTONA_API_KEY=dtn_...
npx tsx evals/e2e/run-cline-bench.ts --env daytona
# Output to JSON
npx tsx evals/e2e/run-cline-bench.ts --output results.json
| Option | Default | Description |
|---|---|---|
--env |
docker |
Execution environment: docker or daytona |
--provider |
anthropic |
Provider: anthropic, openai, openrouter, gemini |
--model |
claude-sonnet-4-20250514 |
Model ID |
--tasks |
all |
Task filter pattern |
--trials |
1 |
Number of trials per task |
--output |
- | Write JSON results to file |
Current tasks from cline-bench (12 total):
These tests run nightly (not on every PR) due to:
See .github/workflows/nightly-evals.yml for CI configuration.
Results are written to evals/cline-bench/jobs/ directory by Harbor:
jobs/
└── 2025-01-25__10-00-00/
├── result.json # Aggregate results
└── <task-id>__<hash>/
├── result.json # Trial result
├── agent/cline.txt # Conversation log
└── verifier/reward.txt # 1 (pass) or 0 (fail)
source .venv/bin/activate # If using venv
uv tool install harbor
# Start Docker daemon
docker info # Should show Docker info
Some tasks (Qt WASM, Android) can take 20-30 minutes. If running locally, ensure Docker has sufficient resources (8GB+ RAM).