|
|
20 ore în urmă | |
|---|---|---|
| .. | ||
| analysis | 3899469d76 feat(evals): comprehensive LLM evaluation framework with CI (#8909) | 20 ore în urmă |
| benchmarks | 3899469d76 feat(evals): comprehensive LLM evaluation framework with CI (#8909) | 20 ore în urmă |
| cline-bench @ d1085569fb | 3899469d76 feat(evals): comprehensive LLM evaluation framework with CI (#8909) | 20 ore în urmă |
| e2e | 3899469d76 feat(evals): comprehensive LLM evaluation framework with CI (#8909) | 20 ore în urmă |
| smoke-tests | 3899469d76 feat(evals): comprehensive LLM evaluation framework with CI (#8909) | 20 ore în urmă |
| .gitignore | 3899469d76 feat(evals): comprehensive LLM evaluation framework with CI (#8909) | 20 ore în urmă |
| ARCHITECTURE.md | 3899469d76 feat(evals): comprehensive LLM evaluation framework with CI (#8909) | 20 ore în urmă |
| README.md | 3899469d76 feat(evals): comprehensive LLM evaluation framework with CI (#8909) | 20 ore în urmă |
| package-lock.json | 3899469d76 feat(evals): comprehensive LLM evaluation framework with CI (#8909) | 20 ore în urmă |
| package.json | 3899469d76 feat(evals): comprehensive LLM evaluation framework with CI (#8909) | 20 ore în urmă |
| tsconfig.json | 430074d0c2 diff evals (#4154) | 7 luni în urmă |
A layered testing system for measuring Cline's performance at different levels.
evals/
├── smoke-tests/ # Quick provider validation (minutes)
│ ├── run-smoke-tests.ts
│ └── scenarios/ # 5 curated test scenarios
│
├── e2e/ # Full E2E with cline-bench (hours)
│ └── run-cline-bench.ts
│
├── cline-bench/ # Real-world tasks (git submodule)
│ └── tasks/ # 12 production bug fixes
│
├── analysis/ # Metrics and reporting framework
│ ├── src/
│ │ ├── metrics.ts # pass@k, pass^k calculations
│ │ ├── classifier.ts # Failure pattern matching
│ │ └── reporters/ # Markdown, JSON output
│ └── patterns/
│ └── cline-failures.yaml
│
└── baselines/ # Performance baselines for regression detection
Location: src/core/api/transform/__tests__/
Tests API transform logic without LLM calls:
Provider format conversions
npm run test:unit -- --grep "Thinking\|Tool Call"
Location: evals/smoke-tests/
Quick validation across providers with real LLM calls:
Runs via cline CLI with -s flags
# Set API key (Cline provider)
export CLINE_API_KEY=sk-...
# Run smoke tests
npm run eval:smoke
# Run specific scenario
npm run eval:smoke -- --scenario 01-create-file
# Run with specific model (overrides per-scenario models)
npm run eval:smoke -- --model anthropic/claude-sonnet-4.5
Location: evals/e2e/ + evals/cline-bench/
Full agent tests on production-grade tasks:
Nightly CI runs
# Prerequisites: Python 3.13, Harbor, Docker
npm run eval:e2e
# Specific task
npm run eval:e2e -- --tasks discord
# Different provider
npm run eval:e2e -- --provider openai --model gpt-4o
The framework calculates:
| Metric | Formula | Interpretation |
|---|---|---|
| pass@k | P(≥1 of k passes) | Solution finding capability |
| pass^k | P(all k pass) | Reliability |
| Flakiness | Entropy of pass rate | Consistency |
With 3 trials:
pass (reliable)fail (broken)flaky (needs investigation)# Run all fast tests
npm run test:unit
npm run eval:smoke
# Run E2E (requires setup)
cd evals/cline-bench
# Follow README.md for Harbor setup
npm run eval:e2e
evals/smoke-tests/scenarios/<name>/config.jsontemplate/ directory with starting filesnpm run eval:smoke -- --scenario <name>src/core/api/transform/__tests__/npm run test:unit -- --grep "YourTest"Contribute to cline/cline-bench
native_tool_call_enabled setting to test Claude 4 with native tools