This document provides an overview of the SQLite database schema used for the diff edit evaluation suite. The database is designed to capture every aspect of the evaluation runs in a structured way, allowing for detailed, multi-dimensional analysis and ensuring full reproducibility of our findings.
The database is composed of several interconnected tables that work together to provide a comprehensive picture of each evaluation. The core of the model revolves around runs, cases, and results.
runsA run represents a single, top-level execution of the evaluation script (e.g., one invocation of npm run diff-eval). It serves as the main container for a complete benchmark session.
run_id: A unique identifier for the entire run.description: A human-readable summary of the run's configuration (e.g., which models were tested, how many cases, etc.).system_prompt_hash: A foreign key that links this run to the specific system prompt that was used, ensuring we can track performance changes based on prompt modifications.casesA case represents a single test scenario that is presented to a model. It corresponds to one of the JSON files in the cases/ directory and links that static definition to a specific benchmark run.
case_id: A unique identifier for the case within a specific run.run_id: A foreign key linking back to the parent run.task_id: The original, persistent identifier for the test case (typically from the JSON filename).file_hash: A foreign key linking to the original, un-edited file content for this case.resultsThis is the most granular and important table in the database. A result represents the outcome of a single attempt by a specific model on a specific case.
result_id: The primary key for the result.run_id, case_id, model_id, processing_functions_hash: A set of foreign keys that precisely situate this result within the context of a specific run, case, model, and set of helper functions.succeeded: A boolean indicating if the generated diff was applied successfully.error_enum: A numeric code representing the specific type of error if the attempt failed (e.g., 1 for no_tool_calls, 7 for wrong_file_edited).num_edits, num_lines_deleted, num_lines_added: Quantitative metrics about the structure of the generated diff.time_to_first_token_ms, time_to_first_edit_ms, time_round_trip_ms: High-precision timing data to measure model latency.cost_usd, completion_tokens: Cost and token usage metrics for efficiency analysis.raw_model_output, file_edited_hash, parsed_tool_call_json: The rich, qualitative data. This includes the model's full, raw response and the parsed tool calls, which are invaluable for debugging and understanding the model's reasoning.The following tables store versioned, deduplicated content to ensure data integrity and efficiency.
system_promptshash: A unique hash of the prompt's content, which acts as the primary key. This prevents duplicate storage of the same prompt.name: A human-readable name for the prompt (e.g., basicSystemPrompt, claude4SystemPrompt).content: The full text of the system prompt.processing_functionshash: A unique hash of the function combination name.name: A human-readable name (e.g., parseV2-diffV2).parsing_function: The name of the function used to parse the model's output.diff_edit_function: The name of the function used to apply the diff.fileshash: A content-based hash of the file, ensuring that identical files are only stored once.filepath: The original path of the file.content: The full content of the file.This relational schema provides a powerful foundation for sophisticated analysis. It moves beyond simple pass/fail metrics and allows us to explore the nuanced interactions between models, prompts, and the code they operate on. With this database, we can answer critical questions like:
Ultimately, this data model enables us to move from simply measuring performance to truly understanding it, providing the insights needed to build more capable and reliable AI engineering systems.
To see the most up-to-date and detailed schema for the database, you can use the sqlite3 command-line tool. From the evals/diff-edits directory, run the following command:
sqlite3 evals.db .schema
This will print the complete CREATE TABLE statements for all tables in the database, providing a definitive reference for the database structure.