database.md 5.9 KB

Diff Edit Evaluation Database Schema

This document provides an overview of the SQLite database schema used for the diff edit evaluation suite. The database is designed to capture every aspect of the evaluation runs in a structured way, allowing for detailed, multi-dimensional analysis and ensuring full reproducibility of our findings.

Data Model Overview

The database is composed of several interconnected tables that work together to provide a comprehensive picture of each evaluation. The core of the model revolves around runs, cases, and results.

runs

A run represents a single, top-level execution of the evaluation script (e.g., one invocation of npm run diff-eval). It serves as the main container for a complete benchmark session.

  • Purpose: To group all the results from a single benchmark execution, allowing for high-level comparison between different runs over time.
  • Key Columns:
    • run_id: A unique identifier for the entire run.
    • description: A human-readable summary of the run's configuration (e.g., which models were tested, how many cases, etc.).
    • system_prompt_hash: A foreign key that links this run to the specific system prompt that was used, ensuring we can track performance changes based on prompt modifications.

cases

A case represents a single test scenario that is presented to a model. It corresponds to one of the JSON files in the cases/ directory and links that static definition to a specific benchmark run.

  • Purpose: To track the individual test scenarios within a given run.
  • Key Columns:
    • case_id: A unique identifier for the case within a specific run.
    • run_id: A foreign key linking back to the parent run.
    • task_id: The original, persistent identifier for the test case (typically from the JSON filename).
    • file_hash: A foreign key linking to the original, un-edited file content for this case.

results

This is the most granular and important table in the database. A result represents the outcome of a single attempt by a specific model on a specific case.

  • Purpose: To store the detailed outcome of every single model attempt, providing the raw data for all quantitative and qualitative analysis.
  • Key Columns:
    • result_id: The primary key for the result.
    • run_id, case_id, model_id, processing_functions_hash: A set of foreign keys that precisely situate this result within the context of a specific run, case, model, and set of helper functions.
    • succeeded: A boolean indicating if the generated diff was applied successfully.
    • error_enum: A numeric code representing the specific type of error if the attempt failed (e.g., 1 for no_tool_calls, 7 for wrong_file_edited).
    • num_edits, num_lines_deleted, num_lines_added: Quantitative metrics about the structure of the generated diff.
    • time_to_first_token_ms, time_to_first_edit_ms, time_round_trip_ms: High-precision timing data to measure model latency.
    • cost_usd, completion_tokens: Cost and token usage metrics for efficiency analysis.
    • raw_model_output, file_edited_hash, parsed_tool_call_json: The rich, qualitative data. This includes the model's full, raw response and the parsed tool calls, which are invaluable for debugging and understanding the model's reasoning.

Supporting Tables

The following tables store versioned, deduplicated content to ensure data integrity and efficiency.

system_prompts

  • Purpose: Stores the versioned content of the system prompts used in evaluations.
  • Key Columns:
    • hash: A unique hash of the prompt's content, which acts as the primary key. This prevents duplicate storage of the same prompt.
    • name: A human-readable name for the prompt (e.g., basicSystemPrompt, claude4SystemPrompt).
    • content: The full text of the system prompt.

processing_functions

  • Purpose: Stores the versioned combinations of parsing and diff-editing functions.
  • Key Columns:
    • hash: A unique hash of the function combination name.
    • name: A human-readable name (e.g., parseV2-diffV2).
    • parsing_function: The name of the function used to parse the model's output.
    • diff_edit_function: The name of the function used to apply the diff.

files

  • Purpose: Stores the content of all files involved in the tests, including the original source files and the diffs generated by the models.
  • Key Columns:
    • hash: A content-based hash of the file, ensuring that identical files are only stored once.
    • filepath: The original path of the file.
    • content: The full content of the file.

The Bigger Picture

This relational schema provides a powerful foundation for sophisticated analysis. It moves beyond simple pass/fail metrics and allows us to explore the nuanced interactions between models, prompts, and the code they operate on. With this database, we can answer critical questions like:

  • "How does prompt engineering affect not just success rate, but also latency and cost?"
  • "Are certain models more prone to specific types of errors (e.g., hallucinating file paths vs. failing to call a tool)?"
  • "Which of our internal diffing algorithms is the most robust against a wide range of model-generated edits?"

Ultimately, this data model enables us to move from simply measuring performance to truly understanding it, providing the insights needed to build more capable and reliable AI engineering systems.


Viewing the Full Schema

To see the most up-to-date and detailed schema for the database, you can use the sqlite3 command-line tool. From the evals/diff-edits directory, run the following command:

sqlite3 evals.db .schema

This will print the complete CREATE TABLE statements for all tables in the database, providing a definitive reference for the database structure.