Skip to content

Latest commit

 

History

History
76 lines (54 loc) · 2.9 KB

File metadata and controls

76 lines (54 loc) · 2.9 KB

VT Code Empirical Evaluation Framework

This directory contains the tools and test cases for performing empirical evaluations of the vtcode agent. The framework allows you to measure model performance across categories like safety, logic, coding, and instruction following.

Getting Started

Prerequisites

  1. Build vtcode: Ensure you have a compiled release binary of vtcode.
    cargo build --release
  2. Python Environment: The evaluation engine requires Python 3.
  3. API Keys: Set the necessary environment variables (e.g., GEMINI_API_KEY, OPENAI_API_KEY) in a .env file in the project root.

Running Evaluations

The eval_engine.py script orchestrates the evaluation process.

Basic Usage:

python3 evals/eval_engine.py --cases evals/test_cases.json --provider gemini --model gemini-3-flash-preview

Arguments:

  • --cases: Path to the test cases JSON file (default: evals/test_cases.json).
  • --provider: The LLM provider to evaluate (e.g., gemini, openai, anthropic).
  • --model: The specific model ID to evaluate (e.g., gemini-3-flash-preview, gpt-4).

Directory Structure

  • eval_engine.py: The main orchestrator that runs test cases and generates reports.
  • metrics.py: Contains grading logic and metric implementations.
  • test_cases.json: The primary benchmark suite.
  • test_cases_mini.json: A smaller suite for quick validation of the framework.
  • reports/: Automatically created directory where evaluation results are saved as JSON files.

Test Case Format

Test cases are defined in JSON format:

{
    "id": "logic_fibonacci",
    "category": "logic",
    "task": "Write a python function to calculate the nth fibonacci number.",
    "metric": "code_validity",
    "language": "python"
}

Supported Metrics

  • exact_match: Checks if the output exactly matches the expected string.
  • contains_match: Checks if the output contains the expected string.
  • code_validity: Checks if the code within markdown blocks is syntactically correct (supports python).
  • llm_grader: Uses the LLM itself to grade the response based on a rubric.

Analyzing Reports

Reports are saved in the reports/ directory with a timestamp. They include:

  • Summary: Total tests, passed, and failed counts.
  • Results: Detailed breakdown for each test case, including:
    • output: The raw agent response.
    • usage: Token usage metadata.
    • latency: Response time in seconds.
    • grade: The score or result from the metric.
    • reasoning: The agent's thinking process (if supported by the model).
    • raw_response: The complete JSON response from vtcode ask.

Grading with LLMs

The llm_grader metric uses vtcode ask internally to perform evaluations. By default, it uses gemini-3-flash-preview for grading to keep costs low and ensure reliability. You can configure this in evals/metrics.py.