Evaluation
The evaluation system helps you measure and improve your agent’s accuracy by running it against a suite of test cases with known expected outcomes.
Creating Test Cases
Section titled “Creating Test Cases”Test cases are stored in the sql_agent_test_cases table. You can seed them using the built-in seeder (--seed flag) or create your own:
use Knobik\SqlAgent\Models\TestCase;
TestCase::create([ 'name' => 'Count active users', 'category' => 'basic', 'question' => 'How many active users are there?', 'expected_values' => ['count' => 42], 'golden_sql' => 'SELECT COUNT(*) as count FROM users WHERE status = "active"', 'golden_result' => [['count' => 42]],]);| Field | Description |
|---|---|
name | A descriptive name for the test case |
category | Grouping category (e.g., basic, aggregation, complex) |
question | The natural language question to ask the agent |
expected_values | Key-value pairs to match against results (supports dot notation) |
golden_sql | The known-good SQL query for comparison |
golden_result | The expected full result set |
Running Evaluations
Section titled “Running Evaluations”# Run all test casesphp artisan sql-agent:eval
# Run with LLM gradingphp artisan sql-agent:eval --llm-grader
# Run a specific categoryphp artisan sql-agent:eval --category=aggregation
# Generate an HTML reportphp artisan sql-agent:eval --html=storage/eval-report.html
# Seed built-in test cases firstphp artisan sql-agent:eval --seedEvaluation Modes
Section titled “Evaluation Modes”Three evaluation modes are available:
| Mode | Description |
|---|---|
| String Matching (default) | Checks if expected values appear in the response |
LLM Grading (--llm-grader) | Uses an LLM to semantically evaluate whether the response is correct |
Golden SQL (--golden-sql) | Runs the golden SQL and compares its results against the agent’s results |