Skip to content

Evaluation

The evaluation system helps you measure and improve your agent’s accuracy by running it against a suite of test cases with known expected outcomes.

Test cases are stored in the sql_agent_test_cases table. You can seed them using the built-in seeder (--seed flag) or create your own:

use Knobik\SqlAgent\Models\TestCase;
TestCase::create([
'name' => 'Count active users',
'category' => 'basic',
'question' => 'How many active users are there?',
'expected_values' => ['count' => 42],
'golden_sql' => 'SELECT COUNT(*) as count FROM users WHERE status = "active"',
'golden_result' => [['count' => 42]],
]);
FieldDescription
nameA descriptive name for the test case
categoryGrouping category (e.g., basic, aggregation, complex)
questionThe natural language question to ask the agent
expected_valuesKey-value pairs to match against results (supports dot notation)
golden_sqlThe known-good SQL query for comparison
golden_resultThe expected full result set
Terminal window
# Run all test cases
php artisan sql-agent:eval
# Run with LLM grading
php artisan sql-agent:eval --llm-grader
# Run a specific category
php artisan sql-agent:eval --category=aggregation
# Generate an HTML report
php artisan sql-agent:eval --html=storage/eval-report.html
# Seed built-in test cases first
php artisan sql-agent:eval --seed

Three evaluation modes are available:

ModeDescription
String Matching (default)Checks if expected values appear in the response
LLM Grading (--llm-grader)Uses an LLM to semantically evaluate whether the response is correct
Golden SQL (--golden-sql)Runs the golden SQL and compares its results against the agent’s results