Files
2026-04-03 20:54:37 +03:00
..
2026-04-03 20:54:37 +03:00
2026-04-03 20:54:37 +03:00
2026-04-03 20:54:37 +03:00

Eval Harness — Rule 6

Fixture-based regression tests for generated artifacts.

Why this exists

"Evals are the test suite for your prompts. You would never ship code without tests; don't ship prompts without evals." — Anthropic Engineering

The validation gate (tools/validate-generation.mjs) checks existence and structural compliance. The eval harness checks semantic correctness: are the right patterns present in the generated code? Do the generated files actually follow the rules in prompts/?

Together they enforce:

  • Gate: "file exists, field names present, auth seams wired"
  • Evals: "DTO has class-validator decorators, FK uses ReferenceInput, date uses DateInput, guard is present"

Usage

# Run all evals
npm run eval:generation

# Run evals for one entity
node tools/eval/run-evals.mjs --entity equipment

# Verbose output (show each file being checked)
node tools/eval/run-evals.mjs --verbose

Fixture format

Each fixture lives in tools/eval/fixtures/<entity>/:

fixtures/
  equipment/
    meta.json                  ← what this fixture tests
    backend.assertions.json    ← patterns the NestJS files must satisfy
    frontend.assertions.json   ← patterns the React Admin files must satisfy
  repair-order/
    meta.json
    backend.assertions.json
    frontend.assertions.json

meta.json

{
  "entity": "Equipment",
  "kebab": "equipment",
  "resource": "equipment",
  "description": "...",
  "tests": ["dto-decorator-coverage", "auth-guards", ...]
}

*.assertions.json

Each file entry supports:

Key Type Meaning
path string Relative path from repo root
must_contain string[] Each string must appear as a literal substring
must_not_contain string[] Each string must NOT appear
must_match_regex string[] Each pattern must match (multiline dot-all)
must_not_match_regex string[] Each pattern must NOT match
comment string Human-readable explanation of what is being tested

Eval-driven development workflow

This is the critical principle from Anthropic and Google:

  1. Write the failing eval first. When you change a prompt or add a rule, add an assertion that captures the new expectation before re-generating.
  2. Run evals: npm run eval:generation → see failures.
  3. Re-generate the affected entity (following the generation workflow in AGENTS.md).
  4. Run evals again: all pass → the change is verified.
  5. Commit both the updated fixture and the regenerated artifacts together.

A passing eval after a prompt change confirms the LLM followed the new rule. A failing eval before a prompt change tells you exactly which prior contract was broken.

Adding a new entity fixture

When adding a new entity to domain/toir.api.dsl and generating its backend + frontend:

  1. Create tools/eval/fixtures/<kebab>/meta.json
  2. Create tools/eval/fixtures/<kebab>/backend.assertions.json with at minimum:
    • controller: @Controller(...), @UseGuards(, JwtAuthGuard, HTTP methods
    • create_dto: from 'class-validator', required fields with !:, @IsString(, @IsOptional(
    • update_dto: from 'class-validator', fields with ?:, @IsOptional(
  3. Create tools/eval/fixtures/<kebab>/frontend.assertions.json with at minimum:
    • create: ReferenceInput for FK fields, NumberInput for numeric, DateInput for date, SelectInput for enum
    • show: ReferenceField for FK fields, DateField for date
  4. Run npm run eval:generation to verify the fixture catches real issues.

Integration with git hooks

The pre-commit hook (installed by npm run install-hooks) runs both:

  1. node tools/validate-generation.mjs --artifacts-only — existence gate
  2. npm run eval:generation — semantic eval gate

Both must pass before a commit is accepted.