(llm-first): context budget, validation, and eval harness, orchestration general-prompt

This commit is contained in:
MaKarin
2026-04-03 14:17:21 +03:00
parent 79c9589658
commit c42a88dff6
189 changed files with 15538 additions and 9109 deletions

106
tools/eval/README.md Normal file
View File

@@ -0,0 +1,106 @@
# Eval Harness — Rule 6
Fixture-based regression tests for generated artifacts.
## Why this exists
> "Evals are the test suite for your prompts. You would never ship code without tests;
> don't ship prompts without evals." — Anthropic Engineering
The validation gate (`tools/validate-generation.mjs`) checks **existence** and **structural compliance**.
The eval harness checks **semantic correctness**: are the right patterns present in the generated code?
Do the generated files actually follow the rules in `prompts/`?
Together they enforce:
- Gate: "file exists, field names present, auth seams wired"
- Evals: "DTO has class-validator decorators, FK uses ReferenceInput, date uses DateInput, guard is present"
## Usage
```bash
# Run all evals
npm run eval:generation
# Run evals for one entity
node tools/eval/run-evals.mjs --entity equipment
# Verbose output (show each file being checked)
node tools/eval/run-evals.mjs --verbose
```
## Fixture format
Each fixture lives in `tools/eval/fixtures/<entity>/`:
```
fixtures/
equipment/
meta.json ← what this fixture tests
backend.assertions.json ← patterns the NestJS files must satisfy
frontend.assertions.json ← patterns the React Admin files must satisfy
repair-order/
meta.json
backend.assertions.json
frontend.assertions.json
```
### `meta.json`
```json
{
"entity": "Equipment",
"kebab": "equipment",
"resource": "equipment",
"description": "...",
"tests": ["dto-decorator-coverage", "auth-guards", ...]
}
```
### `*.assertions.json`
Each file entry supports:
| Key | Type | Meaning |
|-----|------|---------|
| `path` | string | Relative path from repo root |
| `must_contain` | string[] | Each string must appear as a literal substring |
| `must_not_contain` | string[] | Each string must NOT appear |
| `must_match_regex` | string[] | Each pattern must match (multiline dot-all) |
| `must_not_match_regex` | string[] | Each pattern must NOT match |
| `comment` | string | Human-readable explanation of what is being tested |
## Eval-driven development workflow
This is the critical principle from Anthropic and Google:
1. **Write the failing eval first.** When you change a prompt or add a rule, add an
assertion that captures the new expectation *before* re-generating.
2. **Run evals**: `npm run eval:generation` → see failures.
3. **Re-generate** the affected entity (following the generation workflow in `AGENTS.md`).
4. **Run evals again**: all pass → the change is verified.
5. **Commit both** the updated fixture and the regenerated artifacts together.
A passing eval after a prompt change confirms the LLM followed the new rule.
A failing eval before a prompt change tells you exactly which prior contract was broken.
## Adding a new entity fixture
When adding a new entity to `domain/toir.api.dsl` and generating its backend + frontend:
1. Create `tools/eval/fixtures/<kebab>/meta.json`
2. Create `tools/eval/fixtures/<kebab>/backend.assertions.json` with at minimum:
- controller: `@Controller(...)`, `@UseGuards(`, `JwtAuthGuard`, HTTP methods
- create_dto: `from 'class-validator'`, required fields with `!:`, `@IsString(`, `@IsOptional(`
- update_dto: `from 'class-validator'`, fields with `?:`, `@IsOptional(`
3. Create `tools/eval/fixtures/<kebab>/frontend.assertions.json` with at minimum:
- create: `ReferenceInput` for FK fields, `NumberInput` for numeric, `DateInput` for date, `SelectInput` for enum
- show: `ReferenceField` for FK fields, `DateField` for date
4. Run `npm run eval:generation` to verify the fixture catches real issues.
## Integration with git hooks
The pre-commit hook (installed by `npm run install-hooks`) runs both:
1. `node tools/validate-generation.mjs --artifacts-only` — existence gate
2. `npm run eval:generation` — semantic eval gate
Both must pass before a commit is accepted.