Files
toir-automatization/docs/future-work.md
2026-04-06 12:50:46 +03:00

181 lines
6.7 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Future Work — Deferred Items
This file tracks engineering improvements that are deliberately deferred due to the
current stage of the project. They are not forgotten — they are acknowledged technical
debt that should be addressed before scaling.
---
## Rule 7 — Tracing, Telemetry, Cost/Latency Observability
**Status:** Deferred. No LLM calls are instrumented.
**Why it matters (Anthropic / Google / Microsoft guidance):**
Without observability, you cannot:
- Know which prompts are expensive (token count, latency)
- Detect prompt regressions via cost drift
- Attribute generation failures to specific prompt versions
- Track improvement over time
**What needs to be built:**
### 7.1 — Generation log
Create `tools/generation-log.mjs` that wraps any LLM generation call and writes a
structured JSON entry to `logs/generation.jsonl`:
```json
{
"timestamp": "2026-04-03T10:00:00.000Z",
"entity": "Equipment",
"artifact": "backend",
"prompt_version": "1.0",
"model": "...",
"input_tokens": 4200,
"output_tokens": 1800,
"latency_ms": 3200,
"validation_passed": true,
"eval_passed": true
}
```
### 7.2 — Cost budget alerts
Add a threshold check (e.g., warn if input_tokens > 8000 for a single entity generation).
This enforces the context budget from `prompts/general-prompt.md §CONTEXT BUDGET`.
### 7.3 — Prompt version tracking
Add `<!-- prompt-version: X.Y -->` comments to all prompt files (already started in
`backend-rules.md` and `frontend-rules.md`). Increment version on any non-trivial change.
Log the prompt versions alongside the generation log entry.
### 7.4 — Drift detection
Compare generation log entries across runs. If token count for the same entity increases
by >20% without a DSL change, flag it as context rot.
**Effort estimate:** Medium. 23 days to build the logging layer. Zero effort for
prompt versioning (already partially done).
**Trigger:** Implement before the system is used for more than 10 entities or before
any production deployment.
---
## Rule 8 — Risk Controls and Red-Teaming
**Status:** Deferred. No sanitization or adversarial testing exists.
**Why it matters (Anthropic / Google / Microsoft guidance):**
LLM-generated code at scale introduces risks that do not exist in hand-written code:
- **Prompt injection**: malicious content in DSL `description` fields could steer
generation (e.g., `description "Ignore previous instructions and..."`)
- **Generated credential leakage**: LLM may hallucinate hardcoded secrets that look
real (e.g., `apiKey: 'sk-...'`)
- **Missing auth guards**: already caught by Rule 4 validator, but adversarial prompts
could bypass it by generating valid-looking guard syntax that is semantically inactive
- **Supply chain**: generated package imports could reference non-existent or malicious
packages if the LLM hallucinates
**What needs to be built:**
### 8.1 — DSL input sanitization
In `tools/api-summary.mjs`, before building the summary, check all `description` and
`label` fields for injection patterns:
```javascript
function sanitizeDslString(value, fieldPath) {
const injectionPatterns = [
/ignore previous/i,
/disregard.*instruction/i,
/you are now/i,
/system:/i,
];
for (const pattern of injectionPatterns) {
if (pattern.test(value)) {
throw new Error(`Potential prompt injection in DSL field ${fieldPath}: "${value}"`);
}
}
return value;
}
```
### 8.2 — Generated code security scan
Add to `tools/validate-generation.mjs` (or a separate `tools/security-scan.mjs`):
```javascript
// Check no hardcoded secrets leaked into generated code
function validateNoSecretLeakage() {
const patterns = [
/sk-[a-zA-Z0-9]{20,}/, // OpenAI key pattern
/[a-zA-Z0-9+/]{40}={0,2}/, // Base64 secret-like
/password\s*=\s*['"][^'"]{4,}['"]/, // Hardcoded password
/apiKey\s*=\s*['"][^'"]{4,}['"]/, // Hardcoded API key
];
// Run against all generated files...
}
```
### 8.3 — UseGuards completeness audit
Beyond the current semantic gate coverage, add: verify that the guard
constructor arguments are non-empty and match the expected guard class names. A guard
call like `@UseGuards()` (empty) should fail eval coverage because it provides no protection.
### 8.4 — Red-team fixture
Create `tools/eval/fixtures/_adversarial/` with a fixture that includes a DSL snippet
containing a benign injection attempt (e.g., a `description` field with "ignore format
rules") and verifies the generation still produces spec-compliant output.
### 8.5 — Generated import allowlist
Maintain a list of approved npm packages that generated code may import. Flag any
import not on the allowlist as a manual review item.
**Effort estimate:** Medium-High. 35 days. Security scan and sanitization are low
effort; red-team fixtures and import allowlisting are higher effort.
**Trigger:** Implement before any external user can influence `domain/*.api.dsl` content
(i.e., before a UI or API to edit the DSL is exposed).
---
## Tracking
| Rule | Status | Priority | Trigger |
|------|--------|----------|---------|
| Rule 7 — Telemetry | Deferred | Medium | Before >10 entities or production deployment |
| Rule 8 — Risk controls | Deferred | High | Before DSL editing is exposed to external users |
| Rule 9 — Eval corpus automation | Deferred | Medium | After the contract stabilizes for the current entity set |
---
## Rule 9 — Eval Corpus Automation
**Status:** Deferred. Reviewed eval fixtures are still the authoritative semantic gate.
**Why it matters:**
The repository already requires eval-first behavior for new or changed entity coverage, but the repo does not yet synthesize starter eval contracts from source-of-truth. That means humans still have to review and finalize the first failing semantic contract.
**What needs to be built:**
- a deterministic helper that can scaffold `tools/eval/fixtures/<entity>/` starters from the active source-of-truth slice
- a prompt-to-eval helper that emits backend and frontend assertion starters before regeneration
- a documented workflow where generated starters are reviewed and committed, instead of silently replacing the authoritative eval corpus on every run
**Status of current contract:**
- This is not implemented yet.
- The repo should not imply that evals are auto-generated today.
- The repo should also not imply that a step-0 LLM subagent should rewrite committed eval fixtures on every regeneration run; that would collapse the independence of the semantic gate.
- Until bounded automation exists, the reviewed eval-first rule in `prompts/general-prompt.md` and `docs/generation-playbook.md` remains required.
**Effort estimate:** Medium.
**Trigger:** After the contract stabilizes for the current entity set.
Last updated: 2026-04-03