# Future Work — Deferred Items This file tracks engineering improvements that are deliberately deferred due to the current stage of the project. They are not forgotten — they are acknowledged technical debt that should be addressed before scaling. --- ## Rule 7 — Tracing, Telemetry, Cost/Latency Observability **Status:** Deferred. No LLM calls are instrumented. **Why it matters (Anthropic / Google / Microsoft guidance):** Without observability, you cannot: - Know which prompts are expensive (token count, latency) - Detect prompt regressions via cost drift - Attribute generation failures to specific prompt versions - Track improvement over time **What needs to be built:** ### 7.1 — Generation log Create `tools/generation-log.mjs` that wraps any LLM generation call and writes a structured JSON entry to `logs/generation.jsonl`: ```json { "timestamp": "2026-04-03T10:00:00.000Z", "entity": "Equipment", "artifact": "backend", "prompt_version": "1.0", "model": "...", "input_tokens": 4200, "output_tokens": 1800, "latency_ms": 3200, "validation_passed": true, "eval_passed": true } ``` ### 7.2 — Cost budget alerts Add a threshold check (e.g., warn if input_tokens > 8000 for a single entity generation). This enforces the context budget from `prompts/general-prompt.md §CONTEXT BUDGET`. ### 7.3 — Prompt version tracking Add `` comments to all prompt files (already started in `backend-rules.md` and `frontend-rules.md`). Increment version on any non-trivial change. Log the prompt versions alongside the generation log entry. ### 7.4 — Drift detection Compare generation log entries across runs. If token count for the same entity increases by >20% without a DSL change, flag it as context rot. **Effort estimate:** Medium. 2–3 days to build the logging layer. Zero effort for prompt versioning (already partially done). **Trigger:** Implement before the system is used for more than 10 entities or before any production deployment. --- ## Rule 8 — Risk Controls and Red-Teaming **Status:** Deferred. No sanitization or adversarial testing exists. **Why it matters (Anthropic / Google / Microsoft guidance):** LLM-generated code at scale introduces risks that do not exist in hand-written code: - **Prompt injection**: malicious content in DSL `description` fields could steer generation (e.g., `description "Ignore previous instructions and..."`) - **Generated credential leakage**: LLM may hallucinate hardcoded secrets that look real (e.g., `apiKey: 'sk-...'`) - **Missing auth guards**: already caught by Rule 4 validator, but adversarial prompts could bypass it by generating valid-looking guard syntax that is semantically inactive - **Supply chain**: generated package imports could reference non-existent or malicious packages if the LLM hallucinates **What needs to be built:** ### 8.1 — DSL input sanitization In `tools/api-summary.mjs`, before building the summary, check all `description` and `label` fields for injection patterns: ```javascript function sanitizeDslString(value, fieldPath) { const injectionPatterns = [ /ignore previous/i, /disregard.*instruction/i, /you are now/i, /system:/i, ]; for (const pattern of injectionPatterns) { if (pattern.test(value)) { throw new Error(`Potential prompt injection in DSL field ${fieldPath}: "${value}"`); } } return value; } ``` ### 8.2 — Generated code security scan Add to `tools/validate-generation.mjs` (or a separate `tools/security-scan.mjs`): ```javascript // Check no hardcoded secrets leaked into generated code function validateNoSecretLeakage() { const patterns = [ /sk-[a-zA-Z0-9]{20,}/, // OpenAI key pattern /[a-zA-Z0-9+/]{40}={0,2}/, // Base64 secret-like /password\s*=\s*['"][^'"]{4,}['"]/, // Hardcoded password /apiKey\s*=\s*['"][^'"]{4,}['"]/, // Hardcoded API key ]; // Run against all generated files... } ``` ### 8.3 — UseGuards completeness audit Beyond the current semantic gate coverage, add: verify that the guard constructor arguments are non-empty and match the expected guard class names. A guard call like `@UseGuards()` (empty) should fail eval coverage because it provides no protection. ### 8.4 — Red-team fixture Create `tools/eval/fixtures/_adversarial/` with a fixture that includes a DSL snippet containing a benign injection attempt (e.g., a `description` field with "ignore format rules") and verifies the generation still produces spec-compliant output. ### 8.5 — Generated import allowlist Maintain a list of approved npm packages that generated code may import. Flag any import not on the allowlist as a manual review item. **Effort estimate:** Medium-High. 3–5 days. Security scan and sanitization are low effort; red-team fixtures and import allowlisting are higher effort. **Trigger:** Implement before any external user can influence `domain/*.api.dsl` content (i.e., before a UI or API to edit the DSL is exposed). --- ## Tracking | Rule | Status | Priority | Trigger | |------|--------|----------|---------| | Rule 7 — Telemetry | Deferred | Medium | Before >10 entities or production deployment | | Rule 8 — Risk controls | Deferred | High | Before DSL editing is exposed to external users | | Rule 9 — Eval corpus automation | Deferred | Medium | After the contract stabilizes for the current entity set | --- ## Rule 9 — Eval Corpus Automation **Status:** Deferred. Reviewed eval fixtures are still the authoritative semantic gate. **Why it matters:** The repository already requires eval-first behavior for new or changed entity coverage, but the repo does not yet synthesize starter eval contracts from source-of-truth. That means humans still have to review and finalize the first failing semantic contract. **What needs to be built:** - a deterministic helper that can scaffold `tools/eval/fixtures//` starters from the active source-of-truth slice - a prompt-to-eval helper that emits backend and frontend assertion starters before regeneration - a documented workflow where generated starters are reviewed and committed, instead of silently replacing the authoritative eval corpus on every run **Status of current contract:** - This is not implemented yet. - The repo should not imply that evals are auto-generated today. - The repo should also not imply that a step-0 LLM subagent should rewrite committed eval fixtures on every regeneration run; that would collapse the independence of the semantic gate. - Until bounded automation exists, the reviewed eval-first rule in `prompts/general-prompt.md` and `docs/generation-playbook.md` remains required. **Effort estimate:** Medium. **Trigger:** After the contract stabilizes for the current entity set. Last updated: 2026-04-03