6.7 KiB
Future Work — Deferred Items
This file tracks engineering improvements that are deliberately deferred due to the current stage of the project. They are not forgotten — they are acknowledged technical debt that should be addressed before scaling.
Rule 7 — Tracing, Telemetry, Cost/Latency Observability
Status: Deferred. No LLM calls are instrumented.
Why it matters (Anthropic / Google / Microsoft guidance): Without observability, you cannot:
- Know which prompts are expensive (token count, latency)
- Detect prompt regressions via cost drift
- Attribute generation failures to specific prompt versions
- Track improvement over time
What needs to be built:
7.1 — Generation log
Create tools/generation-log.mjs that wraps any LLM generation call and writes a
structured JSON entry to logs/generation.jsonl:
{
"timestamp": "2026-04-03T10:00:00.000Z",
"entity": "Equipment",
"artifact": "backend",
"prompt_version": "1.0",
"model": "...",
"input_tokens": 4200,
"output_tokens": 1800,
"latency_ms": 3200,
"validation_passed": true,
"eval_passed": true
}
7.2 — Cost budget alerts
Add a threshold check (e.g., warn if input_tokens > 8000 for a single entity generation).
This enforces the context budget from prompts/general-prompt.md §CONTEXT BUDGET.
7.3 — Prompt version tracking
Add <!-- prompt-version: X.Y --> comments to all prompt files (already started in
backend-rules.md and frontend-rules.md). Increment version on any non-trivial change.
Log the prompt versions alongside the generation log entry.
7.4 — Drift detection
Compare generation log entries across runs. If token count for the same entity increases by >20% without a DSL change, flag it as context rot.
Effort estimate: Medium. 2–3 days to build the logging layer. Zero effort for prompt versioning (already partially done).
Trigger: Implement before the system is used for more than 10 entities or before any production deployment.
Rule 8 — Risk Controls and Red-Teaming
Status: Deferred. No sanitization or adversarial testing exists.
Why it matters (Anthropic / Google / Microsoft guidance): LLM-generated code at scale introduces risks that do not exist in hand-written code:
- Prompt injection: malicious content in DSL
descriptionfields could steer generation (e.g.,description "Ignore previous instructions and...") - Generated credential leakage: LLM may hallucinate hardcoded secrets that look
real (e.g.,
apiKey: 'sk-...') - Missing auth guards: already caught by Rule 4 validator, but adversarial prompts could bypass it by generating valid-looking guard syntax that is semantically inactive
- Supply chain: generated package imports could reference non-existent or malicious packages if the LLM hallucinates
What needs to be built:
8.1 — DSL input sanitization
In tools/api-summary.mjs, before building the summary, check all description and
label fields for injection patterns:
function sanitizeDslString(value, fieldPath) {
const injectionPatterns = [
/ignore previous/i,
/disregard.*instruction/i,
/you are now/i,
/system:/i,
];
for (const pattern of injectionPatterns) {
if (pattern.test(value)) {
throw new Error(`Potential prompt injection in DSL field ${fieldPath}: "${value}"`);
}
}
return value;
}
8.2 — Generated code security scan
Add to tools/validate-generation.mjs (or a separate tools/security-scan.mjs):
// Check no hardcoded secrets leaked into generated code
function validateNoSecretLeakage() {
const patterns = [
/sk-[a-zA-Z0-9]{20,}/, // OpenAI key pattern
/[a-zA-Z0-9+/]{40}={0,2}/, // Base64 secret-like
/password\s*=\s*['"][^'"]{4,}['"]/, // Hardcoded password
/apiKey\s*=\s*['"][^'"]{4,}['"]/, // Hardcoded API key
];
// Run against all generated files...
}
8.3 — UseGuards completeness audit
Beyond the current semantic gate coverage, add: verify that the guard
constructor arguments are non-empty and match the expected guard class names. A guard
call like @UseGuards() (empty) should fail eval coverage because it provides no protection.
8.4 — Red-team fixture
Create tools/eval/fixtures/_adversarial/ with a fixture that includes a DSL snippet
containing a benign injection attempt (e.g., a description field with "ignore format
rules") and verifies the generation still produces spec-compliant output.
8.5 — Generated import allowlist
Maintain a list of approved npm packages that generated code may import. Flag any import not on the allowlist as a manual review item.
Effort estimate: Medium-High. 3–5 days. Security scan and sanitization are low effort; red-team fixtures and import allowlisting are higher effort.
Trigger: Implement before any external user can influence domain/*.api.dsl content
(i.e., before a UI or API to edit the DSL is exposed).
Tracking
| Rule | Status | Priority | Trigger |
|---|---|---|---|
| Rule 7 — Telemetry | Deferred | Medium | Before >10 entities or production deployment |
| Rule 8 — Risk controls | Deferred | High | Before DSL editing is exposed to external users |
| Rule 9 — Eval corpus automation | Deferred | Medium | After the contract stabilizes for the current entity set |
Rule 9 — Eval Corpus Automation
Status: Deferred. Reviewed eval fixtures are still the authoritative semantic gate.
Why it matters: The repository already requires eval-first behavior for new or changed entity coverage, but the repo does not yet synthesize starter eval contracts from source-of-truth. That means humans still have to review and finalize the first failing semantic contract.
What needs to be built:
- a deterministic helper that can scaffold
tools/eval/fixtures/<entity>/starters from the active source-of-truth slice - a prompt-to-eval helper that emits backend and frontend assertion starters before regeneration
- a documented workflow where generated starters are reviewed and committed, instead of silently replacing the authoritative eval corpus on every run
Status of current contract:
- This is not implemented yet.
- The repo should not imply that evals are auto-generated today.
- The repo should also not imply that a step-0 LLM subagent should rewrite committed eval fixtures on every regeneration run; that would collapse the independence of the semantic gate.
- Until bounded automation exists, the reviewed eval-first rule in
prompts/general-prompt.mdanddocs/generation-playbook.mdremains required.
Effort estimate: Medium. Trigger: After the contract stabilizes for the current entity set.
Last updated: 2026-04-03