Files
toir-automatization/docs/future-work.md
2026-04-06 12:50:46 +03:00

6.7 KiB
Raw Blame History

Future Work — Deferred Items

This file tracks engineering improvements that are deliberately deferred due to the current stage of the project. They are not forgotten — they are acknowledged technical debt that should be addressed before scaling.


Rule 7 — Tracing, Telemetry, Cost/Latency Observability

Status: Deferred. No LLM calls are instrumented.

Why it matters (Anthropic / Google / Microsoft guidance): Without observability, you cannot:

  • Know which prompts are expensive (token count, latency)
  • Detect prompt regressions via cost drift
  • Attribute generation failures to specific prompt versions
  • Track improvement over time

What needs to be built:

7.1 — Generation log

Create tools/generation-log.mjs that wraps any LLM generation call and writes a structured JSON entry to logs/generation.jsonl:

{
  "timestamp": "2026-04-03T10:00:00.000Z",
  "entity": "Equipment",
  "artifact": "backend",
  "prompt_version": "1.0",
  "model": "...",
  "input_tokens": 4200,
  "output_tokens": 1800,
  "latency_ms": 3200,
  "validation_passed": true,
  "eval_passed": true
}

7.2 — Cost budget alerts

Add a threshold check (e.g., warn if input_tokens > 8000 for a single entity generation). This enforces the context budget from prompts/general-prompt.md §CONTEXT BUDGET.

7.3 — Prompt version tracking

Add <!-- prompt-version: X.Y --> comments to all prompt files (already started in backend-rules.md and frontend-rules.md). Increment version on any non-trivial change. Log the prompt versions alongside the generation log entry.

7.4 — Drift detection

Compare generation log entries across runs. If token count for the same entity increases by >20% without a DSL change, flag it as context rot.

Effort estimate: Medium. 23 days to build the logging layer. Zero effort for prompt versioning (already partially done).

Trigger: Implement before the system is used for more than 10 entities or before any production deployment.


Rule 8 — Risk Controls and Red-Teaming

Status: Deferred. No sanitization or adversarial testing exists.

Why it matters (Anthropic / Google / Microsoft guidance): LLM-generated code at scale introduces risks that do not exist in hand-written code:

  • Prompt injection: malicious content in DSL description fields could steer generation (e.g., description "Ignore previous instructions and...")
  • Generated credential leakage: LLM may hallucinate hardcoded secrets that look real (e.g., apiKey: 'sk-...')
  • Missing auth guards: already caught by Rule 4 validator, but adversarial prompts could bypass it by generating valid-looking guard syntax that is semantically inactive
  • Supply chain: generated package imports could reference non-existent or malicious packages if the LLM hallucinates

What needs to be built:

8.1 — DSL input sanitization

In tools/api-summary.mjs, before building the summary, check all description and label fields for injection patterns:

function sanitizeDslString(value, fieldPath) {
  const injectionPatterns = [
    /ignore previous/i,
    /disregard.*instruction/i,
    /you are now/i,
    /system:/i,
  ];
  for (const pattern of injectionPatterns) {
    if (pattern.test(value)) {
      throw new Error(`Potential prompt injection in DSL field ${fieldPath}: "${value}"`);
    }
  }
  return value;
}

8.2 — Generated code security scan

Add to tools/validate-generation.mjs (or a separate tools/security-scan.mjs):

// Check no hardcoded secrets leaked into generated code
function validateNoSecretLeakage() {
  const patterns = [
    /sk-[a-zA-Z0-9]{20,}/,         // OpenAI key pattern
    /[a-zA-Z0-9+/]{40}={0,2}/,     // Base64 secret-like
    /password\s*=\s*['"][^'"]{4,}['"]/, // Hardcoded password
    /apiKey\s*=\s*['"][^'"]{4,}['"]/,   // Hardcoded API key
  ];
  // Run against all generated files...
}

8.3 — UseGuards completeness audit

Beyond the current semantic gate coverage, add: verify that the guard constructor arguments are non-empty and match the expected guard class names. A guard call like @UseGuards() (empty) should fail eval coverage because it provides no protection.

8.4 — Red-team fixture

Create tools/eval/fixtures/_adversarial/ with a fixture that includes a DSL snippet containing a benign injection attempt (e.g., a description field with "ignore format rules") and verifies the generation still produces spec-compliant output.

8.5 — Generated import allowlist

Maintain a list of approved npm packages that generated code may import. Flag any import not on the allowlist as a manual review item.

Effort estimate: Medium-High. 35 days. Security scan and sanitization are low effort; red-team fixtures and import allowlisting are higher effort.

Trigger: Implement before any external user can influence domain/*.api.dsl content (i.e., before a UI or API to edit the DSL is exposed).


Tracking

Rule Status Priority Trigger
Rule 7 — Telemetry Deferred Medium Before >10 entities or production deployment
Rule 8 — Risk controls Deferred High Before DSL editing is exposed to external users
Rule 9 — Eval corpus automation Deferred Medium After the contract stabilizes for the current entity set

Rule 9 — Eval Corpus Automation

Status: Deferred. Reviewed eval fixtures are still the authoritative semantic gate.

Why it matters: The repository already requires eval-first behavior for new or changed entity coverage, but the repo does not yet synthesize starter eval contracts from source-of-truth. That means humans still have to review and finalize the first failing semantic contract.

What needs to be built:

  • a deterministic helper that can scaffold tools/eval/fixtures/<entity>/ starters from the active source-of-truth slice
  • a prompt-to-eval helper that emits backend and frontend assertion starters before regeneration
  • a documented workflow where generated starters are reviewed and committed, instead of silently replacing the authoritative eval corpus on every run

Status of current contract:

  • This is not implemented yet.
  • The repo should not imply that evals are auto-generated today.
  • The repo should also not imply that a step-0 LLM subagent should rewrite committed eval fixtures on every regeneration run; that would collapse the independence of the semantic gate.
  • Until bounded automation exists, the reviewed eval-first rule in prompts/general-prompt.md and docs/generation-playbook.md remains required.

Effort estimate: Medium. Trigger: After the contract stabilizes for the current entity set.

Last updated: 2026-04-03