Wave B Roadmap

Also available: Markdown · Plain text

Wave B Roadmap — State-of-the-Art Systems

Status: Shipped on main (June 2026)
Audience: Engineering, security review, portfolio narrative
Last updated: June 2026

This document captures four systems-level upgrades from the follow-up evaluation. They build on what shipped in Wave A (structured GnomadError payloads, elevation hardening, App.tsx decomposition) and address residual alpha gaps called out in CODE_REVIEW.md and SECURITY_MODEL.md.


Current baseline

Area Today
HITL B1 shipped: HMAC tokens via hitl_token.rs; boolean bypass rejected
Local LLM External Ollama HTTP; in-process GGUF for planner + optional local chat (embedded-llm build)
YOLO! Broader FS via agent_settings; optional sandboxed shell (B4) in YOLO + experimental flag
Terminal UX xterm.js live stream + replay on command cards; summary cards for simple runs
flowchart LR
  subgraph today [Wave A]
    UI[Sudo Gate UI]
    IPC["invoke(hitl_approved: true)"]
    Rust[shell_session + privilege]
    UI --> IPC --> Rust
  end
  subgraph waveB [Wave B target]
    Token[HMAC approval token]
    Rust2[Verify token + command hash]
    UI2[Sudo Gate UI]
    UI2 --> Token --> Rust2
  end
  today -.-> waveB

Order Initiative Why first
B1 Cryptographic HITL tokens Closes real IPC bypass class; small Rust surface; unblocks enterprise narrative
B2 In-process local LLM ✓ B2a planner + B2b local chat shipped
B3 True terminal (Xterm.js) ✓ Live stream + replay in chat
B4 Micro-sandboxing for YOLO ✓ Experimental sandbox-exec / bwrap

B1 — Cryptographic HITL approval tokens

Problem

Any client that can call shell_session_run or agent_execute_tool with hitl_approved: true may bypass the UI gate. Safety heuristics still run, but approval is not cryptographically bound to a specific command, time window, or session.

Design

  1. check_command_safety (or a dedicated request_hitl_token command) returns:

    • requires_hitl_approval, danger_reason (unchanged)
    • approval_nonce + approval_token when HITL is required
  2. Token payload (signed, not encrypted — local app only):

    v1 | command_sha256 | nonce | issued_at_unix | expires_at_unix | scope
    
    • command_sha256: SHA-256 of normalized command string (trim, NFC)
    • scope: shell_run | elevated | path_once (future)
    • TTL: e.g. 60–120 seconds, single-use (nonce stored in memory until consumed or expired)
  3. Signing: HMAC-SHA256 with a per-install secret generated on first launch and stored in OS keychain (gnomad-hitl-secret), not in frontend.

  4. Execution: shell_session_run / elevated path accepts optional approval_token instead of bare hitl_approved. Rust:

    • Verifies HMAC + expiry + command hash match + nonce not reused
    • Re-runs check_command_safety (defense in depth)
    • On success, burns nonce
  5. Frontend: After Sudo Gate approve, pass approval_token from the pending safety response — never a raw boolean.

  6. Elevation: execute_elevated_command requires token with scope=elevated and matching command hash.

Files (indicative)

Layer Files
Rust New hitl_token.rs; privilege.rs, shell_session.rs, agent_runtime.rs, lib.rs
TS shellSession.ts, agentRuntime.ts, useAgentExecution.ts, agentLoop.ts
Docs SECURITY_MODEL.md, QA_CHECKLIST.md (IPC bypass test case)

Acceptance criteria

Effort & risk

Effort ~3–5 days
Risk Low–medium (API migration; keep deprecated bool one release behind feature flag)

B2 — In-process native local LLM (Candle or llama.cpp)

Problem

Ollama is an extra daemon, version skew, and install step. Portfolio story: “local works out of the box” requires an embedded inference path for small models (e.g. 1B Qwen-Coder for planner / tag extraction).

Options

Backend Pros Cons
llama-cpp-2 (Rust bindings) Mature GGUF ecosystem; matches existing GGUF settings field Binary size, CPU/GPU feature matrix per platform
Candle (Hugging Face) Pure Rust, good for custom models Heavier integration for chat templates; GPU paths vary

Recommendation: llama-cpp-2 for v1 in-process path (aligns with stored GGUF path + planner use case); keep Ollama as optional “bring your own models.”

Design

  1. local_inference module in src-tauri:

    • load_model(path: PathBuf, n_ctx, n_threads) — lazy singleton
    • complete(prompt, max_tokens, stop) → string
    • plan_command(prose) -> Result<String, GnomadError> — used by command planner
  2. Model delivery:

    • Phase B2a: User-selected GGUF on disk (Settings)
    • Phase B2b (optional): Bundled tiny model in app resources (size cap ~500MB–1GB for alpha)
  3. Frontend: Provider mode local-embedded vs local-ollama; same chat UI, different backend invoke.

  4. Build: Feature flag embedded-llm; CI builds without it on constrained runners; macOS/Windows/Linux matrix docs in BUILD_PLATFORMS.md.

Acceptance criteria

Effort & risk

Effort ~2–4 weeks (bindings, threading, packaging)
Risk Medium–high (artifact size, Metal/CUDA/CPU fallbacks, licensing of bundled weights)

B3 — True terminal emulation (Xterm.js)

Problem

PTY output is reduced to stdout/stderr strings for cards. ANSI colors, progress bars, TUI apps, and interactive prompts are lossy or confusing (stall/timeout heuristics fight full-screen TUIs).

Design

  1. Rust (minimal change): Already emits PTY chunks via events — add optional base64 or raw UTF-8 frame event shell-pty-output with session id (if not already sufficient).

  2. Frontend:

    • Add @xterm/xterm + fit addon
    • TerminalPanel component: embedded in chat when user expands a run, or docked below composer
    • Wire subscribeShellOutputterm.write(data)
    • Input: optional term.onData → new shell_session_write command for interactive sessions (separate from one-shot shell_run)
  3. Modes:

    • Compact (default): Keep ShellCommandBlock summary for simple commands
    • Live: Open xterm when command tagged interactive or user clicks “Show terminal”
  4. Security: xterm does not bypass gates; interactive mode still requires HITL token for flagged commands.

Acceptance criteria

Effort & risk

Effort ~1–2 weeks
Risk Medium (bundle size, focus/keyboard in Tauri webview, accessibility)

B4 — Micro-sandboxing for YOLO! mode

Problem

YOLO expands filesystem reach; shell on host PTY can still exfiltrate, pivot, or damage outside workspace if the model is tricked. Goal: when YOLO is on, contain shell side effects without breaking normal Standard mode.

Design (platform-specific)

OS Mechanism Notes
macOS sandbox-exec with dynamic profile per session Profile allows: workspace R/W, temp dir, deny network optional, deny ~/.ssh etc. Fragile across macOS versions — needs version matrix
Linux bubblewrap / user namespaces Mount minimal FS; require optional dep (LINUX_PACKAGES.md)
Windows Workspace-scoped init (yolo-shell-init.cmd) TEMP + cwd scoped to workspace; network not blocked — full AppContainer deferred

Principle: Sandbox wraps shell execution path only; agent_fs already path-gated — optionally route YOLO shell through bwrap helper binary shipped with app.

UX

Acceptance criteria

Effort & risk

Effort ~4–8 weeks across OSes
Risk High — support burden, false sense of security if profiles wrong; enterprise may demand third-party audit

Cross-cutting dependencies

flowchart TB
  B1[B1 HITL tokens]
  B2[B2 Embedded LLM]
  B3[B3 Xterm.js]
  B4[B4 Sandbox YOLO]
  B1 --> B4
  B3 --> B4
  B1 --> B3

Mapping to product versions

Version Wave B items
v0.2 beta B1 (HITL tokens) + Wave B error migration (llm, command_planner, chat_history)
v0.3 B3 (xterm) + B2a (GGUF in-process planner)
v0.4+ B2b (optional bundled model), B4 (sandbox experimental per OS)

See also ROADMAP.md.


Explicit non-goals (Wave B)