Wave B Roadmap

Wave B Roadmap — State-of-the-Art Systems

Status: Shipped on main (June 2026)
Audience: Engineering, security review, portfolio narrative
Last updated: June 2026

This document captures four systems-level upgrades from the follow-up evaluation. They build on what shipped in Wave A (structured GnomadError payloads, elevation hardening, App.tsx decomposition) and address residual alpha gaps called out in CODE_REVIEW.md and SECURITY_MODEL.md.

Current baseline

Area	Today
HITL	B1 shipped: HMAC tokens via `hitl_token.rs`; boolean bypass rejected
Local LLM	External Ollama HTTP; in-process GGUF for planner + optional local chat (`embedded-llm` build)
YOLO!	Broader FS via `agent_settings`; optional sandboxed shell (B4) in YOLO + experimental flag
Terminal UX	xterm.js live stream + replay on command cards; summary cards for simple runs

flowchart LR
  subgraph today [Wave A]
    UI[Sudo Gate UI]
    IPC["invoke(hitl_approved: true)"]
    Rust[shell_session + privilege]
    UI --> IPC --> Rust
  end
  subgraph waveB [Wave B target]
    Token[HMAC approval token]
    Rust2[Verify token + command hash]
    UI2[Sudo Gate UI]
    UI2 --> Token --> Rust2
  end
  today -.-> waveB

Recommended delivery order

Order	Initiative	Why first
B1	Cryptographic HITL tokens	Closes real IPC bypass class; small Rust surface; unblocks enterprise narrative
B2	In-process local LLM	✓ B2a planner + B2b local chat shipped
B3	True terminal (Xterm.js)	✓ Live stream + replay in chat
B4	Micro-sandboxing for YOLO	✓ Experimental sandbox-exec / bwrap

B1 — Cryptographic HITL approval tokens

Problem

Any client that can call shell_session_run or agent_execute_tool with hitl_approved: true may bypass the UI gate. Safety heuristics still run, but approval is not cryptographically bound to a specific command, time window, or session.

Design

check_command_safety (or a dedicated request_hitl_token command) returns:
- requires_hitl_approval, danger_reason (unchanged)
- approval_nonce + approval_token when HITL is required
Token payload (signed, not encrypted — local app only):
```
v1 | command_sha256 | nonce | issued_at_unix | expires_at_unix | scope
```
- command_sha256: SHA-256 of normalized command string (trim, NFC)
- scope: shell_run | elevated | path_once (future)
- TTL: e.g. 60–120 seconds, single-use (nonce stored in memory until consumed or expired)
Signing: HMAC-SHA256 with a per-install secret generated on first launch and stored in OS keychain (gnomad-hitl-secret), not in frontend.
Execution: shell_session_run / elevated path accepts optional approval_token instead of bare hitl_approved. Rust:
- Verifies HMAC + expiry + command hash match + nonce not reused
- Re-runs check_command_safety (defense in depth)
- On success, burns nonce
Frontend: After Sudo Gate approve, pass approval_token from the pending safety response — never a raw boolean.
Elevation: execute_elevated_command requires token with scope=elevated and matching command hash.

Files (indicative)

Layer	Files
Rust	New `hitl_token.rs`; `privilege.rs`, `shell_session.rs`, `agent_runtime.rs`, `lib.rs`
TS	`shellSession.ts`, `agentRuntime.ts`, `useAgentExecution.ts`, `agentLoop.ts`
Docs	`SECURITY_MODEL.md`, `QA_CHECKLIST.md` (IPC bypass test case)

Acceptance criteria

hitl_approved: true without valid token → safety_blocked JSON payload
Token for command A rejected when executing command B
Reused token rejected (expiry test: manual)
Unit tests: sign/verify, wrong hash, replay, boolean bypass
Manual: approve in UI → success; devtools invoke with boolean only → fail

Effort & risk


Effort	~3–5 days
Risk	Low–medium (API migration; keep deprecated bool one release behind feature flag)

B2 — In-process native local LLM (Candle or llama.cpp)

Problem

Ollama is an extra daemon, version skew, and install step. Portfolio story: “local works out of the box” requires an embedded inference path for small models (e.g. 1B Qwen-Coder for planner / tag extraction).

Options

Backend	Pros	Cons
llama-cpp-2 (Rust bindings)	Mature GGUF ecosystem; matches existing GGUF settings field	Binary size, CPU/GPU feature matrix per platform
Candle (Hugging Face)	Pure Rust, good for custom models	Heavier integration for chat templates; GPU paths vary

Recommendation: llama-cpp-2 for v1 in-process path (aligns with stored GGUF path + planner use case); keep Ollama as optional “bring your own models.”

Design

local_inference module in src-tauri:
- load_model(path: PathBuf, n_ctx, n_threads) — lazy singleton
- complete(prompt, max_tokens, stop) → string
- plan_command(prose) -> Result<String, GnomadError> — used by command planner
Model delivery:
- Phase B2a: User-selected GGUF on disk (Settings)
- Phase B2b (optional): Bundled tiny model in app resources (size cap ~500MB–1GB for alpha)
Frontend: Provider mode local-embedded vs local-ollama; same chat UI, different backend invoke.
Build: Feature flag embedded-llm; CI builds without it on constrained runners; macOS/Windows/Linux matrix docs in BUILD_PLATFORMS.md.

Acceptance criteria

Planner works with no Ollama process when GGUF configured
Graceful llm error payload if model missing or load fails
Document RAM/CPU expectations (e.g. 1B Q4 ≈ 1GB RAM) — see BUILD.md
Ollama path unchanged (regression)

Effort & risk


Effort	~2–4 weeks (bindings, threading, packaging)
Risk	Medium–high (artifact size, Metal/CUDA/CPU fallbacks, licensing of bundled weights)

B3 — True terminal emulation (Xterm.js)

Problem

PTY output is reduced to stdout/stderr strings for cards. ANSI colors, progress bars, TUI apps, and interactive prompts are lossy or confusing (stall/timeout heuristics fight full-screen TUIs).

Design

Rust (minimal change): Already emits PTY chunks via events — add optional base64 or raw UTF-8 frame event shell-pty-output with session id (if not already sufficient).
Frontend:
- Add @xterm/xterm + fit addon
- TerminalPanel component: embedded in chat when user expands a run, or docked below composer
- Wire subscribeShellOutput → term.write(data)
- Input: optional term.onData → new shell_session_write command for interactive sessions (separate from one-shot shell_run)
Modes:
- Compact (default): Keep ShellCommandBlock summary for simple commands
- Live: Open xterm when command tagged interactive or user clicks “Show terminal”
Security: xterm does not bypass gates; interactive mode still requires HITL token for flagged commands.

Acceptance criteria

ls --color=auto, npm install progress render in live mode / replay
Stop button sends interrupt
No regression for cloud agent tool loop (summary cards still work)
Cross-platform: macOS, Windows, Linux windowed + panel

Effort & risk


Effort	~1–2 weeks
Risk	Medium (bundle size, focus/keyboard in Tauri webview, accessibility)

B4 — Micro-sandboxing for YOLO! mode

Problem

YOLO expands filesystem reach; shell on host PTY can still exfiltrate, pivot, or damage outside workspace if the model is tricked. Goal: when YOLO is on, contain shell side effects without breaking normal Standard mode.

Design (platform-specific)

OS	Mechanism	Notes
macOS	`sandbox-exec` with dynamic profile per session	Profile allows: workspace R/W, temp dir, deny network optional, deny `~/.ssh` etc. Fragile across macOS versions — needs version matrix
Linux	bubblewrap / user namespaces	Mount minimal FS; require optional dep (`LINUX_PACKAGES.md`)
Windows	Workspace-scoped init (`yolo-shell-init.cmd`)	TEMP + cwd scoped to workspace; network not blocked — full AppContainer deferred

Principle: Sandbox wraps shell execution path only; agent_fs already path-gated — optionally route YOLO shell through bwrap helper binary shipped with app.

UX

Settings → YOLO: sub-option “Sandbox shell (experimental)” with platform badge
Audit log records sandboxed: true and profile hash

Acceptance criteria

In sandboxed YOLO: reads outside workspace blocked; workspace writes allowed (profile-dependent)
Escape attempts documented in test notes (not pen-test complete)
Clear fallback when sandbox helper missing (disable feature, error not silent host run)

Effort & risk


Effort	~4–8 weeks across OSes
Risk	High — support burden, false sense of security if profiles wrong; enterprise may demand third-party audit

Cross-cutting dependencies

flowchart TB
  B1[B1 HITL tokens]
  B2[B2 Embedded LLM]
  B3[B3 Xterm.js]
  B4[B4 Sandbox YOLO]
  B1 --> B4
  B3 --> B4
  B1 --> B3

B4 should assume B1 so sandboxed runs cannot skip gates via IPC.
B3 interactive PTY should require tokens for elevated/interactive flows.
B2 is largely orthogonal; improves planner without widening shell attack surface.

Mapping to product versions

Version	Wave B items
v0.2 beta	B1 (HITL tokens) + Wave B error migration (`llm`, `command_planner`, `chat_history`)
v0.3	B3 (xterm) + B2a (GGUF in-process planner)
v0.4+	B2b (optional bundled model), B4 (sandbox experimental per OS)

Explicit non-goals (Wave B)

Full VM per command (Docker/Podman required) — too heavy for default install
Remote attestation / hardware security modules
Replacing cloud providers with on-device 70B models
Autonomous background agents without user message

SECURITY_MODEL.md — threat model + future HITL token section
AGENT_CAPABILITIES_PROPOSAL.md — agent architecture
CODE_REVIEW.md — Wave A completion status
CROSS_PLATFORM_CHECKLIST.md — smoke tests when each B item lands

Wave B Roadmap — State-of-the-Art Systems

Current baseline

Recommended delivery order

B1 — Cryptographic HITL approval tokens

Problem

Design

Files (indicative)

Acceptance criteria

Effort & risk

B2 — In-process native local LLM (Candle or llama.cpp)

Problem

Options

Design

Acceptance criteria

Effort & risk

B3 — True terminal emulation (Xterm.js)

Problem

Design

Acceptance criteria

Effort & risk

B4 — Micro-sandboxing for YOLO! mode

Problem

Design (platform-specific)

UX

Acceptance criteria

Effort & risk

Cross-cutting dependencies

Mapping to product versions

Explicit non-goals (Wave B)

Related docs