Continuity completion plan

Execution roadmap for the remaining ActiveGraph continuity stages (2a → 5, excluding 4a) — seam, sketch, acceptance, and tests per stage.

A crisp execution roadmap for the remaining continuity stages, not a re-explanation of the underlying design. Each stage points back to its detail-bearing plan (activegraph-continuity, developer-features, nixos-sandbox-fleet) and gives only what's needed to cut a PR: seam, sketch, acceptance, tests.

Status: ready to execute — covers every remaining stage in the master table except Stage 4a (the NixOS sandbox-host + microVM fleet, reserved for a dedicated infra push); Stages 4b and 5d are included in degenerate, branch-based form.

Scope at a glance

ID	Stage	Size	Depends on	Why now
2a	Provenance → graph dual-write	XS (~50 lines, 1 file)	1b ✅	✅ shipped (`af443555`)
2b	Behavior registry / dispatcher	M (~150–300 lines, 3–5 files)	1b ✅	✅ shipped (`1367e19b`)
2c	Agent triggers (agents-as-behaviors)	S (~50–100 lines glue)	0 ✅, 2b	✅ shipped (`84db920e`)
5a	Trace-as-product UI (`graph_events` viewer)	M (~200–400 lines)	1b ✅	✅ shipped (`079634c1`)
5b	Semantic memory injection at prompt render	S (~80–150 lines)	2a	✅ shipped (`5ac4719d`)
5c	Propose/accept workflow for agent graph edits	M (~250–400 lines)	2a + 2b	✅ shipped (`56686c02`)
4b-*	`workspace_snapshot` graph node + capture hook (metadata only)	XS (~40 lines)	1b ✅	✅ shipped (`d2b036e8`)
5d-*	Branch-based self-improvement loop (no warm fs)	M (~200–300 lines)	3 ✅ + 2a + 2b	Fork → diff → score → promote; bytes-side waits on 4a

Excluded: 4a (NixOS sandbox-host module + microVM fleet + nixos-anywhere

disko + ZFS + microvm.nix + runner closure + attic) — reserved for a separate dedicated push. Anything in 4b/5d that needs real filesystem snapshots is deferred with it.

Total estimate (excl. 4a): ~6 mergeable PRs, ~1000–1700 lines of net code, plus tests + migrations + view templates. Cleanest path to ship all of it: 2a → 2b → 2c → 5b → 5a in parallel anywhere along the way → 4b-* → 5c → 5d-*.

Execution order + parallelization

Sequential spine (cheapest to richest):
  2a → 2b → 2c → 5b → 5c → 5d-*

Off-spine, parallelizable:
  5a (trace UI)   — needs only 1b ✅; can start day 1
  4b-* (snapshot node) — needs only 1b ✅; can land anytime

5a and 4b-* are deliberately off the critical path: each is a small, self-contained deliverable that doesn't block or get blocked by the spine. Both are good "parallel agent" hand-offs (same pattern as fork-agent-run-plan.md was for Stage 3).

Stage 0b — Ambient repository guidance (out-of-band)

Shipped (commit f3797fa2). Added Application/RepoGuidance.hs (new); export bump in Application/Repositories.hs; threaded into both prompt-PR paths (Application/Jobs/PromptPullRequest/Run.hs — runOnAwsTask, runOnLocalRunner, and runCodexCommand); env-var roundtrip via Application/IsolatedExecution/Runner.hs (GITOKU_PROMPT_PULL_REQUEST_REPO_GUIDANCE). Documented in AGENTS.md + CLAUDE.md. Build green (677 → 678 modules).

Goal: every prompt-PR run automatically sees the branch's own agent guidance (AGENTS.md / CLAUDE.md / README.md / .cursor/rules / .windsurfrules / .github/copilot-instructions.md / .gitoku/instructions.md, plus any other top-level *.md) without the user having to wire each file through a [agent.*] instructions_file in gitoku.toml. Stage 0 covered per-agent manifests; this covers the ambient "how does this codebase think about itself" layer.

Prompt order (final stack as of this stage):

branch context → mode → repo guidance (Stage 0b) → agent instructions (Stage 0) → semantic memory (Stage 5b) → user task

Each later stage layers a more specific signal on top of the prior: repo guidance answers "how does this codebase think about itself?"; agent instructions answers "what's your specific role on this run?"; semantic memory answers "what was decided / failed before?".

Caps: 8 KB per file, 32 KB total. Allowlisted files (the seven above) are never sacrificed under budget pressure; other markdown is appended alphabetically until the cap is reached, then dropped. Excluded prefixes (docs/, vendor/, node_modules/, dist-newstyle/, etc.) prevent generated trees from crowding out real guidance.

Why out-of-band: this wasn't in the original 2026-05-31 plan; user flagged the gap mid-plan ("many projects already include markdown for agents which should be fed automatically on first prompt call of any task"), so it shipped between Stages 5c and 5d-*.

Stage 0c — Unify the AI surface: context layering + provider abstraction (planned, not started)

Stage 0c is a two-pillar plan. The two pillars are orthogonal — either ships independently — but together they unify gitoku's AI story end-to-end:

Pillar 1 — prompt context unification. Every AI surface (not just prompt-PR) sees the same layered context (ambient repo guidance + bounded semantic memory). What we send is unified.
Pillar 2 — LLM provider abstraction. Every AI surface can be routed to any supported provider (Codex CLI, Claude Code CLI, OpenAI direct, OpenRouter, Anthropic direct), with one tool-call shape across the wire. Where we send it is unified.

Both pillars converge on the same end-state: a repo's gitoku.toml [agent.reviewer] can declare e.g. runner = "claude-code", chat-provider = "openrouter", model = "anthropic/claude-opus-4-7", and all five AI surfaces route correctly with the right prompt context layered in.

Pillar 1 — Prompt context unification

Goal: every AI surface in gitoku (not just prompt-PR) sees the same layered prompt context — ambient repo guidance (Stage 0b) and bounded semantic memory (Stage 5b) — so a repo's AGENTS.md / CLAUDE.md improves all AI behavior, not just prompt-PR runs.

Why this is a real gap. Today four AI surfaces ship their own ad-hoc system prompts that ignore AGENTS.md, CLAUDE.md, semantic memory, and per-repo conventions:

Surface	LLM client	Prompt-builder seam
PR Review	Codex (isolated runner)	`Application/Jobs/PullRequestReview/Run.hs:381` `renderCodexReviewPrompt`
Conflict Resolution	Codex (isolated runner)	`Application/Jobs/PullRequestConflictResolution/Run.hs:417` `renderConflictResolutionPrompt`
Diff-thread AI reply	OpenAI streaming chat (`IHP.OpenAI`)	`Application/Jobs/PullRequestDiffAiResponse/Request.hs` `buildPullRequestDiffAiResponseCompletionRequest`
PR Form suggestion	OpenAI tool-call (`IHP.OpenAI`)	`Application/Jobs/PullRequestFormSuggestion/Request.hs:90` `renderPullRequestFormSuggestionSystemMessage`

So a repo can tell the prompt-PR agent "we don't use try/catch in this codebase" via AGENTS.md, but the diff-reply AI on the same repo will happily suggest try/catch. This breaks the "AGENTS.md is gitoku's universal agent context" promise that Stage 0b established.

Two surface families, two shapes:

Codex isolated-runner surfaces (Review + ConflictResolution). Same shape as prompt-PR: env-var roundtrip via Application/IsolatedExecution/Runner.hs, prepend repoGuidanceSection <> semanticMemorySection in the renderCodexXPrompt helper. Mechanically identical to the Stage 0b work on PromptPullRequest.
OpenAI chat surfaces (DiffAiResponse + FormSuggestion). Different shape — GPT.systemMessage / GPT.userMessage rather than a single rendered text block. Inject guidance + memory as a second leading GPT.systemMessage before the existing one. These are sync / interactive, so they get tighter byte caps.

Shared seam to add first

A new module Application/PromptContextBlock.hs that exposes a uniform loader. Today the composition logic is inline in PromptPullRequest/Run.hs; lift it so every surface uses one entry point with per-surface budget knobs.

data SurfaceContextOptions = SurfaceContextOptions
    { repoGuidanceTotalCap :: Int            -- override 0b's 32 KB default
    , repoGuidancePerFileCap :: Int          -- override 0b's 8 KB default
    , repoGuidanceAllowlistOnly :: Bool      -- skip non-allowlisted *.md
    , semanticMemoryNodeLimit :: Maybe Int   -- Nothing disables 5b for this surface
    }

loadSurfaceContext :: (?context :: ctx, ConfigProvider ctx)
    => Repository -> Text -> SurfaceContextOptions -> IO Text

Refactor PromptPullRequest/Run.hs's existing inline composition to call loadSurfaceContext with prompt-PR's existing budgets. This is a behavior-preserving refactor — ship it as 0c-0 before any surface work.

Pillar 1 sub-stages (parallelizable)

0c-prompt-0 — `PromptContextBlock` extraction (refactor, no behavior change)

Scope: new module + cut-over of PromptPullRequest/Run.hs to use it. Existing semantic-memory and repo-guidance behavior unchanged.

Done when: build green, prompt-PR runs produce byte-identical prompts to before (snapshot test on a fixture repo).

0c-prompt-1 — PR Review picks up the prompt stack

Seams:

renderCodexReviewPrompt (Application/Jobs/PullRequestReview/Run.hs:381) — prepend repoGuidanceSection <> semanticMemorySection.
AWS task path + local-runner path in the same module — call loadSurfaceContext before assembling env vars (mirror the two call sites in PromptPullRequest/Run.hs).
PullRequestReviewRunnerEnvironment in Application/IsolatedExecution/Runner.hs — add repoGuidance :: Text, semanticMemory :: Text fields and the env-var reads (GITOKU_PULL_REQUEST_REVIEW_REPO_GUIDANCE, GITOKU_PULL_REQUEST_REVIEW_SEMANTIC_MEMORY).
The Runner's renderCodexPrompt for review (sibling of the PromptPullRequest one) re-injects the two sections.

Budgets: identical to prompt-PR (32 KB guidance, full semantic memory). Review is offline and can be heavy.

Branch to load guidance against: pullRequest.compareBranch (the branch being reviewed — its AGENTS.md is what's relevant for this change).

Done when: a PR review run on a repo with AGENTS.md visibly cites repo conventions in its review comments. Add a fixture-repo manual check; no automated test for prompt content.

0c-prompt-2 — Conflict Resolution picks up the prompt stack

Seam: renderConflictResolutionPrompt (Application/Jobs/PullRequestConflictResolution/Run.hs:417). Same env-var + Runner-side wiring as 0c-1, just for the conflict resolution surface.

Budgets: 16 KB guidance (already a large diff in the prompt; leave headroom), full semantic memory enabled — "how were similar conflicts resolved before" is exactly what memory is for.

Branch: pullRequest.baseBranch (the merge target — that's where the project conventions the resolution must respect live).

Done when: conflict resolution on a repo with code-style rules in AGENTS.md preserves them (e.g. import ordering, brace style).

0c-prompt-3 — Diff-thread AI reply picks up the prompt stack

Seam: buildPullRequestDiffAiResponseCompletionRequest (Application/Jobs/PullRequestDiffAiResponse/Request.hs). Not Codex — uses IHP.OpenAI.

Shape change: instead of a single text prompt, prepend an additional GPT.systemMessage containing the bounded block:

let contextBlock = loadSurfaceContext repository branch opts
let messages =
        [ GPT.systemMessage contextBlock      -- new (Stage 0c-3)
        , GPT.systemMessage existingSystem    -- existing
        , GPT.userMessage existingUser        -- existing
        ]

Budgets: 4 KB total guidance, semantic memory top-3 nodes only. This is a sync interactive surface — latency matters, and the diff itself is already in the user message.

Branch: pullRequest.compareBranch.

Done when: a diff-thread reply on a repo with AGENTS.md follows repo-specific advice (e.g. "always include a test plan in your reply" → reply now does).

Shipped (commit 6eddeaa9 + UI footer follow-up). Backend wired through Pull Request Diff AI Response runner; visible badge appended INSIDE the reply comment body itself (appendContextBytesFooter in Application/Jobs/PullRequestDiffAiResponse/Run.hs) so the bytes-spent line travels with the artifact wherever the comment is read.

0c-prompt-4 — PR form-suggestion picks up the prompt stack

Seam: renderPullRequestFormSuggestionSystemMessage (Application/Jobs/PullRequestFormSuggestion/Request.hs:90).

Shape change: append the guidance block to the existing system message string (or pass as a second GPT.systemMessage like 0c-3 — pick whichever keeps the existing template-handling logic cleanest).

Budgets: 2 KB, allowlist-only (AGENTS.md, CLAUDE.md, README.md, PULL_REQUEST_TEMPLATE.md if present). Semantic memory disabled — this is one-shot metadata generation, not a knowledge-recall task.

Branch: pullRequest.compareBranch.

Done when: a generated PR title/description matches the conventions in the repo's PULL_REQUEST_TEMPLATE.md + tone in AGENTS.md.

Shipped (commit 6eddeaa9 + UI footer follow-up). Backend wired through buildPullRequestFormSuggestionCompletionRequest; visible footer rendered in the create-PR form (renderPullRequestSuggestionContextFooter in Web/View/Repositories/Show.hs). The badge is UI-only — the suggested title and description text stay clean of gitoku metadata since they become the user's PR description.

Pillar 2 — LLM provider abstraction

Goal: every AI surface can target Codex, Claude Code, OpenAI, OpenRouter, or Anthropic-direct — selected per-agent in gitoku.toml, with sensible per-surface defaults — and tool-call definitions are authored once but routed to every provider's native shape at the wire.

Why this is the natural completion of 0b + Pillar 1. Today the AI provider is hardcoded per surface: Codex CLI for the three agentic surfaces, OpenAI direct (gpt-5-mini) for the two HTTP surfaces. That makes the prompt-PR / review / conflict surfaces "only as good as Codex," and the diff-reply / form surfaces "only as good as gpt-5-mini." A repo that prefers Claude for review and GPT for cheap form-gen has no way to express that, and BYOLLM (an enterprise-asked feature; see activegraph plan §9) is impossible.

What's already in the codebase:

Application.CodexCredentials — per-user Codex auth JSON, consumed by all three Codex CLI surfaces.
Application.ClaudeCodeCredentials — per-user Claude Code auth env JSON. UI exists (settings → Claude tab); no consumer wires it into a runner yet. Scaffolded ready for 0c-provider-A.
claude-code-nix flake input + overlay (host nixos-config has pkgs.claude-code available). Same shape as pkgs.codex.
IHP.OpenAI — gitoku's chat client. Used by diff-reply + form suggestion with hardcoded model + key.

What's missing:

A runner :: { codex | claude-code } field on the agent / per-run config and a CLI dispatcher in the isolated runner.
A chat-provider :: { openai-direct | openrouter | anthropic-direct } field on the chat-completion surfaces and a three-way client dispatcher.
An Anthropic Messages API adapter that translates from gitoku's internal IHP.OpenAI.CompletionRequest shape (including tools) to the Anthropic wire format, so the surfaces stay authored against one shape.

Pillar 2 sub-stages (parallelizable across A and B; sequential within each track)

0c-provider-A — CLI-runner provider abstraction (Codex ↔ Claude Code)

Two CLI-based agentic loops, same role, different binaries. Both take a prompt + workspace and produce a git commit. Differences are which binary, which auth env, which model flag, which subprocess invocation shape.

0c-provider-A1 — Introduce AgentRunnerKind type + selection plumbing.

New data AgentRunnerKind = CodexCliRunner | ClaudeCodeCliRunner in a new Application/AgentRunners.hs.
gitoku.toml [agent.*] schema gains optional runner = "codex" | "claude-code" (default = codex for backward compat).
prompt_pull_request_jobs.agent_runner TEXT column + forward-only migration; sibling columns on pull_request_review_jobs and pull_request_conflict_resolution_jobs.
Parser + persisted value validated against the enum.

0c-provider-A2 — Wire the Claude Code CLI as a peer to codex in the isolated runner.

Application/IsolatedExecution/Runner.hs's runPromptPullRequest (and the two sibling review / conflict entrypoints) dispatch on AgentRunnerKind before calling the CLI subprocess.
New env-var contract: GITOKU_AGENT_RUNNER (= codex or claude-code), GITOKU_CLAUDE_CODE_AUTH_ENV_B64. The Codex runner's existing env vars are unchanged.
Subprocess invocation (stable flag surface, verified against claude v2.1.158):
```
ANTHROPIC_API_KEY=… claude --bare --print \
    --dangerously-skip-permissions \
    --append-system-prompt-file <guidance.md> \
    --model <model> \
    --output-format json \
    --settings '{}' \
    <prompt>
```
- --bare isolates the run from ~/.claude/settings.json, hooks, plugin sync, OAuth, and keychain reads; forces auth to ANTHROPIC_API_KEY only (perfect for our subprocess contract). Without it, the runner would inherit any host-side Claude Code config that happened to be installed.
- --dangerously-skip-permissions skips per-edit approval — safe because we control the sandbox. Catastrophic shell commands still prompt (which would hang us), so the --bare --settings '{}' pair also suppresses the prompt-source that would emit them.
- --append-system-prompt-file reads the file at start — cleaner than passing 32 KB of guidance on argv. The file is the rendered Pillar-1 context block written to a tmpfs path by the runner before exec.
- --output-format json is deterministic and stable; gives us is_error, terminal_reason, total_cost_usd, usage, and session_id. Per-token cost surfacing is a nice-to-have for the trace UI.
- cwd is inherited; no --cwd flag needed — the runner chdirs into the cloned workspace before exec, same as it does for Codex.
The auth env is sourced from loadUserClaudeCodeAuthEnvironmentForUse (already exists) — Stage 0c-provider-A2 is the first real consumer.
Both runners write the same trailing JSON commit-summary line to stdout so the runner's existing log parser stays unchanged.

0c-provider-A3 — Selection UI + per-surface defaults.

The settings → Agents page (already exists for [agent.*]) gains a runner radio selector per agent.
Per-surface defaults: leave existing surfaces on Codex. Allow a server-wide GITOKU_DEFAULT_AGENT_RUNNER env var for the managed-host preference (lets git.lazare.ai default to Claude Code without per-repo configuration).

Done when: a [agent.reviewer] in gitoku.toml with runner = "claude-code" runs a PR-review via the claude CLI subprocess, writes the same commit + summary shape, and shows up in the trace UI with the runner kind labeled.

0c-provider-B — HTTP chat-provider abstraction (OpenAI ↔ OpenRouter ↔ Anthropic)

Three HTTP backends, one internal API shape. IHP.OpenAI's CompletionRequest becomes gitoku's lingua franca; OpenRouter takes it verbatim (OpenAI-compatible), Anthropic gets a wire adapter.

0c-provider-B1 — Introduce ChatProvider config.

New data ChatProviderKind = OpenAiDirect | OpenRouter | AnthropicDirect in a new Application/ChatProviders.hs.
data ChatProviderConfig = ChatProviderConfig { kind, baseUrl, authHeader, defaultModel }.
Config/Config.hs reads three env vars per provider (GITOKU_OPENAI_API_KEY, GITOKU_OPENROUTER_API_KEY, GITOKU_ANTHROPIC_API_KEY) and surfaces a Map ChatProviderKind ChatProviderConfig.
gitoku.toml [agent.*] gains optional chat-provider + model keys (so a repo can declare "this agent uses Claude Opus via OpenRouter"). Per-surface defaults if unspecified.

0c-provider-B2 — Refactor existing OpenAI usage to read from ChatProvider.

Diff-reply + form-suggestion stop calling GPT.defaultConfig (cs openAIApiKey) directly; instead resolve ChatProviderConfig per surface, then call GPT.defaultConfig { baseUrl, authHeader }.
For OpenAiDirect and OpenRouter this is enough — both speak the OpenAI Chat Completions wire format.
Behavior-preserving: with no env changes, behaves exactly as today (defaults to OpenAiDirect).

0c-provider-B3 — Anthropic Messages API adapter (covers Pillar 2's "tool calls" ask).

New Application/Llm/AnthropicAdapter.hs.
Translates GPT.CompletionRequest → Anthropic Messages POST body: messages → messages (role mapping, system message split out into top-level system field per Anthropic spec); tools = [GPT.Function] → Anthropic tools = [{ name, description, input_schema }] (mechanical rename; parameters JsonSchema ≡ input_schema); tool_choice → tool_choice.
Translates Anthropic streaming events (content_block_delta with tool_use block) back into the OpenAI-style chunk shape (GPT.CompletionChunk with tool_calls delta) so the existing surface consumers (extractPullRequestDiffAiResponseOutput, the form-suggestion stream consumer) need no changes.
Streaming tool-call reassembly is mechanical: Anthropic streams input_json_delta strings; concatenate them per tool block until the closing event, then emit the equivalent OpenAI tool_calls[i].function.arguments final chunk.
This sub-stage is the tool-call harmonization the user asked for: a tool authored once as GPT.Function works across OpenAI, OpenRouter, and Anthropic.

0c-provider-B4 — Per-surface defaults + override.

Same chat-provider + model fields from B1 are consulted by diff-reply + form-suggestion when assembling a request.
Per-surface server defaults configurable via env (GITOKU_DEFAULT_CHAT_PROVIDER_FORM_SUGGESTION etc.).
Cheap default for form suggestion (e.g. openai-direct + gpt-5-mini as today). Higher-quality default for diff-reply if the operator has a budget for it.

Done when: a [agent.formgen] in gitoku.toml with chat-provider = "anthropic-direct" + model = "claude-haiku-4-5-20251001" causes a PR form-suggestion request to hit api.anthropic.com, returns a tool-call result, and the existing form-suggestion consumer renders it identically to the OpenAI path.

Pillar 2 cross-cutting

Why no new tool-call abstraction layer. The OpenAI tool-call shape ({ name, description, parameters: JsonSchema }) is the strict superset all three providers can be normalized to (OpenAI + OpenRouter accept verbatim; Anthropic differs by field name + location only). Inventing a gitoku-native tool type would add an abstraction with one user. Keep GPT.Function; adapt at the wire.

OpenRouter trade-off. OpenRouter is a third-party single point of failure and takes margin on every call. The plan supports it as a peer not a default — it's the right choice for BYOLLM operators who want "any model, one bill," but the managed gitoku host should prefer direct API integrations for cost + reliability.

Codex CLI is a peer, not a backend. Codex CLI is itself an agentic loop with its own model choice, prompt-routing, and tool execution loop inside the subprocess. It can't be replaced by "Codex chat completions" without losing the agentic behavior. That's why Pillar 2 splits along CLI-runner vs HTTP-chat lines — they really are different abstraction levels.

Migrations:

prompt_pull_request_jobs.agent_runner TEXT (+ siblings on review / conflict tables) — A1.
prompt_pull_request_jobs.chat_provider TEXT + chat_model TEXT if we want to record the resolved choice on the run — optional, defer to observability needs.

Risks:

Claude Code CLI surface — actually low risk. The headless flag set we depend on (--bare, --print, --dangerously-skip-permissions, --append-system-prompt-file, --output-format json, --model) has been stable for 12+ months. Recent churn (v2.1.155–158) is all in adjacent areas (worktrees, Bedrock auto-mode, prompt-cache hints) — not our surface. Real residual risks worth gating:
- No exit-code discipline. Failures surface in JSON (is_error == true, terminal_reason != "completed") rather than a non-zero exit. We must parse the JSON before declaring success — the existing Codex success/failure detector needs a Claude-shaped sibling.
- No native hang detection. If something inside the loop waits for input (missing git credential, network probe), the subprocess hangs indefinitely. Wrap in timeout (or use the existing codexTimeoutSeconds knob renamed to agentTimeoutSeconds) with a generous default; same discipline already applies to Codex.
- Sandbox bleed via .claude/settings.json in the cloned repo. The repo we just cloned for the run may ship a .claude/settings.json that would otherwise be auto-loaded. --bare --settings '{}' suppresses both host-side and repo-side settings, so the only configuration is what we pass on the CLI. Mandatory in our subprocess contract — don't drop it as a "small optimization."
- Version probing. Pin via claude-code-nix (already in the flake), and add a claude --version probe in ensureWorkflowCommandAvailable so a Nix-side downgrade that silently lost a flag fails fast with a useful error.
Anthropic wire-shape drift. Adapter is mechanical but brittle. Add a unit test on the request-translation function that round-trips a fixture GPT.CompletionRequest and asserts the JSON output against a checked-in golden file. No live API call needed.
Credential confusion. Three API keys + two CLI auth envs is a lot of surface for users to manage. The settings UI should group them into a single "AI providers" tab with a clear per-provider connect/disconnect flow (extends the existing Codex / Claude tabs).

Cross-cutting

Cap defaults (final stack after 0c):

Surface	repo guidance	semantic memory	rationale
prompt-PR	32 KB	enabled (full)	offline, agentic, large context
PR Review	32 KB	enabled (full)	offline, can be heavy
Conflict Resolution	16 KB	enabled (full)	offline, but big diff in prompt
Diff-thread reply	4 KB	top-3 nodes	sync, latency-sensitive
Form suggestion	2 KB allowlist-only	disabled	one-shot metadata

Per-agent instructions (Stage 0) deferred for these surfaces. None of the four currently invokes a named [agent.*] — there's no [review] / [reply] / [form] agent in gitoku.toml. Adding one is a natural Stage 0d (cosmetic — same pattern as 0 already established), but it's orthogonal to 0c and not in this plan's scope.

Migrations: none. All changes are prompt-rendering code paths.

Observability: every surface should write the injected byte counts (guidance_bytes, memory_bytes) to its job row, so we can see budget impact and tune defaults. Add the columns in a single migration as part of 0c-0.

Testing convention: pure unit tests for loadSurfaceContext with a fixture allowlist + temp-directory fake repo. Per repo rule, no mock DB. The integration is checked manually with the fixture-repo done-when criteria above.

Risks:

Silent latency regression on sync surfaces (0c-3/0c-4). Mitigation: hard byte cap enforced in loadSurfaceContext, plus the job-row byte-count column lets us spot drift.
Prompt bloat hurts answer quality. Big prompts are not always better — for the small interactive surfaces especially, the guidance + memory injection should help, not distract. The cap table above is conservative; widen by observing the byte-count column after launch.
Cache misses on OpenAI surfaces. Prepending content invalidates any prefix caching the upstream provider does. The 4 KB cap on 0c-3 keeps this tolerable; revisit if measured latency suffers.

Done (whole 0c):

Pillar 1. Every AI surface in the codebase, when it renders its prompt, loads from Application.PromptContextBlock. A repo's AGENTS.md is now the universal gitoku agent context — improving it improves every AI behavior on the repo, not just prompt-PR runs. Stage 0b's promise is honored end-to-end.

Pillar 2. Every AI surface routes through a runner (CLI surfaces) or chat-provider (HTTP surfaces) dispatcher. A repo's gitoku.toml [agent.*] can express provider + model per agent, defaulting sensibly per surface. Tool definitions are authored once as GPT.Function and reach all three HTTP providers (OpenAI, OpenRouter, Anthropic) via the wire-level adapter. Codex and Claude Code are first-class CLI peers. BYOLLM is now mechanically possible — an enterprise self-hosting gitoku in their own cloud can wire their own credentials for any supported provider without touching the gitoku source.

The two pillars together replace today's Codex-only-or-OpenAI-only surface monoculture with a coherent "pick the right tool for the job, declared in the repo" story — the model layer of the open-formats / managed-ops business positioning in activegraph plan §9.

Stage 2a — Provenance → graph dual-write

Shipped (commit af443555). Implemented in Application/GraphProvenance.hs (new), wired at Application/Jobs/PromptPullRequest/Run.hs:~146 inside a withTransaction. Application/Graph.hs gained a FileRangeNode vocabulary entry; Application/GraphBackfill.hs was refactored to delegate to the same projection so historical and live data converge. Pure unit tests for the idempotency key (fileRangeSourceId) live in Test/PromptPullRequestsSpec.hs. Build green (672 → 674 modules).

Goal: every prompt-PR success that already writes commit_ai_contexts also writes an evidence node + derived_from edges into the graph, so the graph layer starts accruing real activity on day 1.

Detail reference: developer-features-implementation.md §4 ("Per-line AI provenance"), "The seam (generalize to the graph)" paragraph.

Seam (single point): Application/Jobs/PromptPullRequest/Run.hs:146 — where replaceCommitAiContexts is called. Emit the graph node + edges in the same transaction, right next to the existing call.

Sketch:

-- After the existing replaceCommitAiContexts call (line ~146):
-- For each commitAiContext row written, project it into the graph:
forM_ writtenContexts \ctx -> do
    evidenceNode <- Graph.upsertNode repository GraphNodeEvidence
        ("commit_ai_context:" <> UUID.toText ctx.id)
        (encodeEvidencePayload ctx)
    -- Each (file, lineSpan) in ctx.modifiedLineSpansJson becomes an edge:
    forM_ (parsedSpans ctx.modifiedLineSpansJson) \(filePath, lineRange) -> do
        targetNode <- Graph.upsertNode repository GraphNodeFileRange
            ("file_range:" <> commitSha <> ":" <> filePath <> ":" <> tshow lineRange)
            (encodeFileRangePayload commitSha filePath lineRange)
        _ <- Graph.upsertEdge repository GraphRelationDerivedFrom
            evidenceNode targetNode mempty
        pure ()

Graph.upsertNode / Graph.upsertEdge are the storage helpers added in Stage 1b (Application/Graph.hs). Both are idempotent on (repository_id, source_kind, source_id) / (src_node_id, dst_node_id, relation).

Acceptance:

A fresh prompt-PR run leaves N rows in commit_ai_contexts AND N+M rows in graph_nodes (1 evidence node per context + 1 file-range node per span) AND M rows in graph_edges (derived_from per span).
Re-running the same prompt-PR job (or re-applying its commit) is idempotent — no duplicate graph nodes/edges.
Application.GraphBackfill (Stage 1b) already covers the historical projection; this PR only adds the live-write side.

Tests: add to Test/PromptPullRequestsSpec.hs — a unit test that runs the dual-write helper against a fixture context and asserts the resulting graph state. Hspec, ORM only (no sqlQuery).

Risk: transaction boundary. The graph emit must be in the same transaction as replaceCommitAiContexts so a graph-write failure doesn't leave the provenance row orphaned. Application.CommitAiContexts.storeCommitAiContextBatch already uses withTransaction; add the graph emit inside the same block.

Diff size: ~50–80 lines in Application/Jobs/PromptPullRequest/Run.hs (or extract to Application/Graph/Provenance.hs if the inline block grows past 30 lines), ~30 lines of test.

Stage 2b — Behavior registry / dispatcher

Shipped (commit 1367e19b). Application/Behaviors.hs (new) holds the RepositoryChangeEvent ADT, dispatchEvent, registeredBehaviors list, and the four behavior functions (branchWorkflowBehavior, pullRequestWorkflowBehavior, pullRequestReviewBehavior, pullRequestConflictBehavior) — each one the verbatim inline block from the old trigger function lifted into a named top-level binding. BranchUpdated joined the GraphEventType vocabulary in Application/Graph.hs. Application/RepositoryWorkflowTriggers.hs collapsed to a thin dispatcher wrapper that pre-fetches open PRs once and dispatches one event per changed branch. Pure unit tests in Test/BehaviorsSpec.hs (wired into Test/Main.hs). Build green.

Goal: replace the hardcoded body of syncRepositoryWorkflowTriggers with a generic dispatcher. Same observable behavior on day 1; reactive substrate is now data-driven.

Detail reference: activegraph-continuity-plan.md Phase 2; mirror in developer-features-implementation.md §3 ("The seam" paragraph).

Seam (single function body swap): Application/RepositoryWorkflowTriggers.hs:71-96 — syncRepositoryWorkflowTriggers. Its current body has four inlined enqueue* calls (branch workflow runs, PR workflow runs, PR review jobs, PR conflict-resolution jobs).

Sketch:

-- New module Application/Behaviors.hs:
data Behavior = Behavior
    { behaviorName :: Text
    , behaviorMatches :: GraphEvent -> Repository -> IO Bool
    , behaviorRun     :: GraphEvent -> Repository -> IO ()
    }

registeredBehaviors :: [Behavior]
registeredBehaviors =
    [ branchWorkflowBehavior          -- replaces the :77 inline
    , pullRequestWorkflowBehavior     -- replaces :88
    , pullRequestReviewBehavior       -- replaces :89
    , pullRequestConflictBehavior     -- replaces :94
    ]

dispatch :: Repository -> GraphEvent -> IO ()
dispatch repo event = do
    matching <- filterM (\b -> behaviorMatches b event repo) registeredBehaviors
    forM_ matching \b -> behaviorRun b event repo

-- syncRepositoryWorkflowTriggers becomes:
syncRepositoryWorkflowTriggers repo changes = do
    forM_ changes \change -> do
        event <- Graph.recordEvent repo Nothing Nothing
                    (graphEventTypeFromChange change) actor before after Nothing
        dispatch repo event

The four existing inlined enqueues lift verbatim into the four behaviorRun bodies. Each behavior's behaviorMatches predicate captures what the inline case/when checks did before. enqueuePullRequestReviewJob's own gating (open + AI-reviews-enabled + dedup) stays inside it unchanged.

Acceptance:

Every existing reactive path (push → branch workflow runs; PR compare-branch moves → review jobs + workflow runs; base-branch moves → conflict-resolution) continues firing with identical timing and gating.
Each fire now leaves a row in graph_events (the bus also emits an audit trail).
Adding a new reactive behavior is a one-line registration plus a matcher + runner — no edits to syncRepositoryWorkflowTriggers.

Tests: add to Test/WorkflowsSpec.hs (or new Test/BehaviorsSpec.hs).

A unit test per behavior: feed a fixture GraphEvent, assert the right enqueue primitive was called.
A regression test: a synthetic push event leaves the same job-table rows it did before (compare against a fixture).

Risk: behavioral parity. The four enqueue calls have subtle preconditions (e.g. only fire for moved branches, not new ones). Each matcher must reproduce its inline equivalent exactly. Strategy: extract the inline conditions into named predicates before swapping the dispatcher, so the diff is "rename + relocate" not "rewrite". Land that prep as its own small commit if helpful.

Diff size: ~150–250 lines net (new Application/Behaviors.hs ~100, trimmed Application/RepositoryWorkflowTriggers.hs ~–30, new test file ~80). One new test fixture file.

Stage 2c — Agent triggers (agents-as-behaviors)

Shipped (commit 84db920e). agentTriggerBehavior added as the 5th entry in registeredBehaviors in Application/Behaviors.hs. It reads gitoku.toml from the changed branch, filters via the new agentMatchesEvent helper in Application/Workflows.hs (sibling of workflowMatchesEvent), and fires one prompt-PR launch per match. The launch path in Application/PromptPullRequests.hs was split: the new launchPromptPullRequestJobCore runs in any context (worker or controller) and skips reviewer notifications; the existing launchPromptPullRequestJob is now a thin wrapper that adds the notification step where a ControllerContext is available. Triggered runs are attributed to the organization owner.

Goal: the events / branches declarations on per-repo [agent.*] manifests (shipped in Stage 0) become behavior configs. A push that matches an agent's events list routes through the 2b dispatcher and enqueues a prompt-PR job for that agent.

Detail reference: developer-features-implementation.md §5 "Graph + reactive wiring" paragraph; activegraph-continuity-plan.md Phase 2.

Seam: Stage 0 already loads agent manifests via loadAgentManifestForBranch (parallel to loadWorkflowManifestForBranch:209 in Application/Workflows.hs). Add an agent behavior to the 2b registry:

agentBehavior :: GitokuAgent -> Behavior
agentBehavior agent = Behavior
    { behaviorName = "agent:" <> agent.key
    , behaviorMatches = \event repo -> pure (agentMatchesEvent agent event)
    , behaviorRun = \event repo -> enqueueAgentPromptJob agent event repo
    }

agentMatchesEvent reuses the existing workflowMatchesEvent shape (it's the same vocabulary). enqueueAgentPromptJob is a thin wrapper over the prompt-PR job builder (already shipped in Stage 0), preloading the agent's prompt/model/instructions.

The registry registers all agents at dispatch time (per-repo lookup), so adding a new [agent.*] to gitoku.toml is a no-restart change.

Acceptance:

An agent declared with events = ["pull_request"] and a branches = ["main"] filter wakes when a PR targeting main is opened.
The resulting prompt-PR job records the originating agent_key (Stage 0 already added that column).
Disabling the trigger (removing the events line in gitoku.toml) stops the wake immediately on next sync.

Tests: Playwright in test_playwright/repositories.spec.ts — extend the existing "named agents" spec with: configure events = ["pull_request"], open a PR, assert a new prompt-PR job appears within N seconds.

Diff size: ~50–100 lines glue (new agentBehavior + reuse of existing helpers).

Stage 2d — Thread-aware commit AI context capture (planned, not started)

Goal: each commit's stored AI context = the full session thread up to that commit's timestamp, not just a heuristic window of "recent prompts." That makes the diff between two consecutive commits' contexts a faithful record of what was discussed between those commits — which is the model agents and humans both intuitively expect from a "rewind a session" view.

Why it matters now. Stage 2a (provenance dual-write) projects every commit_ai_contexts row into an evidence node in the graph; Stage 5b (semantic memory injection) reads those evidence nodes back as recallable context on future prompts. Both stages assume the captured contexts faithfully represent the work that produced the commit. Today they don't: rgh ships only "recent prompts within a client-side heuristic window" per commit, so the projection has lossy input. Stage 2d fixes the input.

Current state (verified against production data, 2026-06-03)

Each commit_ai_contexts row has prompt, thinking, session_id, source_family, capture_timestamp, capture_index. rgh's discoverExactCommitCapture (rgh/src/Rgh/Capture.hs:377) walks recent Codex / Claude Code / Cursor session files, filters by exactPatchCaptureWithinTimestamp against the pre-commit timestamp, and posts the surviving prompts to POST /CreateApiCommitAiContextBatch. The current "window" is implicit — controlled by which session files are considered "recent" and what apply-patch evidence they carry. Sample from commit_ai_contexts on the deployed instance: same session yields 4 prompts on one commit, 1 prompt on another an hour later, even though dozens of messages were exchanged in between.

Pillars

2d-rgh — Full-session capture client-side

New entry point in rgh/src/Rgh/Capture.hs alongside the existing discoverExactCommitCapture:

discoverFullSessionCommitCapture
    :: FilePath        -- repositoryRoot
    -> UTCTime         -- preCommitTimestamp (upper bound)
    -> Maybe UTCTime   -- previousCommitTimestamp (lower bound, optional)
    -> IO CommitCapture

Behavior:

For every agent session that was active in the window (e.g. that touched any file in the staged diff), read the WHOLE session log up to preCommitTimestamp.
Bound by previousCommitTimestamp when known (from git log -1 --format=%cI HEAD~1): only include prompts AFTER the prior commit, so each commit's slice is delta-shaped rather than cumulative. Fall back to cumulative when there's no prior commit (initial commit, root of a branch).
Bypass the exactPatchCaptureWithinTimestamp filter — that heuristic is what currently drops prompts.

Wire it through the post-commit hook (rgh/src/Rgh/Hooks.hs) so the default capture is the full delta-shaped slice. The existing exact-patch-aware capture stays as a fallback / opt-in mode for users on tight quotas.

2d-server — Storage + dedup

The existing schema already supports multiple prompts per commit ((repository_id, commit_sha, capture_index) UNIQUE). For v1, no schema change: rgh sends N prompts per commit, server stores N rows. Storage grows linearly with session length, which is the price of fidelity.

Add a small optimization: when computing the graph projection in Application/GraphProvenance.hs, dedupe evidence nodes by (prompt_hash, session_id) so the same prompt that appears in multiple commits' captures (under the delta model that shouldn't happen, but under the fallback cumulative model it can) only becomes one graph node with multiple derived_from edges. Add a promptHash :: Text column to commit_ai_contexts for the dedup key; SHA-256 of the normalized prompt body, computed server-side at insert time.

Migration: forward-only ALTER TABLE commit_ai_contexts ADD COLUMN prompt_hash TEXT NOT NULL DEFAULT '' + backfill existing rows in the same migration (small dataset on day 1).

2d-ui — Session timeline view

The existing per-commit view (Commit.hs renderCommitAiContextCards) keeps working — under the delta model, each commit now shows only the prompts exchanged for that commit, sorted chronologically.

Add a NEW view: session timeline. Stitches contexts across commits in a single chronological scroll, anchored at each commit boundary. Path: /lazare/<repo>/sessions/<session_id>. Reuses the existing card renderer; adds dividers labelled "→ commit <short-sha> (<title>)" between groups of prompts.

This is what makes the delta capture model legible: per-commit view = "what was discussed here", session view = "what was the agent thinking end-to-end". Both come for free once the underlying capture is delta-shaped.

Cross-cutting

Storage budget: A 4-hour active session might produce 80–200 prompts. Even at 200 rows per commit-burst, the table grows by ~1 MB per commit at most (10 KB avg prompt body). Acceptable on the self-hosted side; worth a per-org budget knob (out of scope for v1) on the multi-tenant side.

Privacy: full-session capture means previously-implicit private chatter (debugging logs, throwaway thoughts) now lands in the repo's gitoku DB. Document in AGENTS.md so users know the capture model; offer a per-session opt-out flag on rgh capture for the quasi-private case.

Acceptance:

After 2d-rgh ships: a commit made during an active 30-message Claude Code session uploads ~30 rows (the delta since the prior commit), not 1.
After 2d-ui ships: git.lazare.ai/lazare/gitoku/sessions/<id> reconstructs the full session chronologically across all commits it spanned.
After 2d-server's dedup ships: any prompt-hash appears in at most ONE evidence node in the graph, with edges to every commit that captured it.

Tests: unit test in rgh/test/Test/CaptureSpec.hs for discoverFullSessionCommitCapture against a fixture session log; golden snapshot of the resulting CommitCapture. Server-side, a Hspec test that POSTs a multi-prompt batch and asserts row count matches input.

Diff size: ~200–400 lines rgh-side (capture + hook wiring + test), ~100–150 lines server-side (migration + dedup helper), ~100 lines view (session timeline page + sidebar entry). One small migration.

Stage 5a — Trace-as-product UI

Shipped (commit 079634c1). Added a new Activity tab to every repository page. Application/PageTabs.hs got RepositoryActivityTab (text code + parser), Web/View/Repositories/Show.hs got the pane id, tab nav link, RepositoryActivityView record, its View instance, and renderRepositoryActivityPageContent — a table with one row per graph_events row (when / event_type / actor / subject / payload), JSON payloads tucked behind a <details> for scan compactness, zero JS. Web/Controller/Repositories.hs got the dispatch entry + buildRepositoryActivityView, which fetches via Graph.fetchRecentEvents (cap: 100 rows, no pagination in v1). 676 modules total; full forced rebuild passed.

Goal: a viewer for the graph_events log, scoped per repository. Independent of every other stage; only needs Stage 1b ✅.

Why early: ships the first visible payoff of the graph layer without waiting on 2a/2b. After 2a lands, the same UI shows provenance events too.

Seam: new controller + view. Suggested path:

Web/Controller/RepositoryGraph.hs — new RepositoryGraphEventsAction that fetches the most recent N events for a repo, joined with their optional node/edge subjects.
Web/View/Repositories/GraphEvents.hs — table with columns time / event_type / actor / subject (node or edge) / payload preview.
Route + tab entry in the existing Repositories page tabs (Web/Controller/Repositories.hs near RepositoryTasksTab).

Sketch: uses Graph.fetchRecentEvents (the helper added in Stage 1b at Application/Graph.hs) directly. No new SQL.

Acceptance:

A repo with any history shows recent events (PR opened, review fired, etc.) ordered by created_at DESC.
Clicking a row expands the before_json / after_json payload.
Autorefresh works (use the same autoRefresh wrapper as the Tasks tab).

Tests: Playwright spec — open the new tab on a fixture repo, assert expected events render.

Diff size: ~200–400 lines (one controller + one view + one route entry

one tab + Playwright). Mostly view code.

Off-critical-path note: This is the cleanest "hand to a parallel agent" work in the plan. Could be done while 2a/2b are in flight.

Stage 5b — Semantic memory injection

Shipped (commit 5ac4719d). New Application/GraphMemory.hs exposes loadSemanticMemoryBlock :: Repository -> Int -> IO Text, which fetches the most recent N memory-flavored graph nodes (claim / decision / failure / evidence — the ones Stage 2a has been writing) and renders them as a bounded text block (default 20 nodes, 200 chars/line summary cap). Two prompt-render seams updated to insert the block: the in-process path in Application/Jobs/PromptPullRequest/Run.hs (its own renderCodexPrompt now takes the memory Text) and the subprocess path via a new GITOKU_PROMPT_PULL_REQUEST_SEMANTIC_MEMORY env var threaded through promptRunnerEnvironmentVariables → PromptRunnerEnvironment → Application/IsolatedExecution/Runner.hs's renderCodexPrompt. The new semanticMemorySection helper is shared between both paths. Empty memory → empty section so first-run prompts don't carry a stray header. No FTS yet — recent-first only.

Goal: at prompt-render time, look up the relevant graph subgraph for this repo + task and inject it into the Codex prompt as "what was already tried/decided/rejected". Stops re-litigation across runs.

Detail reference: developer-features-implementation.md §2 "Semantic memory" paragraph.

Seam: Application/IsolatedExecution/Runner.hs — renderCodexPrompt (approximate line 1465; verify at implementation time). Stage 0 already prepends agent instructions here; this stage adds a second prepend block for the subgraph.

Sketch:

-- In renderCodexPrompt, after the agent-instructions prepend:
memoryBlock <- Graph.readSemanticMemory repository
                  promptPullRequestJob.baseBranch
                  promptPullRequestJob.prompt
let prompt = agentInstructionsBlock
          <> renderMemoryBlock memoryBlock
          <> userTaskBlock

Graph.readSemanticMemory is a new function in Application/Graph.hs (call it Application/Graph/Memory.hs if it grows past ~80 lines). It queries graph_nodes filtered by:

repository_id = repo.id
node_type IN ('claim', 'decision', 'failure', 'evidence')
(optional) FTS match against the prompt text
ordered by created_at DESC
limited to N rows (config-driven; default 20).

The resulting subgraph is rendered as a structured block at the top of the prompt (Markdown table or list). Format follows the same shape Codex already expects from agent instructions.

Acceptance:

A second run of the same prompt on the same repo includes the first run's evidence nodes in the prompt.
The injected block is bounded (no more than N entries, no more than K characters total).
With no relevant memory, the block degenerates to nothing (not a header with empty body).

Tests: unit test for readSemanticMemory (Hspec, against a fixture graph). Manual smoke test for prompt rendering.

Risk: prompt token budget. Memory blocks must be hard-capped or they crowd out the user's task. Cap is a config knob.

Diff size: ~80–150 lines (new memory reader + prompt-render integration

test fixture).

Stage 5c — Propose/accept workflow for agent graph edits

Shipped (commit 56686c02). No schema migration — proposals live as ProposalNode rows in graph_nodes, with status flipping between pending → accepted | rejected. Application/Graph.hs got the new node type plus Proposes and AcceptedFrom relations. Application/GraphProposals.hs (new) is the state machine: submitProposal (agent-facing, idempotent via optional source key), acceptProposal (creates the real target node + an AcceptedFrom edge back to the proposal, all in one transaction), rejectProposal, fetchPendingProposals. Web/Types.hs got ShowRepositoryActivityAction + RepositoryGraphProposalAcceptAction

RepositoryGraphProposalRejectAction (POST). Handlers in Web/Controller/Repositories.hs are gated on canWriteRepository and redirect back to the Activity tab. The Activity view now threads pendingProposals and renders a "Pending proposals" section with Accept/Reject buttons above the events table (buttons gated on canWrite).

Goal: agents can propose graph edits (new claims, decisions, edge updates) that land in a pending state; humans accept/reject from the trace UI (5a). Stops the graph from being write-only.

Detail reference: activegraph-continuity-plan.md Phase 3.

Seam: new node type proposal and edge type proposes. A proposal node carries the would-be node/edge payload; on acceptance, the dispatcher creates the real node/edge and links it via accepted_from. On rejection, the proposal stays for audit (with status = rejected).

Sketch (data model):

-- Reuse graph_nodes — no new table:
--   node_type = 'proposal'
--   status    = 'pending' | 'accepted' | 'rejected'
--   payload   = JSONB { proposed_node_type, proposed_payload, ... }

-- New edges:
--   relation = 'proposes'        -- proposal → would-be target
--   relation = 'accepted_from'   -- new real node ← proposal

Endpoints:

POST /graph/proposals (used by agents) — creates a proposal row.
POST /graph/proposals/:id/accept (used by humans via 5a UI) — flips status and creates the real node/edge.
POST /graph/proposals/:id/reject — flips status only.

Acceptance:

An agent run can emit a proposal during its trace.
The proposal shows up in the 5a UI with accept/reject buttons (gated on canWriteRepository).
Acceptance creates the real graph node + an accepted_from edge in one transaction; rejection leaves the proposal with status = 'rejected'.
Re-running the same proposal (idempotency key) is a no-op.

Tests: unit tests for the accept/reject state transitions; Playwright for the UI flow.

Diff size: ~250–400 lines (controller + service + UI integration in the 5a view + tests).

Stage 4b-* — `workspace_snapshot` graph node + capture hook (metadata only)

Shipped (commit d2b036e8). WorkspaceSnapshotNode added to the GraphNodeType vocabulary; recordWorkspaceSnapshot + workspaceSnapshotSourceId added to Application/GraphProvenance.hs; wired into the same withTransaction as the 2a dual-write at Application/Jobs/PromptPullRequest/Run.hs:~146. Backend field reflects the actual sandbox mode (sandboxAgentRunMode); fs_ref stays null until Stage 4a populates it. Pure unit tests for the source-id stability in Test/PromptPullRequestsSpec.hs.

Goal: add the graph-side model for workspace snapshots and a hook at prompt-PR success time, so the data shape is in place before Stage 4a provides real ZFS bytes.

Detail reference: nixos-sandbox-fleet-plan.md §4 "Snapshot capture and fork" — but only the metadata half.

Seam: Application/Jobs/PromptPullRequest/Run.hs:146 (same location as 2a). After a successful run, emit a workspace_snapshot node:

-- After 2a's dual-write, in the same transaction:
_ <- Graph.upsertNode repository GraphNodeWorkspaceSnapshot
        ("snapshot:" <> commitSha)
        (Aeson.object
            [ "commit_sha" .= commitSha
            , "branch"     .= compareBranch
            , "backend"    .= ("local-runner" :: Text)
            , "fs_ref"     .= Aeson.Null   -- populated by 4a later
            ])

When 4a ships, runOnLocalRunner / runOnMicroVm populate fs_ref with the actual ZFS dataset path. Until then, fs_ref stays null and "snapshots" are conceptually branches only.

Acceptance:

Every successful prompt-PR leaves a workspace_snapshot node with fs_ref = null.
The 5a trace UI can list snapshots (filter node_type = 'workspace_snapshot').
Future Stage 4a / 4b work updates fs_ref in the same row (idempotent upsert on (repository_id, source_kind, source_id)).

Tests: extend the 2a test fixture — assert the snapshot node is created alongside the provenance evidence nodes.

Diff size: ~40 lines added to the same seam as 2a.

Stage 5d-* — Branch-based self-improvement loop (no warm fs)

Goal: the gitoku-bootstrapping-gitoku story: run the same prompt 3 times via the fork mechanism (Stage 3 ✅), score the resulting diffs, and promote the best as the merged PR. Branch-only — the warm-fs version waits on Stage 4b-full.

Detail reference: activegraph-continuity-plan.md Phase 4 ("self-improvement"); developer-features-implementation.md §1's headline (but warm-fs caveat).

Seam: new job type SelfImprovePromptJob that:

Creates the original prompt-PR run (existing flow).
On success, forks it K times via forkPromptPullRequestJob (Stage 3 ✅) with the same prompt + a "variation seed" injected in the agent instructions.
Waits for all K forks to finish.
Runs a scoring step: diff each result against the base, compute a score (test pass count, line count, agent's own confidence, etc.).
Emits a score graph node per fork (new edge scored_as).
Promotes the highest-scoring fork by updating its PR to "ready for review" and closing the losers as superseded.

Acceptance:

A self-improve invocation creates K+1 PRs (1 original + K forks).
After scoring, exactly one PR is ready-for-review; the rest are closed with a superseded_by link.
The trace UI shows the scoring decisions (graph edges between forks and the winner).

What's missing without 4a/4b-full: the forks all start from a branch, not a snapshotted workspace. They re-clone and re-install dependencies per fork — slow. With 4a/4b-full, each fork boots from a ZFS clone of the original run's filesystem in seconds.

Tests: complex; suggest manual smoke + a Playwright e2e on a tiny synthetic repo.

Diff size: ~200–300 lines (new job runner + scoring service + UI hooks).

Cross-cutting

Migrations

Every stage that touches the schema adds a forward-only, <unixtime>-<desc>.sql migration under Application/Migration/. Pattern matches the Stage 0 / Stage 3 migrations (1780225150-add-prompt-agent-fields.sql, 1780227009-add-prompt-forked-from-job.sql). Always use IF NOT EXISTS so re-application is safe. Schema.sql kept in sync in the same commit.

Testing convention

Unit work → Hspec under Test/.
Browser-visible flows → Playwright under test_playwright/.
Per AGENTS.md, the canonical compile check is devenv up, not cabal build exe:App. Focused supplemental specs are fine; the test suite is currently uncompilable due to unrelated pre-existing blockers in Test/GitCliSpec.hs etc. — that's a separate cleanup task and shouldn't block this plan's PRs.

Observability

Each new dispatch path should log structured events the same way the existing Application.RepositoryWorkflowTriggers does. The graph_events table doubles as the audit log; no separate logging infrastructure needed.

Formatter

Every PR runs direnv exec . ./scripts/format-haskell.sh --staged before commit (or trusts the managed pre-commit hook if installed). Expect the diff to expand on first format pass of a file that hasn't been touched in a while — that's fourmolu reformatting the whole file, not the change itself.

Definition of done (whole plan)

When all stages below land, the master table in doc/README.md should read:

0   ✅  1a  ✅  1b  ✅  2a  ✅  2b  ✅  2c  ✅
3   ✅  4a  ⬜  4b-* ✅  5a  ✅  5b  ✅  5c  ✅  5d-* ✅

i.e. every stage except 4a complete. The remaining "real ZFS" work in 4b / 5d hard-depends on 4a; both are tracked as partial / metadata only in the doc until 4a ships.

User-visible state at that point:

Reactive auto-review is generic (data-driven behaviors).
Per-repo agents wake on events (push, PR, etc.) automatically.
Every prompt-PR commit accrues evidence + snapshot metadata in the graph.
Trace UI shows the running event log.
Subsequent runs of the same prompt see the prior runs' evidence in their prompt (semantic memory).
Agents can propose graph edits; humans accept/reject.
Self-improve runs the same prompt K ways and promotes the best (slow: cold-clone per fork until 4a/4b-full).

Everything except "fast self-improve via warm fs" works without 4a.

Open questions to confirm at implementation time

Graph node-type vocabulary. evidence, proposal, workspace_snapshot, score, decision, claim, failure, file_range — finalise the minimal set before 2a lands so the rest don't have to migrate.
Idempotency keys for proposals. A natural key like (agent_run_id, proposal_index) avoids dupe proposals on retry. Confirm the agent runner can supply both.
Memory subgraph FTS vs vector. Stage 5b's readSemanticMemory is spec'd as BM25/keyword today (matches what gitnexus does). If we want semantic-vector retrieval here, the graph schema doesn't need to change but the reader does.
Scoring rubric for 5d. Tests-pass count, diff size, agent confidence, external judge model — pick one for v1 and ship; the rest are pluggable.
Self-improve agent count. K = 3 by default; configurable per-repo under [gitoku.selfimprove] in gitoku.toml.

Continuity completion plan

On this page