Continuity completion plan
Execution roadmap for the remaining ActiveGraph continuity stages (2a → 5, excluding 4a) — seam, sketch, acceptance, and tests per stage.
A crisp execution roadmap for the remaining continuity stages, not a re-explanation of the underlying design. Each stage points back to its detail-bearing plan (activegraph-continuity, developer-features, nixos-sandbox-fleet) and gives only what's needed to cut a PR: seam, sketch, acceptance, tests.
Status: ready to execute — covers every remaining stage in the master table except Stage 4a (the NixOS sandbox-host + microVM fleet, reserved for a dedicated infra push); Stages 4b and 5d are included in degenerate, branch-based form.
Scope at a glance
| ID | Stage | Size | Depends on | Why now |
|---|---|---|---|---|
| 2a | Provenance → graph dual-write | XS (~50 lines, 1 file) | 1b ✅ | ✅ shipped (af443555) |
| 2b | Behavior registry / dispatcher | M (~150–300 lines, 3–5 files) | 1b ✅ | ✅ shipped (1367e19b) |
| 2c | Agent triggers (agents-as-behaviors) | S (~50–100 lines glue) | 0 ✅, 2b | ✅ shipped (84db920e) |
| 5a | Trace-as-product UI (graph_events viewer) | M (~200–400 lines) | 1b ✅ | ✅ shipped (079634c1) |
| 5b | Semantic memory injection at prompt render | S (~80–150 lines) | 2a | ✅ shipped (5ac4719d) |
| 5c | Propose/accept workflow for agent graph edits | M (~250–400 lines) | 2a + 2b | ✅ shipped (56686c02) |
| 4b-* | workspace_snapshot graph node + capture hook (metadata only) | XS (~40 lines) | 1b ✅ | ✅ shipped (d2b036e8) |
| 5d-* | Branch-based self-improvement loop (no warm fs) | M (~200–300 lines) | 3 ✅ + 2a + 2b | Fork → diff → score → promote; bytes-side waits on 4a |
Excluded: 4a (NixOS sandbox-host module + microVM fleet + nixos-anywhere
disko+ ZFS +microvm.nix+ runner closure + attic) — reserved for a separate dedicated push. Anything in 4b/5d that needs real filesystem snapshots is deferred with it.
Total estimate (excl. 4a): ~6 mergeable PRs, ~1000–1700 lines of net code, plus tests + migrations + view templates. Cleanest path to ship all of it: 2a → 2b → 2c → 5b → 5a in parallel anywhere along the way → 4b-* → 5c → 5d-*.
Execution order + parallelization
Sequential spine (cheapest to richest):
2a → 2b → 2c → 5b → 5c → 5d-*
Off-spine, parallelizable:
5a (trace UI) — needs only 1b ✅; can start day 1
4b-* (snapshot node) — needs only 1b ✅; can land anytime5a and 4b-* are deliberately off the critical path: each is a small,
self-contained deliverable that doesn't block or get blocked by the spine.
Both are good "parallel agent" hand-offs (same pattern as fork-agent-run-plan.md
was for Stage 3).
Stage 0b — Ambient repository guidance (out-of-band)
Shipped (commit
f3797fa2). AddedApplication/RepoGuidance.hs(new); export bump inApplication/Repositories.hs; threaded into both prompt-PR paths (Application/Jobs/PromptPullRequest/Run.hs—runOnAwsTask,runOnLocalRunner, andrunCodexCommand); env-var roundtrip viaApplication/IsolatedExecution/Runner.hs(GITOKU_PROMPT_PULL_REQUEST_REPO_GUIDANCE). Documented inAGENTS.md+CLAUDE.md. Build green (677 → 678 modules).
Goal: every prompt-PR run automatically sees the branch's own agent
guidance (AGENTS.md / CLAUDE.md / README.md / .cursor/rules /
.windsurfrules / .github/copilot-instructions.md /
.gitoku/instructions.md, plus any other top-level *.md) without the user
having to wire each file through a [agent.*] instructions_file in
gitoku.toml. Stage 0 covered per-agent manifests; this covers the
ambient "how does this codebase think about itself" layer.
Prompt order (final stack as of this stage):
branch context → mode → repo guidance (Stage 0b) → agent instructions (Stage 0) → semantic memory (Stage 5b) → user taskEach later stage layers a more specific signal on top of the prior: repo guidance answers "how does this codebase think about itself?"; agent instructions answers "what's your specific role on this run?"; semantic memory answers "what was decided / failed before?".
Caps: 8 KB per file, 32 KB total. Allowlisted files (the seven
above) are never sacrificed under budget pressure; other markdown is
appended alphabetically until the cap is reached, then dropped.
Excluded prefixes (docs/, vendor/, node_modules/, dist-newstyle/,
etc.) prevent generated trees from crowding out real guidance.
Why out-of-band: this wasn't in the original 2026-05-31 plan; user flagged the gap mid-plan ("many projects already include markdown for agents which should be fed automatically on first prompt call of any task"), so it shipped between Stages 5c and 5d-*.
Stage 0c — Unify the AI surface: context layering + provider abstraction (planned, not started)
Stage 0c is a two-pillar plan. The two pillars are orthogonal — either ships independently — but together they unify gitoku's AI story end-to-end:
- Pillar 1 — prompt context unification. Every AI surface (not just prompt-PR) sees the same layered context (ambient repo guidance + bounded semantic memory). What we send is unified.
- Pillar 2 — LLM provider abstraction. Every AI surface can be routed to any supported provider (Codex CLI, Claude Code CLI, OpenAI direct, OpenRouter, Anthropic direct), with one tool-call shape across the wire. Where we send it is unified.
Both pillars converge on the same end-state: a repo's
gitoku.toml [agent.reviewer] can declare e.g. runner = "claude-code", chat-provider = "openrouter", model = "anthropic/claude-opus-4-7", and all five AI surfaces route
correctly with the right prompt context layered in.
Pillar 1 — Prompt context unification
Goal: every AI surface in gitoku (not just prompt-PR) sees the same
layered prompt context — ambient repo guidance (Stage 0b) and bounded
semantic memory (Stage 5b) — so a repo's AGENTS.md / CLAUDE.md
improves all AI behavior, not just prompt-PR runs.
Why this is a real gap. Today four AI surfaces ship their own
ad-hoc system prompts that ignore AGENTS.md, CLAUDE.md, semantic
memory, and per-repo conventions:
| Surface | LLM client | Prompt-builder seam |
|---|---|---|
| PR Review | Codex (isolated runner) | Application/Jobs/PullRequestReview/Run.hs:381 renderCodexReviewPrompt |
| Conflict Resolution | Codex (isolated runner) | Application/Jobs/PullRequestConflictResolution/Run.hs:417 renderConflictResolutionPrompt |
| Diff-thread AI reply | OpenAI streaming chat (IHP.OpenAI) | Application/Jobs/PullRequestDiffAiResponse/Request.hs buildPullRequestDiffAiResponseCompletionRequest |
| PR Form suggestion | OpenAI tool-call (IHP.OpenAI) | Application/Jobs/PullRequestFormSuggestion/Request.hs:90 renderPullRequestFormSuggestionSystemMessage |
So a repo can tell the prompt-PR agent "we don't use try/catch in this
codebase" via AGENTS.md, but the diff-reply AI on the same repo will
happily suggest try/catch. This breaks the "AGENTS.md is gitoku's
universal agent context" promise that Stage 0b established.
Two surface families, two shapes:
-
Codex isolated-runner surfaces (Review + ConflictResolution). Same shape as prompt-PR: env-var roundtrip via
Application/IsolatedExecution/Runner.hs, prependrepoGuidanceSection <> semanticMemorySectionin therenderCodexXPrompthelper. Mechanically identical to the Stage 0b work onPromptPullRequest. -
OpenAI chat surfaces (DiffAiResponse + FormSuggestion). Different shape —
GPT.systemMessage/GPT.userMessagerather than a single rendered text block. Inject guidance + memory as a second leadingGPT.systemMessagebefore the existing one. These are sync / interactive, so they get tighter byte caps.
Shared seam to add first
A new module Application/PromptContextBlock.hs that exposes a
uniform loader. Today the composition logic is inline in
PromptPullRequest/Run.hs; lift it so every surface uses one entry
point with per-surface budget knobs.
data SurfaceContextOptions = SurfaceContextOptions
{ repoGuidanceTotalCap :: Int -- override 0b's 32 KB default
, repoGuidancePerFileCap :: Int -- override 0b's 8 KB default
, repoGuidanceAllowlistOnly :: Bool -- skip non-allowlisted *.md
, semanticMemoryNodeLimit :: Maybe Int -- Nothing disables 5b for this surface
}
loadSurfaceContext :: (?context :: ctx, ConfigProvider ctx)
=> Repository -> Text -> SurfaceContextOptions -> IO TextRefactor PromptPullRequest/Run.hs's existing inline composition to
call loadSurfaceContext with prompt-PR's existing budgets. This is
a behavior-preserving refactor — ship it as 0c-0 before any surface
work.
Pillar 1 sub-stages (parallelizable)
0c-prompt-0 — PromptContextBlock extraction (refactor, no behavior change)
Scope: new module + cut-over of PromptPullRequest/Run.hs to use
it. Existing semantic-memory and repo-guidance behavior unchanged.
Done when: build green, prompt-PR runs produce byte-identical prompts to before (snapshot test on a fixture repo).
0c-prompt-1 — PR Review picks up the prompt stack
Seams:
renderCodexReviewPrompt(Application/Jobs/PullRequestReview/Run.hs:381) — prependrepoGuidanceSection <> semanticMemorySection.- AWS task path + local-runner path in the same module — call
loadSurfaceContextbefore assembling env vars (mirror the two call sites inPromptPullRequest/Run.hs). PullRequestReviewRunnerEnvironmentinApplication/IsolatedExecution/Runner.hs— addrepoGuidance :: Text,semanticMemory :: Textfields and the env-var reads (GITOKU_PULL_REQUEST_REVIEW_REPO_GUIDANCE,GITOKU_PULL_REQUEST_REVIEW_SEMANTIC_MEMORY).- The Runner's
renderCodexPromptfor review (sibling of the PromptPullRequest one) re-injects the two sections.
Budgets: identical to prompt-PR (32 KB guidance, full semantic memory). Review is offline and can be heavy.
Branch to load guidance against: pullRequest.compareBranch
(the branch being reviewed — its AGENTS.md is what's relevant for
this change).
Done when: a PR review run on a repo with AGENTS.md visibly
cites repo conventions in its review comments. Add a fixture-repo
manual check; no automated test for prompt content.
0c-prompt-2 — Conflict Resolution picks up the prompt stack
Seam: renderConflictResolutionPrompt
(Application/Jobs/PullRequestConflictResolution/Run.hs:417).
Same env-var + Runner-side wiring as 0c-1, just for the conflict
resolution surface.
Budgets: 16 KB guidance (already a large diff in the prompt; leave headroom), full semantic memory enabled — "how were similar conflicts resolved before" is exactly what memory is for.
Branch: pullRequest.baseBranch (the merge target — that's
where the project conventions the resolution must respect live).
Done when: conflict resolution on a repo with code-style rules
in AGENTS.md preserves them (e.g. import ordering, brace style).
0c-prompt-3 — Diff-thread AI reply picks up the prompt stack
Seam: buildPullRequestDiffAiResponseCompletionRequest
(Application/Jobs/PullRequestDiffAiResponse/Request.hs). Not
Codex — uses IHP.OpenAI.
Shape change: instead of a single text prompt, prepend an
additional GPT.systemMessage containing the bounded block:
let contextBlock = loadSurfaceContext repository branch opts
let messages =
[ GPT.systemMessage contextBlock -- new (Stage 0c-3)
, GPT.systemMessage existingSystem -- existing
, GPT.userMessage existingUser -- existing
]Budgets: 4 KB total guidance, semantic memory top-3 nodes only. This is a sync interactive surface — latency matters, and the diff itself is already in the user message.
Branch: pullRequest.compareBranch.
Done when: a diff-thread reply on a repo with AGENTS.md
follows repo-specific advice (e.g. "always include a test plan in
your reply" → reply now does).
Shipped (commit
6eddeaa9+ UI footer follow-up). Backend wired through Pull Request Diff AI Response runner; visible badge appended INSIDE the reply comment body itself (appendContextBytesFooterinApplication/Jobs/PullRequestDiffAiResponse/Run.hs) so the bytes-spent line travels with the artifact wherever the comment is read.
0c-prompt-4 — PR form-suggestion picks up the prompt stack
Seam: renderPullRequestFormSuggestionSystemMessage
(Application/Jobs/PullRequestFormSuggestion/Request.hs:90).
Shape change: append the guidance block to the existing system
message string (or pass as a second GPT.systemMessage like 0c-3 —
pick whichever keeps the existing template-handling logic cleanest).
Budgets: 2 KB, allowlist-only (AGENTS.md, CLAUDE.md,
README.md, PULL_REQUEST_TEMPLATE.md if present). Semantic memory
disabled — this is one-shot metadata generation, not a
knowledge-recall task.
Branch: pullRequest.compareBranch.
Done when: a generated PR title/description matches the
conventions in the repo's PULL_REQUEST_TEMPLATE.md + tone in
AGENTS.md.
Shipped (commit
6eddeaa9+ UI footer follow-up). Backend wired throughbuildPullRequestFormSuggestionCompletionRequest; visible footer rendered in the create-PR form (renderPullRequestSuggestionContextFooterinWeb/View/Repositories/Show.hs). The badge is UI-only — the suggested title and description text stay clean of gitoku metadata since they become the user's PR description.
Pillar 2 — LLM provider abstraction
Goal: every AI surface can target Codex, Claude Code, OpenAI,
OpenRouter, or Anthropic-direct — selected per-agent in gitoku.toml,
with sensible per-surface defaults — and tool-call definitions are
authored once but routed to every provider's native shape at the
wire.
Why this is the natural completion of 0b + Pillar 1. Today the
AI provider is hardcoded per surface: Codex CLI for the three
agentic surfaces, OpenAI direct (gpt-5-mini) for the two HTTP
surfaces. That makes the prompt-PR / review / conflict surfaces
"only as good as Codex," and the diff-reply / form surfaces "only
as good as gpt-5-mini." A repo that prefers Claude for review and
GPT for cheap form-gen has no way to express that, and BYOLLM (an
enterprise-asked feature; see activegraph plan §9)
is impossible.
What's already in the codebase:
Application.CodexCredentials— per-user Codex auth JSON, consumed by all three Codex CLI surfaces.Application.ClaudeCodeCredentials— per-user Claude Code auth env JSON. UI exists (settings → Claude tab); no consumer wires it into a runner yet. Scaffolded ready for 0c-provider-A.claude-code-nixflake input + overlay (host nixos-config haspkgs.claude-codeavailable). Same shape aspkgs.codex.IHP.OpenAI— gitoku's chat client. Used by diff-reply + form suggestion with hardcoded model + key.
What's missing:
- A
runner :: { codex | claude-code }field on the agent / per-run config and a CLI dispatcher in the isolated runner. - A
chat-provider :: { openai-direct | openrouter | anthropic-direct }field on the chat-completion surfaces and a three-way client dispatcher. - An Anthropic Messages API adapter that translates from gitoku's
internal
IHP.OpenAI.CompletionRequestshape (including tools) to the Anthropic wire format, so the surfaces stay authored against one shape.
Pillar 2 sub-stages (parallelizable across A and B; sequential within each track)
0c-provider-A — CLI-runner provider abstraction (Codex ↔ Claude Code)
Two CLI-based agentic loops, same role, different binaries. Both take a prompt + workspace and produce a git commit. Differences are which binary, which auth env, which model flag, which subprocess invocation shape.
0c-provider-A1 — Introduce AgentRunnerKind type + selection plumbing.
- New
data AgentRunnerKind = CodexCliRunner | ClaudeCodeCliRunnerin a newApplication/AgentRunners.hs. gitoku.toml [agent.*]schema gains optionalrunner = "codex" | "claude-code"(default =codexfor backward compat).prompt_pull_request_jobs.agent_runner TEXTcolumn + forward-only migration; sibling columns onpull_request_review_jobsandpull_request_conflict_resolution_jobs.- Parser + persisted value validated against the enum.
0c-provider-A2 — Wire the Claude Code CLI as a peer to codex in the isolated runner.
Application/IsolatedExecution/Runner.hs'srunPromptPullRequest(and the two sibling review / conflict entrypoints) dispatch onAgentRunnerKindbefore calling the CLI subprocess.- New env-var contract:
GITOKU_AGENT_RUNNER(=codexorclaude-code),GITOKU_CLAUDE_CODE_AUTH_ENV_B64. The Codex runner's existing env vars are unchanged. - Subprocess invocation (stable flag surface, verified against
claudev2.1.158):ANTHROPIC_API_KEY=… claude --bare --print \ --dangerously-skip-permissions \ --append-system-prompt-file <guidance.md> \ --model <model> \ --output-format json \ --settings '{}' \ <prompt>--bareisolates the run from~/.claude/settings.json, hooks, plugin sync, OAuth, and keychain reads; forces auth toANTHROPIC_API_KEYonly (perfect for our subprocess contract). Without it, the runner would inherit any host-side Claude Code config that happened to be installed.--dangerously-skip-permissionsskips per-edit approval — safe because we control the sandbox. Catastrophic shell commands still prompt (which would hang us), so the--bare --settings '{}'pair also suppresses the prompt-source that would emit them.--append-system-prompt-filereads the file at start — cleaner than passing 32 KB of guidance on argv. The file is the rendered Pillar-1 context block written to a tmpfs path by the runner before exec.--output-format jsonis deterministic and stable; gives usis_error,terminal_reason,total_cost_usd,usage, andsession_id. Per-token cost surfacing is a nice-to-have for the trace UI.- cwd is inherited; no
--cwdflag needed — the runnerchdirs into the cloned workspace before exec, same as it does for Codex.
- The auth env is sourced from
loadUserClaudeCodeAuthEnvironmentForUse(already exists) — Stage 0c-provider-A2 is the first real consumer. - Both runners write the same trailing JSON commit-summary line to stdout so the runner's existing log parser stays unchanged.
0c-provider-A3 — Selection UI + per-surface defaults.
- The settings → Agents page (already exists for
[agent.*]) gains a runner radio selector per agent. - Per-surface defaults: leave existing surfaces on Codex. Allow
a server-wide
GITOKU_DEFAULT_AGENT_RUNNERenv var for the managed-host preference (lets git.lazare.ai default to Claude Code without per-repo configuration).
Done when: a [agent.reviewer] in gitoku.toml with runner = "claude-code" runs a PR-review via the claude CLI subprocess,
writes the same commit + summary shape, and shows up in the trace
UI with the runner kind labeled.
0c-provider-B — HTTP chat-provider abstraction (OpenAI ↔ OpenRouter ↔ Anthropic)
Three HTTP backends, one internal API shape. IHP.OpenAI's
CompletionRequest becomes gitoku's lingua franca; OpenRouter takes
it verbatim (OpenAI-compatible), Anthropic gets a wire adapter.
0c-provider-B1 — Introduce ChatProvider config.
- New
data ChatProviderKind = OpenAiDirect | OpenRouter | AnthropicDirectin a newApplication/ChatProviders.hs. data ChatProviderConfig = ChatProviderConfig { kind, baseUrl, authHeader, defaultModel }.Config/Config.hsreads three env vars per provider (GITOKU_OPENAI_API_KEY,GITOKU_OPENROUTER_API_KEY,GITOKU_ANTHROPIC_API_KEY) and surfaces aMap ChatProviderKind ChatProviderConfig.gitoku.toml [agent.*]gains optionalchat-provider+modelkeys (so a repo can declare "this agent uses Claude Opus via OpenRouter"). Per-surface defaults if unspecified.
0c-provider-B2 — Refactor existing OpenAI usage to read from ChatProvider.
- Diff-reply + form-suggestion stop calling
GPT.defaultConfig (cs openAIApiKey)directly; instead resolveChatProviderConfigper surface, then callGPT.defaultConfig { baseUrl, authHeader }. - For
OpenAiDirectandOpenRouterthis is enough — both speak the OpenAI Chat Completions wire format. - Behavior-preserving: with no env changes, behaves exactly as
today (defaults to
OpenAiDirect).
0c-provider-B3 — Anthropic Messages API adapter (covers Pillar 2's "tool calls" ask).
- New
Application/Llm/AnthropicAdapter.hs. - Translates
GPT.CompletionRequest→ Anthropic Messages POST body:messages→messages(role mapping, system message split out into top-levelsystemfield per Anthropic spec);tools = [GPT.Function]→ Anthropictools = [{ name, description, input_schema }](mechanical rename;parametersJsonSchema ≡input_schema);tool_choice→tool_choice. - Translates Anthropic streaming events (
content_block_deltawithtool_useblock) back into the OpenAI-style chunk shape (GPT.CompletionChunkwithtool_callsdelta) so the existing surface consumers (extractPullRequestDiffAiResponseOutput, the form-suggestion stream consumer) need no changes. - Streaming tool-call reassembly is mechanical: Anthropic
streams
input_json_deltastrings; concatenate them per tool block until the closing event, then emit the equivalent OpenAItool_calls[i].function.argumentsfinal chunk. - This sub-stage is the tool-call harmonization the user
asked for: a tool authored once as
GPT.Functionworks across OpenAI, OpenRouter, and Anthropic.
0c-provider-B4 — Per-surface defaults + override.
- Same
chat-provider+modelfields from B1 are consulted by diff-reply + form-suggestion when assembling a request. - Per-surface server defaults configurable via env
(
GITOKU_DEFAULT_CHAT_PROVIDER_FORM_SUGGESTIONetc.). - Cheap default for form suggestion (e.g.
openai-direct+gpt-5-minias today). Higher-quality default for diff-reply if the operator has a budget for it.
Done when: a [agent.formgen] in gitoku.toml with
chat-provider = "anthropic-direct" + model = "claude-haiku-4-5-20251001" causes a PR form-suggestion request
to hit api.anthropic.com, returns a tool-call result, and the
existing form-suggestion consumer renders it identically to the
OpenAI path.
Pillar 2 cross-cutting
Why no new tool-call abstraction layer. The OpenAI tool-call
shape ({ name, description, parameters: JsonSchema }) is the
strict superset all three providers can be normalized to (OpenAI +
OpenRouter accept verbatim; Anthropic differs by field name +
location only). Inventing a gitoku-native tool type would add an
abstraction with one user. Keep GPT.Function; adapt at the wire.
OpenRouter trade-off. OpenRouter is a third-party single point of failure and takes margin on every call. The plan supports it as a peer not a default — it's the right choice for BYOLLM operators who want "any model, one bill," but the managed gitoku host should prefer direct API integrations for cost + reliability.
Codex CLI is a peer, not a backend. Codex CLI is itself an agentic loop with its own model choice, prompt-routing, and tool execution loop inside the subprocess. It can't be replaced by "Codex chat completions" without losing the agentic behavior. That's why Pillar 2 splits along CLI-runner vs HTTP-chat lines — they really are different abstraction levels.
Migrations:
prompt_pull_request_jobs.agent_runner TEXT(+ siblings on review / conflict tables) — A1.prompt_pull_request_jobs.chat_provider TEXT+chat_model TEXTif we want to record the resolved choice on the run — optional, defer to observability needs.
Risks:
- Claude Code CLI surface — actually low risk. The headless
flag set we depend on (
--bare,--print,--dangerously-skip-permissions,--append-system-prompt-file,--output-format json,--model) has been stable for 12+ months. Recent churn (v2.1.155–158) is all in adjacent areas (worktrees, Bedrock auto-mode, prompt-cache hints) — not our surface. Real residual risks worth gating:- No exit-code discipline. Failures surface in JSON
(
is_error == true,terminal_reason != "completed") rather than a non-zero exit. We must parse the JSON before declaring success — the existing Codex success/failure detector needs a Claude-shaped sibling. - No native hang detection. If something inside the loop waits
for input (missing git credential, network probe), the
subprocess hangs indefinitely. Wrap in
timeout(or use the existingcodexTimeoutSecondsknob renamed toagentTimeoutSeconds) with a generous default; same discipline already applies to Codex. - Sandbox bleed via
.claude/settings.jsonin the cloned repo. The repo we just cloned for the run may ship a.claude/settings.jsonthat would otherwise be auto-loaded.--bare --settings '{}'suppresses both host-side and repo-side settings, so the only configuration is what we pass on the CLI. Mandatory in our subprocess contract — don't drop it as a "small optimization." - Version probing. Pin via
claude-code-nix(already in the flake), and add aclaude --versionprobe inensureWorkflowCommandAvailableso a Nix-side downgrade that silently lost a flag fails fast with a useful error.
- No exit-code discipline. Failures surface in JSON
(
- Anthropic wire-shape drift. Adapter is mechanical but
brittle. Add a unit test on the request-translation function
that round-trips a fixture
GPT.CompletionRequestand asserts the JSON output against a checked-in golden file. No live API call needed. - Credential confusion. Three API keys + two CLI auth envs is a lot of surface for users to manage. The settings UI should group them into a single "AI providers" tab with a clear per-provider connect/disconnect flow (extends the existing Codex / Claude tabs).
Cross-cutting
Cap defaults (final stack after 0c):
| Surface | repo guidance | semantic memory | rationale |
|---|---|---|---|
| prompt-PR | 32 KB | enabled (full) | offline, agentic, large context |
| PR Review | 32 KB | enabled (full) | offline, can be heavy |
| Conflict Resolution | 16 KB | enabled (full) | offline, but big diff in prompt |
| Diff-thread reply | 4 KB | top-3 nodes | sync, latency-sensitive |
| Form suggestion | 2 KB allowlist-only | disabled | one-shot metadata |
Per-agent instructions (Stage 0) deferred for these surfaces.
None of the four currently invokes a named [agent.*] — there's no
[review] / [reply] / [form] agent in gitoku.toml. Adding one
is a natural Stage 0d (cosmetic — same pattern as 0 already
established), but it's orthogonal to 0c and not in this plan's
scope.
Migrations: none. All changes are prompt-rendering code paths.
Observability: every surface should write the injected byte
counts (guidance_bytes, memory_bytes) to its job row, so we can
see budget impact and tune defaults. Add the columns in a single
migration as part of 0c-0.
Testing convention: pure unit tests for loadSurfaceContext with
a fixture allowlist + temp-directory fake repo. Per repo rule, no
mock DB. The integration is checked manually with the fixture-repo
done-when criteria above.
Risks:
- Silent latency regression on sync surfaces (0c-3/0c-4). Mitigation:
hard byte cap enforced in
loadSurfaceContext, plus the job-row byte-count column lets us spot drift. - Prompt bloat hurts answer quality. Big prompts are not always better — for the small interactive surfaces especially, the guidance + memory injection should help, not distract. The cap table above is conservative; widen by observing the byte-count column after launch.
- Cache misses on OpenAI surfaces. Prepending content invalidates any prefix caching the upstream provider does. The 4 KB cap on 0c-3 keeps this tolerable; revisit if measured latency suffers.
Done (whole 0c):
Pillar 1. Every AI surface in the codebase, when it renders its
prompt, loads from Application.PromptContextBlock. A repo's
AGENTS.md is now the universal gitoku agent context — improving
it improves every AI behavior on the repo, not just prompt-PR runs.
Stage 0b's promise is honored end-to-end.
Pillar 2. Every AI surface routes through a runner (CLI
surfaces) or chat-provider (HTTP surfaces) dispatcher. A repo's
gitoku.toml [agent.*] can express provider + model per agent,
defaulting sensibly per surface. Tool definitions are authored once
as GPT.Function and reach all three HTTP providers (OpenAI,
OpenRouter, Anthropic) via the wire-level adapter. Codex and Claude
Code are first-class CLI peers. BYOLLM is now mechanically
possible — an enterprise self-hosting gitoku in their own cloud
can wire their own credentials for any supported provider without
touching the gitoku source.
The two pillars together replace today's Codex-only-or-OpenAI-only
surface monoculture with a coherent "pick the right tool for the
job, declared in the repo" story — the model layer of the
open-formats / managed-ops business positioning in
activegraph plan §9.
Stage 2a — Provenance → graph dual-write
Shipped (commit
af443555). Implemented inApplication/GraphProvenance.hs(new), wired atApplication/Jobs/PromptPullRequest/Run.hs:~146inside awithTransaction.Application/Graph.hsgained aFileRangeNodevocabulary entry;Application/GraphBackfill.hswas refactored to delegate to the same projection so historical and live data converge. Pure unit tests for the idempotency key (fileRangeSourceId) live inTest/PromptPullRequestsSpec.hs. Build green (672 → 674 modules).
Goal: every prompt-PR success that already writes commit_ai_contexts
also writes an evidence node + derived_from edges into the graph, so
the graph layer starts accruing real activity on day 1.
Detail reference: developer-features-implementation.md §4 ("Per-line AI
provenance"), "The seam (generalize to the graph)" paragraph.
Seam (single point): Application/Jobs/PromptPullRequest/Run.hs:146 —
where replaceCommitAiContexts is called. Emit the graph node + edges in
the same transaction, right next to the existing call.
Sketch:
-- After the existing replaceCommitAiContexts call (line ~146):
-- For each commitAiContext row written, project it into the graph:
forM_ writtenContexts \ctx -> do
evidenceNode <- Graph.upsertNode repository GraphNodeEvidence
("commit_ai_context:" <> UUID.toText ctx.id)
(encodeEvidencePayload ctx)
-- Each (file, lineSpan) in ctx.modifiedLineSpansJson becomes an edge:
forM_ (parsedSpans ctx.modifiedLineSpansJson) \(filePath, lineRange) -> do
targetNode <- Graph.upsertNode repository GraphNodeFileRange
("file_range:" <> commitSha <> ":" <> filePath <> ":" <> tshow lineRange)
(encodeFileRangePayload commitSha filePath lineRange)
_ <- Graph.upsertEdge repository GraphRelationDerivedFrom
evidenceNode targetNode mempty
pure ()Graph.upsertNode / Graph.upsertEdge are the storage helpers added in
Stage 1b (Application/Graph.hs). Both are idempotent on
(repository_id, source_kind, source_id) / (src_node_id, dst_node_id, relation).
Acceptance:
- A fresh prompt-PR run leaves N rows in
commit_ai_contextsAND N+M rows ingraph_nodes(1 evidence node per context + 1 file-range node per span) AND M rows ingraph_edges(derived_fromper span). - Re-running the same prompt-PR job (or re-applying its commit) is idempotent — no duplicate graph nodes/edges.
Application.GraphBackfill(Stage 1b) already covers the historical projection; this PR only adds the live-write side.
Tests: add to Test/PromptPullRequestsSpec.hs — a unit test that runs
the dual-write helper against a fixture context and asserts the resulting
graph state. Hspec, ORM only (no sqlQuery).
Risk: transaction boundary. The graph emit must be in the same
transaction as replaceCommitAiContexts so a graph-write failure doesn't
leave the provenance row orphaned. Application.CommitAiContexts.storeCommitAiContextBatch
already uses withTransaction; add the graph emit inside the same block.
Diff size: ~50–80 lines in Application/Jobs/PromptPullRequest/Run.hs
(or extract to Application/Graph/Provenance.hs if the inline block grows
past 30 lines), ~30 lines of test.
Stage 2b — Behavior registry / dispatcher
Shipped (commit
1367e19b).Application/Behaviors.hs(new) holds theRepositoryChangeEventADT,dispatchEvent,registeredBehaviorslist, and the four behavior functions (branchWorkflowBehavior,pullRequestWorkflowBehavior,pullRequestReviewBehavior,pullRequestConflictBehavior) — each one the verbatim inline block from the old trigger function lifted into a named top-level binding.BranchUpdatedjoined theGraphEventTypevocabulary inApplication/Graph.hs.Application/RepositoryWorkflowTriggers.hscollapsed to a thin dispatcher wrapper that pre-fetches open PRs once and dispatches one event per changed branch. Pure unit tests inTest/BehaviorsSpec.hs(wired intoTest/Main.hs). Build green.
Goal: replace the hardcoded body of syncRepositoryWorkflowTriggers
with a generic dispatcher. Same observable behavior on day 1; reactive
substrate is now data-driven.
Detail reference: activegraph-continuity-plan.md Phase 2; mirror in
developer-features-implementation.md §3 ("The seam" paragraph).
Seam (single function body swap):
Application/RepositoryWorkflowTriggers.hs:71-96 — syncRepositoryWorkflowTriggers.
Its current body has four inlined enqueue* calls (branch workflow runs,
PR workflow runs, PR review jobs, PR conflict-resolution jobs).
Sketch:
-- New module Application/Behaviors.hs:
data Behavior = Behavior
{ behaviorName :: Text
, behaviorMatches :: GraphEvent -> Repository -> IO Bool
, behaviorRun :: GraphEvent -> Repository -> IO ()
}
registeredBehaviors :: [Behavior]
registeredBehaviors =
[ branchWorkflowBehavior -- replaces the :77 inline
, pullRequestWorkflowBehavior -- replaces :88
, pullRequestReviewBehavior -- replaces :89
, pullRequestConflictBehavior -- replaces :94
]
dispatch :: Repository -> GraphEvent -> IO ()
dispatch repo event = do
matching <- filterM (\b -> behaviorMatches b event repo) registeredBehaviors
forM_ matching \b -> behaviorRun b event repo
-- syncRepositoryWorkflowTriggers becomes:
syncRepositoryWorkflowTriggers repo changes = do
forM_ changes \change -> do
event <- Graph.recordEvent repo Nothing Nothing
(graphEventTypeFromChange change) actor before after Nothing
dispatch repo eventThe four existing inlined enqueues lift verbatim into the four behaviorRun
bodies. Each behavior's behaviorMatches predicate captures what the
inline case/when checks did before. enqueuePullRequestReviewJob's
own gating (open + AI-reviews-enabled + dedup) stays inside it unchanged.
Acceptance:
- Every existing reactive path (push → branch workflow runs; PR compare-branch moves → review jobs + workflow runs; base-branch moves → conflict-resolution) continues firing with identical timing and gating.
- Each fire now leaves a row in
graph_events(the bus also emits an audit trail). - Adding a new reactive behavior is a one-line registration plus a
matcher + runner — no edits to
syncRepositoryWorkflowTriggers.
Tests: add to Test/WorkflowsSpec.hs (or new Test/BehaviorsSpec.hs).
- A unit test per behavior: feed a fixture
GraphEvent, assert the right enqueue primitive was called. - A regression test: a synthetic push event leaves the same job-table rows it did before (compare against a fixture).
Risk: behavioral parity. The four enqueue calls have subtle preconditions (e.g. only fire for moved branches, not new ones). Each matcher must reproduce its inline equivalent exactly. Strategy: extract the inline conditions into named predicates before swapping the dispatcher, so the diff is "rename + relocate" not "rewrite". Land that prep as its own small commit if helpful.
Diff size: ~150–250 lines net (new Application/Behaviors.hs ~100,
trimmed Application/RepositoryWorkflowTriggers.hs ~–30, new test file
~80). One new test fixture file.
Stage 2c — Agent triggers (agents-as-behaviors)
Shipped (commit
84db920e).agentTriggerBehavioradded as the 5th entry inregisteredBehaviorsinApplication/Behaviors.hs. It readsgitoku.tomlfrom the changed branch, filters via the newagentMatchesEventhelper inApplication/Workflows.hs(sibling ofworkflowMatchesEvent), and fires one prompt-PR launch per match. The launch path inApplication/PromptPullRequests.hswas split: the newlaunchPromptPullRequestJobCoreruns in any context (worker or controller) and skips reviewer notifications; the existinglaunchPromptPullRequestJobis now a thin wrapper that adds the notification step where aControllerContextis available. Triggered runs are attributed to the organization owner.
Goal: the events / branches declarations on per-repo [agent.*]
manifests (shipped in Stage 0) become behavior configs. A push that matches
an agent's events list routes through the 2b dispatcher and enqueues a
prompt-PR job for that agent.
Detail reference: developer-features-implementation.md §5 "Graph +
reactive wiring" paragraph; activegraph-continuity-plan.md Phase 2.
Seam: Stage 0 already loads agent manifests via loadAgentManifestForBranch
(parallel to loadWorkflowManifestForBranch:209 in Application/Workflows.hs).
Add an agent behavior to the 2b registry:
agentBehavior :: GitokuAgent -> Behavior
agentBehavior agent = Behavior
{ behaviorName = "agent:" <> agent.key
, behaviorMatches = \event repo -> pure (agentMatchesEvent agent event)
, behaviorRun = \event repo -> enqueueAgentPromptJob agent event repo
}agentMatchesEvent reuses the existing workflowMatchesEvent shape (it's
the same vocabulary). enqueueAgentPromptJob is a thin wrapper over the
prompt-PR job builder (already shipped in Stage 0), preloading the agent's
prompt/model/instructions.
The registry registers all agents at dispatch time (per-repo lookup), so
adding a new [agent.*] to gitoku.toml is a no-restart change.
Acceptance:
- An agent declared with
events = ["pull_request"]and abranches = ["main"]filter wakes when a PR targetingmainis opened. - The resulting prompt-PR job records the originating
agent_key(Stage 0 already added that column). - Disabling the trigger (removing the
eventsline ingitoku.toml) stops the wake immediately on next sync.
Tests: Playwright in test_playwright/repositories.spec.ts — extend
the existing "named agents" spec with: configure events = ["pull_request"],
open a PR, assert a new prompt-PR job appears within N seconds.
Diff size: ~50–100 lines glue (new agentBehavior + reuse of existing
helpers).
Stage 2d — Thread-aware commit AI context capture (planned, not started)
Goal: each commit's stored AI context = the full session thread up to that commit's timestamp, not just a heuristic window of "recent prompts." That makes the diff between two consecutive commits' contexts a faithful record of what was discussed between those commits — which is the model agents and humans both intuitively expect from a "rewind a session" view.
Why it matters now. Stage 2a (provenance dual-write) projects every
commit_ai_contexts row into an evidence node in the graph; Stage 5b
(semantic memory injection) reads those evidence nodes back as
recallable context on future prompts. Both stages assume the captured
contexts faithfully represent the work that produced the commit. Today
they don't: rgh ships only "recent prompts within a client-side
heuristic window" per commit, so the projection has lossy input. Stage
2d fixes the input.
Current state (verified against production data, 2026-06-03)
Each commit_ai_contexts row has prompt, thinking, session_id,
source_family, capture_timestamp, capture_index. rgh's
discoverExactCommitCapture
(rgh/src/Rgh/Capture.hs:377) walks recent Codex / Claude Code / Cursor
session files, filters by exactPatchCaptureWithinTimestamp against
the pre-commit timestamp, and posts the surviving prompts to
POST /CreateApiCommitAiContextBatch. The current "window" is
implicit — controlled by which session files are considered "recent"
and what apply-patch evidence they carry. Sample from
commit_ai_contexts on the deployed instance: same session yields 4
prompts on one commit, 1 prompt on another an hour later, even
though dozens of messages were exchanged in between.
Pillars
2d-rgh — Full-session capture client-side
New entry point in rgh/src/Rgh/Capture.hs alongside the existing
discoverExactCommitCapture:
discoverFullSessionCommitCapture
:: FilePath -- repositoryRoot
-> UTCTime -- preCommitTimestamp (upper bound)
-> Maybe UTCTime -- previousCommitTimestamp (lower bound, optional)
-> IO CommitCaptureBehavior:
- For every agent session that was active in the window (e.g. that
touched any file in the staged diff), read the WHOLE session log up
to
preCommitTimestamp. - Bound by
previousCommitTimestampwhen known (fromgit log -1 --format=%cI HEAD~1): only include prompts AFTER the prior commit, so each commit's slice is delta-shaped rather than cumulative. Fall back to cumulative when there's no prior commit (initial commit, root of a branch). - Bypass the
exactPatchCaptureWithinTimestampfilter — that heuristic is what currently drops prompts.
Wire it through the post-commit hook
(rgh/src/Rgh/Hooks.hs) so the default capture is the full
delta-shaped slice. The existing exact-patch-aware capture stays as a
fallback / opt-in mode for users on tight quotas.
2d-server — Storage + dedup
The existing schema already supports multiple prompts per commit
((repository_id, commit_sha, capture_index) UNIQUE). For v1, no
schema change: rgh sends N prompts per commit, server stores N rows.
Storage grows linearly with session length, which is the price of
fidelity.
Add a small optimization: when computing the graph projection in
Application/GraphProvenance.hs, dedupe evidence nodes by
(prompt_hash, session_id) so the same prompt that appears in
multiple commits' captures (under the delta model that shouldn't
happen, but under the fallback cumulative model it can) only becomes
one graph node with multiple derived_from edges. Add a
promptHash :: Text column to commit_ai_contexts for the dedup key;
SHA-256 of the normalized prompt body, computed server-side at insert
time.
Migration: forward-only ALTER TABLE commit_ai_contexts ADD COLUMN prompt_hash TEXT NOT NULL DEFAULT '' + backfill existing rows in the
same migration (small dataset on day 1).
2d-ui — Session timeline view
The existing per-commit view (Commit.hs renderCommitAiContextCards)
keeps working — under the delta model, each commit now shows only the
prompts exchanged for that commit, sorted chronologically.
Add a NEW view: session timeline. Stitches contexts across
commits in a single chronological scroll, anchored at each commit
boundary. Path: /lazare/<repo>/sessions/<session_id>. Reuses the
existing card renderer; adds dividers labelled "→ commit
<short-sha> (<title>)" between groups of prompts.
This is what makes the delta capture model legible: per-commit view = "what was discussed here", session view = "what was the agent thinking end-to-end". Both come for free once the underlying capture is delta-shaped.
Cross-cutting
Storage budget: A 4-hour active session might produce 80–200 prompts. Even at 200 rows per commit-burst, the table grows by ~1 MB per commit at most (10 KB avg prompt body). Acceptable on the self-hosted side; worth a per-org budget knob (out of scope for v1) on the multi-tenant side.
Privacy: full-session capture means previously-implicit
private chatter (debugging logs, throwaway thoughts) now lands in the
repo's gitoku DB. Document in AGENTS.md so users know the capture
model; offer a per-session opt-out flag on rgh capture for the
quasi-private case.
Acceptance:
- After 2d-rgh ships: a commit made during an active 30-message Claude Code session uploads ~30 rows (the delta since the prior commit), not 1.
- After 2d-ui ships:
git.lazare.ai/lazare/gitoku/sessions/<id>reconstructs the full session chronologically across all commits it spanned. - After 2d-server's dedup ships: any prompt-hash appears in at most
ONE
evidencenode in the graph, with edges to every commit that captured it.
Tests: unit test in rgh/test/Test/CaptureSpec.hs for
discoverFullSessionCommitCapture against a fixture session log;
golden snapshot of the resulting CommitCapture. Server-side, a
Hspec test that POSTs a multi-prompt batch and asserts row count
matches input.
Diff size: ~200–400 lines rgh-side (capture + hook wiring + test), ~100–150 lines server-side (migration + dedup helper), ~100 lines view (session timeline page + sidebar entry). One small migration.
Stage 5a — Trace-as-product UI
Shipped (commit
079634c1). Added a new Activity tab to every repository page.Application/PageTabs.hsgotRepositoryActivityTab(text code + parser),Web/View/Repositories/Show.hsgot the pane id, tab nav link,RepositoryActivityViewrecord, its View instance, andrenderRepositoryActivityPageContent— a table with one row pergraph_eventsrow (when / event_type / actor / subject / payload), JSON payloads tucked behind a<details>for scan compactness, zero JS.Web/Controller/Repositories.hsgot the dispatch entry +buildRepositoryActivityView, which fetches viaGraph.fetchRecentEvents(cap: 100 rows, no pagination in v1). 676 modules total; full forced rebuild passed.
Goal: a viewer for the graph_events log, scoped per repository.
Independent of every other stage; only needs Stage 1b ✅.
Why early: ships the first visible payoff of the graph layer without waiting on 2a/2b. After 2a lands, the same UI shows provenance events too.
Seam: new controller + view. Suggested path:
Web/Controller/RepositoryGraph.hs— newRepositoryGraphEventsActionthat fetches the most recent N events for a repo, joined with their optional node/edge subjects.Web/View/Repositories/GraphEvents.hs— table with columnstime / event_type / actor / subject (node or edge) / payload preview.- Route + tab entry in the existing Repositories page tabs
(
Web/Controller/Repositories.hsnearRepositoryTasksTab).
Sketch: uses Graph.fetchRecentEvents (the helper added in Stage 1b
at Application/Graph.hs) directly. No new SQL.
Acceptance:
- A repo with any history shows recent events (PR opened, review fired,
etc.) ordered by
created_at DESC. - Clicking a row expands the
before_json/after_jsonpayload. - Autorefresh works (use the same
autoRefreshwrapper as the Tasks tab).
Tests: Playwright spec — open the new tab on a fixture repo, assert expected events render.
Diff size: ~200–400 lines (one controller + one view + one route entry
- one tab + Playwright). Mostly view code.
Off-critical-path note: This is the cleanest "hand to a parallel agent" work in the plan. Could be done while 2a/2b are in flight.
Stage 5b — Semantic memory injection
Shipped (commit
5ac4719d). NewApplication/GraphMemory.hsexposesloadSemanticMemoryBlock :: Repository -> Int -> IO Text, which fetches the most recent N memory-flavored graph nodes (claim / decision / failure / evidence — the ones Stage 2a has been writing) and renders them as a bounded text block (default 20 nodes, 200 chars/line summary cap). Two prompt-render seams updated to insert the block: the in-process path inApplication/Jobs/PromptPullRequest/Run.hs(its ownrenderCodexPromptnow takes the memory Text) and the subprocess path via a newGITOKU_PROMPT_PULL_REQUEST_SEMANTIC_MEMORYenv var threaded throughpromptRunnerEnvironmentVariables→PromptRunnerEnvironment→Application/IsolatedExecution/Runner.hs'srenderCodexPrompt. The newsemanticMemorySectionhelper is shared between both paths. Empty memory → empty section so first-run prompts don't carry a stray header. No FTS yet — recent-first only.
Goal: at prompt-render time, look up the relevant graph subgraph for this repo + task and inject it into the Codex prompt as "what was already tried/decided/rejected". Stops re-litigation across runs.
Detail reference: developer-features-implementation.md §2 "Semantic
memory" paragraph.
Seam: Application/IsolatedExecution/Runner.hs — renderCodexPrompt
(approximate line 1465; verify at implementation time). Stage 0 already
prepends agent instructions here; this stage adds a second prepend block
for the subgraph.
Sketch:
-- In renderCodexPrompt, after the agent-instructions prepend:
memoryBlock <- Graph.readSemanticMemory repository
promptPullRequestJob.baseBranch
promptPullRequestJob.prompt
let prompt = agentInstructionsBlock
<> renderMemoryBlock memoryBlock
<> userTaskBlockGraph.readSemanticMemory is a new function in Application/Graph.hs
(call it Application/Graph/Memory.hs if it grows past ~80 lines). It
queries graph_nodes filtered by:
repository_id = repo.idnode_type IN ('claim', 'decision', 'failure', 'evidence')- (optional) FTS match against the prompt text
- ordered by
created_at DESC - limited to N rows (config-driven; default 20).
The resulting subgraph is rendered as a structured block at the top of the prompt (Markdown table or list). Format follows the same shape Codex already expects from agent instructions.
Acceptance:
- A second run of the same prompt on the same repo includes the first
run's
evidencenodes in the prompt. - The injected block is bounded (no more than N entries, no more than K characters total).
- With no relevant memory, the block degenerates to nothing (not a header with empty body).
Tests: unit test for readSemanticMemory (Hspec, against a fixture
graph). Manual smoke test for prompt rendering.
Risk: prompt token budget. Memory blocks must be hard-capped or they crowd out the user's task. Cap is a config knob.
Diff size: ~80–150 lines (new memory reader + prompt-render integration
- test fixture).
Stage 5c — Propose/accept workflow for agent graph edits
Shipped (commit
56686c02). No schema migration — proposals live asProposalNoderows ingraph_nodes, withstatusflipping between pending → accepted | rejected.Application/Graph.hsgot the new node type plusProposesandAcceptedFromrelations.Application/GraphProposals.hs(new) is the state machine:submitProposal(agent-facing, idempotent via optional source key),acceptProposal(creates the real target node + anAcceptedFromedge back to the proposal, all in one transaction),rejectProposal,fetchPendingProposals.Web/Types.hsgotShowRepositoryActivityAction+RepositoryGraphProposalAcceptAction
RepositoryGraphProposalRejectAction(POST). Handlers inWeb/Controller/Repositories.hsare gated oncanWriteRepositoryand redirect back to the Activity tab. The Activity view now threadspendingProposalsand renders a "Pending proposals" section with Accept/Reject buttons above the events table (buttons gated oncanWrite).
Goal: agents can propose graph edits (new claims, decisions, edge updates) that land in a pending state; humans accept/reject from the trace UI (5a). Stops the graph from being write-only.
Detail reference: activegraph-continuity-plan.md Phase 3.
Seam: new node type proposal and edge type proposes. A proposal
node carries the would-be node/edge payload; on acceptance, the dispatcher
creates the real node/edge and links it via accepted_from. On rejection,
the proposal stays for audit (with status = rejected).
Sketch (data model):
-- Reuse graph_nodes — no new table:
-- node_type = 'proposal'
-- status = 'pending' | 'accepted' | 'rejected'
-- payload = JSONB { proposed_node_type, proposed_payload, ... }
-- New edges:
-- relation = 'proposes' -- proposal → would-be target
-- relation = 'accepted_from' -- new real node ← proposalEndpoints:
POST /graph/proposals(used by agents) — creates a proposal row.POST /graph/proposals/:id/accept(used by humans via 5a UI) — flips status and creates the real node/edge.POST /graph/proposals/:id/reject— flips status only.
Acceptance:
- An agent run can emit a proposal during its trace.
- The proposal shows up in the 5a UI with accept/reject buttons (gated on
canWriteRepository). - Acceptance creates the real graph node + an
accepted_fromedge in one transaction; rejection leaves the proposal withstatus = 'rejected'. - Re-running the same proposal (idempotency key) is a no-op.
Tests: unit tests for the accept/reject state transitions; Playwright for the UI flow.
Diff size: ~250–400 lines (controller + service + UI integration in the 5a view + tests).
Stage 4b-* — workspace_snapshot graph node + capture hook (metadata only)
Shipped (commit
d2b036e8).WorkspaceSnapshotNodeadded to theGraphNodeTypevocabulary;recordWorkspaceSnapshot+workspaceSnapshotSourceIdadded toApplication/GraphProvenance.hs; wired into the samewithTransactionas the 2a dual-write atApplication/Jobs/PromptPullRequest/Run.hs:~146. Backend field reflects the actual sandbox mode (sandboxAgentRunMode);fs_refstays null until Stage 4a populates it. Pure unit tests for the source-id stability inTest/PromptPullRequestsSpec.hs.
Goal: add the graph-side model for workspace snapshots and a hook at prompt-PR success time, so the data shape is in place before Stage 4a provides real ZFS bytes.
Detail reference: nixos-sandbox-fleet-plan.md §4 "Snapshot capture
and fork" — but only the metadata half.
Seam: Application/Jobs/PromptPullRequest/Run.hs:146 (same location
as 2a). After a successful run, emit a workspace_snapshot node:
-- After 2a's dual-write, in the same transaction:
_ <- Graph.upsertNode repository GraphNodeWorkspaceSnapshot
("snapshot:" <> commitSha)
(Aeson.object
[ "commit_sha" .= commitSha
, "branch" .= compareBranch
, "backend" .= ("local-runner" :: Text)
, "fs_ref" .= Aeson.Null -- populated by 4a later
])When 4a ships, runOnLocalRunner / runOnMicroVm populate fs_ref with
the actual ZFS dataset path. Until then, fs_ref stays null and
"snapshots" are conceptually branches only.
Acceptance:
- Every successful prompt-PR leaves a
workspace_snapshotnode withfs_ref = null. - The 5a trace UI can list snapshots (filter
node_type = 'workspace_snapshot'). - Future Stage 4a / 4b work updates
fs_refin the same row (idempotent upsert on(repository_id, source_kind, source_id)).
Tests: extend the 2a test fixture — assert the snapshot node is created alongside the provenance evidence nodes.
Diff size: ~40 lines added to the same seam as 2a.
Stage 5d-* — Branch-based self-improvement loop (no warm fs)
Goal: the gitoku-bootstrapping-gitoku story: run the same prompt 3 times via the fork mechanism (Stage 3 ✅), score the resulting diffs, and promote the best as the merged PR. Branch-only — the warm-fs version waits on Stage 4b-full.
Detail reference: activegraph-continuity-plan.md Phase 4
("self-improvement"); developer-features-implementation.md §1's headline
(but warm-fs caveat).
Seam: new job type SelfImprovePromptJob that:
- Creates the original prompt-PR run (existing flow).
- On success, forks it K times via
forkPromptPullRequestJob(Stage 3 ✅) with the same prompt + a "variation seed" injected in the agent instructions. - Waits for all K forks to finish.
- Runs a scoring step: diff each result against the base, compute a score (test pass count, line count, agent's own confidence, etc.).
- Emits a
scoregraph node per fork (new edgescored_as). - Promotes the highest-scoring fork by updating its PR to "ready for review" and closing the losers as superseded.
Acceptance:
- A self-improve invocation creates K+1 PRs (1 original + K forks).
- After scoring, exactly one PR is ready-for-review; the rest are closed
with a
superseded_bylink. - The trace UI shows the scoring decisions (graph edges between forks and the winner).
What's missing without 4a/4b-full: the forks all start from a branch, not a snapshotted workspace. They re-clone and re-install dependencies per fork — slow. With 4a/4b-full, each fork boots from a ZFS clone of the original run's filesystem in seconds.
Tests: complex; suggest manual smoke + a Playwright e2e on a tiny synthetic repo.
Diff size: ~200–300 lines (new job runner + scoring service + UI hooks).
Cross-cutting
Migrations
Every stage that touches the schema adds a forward-only,
<unixtime>-<desc>.sql migration under Application/Migration/. Pattern
matches the Stage 0 / Stage 3 migrations (1780225150-add-prompt-agent-fields.sql,
1780227009-add-prompt-forked-from-job.sql). Always use IF NOT EXISTS so
re-application is safe. Schema.sql kept in sync in the same commit.
Testing convention
- Unit work → Hspec under
Test/. - Browser-visible flows → Playwright under
test_playwright/. - Per AGENTS.md, the canonical compile check is
devenv up, notcabal build exe:App. Focused supplemental specs are fine; the test suite is currently uncompilable due to unrelated pre-existing blockers inTest/GitCliSpec.hsetc. — that's a separate cleanup task and shouldn't block this plan's PRs.
Observability
Each new dispatch path should log structured events the same way the
existing Application.RepositoryWorkflowTriggers does. The graph_events
table doubles as the audit log; no separate logging infrastructure needed.
Formatter
Every PR runs direnv exec . ./scripts/format-haskell.sh --staged before
commit (or trusts the managed pre-commit hook if installed). Expect the
diff to expand on first format pass of a file that hasn't been touched in
a while — that's fourmolu reformatting the whole file, not the change
itself.
Definition of done (whole plan)
When all stages below land, the master table in doc/README.md should read:
0 ✅ 1a ✅ 1b ✅ 2a ✅ 2b ✅ 2c ✅
3 ✅ 4a ⬜ 4b-* ✅ 5a ✅ 5b ✅ 5c ✅ 5d-* ✅i.e. every stage except 4a complete. The remaining "real ZFS" work
in 4b / 5d hard-depends on 4a; both are tracked as partial / metadata only in the doc until 4a ships.
User-visible state at that point:
- Reactive auto-review is generic (data-driven behaviors).
- Per-repo agents wake on events (push, PR, etc.) automatically.
- Every prompt-PR commit accrues evidence + snapshot metadata in the graph.
- Trace UI shows the running event log.
- Subsequent runs of the same prompt see the prior runs' evidence in their prompt (semantic memory).
- Agents can propose graph edits; humans accept/reject.
- Self-improve runs the same prompt K ways and promotes the best (slow: cold-clone per fork until 4a/4b-full).
Everything except "fast self-improve via warm fs" works without 4a.
Open questions to confirm at implementation time
- Graph node-type vocabulary.
evidence,proposal,workspace_snapshot,score,decision,claim,failure,file_range— finalise the minimal set before 2a lands so the rest don't have to migrate. - Idempotency keys for proposals. A natural key like
(agent_run_id, proposal_index)avoids dupe proposals on retry. Confirm the agent runner can supply both. - Memory subgraph FTS vs vector. Stage 5b's
readSemanticMemoryis spec'd as BM25/keyword today (matches what gitnexus does). If we want semantic-vector retrieval here, the graph schema doesn't need to change but the reader does. - Scoring rubric for 5d. Tests-pass count, diff size, agent confidence, external judge model — pick one for v1 and ship; the rest are pluggable.
- Self-improve agent count. K = 3 by default; configurable per-repo
under
[gitoku.selfimprove]ingitoku.toml.