Architecture
This page is the long-form companion to the rest of the docs. Read Getting Started first; come here when the operator needs to know why something is the way it is, or how a moving part actually works.
Goal in one paragraph
Section titled “Goal in one paragraph”A single Claude Code process per pod, listening on a Matrix room, with
full git + shell + kubectl access on a persistent workspace, configured
purely by files in agents/<name>/. Multiple agents = multiple
StatefulSets of the same image with different AGENT_NAME. No real
credentials ever live in a pod — the egress credential firewall
(iron-proxy) swaps them in at the network boundary.
The runtime
Section titled “The runtime”StatefulSet/<agent> (one per agent: infrabot, devbot, …)├── init container: scripts/setup.sh│ • install iron-proxy MITM CA into system trust store│ • copy stub credentials into ~/.claude/.credentials.json│ • concatenate agents/_shared/CLAUDE.md + agents/<name>/CLAUDE.md│ → ~/.claude/CLAUDE.md│ • copy settings.json, mcp.json, subagents/ into ~/.claude/│ • mark first-run onboarding complete + pre-trust workspace repos│ • install Matrix channel plugin (`claude plugin install matrix@…`)│ • write Matrix channel .env + access.json│ • configure git/gh credentials (split: gh uses proxy token; git uses real)│ • clone every AGENT_REPOS entry into /workspace/<basename>│└── main container: scripts/entrypoint.sh • startup jitter (0–45 s) — desynchronises devbot/infrabot restarts • tmux session "main" │ ├── pane 0 — scripts/claude-loop.sh │ │ • restore stub creds before every start │ │ • claude --dangerously-load-development-channels │ │ plugin:matrix@claude-code-channel-matrix │ │ --remote-control "${AGENT_NAME}" │ │ --permission-mode bypassPermissions │ │ • exponential backoff with jitter on crash │ │ • --continue if a prior session dir exists │ └── pane 1 — plain bash shell at /workspace/${PRIMARY_REPO} • dispatch() — auto-accept theme/Bypass/dev-channels first-run prompts • watchdog loop: • every 10 s: re-scan pane 0 for any prompts that reappear after a crash • every 1–3 h: inject organic keepalive prompt if pane 0 is idle 30 s+The claude-loop.sh wrapper exists because Claude Code mutates
.credentials.json mid-flight (the OAuth refresh path overwrites the
file with the upstream response, which strips subscriptionType).
Restoring the stub before every start makes the iron-proxy swap reliable
across restarts.
One image, many agents
Section titled “One image, many agents”The image is parametric on AGENT_NAME. Every agent runs
ghcr.io/sherodtaylor/agent-smith:vX.Y.Z with a different env var.
setup.sh reads agents/${AGENT_NAME}/ at startup and assembles
~/.claude/ from these sources:
| Source file | Assembled to | Purpose |
|---|---|---|
agents/_shared/CLAUDE.md + agents/<name>/CLAUDE.md (concatenated) | ~/.claude/CLAUDE.md | base rules + persona |
agents/_shared/settings.json | ~/.claude/settings.json | plugins, permissions, hooks |
agents/_shared/.credentials.json | ~/.claude/.credentials.json | stub OAuth (iron-proxy swaps in real tokens) |
agents/<name>/mcp.json | ~/.claude/.mcp.json | per-agent MCP servers |
agents/<name>/subagents/*.md | ~/.claude/agents/*.md | persona-specific subagents |
Adding a new agent is a directory + a StatefulSet referencing the
same image. No image rebuild. See
Agents for the full procedure.
Matrix as the channel
Section titled “Matrix as the channel”The agent’s input path is the
claude-code-channel-matrix
plugin. settings.json registers the marketplace; setup.sh
materialises the plugin with claude plugin install. Per-agent Matrix
credentials and the sender allowlist are written to
~/.claude/channels/matrix/.
Every permitted message in a joined room becomes a Claude Code prompt for the agent — no separate listener, no message queue, no per-room wiring. The 👀 reaction the bot posts on acknowledgement comes from the same plugin.
The plugin runs inside the same claude process as the rest of the
session. There is no second process to keep alive, no IPC, no bridge.
Cross-agent collaboration
Section titled “Cross-agent collaboration”Two channels do different jobs.
Matrix — the wake signal. Bots respond to:
- their own name (plain text,
@name, full Matrix ID, or Matrix display-name link from clients like Element); - any message from
@sherod:lab.sherodtaylor.dev; - a threaded reply to one of their own messages.
NATS does not wake an agent — only Matrix does.
NATS JetStream — durable structured event log. Bots publish to
swarm.events.{pr_opened, pr_merged, incident, task_done} after
meaningful actions. Bots read from NATS only when explicitly asked. The
audit room #audit is the human-readable mirror.
The cross-agent PR review loop:
Author bot opens PR ├── publishes swarm.events.pr_opened └── posts in #dev mentioning every teammate (full Matrix IDs) │ ▼ Mentioned bot wakes, acknowledges in same thread │ ▼ gh pr diff … → code-review skill (--comment) posts inline findings │ ▼ Reviewer posts one-liner: "Reviewed #N — N findings, N blocking." │ ▼ Author addresses comments. The check-pr-comments.sh Stop hook rewakes the author whenever new comments arrive on their PRs.The PR-comment Stop hook
Section titled “The PR-comment Stop hook”scripts/check-pr-comments.sh is wired into settings.json as a Stop
hook with asyncRewake: true. After every Claude turn:
- List the author’s open PRs in each known repo.
- Count issue-level comments + unresolved inline review threads per PR.
- Compare to a per-PR counter persisted at
~/.pr-comment-state.json. - If any PR’s count went up, exit
2with a structuredhookSpecificOutputthat names which PRs.
Exit 2 = rewake. The agent sees the rewake message, reads the new comments, addresses them, and posts a one-liner. The counter prevents infinite rewake on already-seen comments.
Security — iron-proxy
Section titled “Security — iron-proxy”iron-proxy is the egress credential firewall. Cluster-internal name
10.43.100.100. All HTTPS egress from agent pods is routed through it
via dnsPolicy: None + dnsConfig.nameservers: [<iron-proxy>].
In-cluster names (*.cluster.local) pass through to CoreDNS, so NATS
and Matrix still resolve normally.
agent pod iron-proxy internet ───────── ────────── ──────── git/gh/curl/claude │ Authorization: Bearer proxy-token-github │ Authorization: Bearer access-token-stub ▼ resolve api.github.com ─────► iron-proxy MITM (private CA in pod's trust store) │ │ match host → look up real credential │ rewrite Authorization header ▼ forward to upstream ───────────► api.github.com api.anthropic.comProperties this gives the operator:
- A leaked pod token is worthless outside the cluster (it’s literally
proxy-token-github). - Token rotation is iron-proxy’s job. Agents never refresh OAuth — the
pod’s
~/.claude/.credentials.jsonis permanently the stub. - Default-deny domain allowlist means a misbehaving agent can’t exfiltrate to an attacker-controlled host even if it tried.
- The iron-proxy CA is distributed via ExternalSecret.
setup.shinstalls it intoupdate-ca-certificates; the Dockerfile setsNODE_EXTRA_CA_CERTSso the Node-basedclaudeCLI trusts it too.
Why split GITHUB_TOKEN from GIT_GITHUB_TOKEN?
ghand the GitHub REST API use plain-textAuthorization: Bearer <token>headers — iron-proxy can string-match and swapproxy-token-github.gitHTTPS uses Basic Auth (Authorization: Basic <base64(user:pass)>), which is opaque to a plain-text match. Sosetup.shwrites the real PAT into~/.git-credentialsviaGIT_GITHUB_TOKEN, and routesgh/API calls through iron-proxy viaGITHUB_TOKEN=proxy-token-github.
This is a known wart; an iron-proxy that can decode and rewrite Basic Auth would let the second token disappear.
Why Claude Code CLI (not the Agent SDK)
Section titled “Why Claude Code CLI (not the Agent SDK)”The interactive CLI is the only option that is long-lived, subscription-billed, and MCP-capable at the same time:
- Claude Agent SDK is billed as Anthropic API consumption, separate from a Pro/Max subscription. Multi-agent always-on Matrix bots would turn a flat subscription into a metered bill.
claude -pis on subscription quota but single-shot — every invocation is a cold start, no prompt cache, no MCP handshake, no Matrix connection.- opencode et al. are good interactive tools but the operator brings their own model — they don’t get the Claude subscription itself.
So: long-lived Claude Code CLI per agent, Matrix channel plugin for
input, --remote-control for direct drive-in from a laptop, MCP for
everything else. The stub-credential isolation in iron-proxy is what
makes putting a Claude subscription inside a pod safe.
Release pipeline
Section titled “Release pipeline”git tag vX.Y.Z + push docker.yml fires │ │ │ ▼ │ job: build (always) │ ┌──────────────────┐ │ │ Buildx multi- │ │ │ stage build │ │ │ │ │ │ docker/metadata-│ │ │ action computes │ │ │ tags from ref: │ │ │ • vX.Y.Z │ │ │ • vX.Y │ │ │ • vX │ │ │ • latest │ │ └────────┬─────────┘ │ ▼ │ push to GHCR │ │ │ ▼ │ job: chart (only on vX.Y.Z) │ ┌──────────────────┐ │ │ helm package │ │ │ --version X.Y.Z│ │ │ helm push │ │ │ oci://ghcr.io/ │ │ │ sherodtaylor/ │ │ │ charts │ │ │ attach .tgz to │ │ │ GitHub Release │ │ └──────────────────┘ ▼GitHub Release (created manually, see release runbook) │ ▼Consumer: bump version on HelmRelease in sherodtaylor/homelab ▼Flux reconciles → image rollRules baked into the metadata-action config:
- Push to
main→:main(moving) +:sha-<short>(immutable). No:latest. - Tag
vX.Y.Z→:vX.Y.Z,:vX.Y,:vX,:latest, plus the Helm chart job runs and publishes the OCI chart at the matching version.
A downstream consumer that pins :latest only ever gets versioned
releases, never a mid-flight refactor.
Helm chart shape
Section titled “Helm chart shape”One chart = one agent. The chart renders:
ServiceAccount(optional)ClusterRole+ClusterRoleBinding(read-only defaults; overridable for mutating agents)StatefulSetwith two PVCs (/rootfor~/.claude/,/workspace/for cloned repos)- Optional iron-proxy DNS routing (
dnsPolicy: None, nameserver atironProxy.clusterIp)
The chart does not manage the underlying Secret. The consumer
provides one (manually, ExternalSecrets, sealed-secrets) with these
keys:
| Key | Used by |
|---|---|
MATRIX_ACCESS_TOKEN | Matrix channel plugin |
GITHUB_TOKEN | gh, GitHub API (proxy token; iron-proxy swaps) |
GIT_GITHUB_TOKEN (optional) | git HTTPS Basic Auth (real PAT) |
IRON_PROXY_CA_CRT | system trust store + NODE_EXTRA_CA_CERTS |
MATRIX_HOMESERVER_URL, MATRIX_BOT_USER_ID | Matrix plugin |
tmux pipe-pane -o 'cat >> /proc/1/fd/1' is configured on both panes
inside entrypoint.sh, so PID 1 sees everything either pane prints.
kubectl logs returns the full stream. VictoriaLogs in-cluster captures
it via the DaemonSet.
A known cost: this includes Claude Code’s JSONL transcript noise when it prints assistant messages. The trade-off is paid because losing pane output across a restart would be worse.
Things that look weird but are load-bearing
Section titled “Things that look weird but are load-bearing”| Thing | Why |
|---|---|
Stub credentials committed to the repo (access-token-stub, refresh-token-stub) | They’re literal placeholders, not secrets. Their presence is what iron-proxy string-matches on. |
claude-loop.sh restoring .credentials.json before every start | Claude Code’s OAuth refresh overwrites the file mid-flight, stripping subscriptionType. Restoring the stub each start keeps the iron-proxy swap working. |
Two GitHub tokens (GITHUB_TOKEN proxy + GIT_GITHUB_TOKEN real) | git HTTPS uses Basic Auth which iron-proxy can’t plain-text swap. |
Startup jitter at the top of entrypoint.sh | Without it devbot and infrabot restart in lockstep, hammering Anthropic and GitHub simultaneously. |
dispatch() matching prompt text by string | There’s no headless config flag for the theme picker / Bypass warning / dev-channels consent. The string match is the workaround. |
| The keepalive prompt injector | Empty Matrix days plus a long idle pane look like a stuck process to outside observers (and sometimes to Claude itself). Random low-frequency activity prevents that. |
If the operator is about to “fix” one of these, read the related runbook first.