Skip to content

Roadmap

Status: draft for review Author: DevBot (with Sherod) Last updated: 2026-05-25


agent-smith is an autonomous engineering crew that ships real work against real infrastructure. Not a chatbot, not a code-completion sidecar — a swarm of AI engineers who operate as peers, coordinate with each other, and learn from what they ship. Humans set direction; agents execute, coordinate, recover from their own mistakes, and surface what they couldn’t.

The shape we’re aiming at: the operator opens Matrix on Monday morning and the crew has already triaged the weekend’s CI failures, opened three Dependabot PRs, fixed the one with a clean bump, asked for a steer on the two that weren’t, and posted a one-line summary in #audit. They don’t relay information between bots; the bots share state. They don’t piece together what happened from three browser tabs; one timeline shows the whole run. They don’t audit token usage by hand; budgets are enforced at the edge. They trust what the bots changed because they can inspect what was done, not because they were watching live.

Right now agent-smith is a long way from that. It is two bots that faithfully complete the tasks they’re handed, with no shared memory, no unified observability, no capability scoping, and no ability to originate work. Every conversation is the first conversation. v1 closes that gap.


Five promises. Every feature in this roadmap traces back to one of them. If a feature can’t be tied to a promise, it doesn’t belong in v1.

#PromiseWhat it means in practice
P1Coordination is realDevBot and InfraBot share understanding without Sherod as relay. Decisions, incidents, and patterns are durable and queryable across agents.
P2Work is observableAny past run can be reconstructed end-to-end from a single timeline — Matrix message → tool calls → file edits → NATS events → log lines → outcome. No three-tab archaeology.
P3Boundaries are enforcedEach agent has a documented capability scope (what tools, which secrets, which repos). Boundary violations are detected, not just hoped against.
P4Memory compoundsAgents recall their own past decisions and each other’s. Knowledge accumulates over months; agents don’t get repeatedly stuck on the same problem.
P5Work originatesBots don’t only respond to pings — they pick up stale PRs, dep bumps, CI rot, and incident triage on their own cadence. The crew has work even when Sherod is asleep.

Items we’ve thought through, have a position on, and intentionally placed beyond v1. Each has a clear trigger that would pull it forward.

  • Model portability — a harness-agnostic agent loop. Today we run on Claude Code, and the leverage of CLAUDE.md + MCP + the plugin marketplace + the Matrix channel plugin is exactly why the crew works. A future agent layer could abstract the LLM so local models (Hermes, Llama) can run cheap sub-agent dispatch (classification, embeddings, low-stakes tool calls) while Claude stays on the main loop. Pull forward when: a second model becomes load-bearing for cost or latency reasons we can measure.
  • Decentralized agent discovery (DID / registry). A team.yaml is enough while a single operator runs all agents. A future where agents span trust boundaries — multiple operators, cross-org collaboration, agents that join the swarm without being pre-provisioned — wants DIDs, a registry, or a federated MCP catalog. Pull forward when: the crew grows past ~5 agents OR an agent runs outside this trust boundary.
  • Syscall-level observability via eBPF. iron-proxy controls egress and VictoriaLogs captures stdout/stderr, which is enough today because bots only ingest input from the Matrix allowlist. A future posture that accepts untrusted input (third-party MCP servers, webhook triggers, user-submitted scripts) wants per-process syscall audit, file-write tracking, and fork-chain visibility. Pull forward when: bots take untrusted input, OR an incident shows application-layer logs were not enough to reconstruct what happened.
  • Additional human-interface channels (Slack, Discord, IRC, WhatsApp, …). We’re building a framework, not a Matrix-only tool. Matrix happens to be the channel we shipped first because the Claude Code channel-plugin pattern already had a Matrix implementation, but the agent loop itself is channel-agnostic: CLAUDE.md persona, MCP tooling, the dispatch event surface, the capability scopes (P3), and the run-id correlation (P2) all apply unchanged regardless of where the human message enters. A second channel proves the abstraction. Pull forward when: a user (or the framework’s adopter) needs a non-Matrix channel, OR a second channel plugin lands in the Claude Code marketplace that we can wire in without writing one ourselves. Likely first add: Slack — most-requested in similar tools, has a mature plugin ecosystem, and the per-channel allowlist model maps cleanly to Slack workspace channels.

Honest gap analysis. Where are we against each promise right now?

PromiseCurrent stateGap
P1 — CoordinationNATS event log exists but is opaque (no UI, queried only on request); #audit room is unstructured prose.No typed shared store, no read-on-startup convention. Agents start every conversation from zero.
P2 — ObservabilityThree observability tiers: NATS (structured, no UI), VictoriaLogs (text), Matrix (semantic, unstructured). None talk to each other.No run_id correlation. Diagnosing a misbehaving run means correlating by hand.
P3 — BoundariesMatrix allowlist gates who can trigger a bot. Cluster RBAC partially scoped via per-agent ServiceAccounts.Nothing gates what a bot can do once triggered. DevBot can call any tool InfraBot can. No detection of cross-boundary calls.
P4 — MemoryClaude Code’s per-project auto-memory works for an individual agent.No cross-agent memory. No KB. Agents repeatedly re-discover the same context.
P5 — OriginationZero. Bots are 100% reactive.No cron, no event triggers beyond Matrix, no concept of “work the crew has noticed and is doing.”

v1 themes and the features that serve them

Section titled “v1 themes and the features that serve them”

Themes are the work. Features are how we deliver each theme. Sequencing is at the bottom.

Theme A — Make the crew coherent (P1 + P4)

Section titled “Theme A — Make the crew coherent (P1 + P4)”

The single biggest leverage point. Today every Matrix conversation starts from zero because there’s nowhere for either agent to look up what they or their teammate already decided. Fix this and 50% of “wait, what was the context for…” disappears.

  • Agent Memory: cross-agent typed KV. NATS-backed records with strict schemas (decision, incident, pattern, runbook). An mcp-memory Go binary exposes write_record, read_records(type, agent?, since?), find_records(query). Both bots wire it into agents/_shared/mcp.json. Solves P1 directly and is the substrate for P4.
  • Native Knowledge Base (read-only MCP). Vector DB (qdrant or pgvector) over past PRs, #audit history, and docs/. The KB is what makes memory compound rather than just accumulate — retrieval, not just storage. Ships in v1.1, paired with memory.

Theme B — Make every run inspectable (P2)

Section titled “Theme B — Make every run inspectable (P2)”

Today an off-the-rails run is undebuggable after the fact. This blocks ephemeral agents (you can’t run short-lived bots if you can’t review what they did) and erodes trust in the crew over time.

  • Orchestration Hub. A run_id UUID is generated per Matrix message that wakes an agent and threaded through every NATS event, every log line (echo "[run=$RUN_ID] ..."), and every Matrix reply (footer). A Grafana dashboard joins NATS + VictoriaLogs + Matrix on run_id and surfaces a single timeline. Days of work on existing infra; mandatory prerequisite for Theme D.

The current model is “trust both agents fully.” It works because there are two agents and Sherod operates both. The moment we add ephemeral agents (Theme D) or take outside input (eventually), it stops working.

  • Per-agent capability scopes. agents/<name>/capabilities.yaml enumerating allowed tools, allowed Matrix rooms, and accessible secret keys. Enforced at three layers: Claude permissions.deny, k8s RBAC, and an audit-log event on every boundary check.
  • Cost / budget controls. Daily token budget per agent enforced at iron-proxy; hard-kill when exceeded; alert to #audit. Becomes critical when Theme D ships — without budgets, a buggy ephemeral job can rack up real money before anyone notices.

Theme D — Make the crew autonomous (P5 + scale via P3)

Section titled “Theme D — Make the crew autonomous (P5 + scale via P3)”

The end state of v1. Bots that originate work on their own cadence, scale out via short-lived task-scoped runs, and don’t need a human to start them. Blocked on every prior theme.

  • Ephemeral Agents. K8s Jobs (not StatefulSets) triggered by NATS events or cron. First implementation: pr-reviewer, wakes on swarm.events.pr_opened, runs the code-review skill, posts inline comments, exits. Strict dep on Themes A (shared memory because no persistence), B (debuggability), and C (scoped credentials per run).
  • Proactive work origination. Start with one concrete loop — weekly stale-PR sweep — and add more once the pattern works. Driven by the existing schedule skill. Each new loop is a small lift once the infrastructure is in place.

v1.0 Themes A (memory only) + B + C — the foundation
- Agent Memory (cross-agent typed KV)
- Orchestration Hub (run_id correlation + Grafana)
- Per-agent capability scopes
- team.yaml replaces hardcoded agent list (cheap drop-in)
v1.1 Theme A complete + Theme D pilot
- Native KB (read-only MCP) — pairs with memory
- Cost / budget controls
- First ephemeral agent: pr-reviewer
v1.2 Theme D scaled
- First proactive workflow: stale-PR sweep
- Second ephemeral agent type (TBD with Sherod)
v2.x Future considerations pulled forward as their triggers fire
(model portability, federated discovery, syscall observability)

Real decisions where I don’t want to commit without you weighing in.

  1. Capability-scope enforcement layers. Cluster RBAC + Claude permissions.deny is already two layers. Do we want a third — the bot itself rejecting calls before they hit either — for defense in depth? Or is two enough?
  2. KB substrate. Qdrant (own Helm release) vs. pgvector (reuse existing Postgres). Preference?
  3. Ephemeral agent egress. iron-proxy currently issues stable credentials per agent name. Ephemerals need a credential per run. Does iron-proxy grow a session-broker, or do we sidecar something new?
  4. NATS stream retention. The memory stream needs different retention from swarm.events.*. Comfortable with a new stream + retention policy, or want to overload an existing one?
  5. Origination cadence. How proactive do you actually want the crew? “Bot opens a PR every night if it can find a clean dep bump” is a very different posture from “bot prepares a list of candidate work for Sherod’s morning review.” This shapes Theme D.

Appendix — items from the original list mapped to themes

Section titled “Appendix — items from the original list mapped to themes”

For traceability, here’s where each of your original v1 candidates landed.

Original ideaThemeVerdict
Agent MemoryAIn v1.0 — biggest leverage item, ships first
Orchestration Hub for debuggingBIn v1.0 — mandatory prerequisite for ephemerals
Native KB integrationAIn v1.1 — pairs with memory; read-only MCP first
Ephemeral AgentsDIn v1.1 — pilot with pr-reviewer, then expand
Harness Agnostic (decoupled from Claude)Future consideration — pull forward when a second model becomes load-bearing for cost or latency we can measure
Decentralized agent discovery (DID / registry)Future consideration — pull forward when the crew grows past ~5 agents or spans trust boundaries; team.yaml carries us until then
Native eBPF for networking/securityFuture consideration — pull forward when bots take untrusted input, or when an incident shows application-layer logs were not enough

Net adds (not on original list, surfaced by promise analysis):

  • Per-agent capability scopes (P3)
  • Cost/budget controls (P3 + safety for Theme D)
  • Proactive work origination as a first-class theme (P5)