surfaces/docs/superpowers/specs/2026-04-24-matrix-multi-agent-routing-design.md

10 KiB

Matrix Multi-Agent Routing Design

Goal

Move the Matrix surface from a single hardcoded upstream agent to a user-selectable multi-agent model, while preserving the existing room-based UX and the current PlatformClient boundary.

The result should be:

  • one Matrix bot can work with multiple upstream agents
  • users can choose an agent from the full configured list
  • each chat is bound to exactly one agent
  • switching the selected agent does not silently retarget an existing chat

Core Decision

The selected routing model is:

user.selected_agent_id + room.agent_id + room.platform_chat_id

This means:

  • the user has one current selected agent
  • each Matrix working room stores the agent it is bound to
  • each Matrix working room stores its own platform_chat_id
  • a room never changes agent implicitly
  • the shared PlatformClient protocol remains unchanged
  • Matrix multi-agent routing is implemented by a single routing facade that delegates to per-agent real clients

Why This Decision

The current Matrix adapter already separates:

  • user-facing room organization
  • local chat labels such as C1, C2, C3
  • platform-facing conversation identity via platform_chat_id

Adding multi-agent support should preserve that shape instead of replacing it.

If routing depended only on the current user selection, then an old room could start talking to a different agent after a switch. That would make room history and backend context hard to reason about. Binding an agent to the room keeps the conversation model explicit.

Scope

This design covers:

  • agent selection by the user inside the Matrix surface
  • durable storage of the selected agent
  • durable storage of the room-bound agent
  • routing normal messages and context commands to the correct upstream agent
  • behavior when a room becomes stale after an agent switch

This design does not cover:

  • per-agent workspace isolation
  • platform-side agent lifecycle or memory persistence
  • per-user allowlists for available agents
  • Telegram or other surfaces

Configuration Model

Agent registry

Available agents are defined in a local config file loaded once at bot startup.

Example:

agents:
  - id: agent-1
    label: Analyst
  - id: agent-2
    label: Research
  - id: agent-3
    label: Ops

Rules:

  • every entry must have a stable id
  • every entry must have a user-visible label
  • all configured agents are selectable by all users
  • config changes apply only after bot restart

Startup validation

If the agent config is missing, empty, or invalid, the Matrix bot must fail fast on startup with a clear operator error.

Durable State Model

User-level state

User metadata keeps the current selected agent.

Example matrix_user:* shape:

{
  "space_id": "!space:example.org",
  "next_chat_index": 4,
  "selected_agent_id": "agent-2"
}

Meaning:

  • selected_agent_id controls future chat creation and activation of an unbound room
  • selected_agent_id does not rewrite already bound rooms

Room-level state

Room metadata stores the agent bound to that chat.

Example matrix_room:* shape:

{
  "room_type": "chat",
  "chat_id": "C3",
  "display_name": "Чат 3",
  "matrix_user_id": "@alice:example.org",
  "space_id": "!space:example.org",
  "platform_chat_id": "42",
  "agent_id": "agent-2"
}

Rules:

  • one room binds to exactly one agent_id
  • one room binds to exactly one current platform_chat_id
  • once a room becomes stale after an agent switch, it never becomes active again

Runtime Semantics

!start

!start remains lightweight:

  • if no agent is selected, the bot explains that an agent must be selected before normal messaging
  • if an agent is already selected, the bot reports the current selection and reminds the user that !new creates a new room under that agent

!agent

Introduce an agent-selection command.

Behavior:

  • !agent shows the available agent list
  • agent selection stores selected_agent_id in user metadata
  • after a successful switch, the bot tells the user that existing chats bound to another agent are stale and that !new is required for continued work

The exact UI can be text-first for MVP. A richer UI can be added later without changing the state model.

Normal message without selected agent

If the user has not selected an agent yet:

  • do not call the platform
  • return the available agent list
  • ask the user to choose one first

This is an intentional one-time routing handshake, not an accidental fallback. In a multi-agent deployment, the surface must not silently guess which agent an unbound user should talk to.

Selecting an agent inside an unbound chat

If the current room has never been bound to any agent:

  • store the new selected_agent_id for the user
  • bind the current room to that same agent_id
  • allow the room to become the active working chat immediately

This avoids forcing !new for the user's first usable chat.

!new

!new creates a new working room under the current selected agent.

Behavior:

  1. require selected_agent_id
  2. create the new Matrix room
  3. allocate a new platform_chat_id
  4. store agent_id = selected_agent_id in the new room metadata

Normal message in an unbound room with selected agent

If a room exists but has no agent_id yet and the user already has selected_agent_id:

  • bind the room to selected_agent_id
  • ensure it has platform_chat_id
  • continue normal message dispatch

Normal message in a bound room

If the room already has agent_id and it matches the current selected agent:

  • route the message to that agent_id
  • use the room's platform_chat_id

Stale room after agent switch

If the room's bound agent_id differs from the user's current selected_agent_id:

  • do not call the platform
  • treat the room as stale
  • return a short message telling the user that this chat belongs to the old agent and that they must use !new

Returning to a previously selected agent

If the user later selects an old agent again:

  • previously stale rooms do not become valid again
  • the user must still create a fresh room via !new

Routing and Component Changes

Agent registry loader

Add a small loader responsible for:

  • reading agents.yaml
  • validating ids and labels
  • exposing a read-only registry to runtime code

The runtime should not parse YAML ad hoc during message handling.

Matrix runtime pre-check

Before dispatching a normal message, the Matrix runtime must resolve:

  • whether the user has selected_agent_id
  • whether the current room already has agent_id
  • whether the room can be bound now
  • whether the room is stale

This pre-check happens before handing the message to the existing dispatcher path.

Routed platform client

The selected implementation keeps the shared PlatformClient protocol unchanged.

The Matrix runtime owns one routing-aware facade, for example RoutedPlatformClient, that implements PlatformClient and delegates to agent-specific real clients.

Responsibilities:

  • resolve the current room binding from local Matrix metadata
  • translate a local Matrix logical chat id into the room's platform_chat_id
  • choose the correct per-agent delegate for the room's bound agent_id
  • keep get_or_create_user, get_settings, and update_settings behavior stable for the rest of the runtime

This keeps the multi-agent logic inside the Matrix integration boundary instead of pushing agent selection into the shared protocol.

Real platform bridge delegates

The current real backend path hardcodes a single runtime-level agent_id. That must be replaced with per-agent delegates hidden behind the routing facade.

The selected design is:

  • RealPlatformClient remains the low-level direct-agent delegate for one configured agent_id
  • the routing facade holds or creates one RealPlatformClient delegate per configured agent
  • send_message(...) and stream_message(...) on the facade resolve the room target and forward the call to the matching delegate
  • the delegate creates a fresh upstream AgentApi for its configured agent_id
  • no long-lived AgentApi instances are cached by user

This preserves the current fresh-connection-per-request behavior while avoiding a protocol break for Telegram or other surfaces.

Error Handling

Missing or invalid selected agent

If selected_agent_id is absent:

  • ask the user to select an agent

If selected_agent_id points to an agent that no longer exists in config:

  • treat the selection as invalid
  • ask the user to select again

Missing room binding

If the room has no agent_id:

  • bind it only when the user has a valid current selection
  • otherwise return the selection prompt

Stale room

If the room is stale:

  • do not attempt fallback routing
  • do not silently rewrite room metadata
  • instruct the user to run !new

Invalid config

If the bot cannot load a valid agent registry:

  • fail at startup
  • do not start in degraded single-agent mode

Testing Expectations

Tests for this design should prove:

  • config parsing and startup validation
  • selecting an agent persists selected_agent_id
  • selecting an agent inside an unbound room activates that room
  • !new binds the new room to the selected agent
  • messages in a bound room use that room's agent_id
  • stale rooms reject normal messaging with a clear !new instruction
  • returning to the same agent later does not revive stale rooms

Migration Notes

Existing rooms may have platform_chat_id but no agent_id.

For this MVP, treat those rooms as legacy-unbound rooms:

  • if the user has a valid selected agent, the room may be bound on first use
  • if no agent is selected, the room prompts for selection first

No automatic migration across agents is introduced.

Existing users without selected_agent_id

Existing users upgraded from the single-agent model may have working rooms but no stored selected_agent_id.

For this MVP, that is handled explicitly:

  • normal messaging is paused until the user selects an agent
  • the first valid selection can bind an unbound room immediately
  • the surface does not auto-assign a default agent in a multi-agent config

This is intentional. Once more than one agent exists, silent migration would be ambiguous and could route a user to the wrong backend target.