docs: add matrix multi-agent and restart state specs

2026-04-24 12:28:53 +03:00 · 2026-04-24 12:28:53 +03:00 · 59fbb52c20
commit 59fbb52c20
parent 76230392fa
2 changed files with 546 additions and 0 deletions
--- a/docs/superpowers/specs/2026-04-24-matrix-multi-agent-routing-design.md
+++ b/docs/superpowers/specs/2026-04-24-matrix-multi-agent-routing-design.md
@ -0,0 +1,302 @@
+# Matrix Multi-Agent Routing Design
+
+## Goal
+
+Move the Matrix surface from a single hardcoded upstream agent to a user-selectable multi-agent model, while preserving the existing room-based UX and the current `PlatformClient` boundary.
+
+The result should be:
+
+- one Matrix bot can work with multiple upstream agents
+- users can choose an agent from the full configured list
+- each chat is bound to exactly one agent
+- switching the selected agent does not silently retarget an existing chat
+
+## Core Decision
+
+The selected routing model is:
+
+`user.selected_agent_id + room.agent_id + room.platform_chat_id`
+
+This means:
+
+- the user has one current selected agent
+- each Matrix working room stores the agent it is bound to
+- each Matrix working room stores its own `platform_chat_id`
+- a room never changes agent implicitly
+
+## Why This Decision
+
+The current Matrix adapter already separates:
+
+- user-facing room organization
+- local chat labels such as `C1`, `C2`, `C3`
+- platform-facing conversation identity via `platform_chat_id`
+
+Adding multi-agent support should preserve that shape instead of replacing it.
+
+If routing depended only on the current user selection, then an old room could start talking to a different agent after a switch. That would make room history and backend context hard to reason about. Binding an agent to the room keeps the conversation model explicit.
+
+## Scope
+
+This design covers:
+
+- agent selection by the user inside the Matrix surface
+- durable storage of the selected agent
+- durable storage of the room-bound agent
+- routing normal messages and context commands to the correct upstream agent
+- behavior when a room becomes stale after an agent switch
+
+This design does not cover:
+
+- per-agent workspace isolation
+- platform-side agent lifecycle or memory persistence
+- per-user allowlists for available agents
+- Telegram or other surfaces
+
+## Configuration Model
+
+### Agent registry
+
+Available agents are defined in a local config file loaded once at bot startup.
+
+Example:
+
+```yaml
+agents:
+  - id: agent-1
+    label: Analyst
+  - id: agent-2
+    label: Research
+  - id: agent-3
+    label: Ops
+```
+
+Rules:
+
+- every entry must have a stable `id`
+- every entry must have a user-visible `label`
+- all configured agents are selectable by all users
+- config changes apply only after bot restart
+
+### Startup validation
+
+If the agent config is missing, empty, or invalid, the Matrix bot must fail fast on startup with a clear operator error.
+
+## Durable State Model
+
+### User-level state
+
+User metadata keeps the current selected agent.
+
+Example `matrix_user:*` shape:
+
+```json
+{
+  "space_id": "!space:example.org",
+  "next_chat_index": 4,
+  "selected_agent_id": "agent-2"
+}
+```
+
+Meaning:
+
+- `selected_agent_id` controls future chat creation and activation of an unbound room
+- `selected_agent_id` does not rewrite already bound rooms
+
+### Room-level state
+
+Room metadata stores the agent bound to that chat.
+
+Example `matrix_room:*` shape:
+
+```json
+{
+  "room_type": "chat",
+  "chat_id": "C3",
+  "display_name": "Чат 3",
+  "matrix_user_id": "@alice:example.org",
+  "space_id": "!space:example.org",
+  "platform_chat_id": "42",
+  "agent_id": "agent-2"
+}
+```
+
+Rules:
+
+- one room binds to exactly one `agent_id`
+- one room binds to exactly one current `platform_chat_id`
+- once a room becomes stale after an agent switch, it never becomes active again
+
+## Runtime Semantics
+
+### `!start`
+
+`!start` remains lightweight:
+
+- if no agent is selected, the bot explains that an agent must be selected before normal messaging
+- if an agent is already selected, the bot reports the current selection and reminds the user that `!new` creates a new room under that agent
+
+### `!agent`
+
+Introduce an agent-selection command.
+
+Behavior:
+
+- `!agent` shows the available agent list
+- agent selection stores `selected_agent_id` in user metadata
+- after a successful switch, the bot tells the user that existing chats bound to another agent are stale and that `!new` is required for continued work
+
+The exact UI can be text-first for MVP. A richer UI can be added later without changing the state model.
+
+### Normal message without selected agent
+
+If the user has not selected an agent yet:
+
+- do not call the platform
+- return the available agent list
+- ask the user to choose one first
+
+### Selecting an agent inside an unbound chat
+
+If the current room has never been bound to any agent:
+
+- store the new `selected_agent_id` for the user
+- bind the current room to that same `agent_id`
+- allow the room to become the active working chat immediately
+
+This avoids forcing `!new` for the user's first usable chat.
+
+### `!new`
+
+`!new` creates a new working room under the current selected agent.
+
+Behavior:
+
+1. require `selected_agent_id`
+2. create the new Matrix room
+3. allocate a new `platform_chat_id`
+4. store `agent_id = selected_agent_id` in the new room metadata
+
+### Normal message in an unbound room with selected agent
+
+If a room exists but has no `agent_id` yet and the user already has `selected_agent_id`:
+
+- bind the room to `selected_agent_id`
+- ensure it has `platform_chat_id`
+- continue normal message dispatch
+
+### Normal message in a bound room
+
+If the room already has `agent_id` and it matches the current selected agent:
+
+- route the message to that `agent_id`
+- use the room's `platform_chat_id`
+
+### Stale room after agent switch
+
+If the room's bound `agent_id` differs from the user's current `selected_agent_id`:
+
+- do not call the platform
+- treat the room as stale
+- return a short message telling the user that this chat belongs to the old agent and that they must use `!new`
+
+### Returning to a previously selected agent
+
+If the user later selects an old agent again:
+
+- previously stale rooms do not become valid again
+- the user must still create a fresh room via `!new`
+
+## Routing and Component Changes
+
+### Agent registry loader
+
+Add a small loader responsible for:
+
+- reading `agents.yaml`
+- validating ids and labels
+- exposing a read-only registry to runtime code
+
+The runtime should not parse YAML ad hoc during message handling.
+
+### Matrix runtime pre-check
+
+Before dispatching a normal message, the Matrix runtime must resolve:
+
+- whether the user has `selected_agent_id`
+- whether the current room already has `agent_id`
+- whether the room can be bound now
+- whether the room is stale
+
+This pre-check happens before handing the message to the existing dispatcher path.
+
+### Real platform bridge
+
+The current real backend path hardcodes a single runtime-level `agent_id`.
+That must be replaced with per-request routing.
+
+The selected design is:
+
+- the runtime resolves the target `agent_id`
+- the platform bridge creates a fresh upstream `AgentApi` for that `agent_id`
+- no long-lived `AgentApi` instances are cached by user
+
+This preserves the current fresh-connection-per-request behavior.
+
+## Error Handling
+
+### Missing or invalid selected agent
+
+If `selected_agent_id` is absent:
+
+- ask the user to select an agent
+
+If `selected_agent_id` points to an agent that no longer exists in config:
+
+- treat the selection as invalid
+- ask the user to select again
+
+### Missing room binding
+
+If the room has no `agent_id`:
+
+- bind it only when the user has a valid current selection
+- otherwise return the selection prompt
+
+### Stale room
+
+If the room is stale:
+
+- do not attempt fallback routing
+- do not silently rewrite room metadata
+- instruct the user to run `!new`
+
+### Invalid config
+
+If the bot cannot load a valid agent registry:
+
+- fail at startup
+- do not start in degraded single-agent mode
+
+## Testing Expectations
+
+Tests for this design should prove:
+
+- config parsing and startup validation
+- selecting an agent persists `selected_agent_id`
+- selecting an agent inside an unbound room activates that room
+- `!new` binds the new room to the selected agent
+- messages in a bound room use that room's `agent_id`
+- stale rooms reject normal messaging with a clear `!new` instruction
+- returning to the same agent later does not revive stale rooms
+
+## Migration Notes
+
+Existing rooms may have `platform_chat_id` but no `agent_id`.
+
+For this MVP, treat those rooms as legacy-unbound rooms:
+
+- if the user has a valid selected agent, the room may be bound on first use
+- if no agent is selected, the room prompts for selection first
+
+No automatic migration across agents is introduced.
--- a/docs/superpowers/specs/2026-04-24-matrix-surface-restart-state-persistence-design.md
+++ b/docs/superpowers/specs/2026-04-24-matrix-surface-restart-state-persistence-design.md
@ -0,0 +1,244 @@
+# Matrix Surface Restart State Persistence Design
+
+## Goal
+
+Make the Matrix surface survive a normal restart or container recreate without losing the minimal state required to keep working as a bot.
+
+The result should be:
+
+- after restart, the bot can still answer messages and execute commands
+- the bot remembers the selected agent for each user
+- the bot remembers which agent and `platform_chat_id` each room is bound to
+- temporary UX flows may be lost without being treated as a bug
+
+## Core Decision
+
+The selected persistence model is:
+
+`durable surface state only`
+
+This means:
+
+- persist only the state needed for routing and normal command handling
+- do not persist temporary UI and wizard state
+- require persistent local storage for the surface
+- do not attempt recovery if those volumes are lost
+
+## Why This Decision
+
+The Matrix surface already has two different classes of state:
+
+- stable local state that defines how rooms and users are routed
+- temporary UX state that exists only to complete short-lived interactions
+
+Trying to make all temporary UX state survive restart would add complexity and edge cases without improving the core requirement: the bot should still function normally after restart.
+
+The chosen design keeps persistence aligned with what the surface actually owns:
+
+- Matrix-side metadata and routing state are durable
+- agent conversation memory is the platform's responsibility
+- lost local volumes are treated as environment reset, not as an auto-recovery scenario
+
+## Scope
+
+This design covers:
+
+- which Matrix surface data must persist across restart
+- where that data lives
+- how restart behavior interacts with multi-agent routing
+- what state is intentionally non-durable
+
+This design does not cover:
+
+- platform-side persistence of agent memory
+- workspace isolation between multiple agents
+- automatic reconstruction after total local volume loss
+- persistence of temporary UX flows
+
+## Persistence Boundary
+
+### Durable state
+
+The Matrix surface must persist:
+
+- `matrix_user:*`
+- `matrix_room:*`
+- `chat:*`
+- `selected_agent_id`
+- room-bound `agent_id`
+- room-bound `platform_chat_id`
+
+This is the minimal state required so that, after restart, the surface can:
+
+- identify the user
+- identify the room
+- determine which agent should receive a message
+- determine which `platform_chat_id` should be used
+
+### Non-durable state
+
+The Matrix surface does not need to persist:
+
+- staged attachments
+- pending `!load` selection
+- pending `!yes/!no` confirmation
+- any temporary service UI step
+- live `AgentApi` instances or connection objects
+
+After restart, those flows may be lost. The bot only needs to remain operational.
+
+## Storage Model
+
+### Surface durable storage
+
+The Matrix surface must use persistent storage for:
+
+- `lambda_matrix.db`
+- `matrix_store`
+
+`lambda_matrix.db` stores the local key-value state used by the surface.
+`matrix_store` stores Matrix client state needed by `nio`.
+
+These paths must be backed by persistent container storage in normal deployments.
+
+### Shared `/workspace`
+
+The current local runtime also uses `/workspace`, but workspace behavior is outside the scope of this design.
+
+For this document, the only requirement is:
+
+- do not make restart persistence depend on solving per-agent workspace isolation first
+
+## Restart Assumptions
+
+This design assumes:
+
+- normal restart or redeploy with persistent local volumes still present
+
+This design does not assume:
+
+- automatic recovery after deleting or losing those volumes
+
+If the relevant volumes are lost, the environment is treated as reset.
+
+## Data Model Requirements
+
+### User metadata
+
+User metadata remains the durable location for user-level routing state.
+
+Example:
+
+```json
+{
+  "space_id": "!space:example.org",
+  "next_chat_index": 4,
+  "selected_agent_id": "agent-2"
+}
+```
+
+### Room metadata
+
+Room metadata remains the durable location for room-level routing state.
+
+Example:
+
+```json
+{
+  "room_type": "chat",
+  "chat_id": "C3",
+  "display_name": "Чат 3",
+  "matrix_user_id": "@alice:example.org",
+  "space_id": "!space:example.org",
+  "platform_chat_id": "42",
+  "agent_id": "agent-2"
+}
+```
+
+## Runtime Semantics After Restart
+
+After restart, the Matrix surface must:
+
+1. load the durable Matrix store
+2. load the durable surface key-value state
+3. load the agent registry config
+4. resume normal room routing using persisted `selected_agent_id`, `agent_id`, and `platform_chat_id`
+
+Expected behavior:
+
+- a user with a valid previously selected agent does not need to reselect it
+- a room previously bound to an agent remains bound to that agent
+- normal messages and commands continue to work
+
+### Lost temporary UX state
+
+If the bot restarts during a transient UX flow:
+
+- staged attachments may disappear
+- pending `!load` selections may disappear
+- pending confirmations may disappear
+
+This is acceptable and should not block normal operation after restart.
+
+## Interaction With Multi-Agent Routing
+
+The multi-agent design introduces new durable state that must survive restart:
+
+- `selected_agent_id` on the user
+- `agent_id` on the room
+
+Restart persistence and multi-agent routing therefore belong together.
+
+Without durable storage for those fields, a restart would make room routing ambiguous.
+
+## Failure Handling
+
+### Missing durable surface store
+
+If the durable store paths are missing because the environment was reset:
+
+- do not attempt to reconstruct a full working state from scratch in this design
+- treat startup as a clean environment
+- allow normal onboarding flows to begin again
+
+### Invalid durable references
+
+If persisted `selected_agent_id` or room `agent_id` references an agent no longer present in config:
+
+- do not crash
+- treat the selection or room binding as invalid
+- ask the user to select a valid agent again
+
+### Platform conversation memory
+
+If the upstream platform loses agent memory across restart:
+
+- that is outside the surface persistence boundary
+- the surface must still route correctly
+- platform memory persistence remains a platform responsibility
+
+## Testing Expectations
+
+Tests for this design should prove:
+
+- `selected_agent_id` survives restart through durable local storage
+- room `agent_id` and `platform_chat_id` survive restart through durable local storage
+- the bot can route messages correctly after restart without user reconfiguration
+- missing temporary UX state does not break normal messaging and command handling
+- invalid persisted agent references degrade into reselection prompts rather than crashes
+
+## Operational Notes
+
+For the Matrix surface to survive restart in the intended way, deployment must persist:
+
+- `lambda_matrix.db`
+- `matrix_store`
+
+This is a deployment requirement, not an optional optimization.
+
+The design intentionally stops there. It does not require:
+
+- hot reload of agent config
+- recovery after total local state loss
+- persistence of temporary UX flows
+- a solved multi-agent workspace story