docs: add matrix multi-agent and restart state specs

2026-04-24 12:28:53 +03:00 · 2026-04-24 12:28:53 +03:00 · 59fbb52c20
commit 59fbb52c20
parent 76230392fa
2 changed files with 546 additions and 0 deletions
--- a/docs/superpowers/specs/2026-04-24-matrix-multi-agent-routing-design.md
+++ b/docs/superpowers/specs/2026-04-24-matrix-multi-agent-routing-design.md
@ -0,0 +1,302 @@
 # Matrix Multi-Agent Routing Design
 ## Goal
 Move the Matrix surface from a single hardcoded upstream agent to a user-selectable multi-agent model, while preserving the existing room-based UX and the current `PlatformClient` boundary.
 The result should be:
 - one Matrix bot can work with multiple upstream agents
 - users can choose an agent from the full configured list
 - each chat is bound to exactly one agent
 - switching the selected agent does not silently retarget an existing chat
 ## Core Decision
 The selected routing model is:
 `user.selected_agent_id + room.agent_id + room.platform_chat_id`
 This means:
 - the user has one current selected agent
 - each Matrix working room stores the agent it is bound to
 - each Matrix working room stores its own `platform_chat_id`
 - a room never changes agent implicitly
 ## Why This Decision
 The current Matrix adapter already separates:
 - user-facing room organization
 - local chat labels such as `C1`, `C2`, `C3`
 - platform-facing conversation identity via `platform_chat_id`
 Adding multi-agent support should preserve that shape instead of replacing it.
 If routing depended only on the current user selection, then an old room could start talking to a different agent after a switch. That would make room history and backend context hard to reason about. Binding an agent to the room keeps the conversation model explicit.
 ## Scope
 This design covers:
 - agent selection by the user inside the Matrix surface
 - durable storage of the selected agent
 - durable storage of the room-bound agent
 - routing normal messages and context commands to the correct upstream agent
 - behavior when a room becomes stale after an agent switch
 This design does not cover:
 - per-agent workspace isolation
 - platform-side agent lifecycle or memory persistence
 - per-user allowlists for available agents
 - Telegram or other surfaces
 ## Configuration Model
 ### Agent registry
 Available agents are defined in a local config file loaded once at bot startup.
 Example:
 ```yaml
 agents:
  - id: agent-1
    label: Analyst
  - id: agent-2
    label: Research
  - id: agent-3
    label: Ops
 ```
 Rules:
 - every entry must have a stable `id`
 - every entry must have a user-visible `label`
 - all configured agents are selectable by all users
 - config changes apply only after bot restart
 ### Startup validation
 If the agent config is missing, empty, or invalid, the Matrix bot must fail fast on startup with a clear operator error.
 ## Durable State Model
 ### User-level state
 User metadata keeps the current selected agent.
 Example `matrix_user:*` shape:
 ```json
 {
  "space_id": "!space:example.org",
  "next_chat_index": 4,
  "selected_agent_id": "agent-2"
 }
 ```
 Meaning:
 - `selected_agent_id` controls future chat creation and activation of an unbound room
 - `selected_agent_id` does not rewrite already bound rooms
 ### Room-level state
 Room metadata stores the agent bound to that chat.
 Example `matrix_room:*` shape:
 ```json
 {
  "room_type": "chat",
  "chat_id": "C3",
  "display_name": "Чат 3",
  "matrix_user_id": "@alice:example.org",
  "space_id": "!space:example.org",
  "platform_chat_id": "42",
  "agent_id": "agent-2"
 }
 ```
 Rules:
 - one room binds to exactly one `agent_id`
 - one room binds to exactly one current `platform_chat_id`
 - once a room becomes stale after an agent switch, it never becomes active again
 ## Runtime Semantics
 ### `!start`
 `!start` remains lightweight:
 - if no agent is selected, the bot explains that an agent must be selected before normal messaging
 - if an agent is already selected, the bot reports the current selection and reminds the user that `!new` creates a new room under that agent
 ### `!agent`
 Introduce an agent-selection command.
 Behavior:
 - `!agent` shows the available agent list
 - agent selection stores `selected_agent_id` in user metadata
 - after a successful switch, the bot tells the user that existing chats bound to another agent are stale and that `!new` is required for continued work
 The exact UI can be text-first for MVP. A richer UI can be added later without changing the state model.
 ### Normal message without selected agent
 If the user has not selected an agent yet:
 - do not call the platform
 - return the available agent list
 - ask the user to choose one first
 ### Selecting an agent inside an unbound chat
 If the current room has never been bound to any agent:
 - store the new `selected_agent_id` for the user
 - bind the current room to that same `agent_id`
 - allow the room to become the active working chat immediately
 This avoids forcing `!new` for the user's first usable chat.
 ### `!new`
 `!new` creates a new working room under the current selected agent.
 Behavior:
 1. require `selected_agent_id`
 2. create the new Matrix room
 3. allocate a new `platform_chat_id`
 4. store `agent_id = selected_agent_id` in the new room metadata
 ### Normal message in an unbound room with selected agent
 If a room exists but has no `agent_id` yet and the user already has `selected_agent_id`:
 - bind the room to `selected_agent_id`
 - ensure it has `platform_chat_id`
 - continue normal message dispatch
 ### Normal message in a bound room
 If the room already has `agent_id` and it matches the current selected agent:
 - route the message to that `agent_id`
 - use the room's `platform_chat_id`
 ### Stale room after agent switch
 If the room's bound `agent_id` differs from the user's current `selected_agent_id`:
 - do not call the platform
 - treat the room as stale
 - return a short message telling the user that this chat belongs to the old agent and that they must use `!new`
 ### Returning to a previously selected agent
 If the user later selects an old agent again:
 - previously stale rooms do not become valid again
 - the user must still create a fresh room via `!new`
 ## Routing and Component Changes
 ### Agent registry loader
 Add a small loader responsible for:
 - reading `agents.yaml`
 - validating ids and labels
 - exposing a read-only registry to runtime code
 The runtime should not parse YAML ad hoc during message handling.
 ### Matrix runtime pre-check
 Before dispatching a normal message, the Matrix runtime must resolve:
 - whether the user has `selected_agent_id`
 - whether the current room already has `agent_id`
 - whether the room can be bound now
 - whether the room is stale
 This pre-check happens before handing the message to the existing dispatcher path.
 ### Real platform bridge
 The current real backend path hardcodes a single runtime-level `agent_id`.
 That must be replaced with per-request routing.
 The selected design is:
 - the runtime resolves the target `agent_id`
 - the platform bridge creates a fresh upstream `AgentApi` for that `agent_id`
 - no long-lived `AgentApi` instances are cached by user
 This preserves the current fresh-connection-per-request behavior.
 ## Error Handling
 ### Missing or invalid selected agent
 If `selected_agent_id` is absent:
 - ask the user to select an agent
 If `selected_agent_id` points to an agent that no longer exists in config:
 - treat the selection as invalid
 - ask the user to select again
 ### Missing room binding
 If the room has no `agent_id`:
 - bind it only when the user has a valid current selection
 - otherwise return the selection prompt
 ### Stale room
 If the room is stale:
 - do not attempt fallback routing
 - do not silently rewrite room metadata
 - instruct the user to run `!new`
 ### Invalid config
 If the bot cannot load a valid agent registry:
 - fail at startup
 - do not start in degraded single-agent mode
 ## Testing Expectations
 Tests for this design should prove:
 - config parsing and startup validation
 - selecting an agent persists `selected_agent_id`
 - selecting an agent inside an unbound room activates that room
 - `!new` binds the new room to the selected agent
 - messages in a bound room use that room's `agent_id`
 - stale rooms reject normal messaging with a clear `!new` instruction
 - returning to the same agent later does not revive stale rooms
 ## Migration Notes
 Existing rooms may have `platform_chat_id` but no `agent_id`.
 For this MVP, treat those rooms as legacy-unbound rooms:
 - if the user has a valid selected agent, the room may be bound on first use
 - if no agent is selected, the room prompts for selection first
 No automatic migration across agents is introduced.
--- a/docs/superpowers/specs/2026-04-24-matrix-surface-restart-state-persistence-design.md
+++ b/docs/superpowers/specs/2026-04-24-matrix-surface-restart-state-persistence-design.md
@ -0,0 +1,244 @@
 # Matrix Surface Restart State Persistence Design
 ## Goal
 Make the Matrix surface survive a normal restart or container recreate without losing the minimal state required to keep working as a bot.
 The result should be:
 - after restart, the bot can still answer messages and execute commands
 - the bot remembers the selected agent for each user
 - the bot remembers which agent and `platform_chat_id` each room is bound to
 - temporary UX flows may be lost without being treated as a bug
 ## Core Decision
 The selected persistence model is:
 `durable surface state only`
 This means:
 - persist only the state needed for routing and normal command handling
 - do not persist temporary UI and wizard state
 - require persistent local storage for the surface
 - do not attempt recovery if those volumes are lost
 ## Why This Decision
 The Matrix surface already has two different classes of state:
 - stable local state that defines how rooms and users are routed
 - temporary UX state that exists only to complete short-lived interactions
 Trying to make all temporary UX state survive restart would add complexity and edge cases without improving the core requirement: the bot should still function normally after restart.
 The chosen design keeps persistence aligned with what the surface actually owns:
 - Matrix-side metadata and routing state are durable
 - agent conversation memory is the platform's responsibility
 - lost local volumes are treated as environment reset, not as an auto-recovery scenario
 ## Scope
 This design covers:
 - which Matrix surface data must persist across restart
 - where that data lives
 - how restart behavior interacts with multi-agent routing
 - what state is intentionally non-durable
 This design does not cover:
 - platform-side persistence of agent memory
 - workspace isolation between multiple agents
 - automatic reconstruction after total local volume loss
 - persistence of temporary UX flows
 ## Persistence Boundary
 ### Durable state
 The Matrix surface must persist:
 - `matrix_user:*`
 - `matrix_room:*`
 - `chat:*`
 - `selected_agent_id`
 - room-bound `agent_id`
 - room-bound `platform_chat_id`
 This is the minimal state required so that, after restart, the surface can:
 - identify the user
 - identify the room
 - determine which agent should receive a message
 - determine which `platform_chat_id` should be used
 ### Non-durable state
 The Matrix surface does not need to persist:
 - staged attachments
 - pending `!load` selection
 - pending `!yes/!no` confirmation
 - any temporary service UI step
 - live `AgentApi` instances or connection objects
 After restart, those flows may be lost. The bot only needs to remain operational.
 ## Storage Model
 ### Surface durable storage
 The Matrix surface must use persistent storage for:
 - `lambda_matrix.db`
 - `matrix_store`
 `lambda_matrix.db` stores the local key-value state used by the surface.
 `matrix_store` stores Matrix client state needed by `nio`.
 These paths must be backed by persistent container storage in normal deployments.
 ### Shared `/workspace`
 The current local runtime also uses `/workspace`, but workspace behavior is outside the scope of this design.
 For this document, the only requirement is:
 - do not make restart persistence depend on solving per-agent workspace isolation first
 ## Restart Assumptions
 This design assumes:
 - normal restart or redeploy with persistent local volumes still present
 This design does not assume:
 - automatic recovery after deleting or losing those volumes
 If the relevant volumes are lost, the environment is treated as reset.
 ## Data Model Requirements
 ### User metadata
 User metadata remains the durable location for user-level routing state.
 Example:
 ```json
 {
  "space_id": "!space:example.org",
  "next_chat_index": 4,
  "selected_agent_id": "agent-2"
 }
 ```
 ### Room metadata
 Room metadata remains the durable location for room-level routing state.
 Example:
 ```json
 {
  "room_type": "chat",
  "chat_id": "C3",
  "display_name": "Чат 3",
  "matrix_user_id": "@alice:example.org",
  "space_id": "!space:example.org",
  "platform_chat_id": "42",
  "agent_id": "agent-2"
 }
 ```
 ## Runtime Semantics After Restart
 After restart, the Matrix surface must:
 1. load the durable Matrix store
 2. load the durable surface key-value state
 3. load the agent registry config
 4. resume normal room routing using persisted `selected_agent_id`, `agent_id`, and `platform_chat_id`
 Expected behavior:
 - a user with a valid previously selected agent does not need to reselect it
 - a room previously bound to an agent remains bound to that agent
 - normal messages and commands continue to work
 ### Lost temporary UX state
 If the bot restarts during a transient UX flow:
 - staged attachments may disappear
 - pending `!load` selections may disappear
 - pending confirmations may disappear
 This is acceptable and should not block normal operation after restart.
 ## Interaction With Multi-Agent Routing
 The multi-agent design introduces new durable state that must survive restart:
 - `selected_agent_id` on the user
 - `agent_id` on the room
 Restart persistence and multi-agent routing therefore belong together.
 Without durable storage for those fields, a restart would make room routing ambiguous.
 ## Failure Handling
 ### Missing durable surface store
 If the durable store paths are missing because the environment was reset:
 - do not attempt to reconstruct a full working state from scratch in this design
 - treat startup as a clean environment
 - allow normal onboarding flows to begin again
 ### Invalid durable references
 If persisted `selected_agent_id` or room `agent_id` references an agent no longer present in config:
 - do not crash
 - treat the selection or room binding as invalid
 - ask the user to select a valid agent again
 ### Platform conversation memory
 If the upstream platform loses agent memory across restart:
 - that is outside the surface persistence boundary
 - the surface must still route correctly
 - platform memory persistence remains a platform responsibility
 ## Testing Expectations
 Tests for this design should prove:
 - `selected_agent_id` survives restart through durable local storage
 - room `agent_id` and `platform_chat_id` survive restart through durable local storage
 - the bot can route messages correctly after restart without user reconfiguration
 - missing temporary UX state does not break normal messaging and command handling
 - invalid persisted agent references degrade into reselection prompts rather than crashes
 ## Operational Notes
 For the Matrix surface to survive restart in the intended way, deployment must persist:
 - `lambda_matrix.db`
 - `matrix_store`
 This is a deployment requirement, not an optional optimization.
 The design intentionally stops there. It does not require:
 - hot reload of agent config
 - recovery after total local state loss
 - persistence of temporary UX flows
 - a solved multi-agent workspace story