From 59fbb52c20d4082a91ea386040bdeabffeb6f633 Mon Sep 17 00:00:00 2001 From: Mikhail Putilovskij Date: Fri, 24 Apr 2026 12:28:53 +0300 Subject: [PATCH] docs: add matrix multi-agent and restart state specs --- ...04-24-matrix-multi-agent-routing-design.md | 302 ++++++++++++++++++ ...urface-restart-state-persistence-design.md | 244 ++++++++++++++ 2 files changed, 546 insertions(+) create mode 100644 docs/superpowers/specs/2026-04-24-matrix-multi-agent-routing-design.md create mode 100644 docs/superpowers/specs/2026-04-24-matrix-surface-restart-state-persistence-design.md diff --git a/docs/superpowers/specs/2026-04-24-matrix-multi-agent-routing-design.md b/docs/superpowers/specs/2026-04-24-matrix-multi-agent-routing-design.md new file mode 100644 index 0000000..18ce603 --- /dev/null +++ b/docs/superpowers/specs/2026-04-24-matrix-multi-agent-routing-design.md @@ -0,0 +1,302 @@ +# Matrix Multi-Agent Routing Design + +## Goal + +Move the Matrix surface from a single hardcoded upstream agent to a user-selectable multi-agent model, while preserving the existing room-based UX and the current `PlatformClient` boundary. + +The result should be: + +- one Matrix bot can work with multiple upstream agents +- users can choose an agent from the full configured list +- each chat is bound to exactly one agent +- switching the selected agent does not silently retarget an existing chat + +## Core Decision + +The selected routing model is: + +`user.selected_agent_id + room.agent_id + room.platform_chat_id` + +This means: + +- the user has one current selected agent +- each Matrix working room stores the agent it is bound to +- each Matrix working room stores its own `platform_chat_id` +- a room never changes agent implicitly + +## Why This Decision + +The current Matrix adapter already separates: + +- user-facing room organization +- local chat labels such as `C1`, `C2`, `C3` +- platform-facing conversation identity via `platform_chat_id` + +Adding multi-agent support should preserve that shape instead of replacing it. + +If routing depended only on the current user selection, then an old room could start talking to a different agent after a switch. That would make room history and backend context hard to reason about. Binding an agent to the room keeps the conversation model explicit. + +## Scope + +This design covers: + +- agent selection by the user inside the Matrix surface +- durable storage of the selected agent +- durable storage of the room-bound agent +- routing normal messages and context commands to the correct upstream agent +- behavior when a room becomes stale after an agent switch + +This design does not cover: + +- per-agent workspace isolation +- platform-side agent lifecycle or memory persistence +- per-user allowlists for available agents +- Telegram or other surfaces + +## Configuration Model + +### Agent registry + +Available agents are defined in a local config file loaded once at bot startup. + +Example: + +```yaml +agents: + - id: agent-1 + label: Analyst + - id: agent-2 + label: Research + - id: agent-3 + label: Ops +``` + +Rules: + +- every entry must have a stable `id` +- every entry must have a user-visible `label` +- all configured agents are selectable by all users +- config changes apply only after bot restart + +### Startup validation + +If the agent config is missing, empty, or invalid, the Matrix bot must fail fast on startup with a clear operator error. + +## Durable State Model + +### User-level state + +User metadata keeps the current selected agent. + +Example `matrix_user:*` shape: + +```json +{ + "space_id": "!space:example.org", + "next_chat_index": 4, + "selected_agent_id": "agent-2" +} +``` + +Meaning: + +- `selected_agent_id` controls future chat creation and activation of an unbound room +- `selected_agent_id` does not rewrite already bound rooms + +### Room-level state + +Room metadata stores the agent bound to that chat. + +Example `matrix_room:*` shape: + +```json +{ + "room_type": "chat", + "chat_id": "C3", + "display_name": "Чат 3", + "matrix_user_id": "@alice:example.org", + "space_id": "!space:example.org", + "platform_chat_id": "42", + "agent_id": "agent-2" +} +``` + +Rules: + +- one room binds to exactly one `agent_id` +- one room binds to exactly one current `platform_chat_id` +- once a room becomes stale after an agent switch, it never becomes active again + +## Runtime Semantics + +### `!start` + +`!start` remains lightweight: + +- if no agent is selected, the bot explains that an agent must be selected before normal messaging +- if an agent is already selected, the bot reports the current selection and reminds the user that `!new` creates a new room under that agent + +### `!agent` + +Introduce an agent-selection command. + +Behavior: + +- `!agent` shows the available agent list +- agent selection stores `selected_agent_id` in user metadata +- after a successful switch, the bot tells the user that existing chats bound to another agent are stale and that `!new` is required for continued work + +The exact UI can be text-first for MVP. A richer UI can be added later without changing the state model. + +### Normal message without selected agent + +If the user has not selected an agent yet: + +- do not call the platform +- return the available agent list +- ask the user to choose one first + +### Selecting an agent inside an unbound chat + +If the current room has never been bound to any agent: + +- store the new `selected_agent_id` for the user +- bind the current room to that same `agent_id` +- allow the room to become the active working chat immediately + +This avoids forcing `!new` for the user's first usable chat. + +### `!new` + +`!new` creates a new working room under the current selected agent. + +Behavior: + +1. require `selected_agent_id` +2. create the new Matrix room +3. allocate a new `platform_chat_id` +4. store `agent_id = selected_agent_id` in the new room metadata + +### Normal message in an unbound room with selected agent + +If a room exists but has no `agent_id` yet and the user already has `selected_agent_id`: + +- bind the room to `selected_agent_id` +- ensure it has `platform_chat_id` +- continue normal message dispatch + +### Normal message in a bound room + +If the room already has `agent_id` and it matches the current selected agent: + +- route the message to that `agent_id` +- use the room's `platform_chat_id` + +### Stale room after agent switch + +If the room's bound `agent_id` differs from the user's current `selected_agent_id`: + +- do not call the platform +- treat the room as stale +- return a short message telling the user that this chat belongs to the old agent and that they must use `!new` + +### Returning to a previously selected agent + +If the user later selects an old agent again: + +- previously stale rooms do not become valid again +- the user must still create a fresh room via `!new` + +## Routing and Component Changes + +### Agent registry loader + +Add a small loader responsible for: + +- reading `agents.yaml` +- validating ids and labels +- exposing a read-only registry to runtime code + +The runtime should not parse YAML ad hoc during message handling. + +### Matrix runtime pre-check + +Before dispatching a normal message, the Matrix runtime must resolve: + +- whether the user has `selected_agent_id` +- whether the current room already has `agent_id` +- whether the room can be bound now +- whether the room is stale + +This pre-check happens before handing the message to the existing dispatcher path. + +### Real platform bridge + +The current real backend path hardcodes a single runtime-level `agent_id`. +That must be replaced with per-request routing. + +The selected design is: + +- the runtime resolves the target `agent_id` +- the platform bridge creates a fresh upstream `AgentApi` for that `agent_id` +- no long-lived `AgentApi` instances are cached by user + +This preserves the current fresh-connection-per-request behavior. + +## Error Handling + +### Missing or invalid selected agent + +If `selected_agent_id` is absent: + +- ask the user to select an agent + +If `selected_agent_id` points to an agent that no longer exists in config: + +- treat the selection as invalid +- ask the user to select again + +### Missing room binding + +If the room has no `agent_id`: + +- bind it only when the user has a valid current selection +- otherwise return the selection prompt + +### Stale room + +If the room is stale: + +- do not attempt fallback routing +- do not silently rewrite room metadata +- instruct the user to run `!new` + +### Invalid config + +If the bot cannot load a valid agent registry: + +- fail at startup +- do not start in degraded single-agent mode + +## Testing Expectations + +Tests for this design should prove: + +- config parsing and startup validation +- selecting an agent persists `selected_agent_id` +- selecting an agent inside an unbound room activates that room +- `!new` binds the new room to the selected agent +- messages in a bound room use that room's `agent_id` +- stale rooms reject normal messaging with a clear `!new` instruction +- returning to the same agent later does not revive stale rooms + +## Migration Notes + +Existing rooms may have `platform_chat_id` but no `agent_id`. + +For this MVP, treat those rooms as legacy-unbound rooms: + +- if the user has a valid selected agent, the room may be bound on first use +- if no agent is selected, the room prompts for selection first + +No automatic migration across agents is introduced. diff --git a/docs/superpowers/specs/2026-04-24-matrix-surface-restart-state-persistence-design.md b/docs/superpowers/specs/2026-04-24-matrix-surface-restart-state-persistence-design.md new file mode 100644 index 0000000..e9c235e --- /dev/null +++ b/docs/superpowers/specs/2026-04-24-matrix-surface-restart-state-persistence-design.md @@ -0,0 +1,244 @@ +# Matrix Surface Restart State Persistence Design + +## Goal + +Make the Matrix surface survive a normal restart or container recreate without losing the minimal state required to keep working as a bot. + +The result should be: + +- after restart, the bot can still answer messages and execute commands +- the bot remembers the selected agent for each user +- the bot remembers which agent and `platform_chat_id` each room is bound to +- temporary UX flows may be lost without being treated as a bug + +## Core Decision + +The selected persistence model is: + +`durable surface state only` + +This means: + +- persist only the state needed for routing and normal command handling +- do not persist temporary UI and wizard state +- require persistent local storage for the surface +- do not attempt recovery if those volumes are lost + +## Why This Decision + +The Matrix surface already has two different classes of state: + +- stable local state that defines how rooms and users are routed +- temporary UX state that exists only to complete short-lived interactions + +Trying to make all temporary UX state survive restart would add complexity and edge cases without improving the core requirement: the bot should still function normally after restart. + +The chosen design keeps persistence aligned with what the surface actually owns: + +- Matrix-side metadata and routing state are durable +- agent conversation memory is the platform's responsibility +- lost local volumes are treated as environment reset, not as an auto-recovery scenario + +## Scope + +This design covers: + +- which Matrix surface data must persist across restart +- where that data lives +- how restart behavior interacts with multi-agent routing +- what state is intentionally non-durable + +This design does not cover: + +- platform-side persistence of agent memory +- workspace isolation between multiple agents +- automatic reconstruction after total local volume loss +- persistence of temporary UX flows + +## Persistence Boundary + +### Durable state + +The Matrix surface must persist: + +- `matrix_user:*` +- `matrix_room:*` +- `chat:*` +- `selected_agent_id` +- room-bound `agent_id` +- room-bound `platform_chat_id` + +This is the minimal state required so that, after restart, the surface can: + +- identify the user +- identify the room +- determine which agent should receive a message +- determine which `platform_chat_id` should be used + +### Non-durable state + +The Matrix surface does not need to persist: + +- staged attachments +- pending `!load` selection +- pending `!yes/!no` confirmation +- any temporary service UI step +- live `AgentApi` instances or connection objects + +After restart, those flows may be lost. The bot only needs to remain operational. + +## Storage Model + +### Surface durable storage + +The Matrix surface must use persistent storage for: + +- `lambda_matrix.db` +- `matrix_store` + +`lambda_matrix.db` stores the local key-value state used by the surface. +`matrix_store` stores Matrix client state needed by `nio`. + +These paths must be backed by persistent container storage in normal deployments. + +### Shared `/workspace` + +The current local runtime also uses `/workspace`, but workspace behavior is outside the scope of this design. + +For this document, the only requirement is: + +- do not make restart persistence depend on solving per-agent workspace isolation first + +## Restart Assumptions + +This design assumes: + +- normal restart or redeploy with persistent local volumes still present + +This design does not assume: + +- automatic recovery after deleting or losing those volumes + +If the relevant volumes are lost, the environment is treated as reset. + +## Data Model Requirements + +### User metadata + +User metadata remains the durable location for user-level routing state. + +Example: + +```json +{ + "space_id": "!space:example.org", + "next_chat_index": 4, + "selected_agent_id": "agent-2" +} +``` + +### Room metadata + +Room metadata remains the durable location for room-level routing state. + +Example: + +```json +{ + "room_type": "chat", + "chat_id": "C3", + "display_name": "Чат 3", + "matrix_user_id": "@alice:example.org", + "space_id": "!space:example.org", + "platform_chat_id": "42", + "agent_id": "agent-2" +} +``` + +## Runtime Semantics After Restart + +After restart, the Matrix surface must: + +1. load the durable Matrix store +2. load the durable surface key-value state +3. load the agent registry config +4. resume normal room routing using persisted `selected_agent_id`, `agent_id`, and `platform_chat_id` + +Expected behavior: + +- a user with a valid previously selected agent does not need to reselect it +- a room previously bound to an agent remains bound to that agent +- normal messages and commands continue to work + +### Lost temporary UX state + +If the bot restarts during a transient UX flow: + +- staged attachments may disappear +- pending `!load` selections may disappear +- pending confirmations may disappear + +This is acceptable and should not block normal operation after restart. + +## Interaction With Multi-Agent Routing + +The multi-agent design introduces new durable state that must survive restart: + +- `selected_agent_id` on the user +- `agent_id` on the room + +Restart persistence and multi-agent routing therefore belong together. + +Without durable storage for those fields, a restart would make room routing ambiguous. + +## Failure Handling + +### Missing durable surface store + +If the durable store paths are missing because the environment was reset: + +- do not attempt to reconstruct a full working state from scratch in this design +- treat startup as a clean environment +- allow normal onboarding flows to begin again + +### Invalid durable references + +If persisted `selected_agent_id` or room `agent_id` references an agent no longer present in config: + +- do not crash +- treat the selection or room binding as invalid +- ask the user to select a valid agent again + +### Platform conversation memory + +If the upstream platform loses agent memory across restart: + +- that is outside the surface persistence boundary +- the surface must still route correctly +- platform memory persistence remains a platform responsibility + +## Testing Expectations + +Tests for this design should prove: + +- `selected_agent_id` survives restart through durable local storage +- room `agent_id` and `platform_chat_id` survive restart through durable local storage +- the bot can route messages correctly after restart without user reconfiguration +- missing temporary UX state does not break normal messaging and command handling +- invalid persisted agent references degrade into reselection prompts rather than crashes + +## Operational Notes + +For the Matrix surface to survive restart in the intended way, deployment must persist: + +- `lambda_matrix.db` +- `matrix_store` + +This is a deployment requirement, not an optional optimization. + +The design intentionally stops there. It does not require: + +- hot reload of agent config +- recovery after total local state loss +- persistence of temporary UX flows +- a solved multi-agent workspace story