docs: add matrix multi-agent and restart state specs
This commit is contained in:
parent
76230392fa
commit
59fbb52c20
2 changed files with 546 additions and 0 deletions
|
|
@ -0,0 +1,302 @@
|
|||
# Matrix Multi-Agent Routing Design
|
||||
|
||||
## Goal
|
||||
|
||||
Move the Matrix surface from a single hardcoded upstream agent to a user-selectable multi-agent model, while preserving the existing room-based UX and the current `PlatformClient` boundary.
|
||||
|
||||
The result should be:
|
||||
|
||||
- one Matrix bot can work with multiple upstream agents
|
||||
- users can choose an agent from the full configured list
|
||||
- each chat is bound to exactly one agent
|
||||
- switching the selected agent does not silently retarget an existing chat
|
||||
|
||||
## Core Decision
|
||||
|
||||
The selected routing model is:
|
||||
|
||||
`user.selected_agent_id + room.agent_id + room.platform_chat_id`
|
||||
|
||||
This means:
|
||||
|
||||
- the user has one current selected agent
|
||||
- each Matrix working room stores the agent it is bound to
|
||||
- each Matrix working room stores its own `platform_chat_id`
|
||||
- a room never changes agent implicitly
|
||||
|
||||
## Why This Decision
|
||||
|
||||
The current Matrix adapter already separates:
|
||||
|
||||
- user-facing room organization
|
||||
- local chat labels such as `C1`, `C2`, `C3`
|
||||
- platform-facing conversation identity via `platform_chat_id`
|
||||
|
||||
Adding multi-agent support should preserve that shape instead of replacing it.
|
||||
|
||||
If routing depended only on the current user selection, then an old room could start talking to a different agent after a switch. That would make room history and backend context hard to reason about. Binding an agent to the room keeps the conversation model explicit.
|
||||
|
||||
## Scope
|
||||
|
||||
This design covers:
|
||||
|
||||
- agent selection by the user inside the Matrix surface
|
||||
- durable storage of the selected agent
|
||||
- durable storage of the room-bound agent
|
||||
- routing normal messages and context commands to the correct upstream agent
|
||||
- behavior when a room becomes stale after an agent switch
|
||||
|
||||
This design does not cover:
|
||||
|
||||
- per-agent workspace isolation
|
||||
- platform-side agent lifecycle or memory persistence
|
||||
- per-user allowlists for available agents
|
||||
- Telegram or other surfaces
|
||||
|
||||
## Configuration Model
|
||||
|
||||
### Agent registry
|
||||
|
||||
Available agents are defined in a local config file loaded once at bot startup.
|
||||
|
||||
Example:
|
||||
|
||||
```yaml
|
||||
agents:
|
||||
- id: agent-1
|
||||
label: Analyst
|
||||
- id: agent-2
|
||||
label: Research
|
||||
- id: agent-3
|
||||
label: Ops
|
||||
```
|
||||
|
||||
Rules:
|
||||
|
||||
- every entry must have a stable `id`
|
||||
- every entry must have a user-visible `label`
|
||||
- all configured agents are selectable by all users
|
||||
- config changes apply only after bot restart
|
||||
|
||||
### Startup validation
|
||||
|
||||
If the agent config is missing, empty, or invalid, the Matrix bot must fail fast on startup with a clear operator error.
|
||||
|
||||
## Durable State Model
|
||||
|
||||
### User-level state
|
||||
|
||||
User metadata keeps the current selected agent.
|
||||
|
||||
Example `matrix_user:*` shape:
|
||||
|
||||
```json
|
||||
{
|
||||
"space_id": "!space:example.org",
|
||||
"next_chat_index": 4,
|
||||
"selected_agent_id": "agent-2"
|
||||
}
|
||||
```
|
||||
|
||||
Meaning:
|
||||
|
||||
- `selected_agent_id` controls future chat creation and activation of an unbound room
|
||||
- `selected_agent_id` does not rewrite already bound rooms
|
||||
|
||||
### Room-level state
|
||||
|
||||
Room metadata stores the agent bound to that chat.
|
||||
|
||||
Example `matrix_room:*` shape:
|
||||
|
||||
```json
|
||||
{
|
||||
"room_type": "chat",
|
||||
"chat_id": "C3",
|
||||
"display_name": "Чат 3",
|
||||
"matrix_user_id": "@alice:example.org",
|
||||
"space_id": "!space:example.org",
|
||||
"platform_chat_id": "42",
|
||||
"agent_id": "agent-2"
|
||||
}
|
||||
```
|
||||
|
||||
Rules:
|
||||
|
||||
- one room binds to exactly one `agent_id`
|
||||
- one room binds to exactly one current `platform_chat_id`
|
||||
- once a room becomes stale after an agent switch, it never becomes active again
|
||||
|
||||
## Runtime Semantics
|
||||
|
||||
### `!start`
|
||||
|
||||
`!start` remains lightweight:
|
||||
|
||||
- if no agent is selected, the bot explains that an agent must be selected before normal messaging
|
||||
- if an agent is already selected, the bot reports the current selection and reminds the user that `!new` creates a new room under that agent
|
||||
|
||||
### `!agent`
|
||||
|
||||
Introduce an agent-selection command.
|
||||
|
||||
Behavior:
|
||||
|
||||
- `!agent` shows the available agent list
|
||||
- agent selection stores `selected_agent_id` in user metadata
|
||||
- after a successful switch, the bot tells the user that existing chats bound to another agent are stale and that `!new` is required for continued work
|
||||
|
||||
The exact UI can be text-first for MVP. A richer UI can be added later without changing the state model.
|
||||
|
||||
### Normal message without selected agent
|
||||
|
||||
If the user has not selected an agent yet:
|
||||
|
||||
- do not call the platform
|
||||
- return the available agent list
|
||||
- ask the user to choose one first
|
||||
|
||||
### Selecting an agent inside an unbound chat
|
||||
|
||||
If the current room has never been bound to any agent:
|
||||
|
||||
- store the new `selected_agent_id` for the user
|
||||
- bind the current room to that same `agent_id`
|
||||
- allow the room to become the active working chat immediately
|
||||
|
||||
This avoids forcing `!new` for the user's first usable chat.
|
||||
|
||||
### `!new`
|
||||
|
||||
`!new` creates a new working room under the current selected agent.
|
||||
|
||||
Behavior:
|
||||
|
||||
1. require `selected_agent_id`
|
||||
2. create the new Matrix room
|
||||
3. allocate a new `platform_chat_id`
|
||||
4. store `agent_id = selected_agent_id` in the new room metadata
|
||||
|
||||
### Normal message in an unbound room with selected agent
|
||||
|
||||
If a room exists but has no `agent_id` yet and the user already has `selected_agent_id`:
|
||||
|
||||
- bind the room to `selected_agent_id`
|
||||
- ensure it has `platform_chat_id`
|
||||
- continue normal message dispatch
|
||||
|
||||
### Normal message in a bound room
|
||||
|
||||
If the room already has `agent_id` and it matches the current selected agent:
|
||||
|
||||
- route the message to that `agent_id`
|
||||
- use the room's `platform_chat_id`
|
||||
|
||||
### Stale room after agent switch
|
||||
|
||||
If the room's bound `agent_id` differs from the user's current `selected_agent_id`:
|
||||
|
||||
- do not call the platform
|
||||
- treat the room as stale
|
||||
- return a short message telling the user that this chat belongs to the old agent and that they must use `!new`
|
||||
|
||||
### Returning to a previously selected agent
|
||||
|
||||
If the user later selects an old agent again:
|
||||
|
||||
- previously stale rooms do not become valid again
|
||||
- the user must still create a fresh room via `!new`
|
||||
|
||||
## Routing and Component Changes
|
||||
|
||||
### Agent registry loader
|
||||
|
||||
Add a small loader responsible for:
|
||||
|
||||
- reading `agents.yaml`
|
||||
- validating ids and labels
|
||||
- exposing a read-only registry to runtime code
|
||||
|
||||
The runtime should not parse YAML ad hoc during message handling.
|
||||
|
||||
### Matrix runtime pre-check
|
||||
|
||||
Before dispatching a normal message, the Matrix runtime must resolve:
|
||||
|
||||
- whether the user has `selected_agent_id`
|
||||
- whether the current room already has `agent_id`
|
||||
- whether the room can be bound now
|
||||
- whether the room is stale
|
||||
|
||||
This pre-check happens before handing the message to the existing dispatcher path.
|
||||
|
||||
### Real platform bridge
|
||||
|
||||
The current real backend path hardcodes a single runtime-level `agent_id`.
|
||||
That must be replaced with per-request routing.
|
||||
|
||||
The selected design is:
|
||||
|
||||
- the runtime resolves the target `agent_id`
|
||||
- the platform bridge creates a fresh upstream `AgentApi` for that `agent_id`
|
||||
- no long-lived `AgentApi` instances are cached by user
|
||||
|
||||
This preserves the current fresh-connection-per-request behavior.
|
||||
|
||||
## Error Handling
|
||||
|
||||
### Missing or invalid selected agent
|
||||
|
||||
If `selected_agent_id` is absent:
|
||||
|
||||
- ask the user to select an agent
|
||||
|
||||
If `selected_agent_id` points to an agent that no longer exists in config:
|
||||
|
||||
- treat the selection as invalid
|
||||
- ask the user to select again
|
||||
|
||||
### Missing room binding
|
||||
|
||||
If the room has no `agent_id`:
|
||||
|
||||
- bind it only when the user has a valid current selection
|
||||
- otherwise return the selection prompt
|
||||
|
||||
### Stale room
|
||||
|
||||
If the room is stale:
|
||||
|
||||
- do not attempt fallback routing
|
||||
- do not silently rewrite room metadata
|
||||
- instruct the user to run `!new`
|
||||
|
||||
### Invalid config
|
||||
|
||||
If the bot cannot load a valid agent registry:
|
||||
|
||||
- fail at startup
|
||||
- do not start in degraded single-agent mode
|
||||
|
||||
## Testing Expectations
|
||||
|
||||
Tests for this design should prove:
|
||||
|
||||
- config parsing and startup validation
|
||||
- selecting an agent persists `selected_agent_id`
|
||||
- selecting an agent inside an unbound room activates that room
|
||||
- `!new` binds the new room to the selected agent
|
||||
- messages in a bound room use that room's `agent_id`
|
||||
- stale rooms reject normal messaging with a clear `!new` instruction
|
||||
- returning to the same agent later does not revive stale rooms
|
||||
|
||||
## Migration Notes
|
||||
|
||||
Existing rooms may have `platform_chat_id` but no `agent_id`.
|
||||
|
||||
For this MVP, treat those rooms as legacy-unbound rooms:
|
||||
|
||||
- if the user has a valid selected agent, the room may be bound on first use
|
||||
- if no agent is selected, the room prompts for selection first
|
||||
|
||||
No automatic migration across agents is introduced.
|
||||
|
|
@ -0,0 +1,244 @@
|
|||
# Matrix Surface Restart State Persistence Design
|
||||
|
||||
## Goal
|
||||
|
||||
Make the Matrix surface survive a normal restart or container recreate without losing the minimal state required to keep working as a bot.
|
||||
|
||||
The result should be:
|
||||
|
||||
- after restart, the bot can still answer messages and execute commands
|
||||
- the bot remembers the selected agent for each user
|
||||
- the bot remembers which agent and `platform_chat_id` each room is bound to
|
||||
- temporary UX flows may be lost without being treated as a bug
|
||||
|
||||
## Core Decision
|
||||
|
||||
The selected persistence model is:
|
||||
|
||||
`durable surface state only`
|
||||
|
||||
This means:
|
||||
|
||||
- persist only the state needed for routing and normal command handling
|
||||
- do not persist temporary UI and wizard state
|
||||
- require persistent local storage for the surface
|
||||
- do not attempt recovery if those volumes are lost
|
||||
|
||||
## Why This Decision
|
||||
|
||||
The Matrix surface already has two different classes of state:
|
||||
|
||||
- stable local state that defines how rooms and users are routed
|
||||
- temporary UX state that exists only to complete short-lived interactions
|
||||
|
||||
Trying to make all temporary UX state survive restart would add complexity and edge cases without improving the core requirement: the bot should still function normally after restart.
|
||||
|
||||
The chosen design keeps persistence aligned with what the surface actually owns:
|
||||
|
||||
- Matrix-side metadata and routing state are durable
|
||||
- agent conversation memory is the platform's responsibility
|
||||
- lost local volumes are treated as environment reset, not as an auto-recovery scenario
|
||||
|
||||
## Scope
|
||||
|
||||
This design covers:
|
||||
|
||||
- which Matrix surface data must persist across restart
|
||||
- where that data lives
|
||||
- how restart behavior interacts with multi-agent routing
|
||||
- what state is intentionally non-durable
|
||||
|
||||
This design does not cover:
|
||||
|
||||
- platform-side persistence of agent memory
|
||||
- workspace isolation between multiple agents
|
||||
- automatic reconstruction after total local volume loss
|
||||
- persistence of temporary UX flows
|
||||
|
||||
## Persistence Boundary
|
||||
|
||||
### Durable state
|
||||
|
||||
The Matrix surface must persist:
|
||||
|
||||
- `matrix_user:*`
|
||||
- `matrix_room:*`
|
||||
- `chat:*`
|
||||
- `selected_agent_id`
|
||||
- room-bound `agent_id`
|
||||
- room-bound `platform_chat_id`
|
||||
|
||||
This is the minimal state required so that, after restart, the surface can:
|
||||
|
||||
- identify the user
|
||||
- identify the room
|
||||
- determine which agent should receive a message
|
||||
- determine which `platform_chat_id` should be used
|
||||
|
||||
### Non-durable state
|
||||
|
||||
The Matrix surface does not need to persist:
|
||||
|
||||
- staged attachments
|
||||
- pending `!load` selection
|
||||
- pending `!yes/!no` confirmation
|
||||
- any temporary service UI step
|
||||
- live `AgentApi` instances or connection objects
|
||||
|
||||
After restart, those flows may be lost. The bot only needs to remain operational.
|
||||
|
||||
## Storage Model
|
||||
|
||||
### Surface durable storage
|
||||
|
||||
The Matrix surface must use persistent storage for:
|
||||
|
||||
- `lambda_matrix.db`
|
||||
- `matrix_store`
|
||||
|
||||
`lambda_matrix.db` stores the local key-value state used by the surface.
|
||||
`matrix_store` stores Matrix client state needed by `nio`.
|
||||
|
||||
These paths must be backed by persistent container storage in normal deployments.
|
||||
|
||||
### Shared `/workspace`
|
||||
|
||||
The current local runtime also uses `/workspace`, but workspace behavior is outside the scope of this design.
|
||||
|
||||
For this document, the only requirement is:
|
||||
|
||||
- do not make restart persistence depend on solving per-agent workspace isolation first
|
||||
|
||||
## Restart Assumptions
|
||||
|
||||
This design assumes:
|
||||
|
||||
- normal restart or redeploy with persistent local volumes still present
|
||||
|
||||
This design does not assume:
|
||||
|
||||
- automatic recovery after deleting or losing those volumes
|
||||
|
||||
If the relevant volumes are lost, the environment is treated as reset.
|
||||
|
||||
## Data Model Requirements
|
||||
|
||||
### User metadata
|
||||
|
||||
User metadata remains the durable location for user-level routing state.
|
||||
|
||||
Example:
|
||||
|
||||
```json
|
||||
{
|
||||
"space_id": "!space:example.org",
|
||||
"next_chat_index": 4,
|
||||
"selected_agent_id": "agent-2"
|
||||
}
|
||||
```
|
||||
|
||||
### Room metadata
|
||||
|
||||
Room metadata remains the durable location for room-level routing state.
|
||||
|
||||
Example:
|
||||
|
||||
```json
|
||||
{
|
||||
"room_type": "chat",
|
||||
"chat_id": "C3",
|
||||
"display_name": "Чат 3",
|
||||
"matrix_user_id": "@alice:example.org",
|
||||
"space_id": "!space:example.org",
|
||||
"platform_chat_id": "42",
|
||||
"agent_id": "agent-2"
|
||||
}
|
||||
```
|
||||
|
||||
## Runtime Semantics After Restart
|
||||
|
||||
After restart, the Matrix surface must:
|
||||
|
||||
1. load the durable Matrix store
|
||||
2. load the durable surface key-value state
|
||||
3. load the agent registry config
|
||||
4. resume normal room routing using persisted `selected_agent_id`, `agent_id`, and `platform_chat_id`
|
||||
|
||||
Expected behavior:
|
||||
|
||||
- a user with a valid previously selected agent does not need to reselect it
|
||||
- a room previously bound to an agent remains bound to that agent
|
||||
- normal messages and commands continue to work
|
||||
|
||||
### Lost temporary UX state
|
||||
|
||||
If the bot restarts during a transient UX flow:
|
||||
|
||||
- staged attachments may disappear
|
||||
- pending `!load` selections may disappear
|
||||
- pending confirmations may disappear
|
||||
|
||||
This is acceptable and should not block normal operation after restart.
|
||||
|
||||
## Interaction With Multi-Agent Routing
|
||||
|
||||
The multi-agent design introduces new durable state that must survive restart:
|
||||
|
||||
- `selected_agent_id` on the user
|
||||
- `agent_id` on the room
|
||||
|
||||
Restart persistence and multi-agent routing therefore belong together.
|
||||
|
||||
Without durable storage for those fields, a restart would make room routing ambiguous.
|
||||
|
||||
## Failure Handling
|
||||
|
||||
### Missing durable surface store
|
||||
|
||||
If the durable store paths are missing because the environment was reset:
|
||||
|
||||
- do not attempt to reconstruct a full working state from scratch in this design
|
||||
- treat startup as a clean environment
|
||||
- allow normal onboarding flows to begin again
|
||||
|
||||
### Invalid durable references
|
||||
|
||||
If persisted `selected_agent_id` or room `agent_id` references an agent no longer present in config:
|
||||
|
||||
- do not crash
|
||||
- treat the selection or room binding as invalid
|
||||
- ask the user to select a valid agent again
|
||||
|
||||
### Platform conversation memory
|
||||
|
||||
If the upstream platform loses agent memory across restart:
|
||||
|
||||
- that is outside the surface persistence boundary
|
||||
- the surface must still route correctly
|
||||
- platform memory persistence remains a platform responsibility
|
||||
|
||||
## Testing Expectations
|
||||
|
||||
Tests for this design should prove:
|
||||
|
||||
- `selected_agent_id` survives restart through durable local storage
|
||||
- room `agent_id` and `platform_chat_id` survive restart through durable local storage
|
||||
- the bot can route messages correctly after restart without user reconfiguration
|
||||
- missing temporary UX state does not break normal messaging and command handling
|
||||
- invalid persisted agent references degrade into reselection prompts rather than crashes
|
||||
|
||||
## Operational Notes
|
||||
|
||||
For the Matrix surface to survive restart in the intended way, deployment must persist:
|
||||
|
||||
- `lambda_matrix.db`
|
||||
- `matrix_store`
|
||||
|
||||
This is a deployment requirement, not an optional optimization.
|
||||
|
||||
The design intentionally stops there. It does not require:
|
||||
|
||||
- hot reload of agent config
|
||||
- recovery after total local state loss
|
||||
- persistence of temporary UX flows
|
||||
- a solved multi-agent workspace story
|
||||
Loading…
Add table
Add a link
Reference in a new issue