336 lines
10 KiB
Markdown
336 lines
10 KiB
Markdown
# Matrix Multi-Agent Routing Design
|
|
|
|
## Goal
|
|
|
|
Move the Matrix surface from a single hardcoded upstream agent to a user-selectable multi-agent model, while preserving the existing room-based UX and the current `PlatformClient` boundary.
|
|
|
|
The result should be:
|
|
|
|
- one Matrix bot can work with multiple upstream agents
|
|
- users can choose an agent from the full configured list
|
|
- each chat is bound to exactly one agent
|
|
- switching the selected agent does not silently retarget an existing chat
|
|
|
|
## Core Decision
|
|
|
|
The selected routing model is:
|
|
|
|
`user.selected_agent_id + room.agent_id + room.platform_chat_id`
|
|
|
|
This means:
|
|
|
|
- the user has one current selected agent
|
|
- each Matrix working room stores the agent it is bound to
|
|
- each Matrix working room stores its own `platform_chat_id`
|
|
- a room never changes agent implicitly
|
|
- the shared `PlatformClient` protocol remains unchanged
|
|
- Matrix multi-agent routing is implemented by a single routing facade that delegates to per-agent real clients
|
|
|
|
## Why This Decision
|
|
|
|
The current Matrix adapter already separates:
|
|
|
|
- user-facing room organization
|
|
- local chat labels such as `C1`, `C2`, `C3`
|
|
- platform-facing conversation identity via `platform_chat_id`
|
|
|
|
Adding multi-agent support should preserve that shape instead of replacing it.
|
|
|
|
If routing depended only on the current user selection, then an old room could start talking to a different agent after a switch. That would make room history and backend context hard to reason about. Binding an agent to the room keeps the conversation model explicit.
|
|
|
|
## Scope
|
|
|
|
This design covers:
|
|
|
|
- agent selection by the user inside the Matrix surface
|
|
- durable storage of the selected agent
|
|
- durable storage of the room-bound agent
|
|
- routing normal messages and context commands to the correct upstream agent
|
|
- behavior when a room becomes stale after an agent switch
|
|
|
|
This design does not cover:
|
|
|
|
- per-agent workspace isolation
|
|
- platform-side agent lifecycle or memory persistence
|
|
- per-user allowlists for available agents
|
|
- Telegram or other surfaces
|
|
|
|
## Configuration Model
|
|
|
|
### Agent registry
|
|
|
|
Available agents are defined in a local config file loaded once at bot startup.
|
|
|
|
Example:
|
|
|
|
```yaml
|
|
agents:
|
|
- id: agent-1
|
|
label: Analyst
|
|
- id: agent-2
|
|
label: Research
|
|
- id: agent-3
|
|
label: Ops
|
|
```
|
|
|
|
Rules:
|
|
|
|
- every entry must have a stable `id`
|
|
- every entry must have a user-visible `label`
|
|
- all configured agents are selectable by all users
|
|
- config changes apply only after bot restart
|
|
|
|
### Startup validation
|
|
|
|
If the agent config is missing, empty, or invalid, the Matrix bot must fail fast on startup with a clear operator error.
|
|
|
|
## Durable State Model
|
|
|
|
### User-level state
|
|
|
|
User metadata keeps the current selected agent.
|
|
|
|
Example `matrix_user:*` shape:
|
|
|
|
```json
|
|
{
|
|
"space_id": "!space:example.org",
|
|
"next_chat_index": 4,
|
|
"selected_agent_id": "agent-2"
|
|
}
|
|
```
|
|
|
|
Meaning:
|
|
|
|
- `selected_agent_id` controls future chat creation and activation of an unbound room
|
|
- `selected_agent_id` does not rewrite already bound rooms
|
|
|
|
### Room-level state
|
|
|
|
Room metadata stores the agent bound to that chat.
|
|
|
|
Example `matrix_room:*` shape:
|
|
|
|
```json
|
|
{
|
|
"room_type": "chat",
|
|
"chat_id": "C3",
|
|
"display_name": "Чат 3",
|
|
"matrix_user_id": "@alice:example.org",
|
|
"space_id": "!space:example.org",
|
|
"platform_chat_id": "42",
|
|
"agent_id": "agent-2"
|
|
}
|
|
```
|
|
|
|
Rules:
|
|
|
|
- one room binds to exactly one `agent_id`
|
|
- one room binds to exactly one current `platform_chat_id`
|
|
- once a room becomes stale after an agent switch, it never becomes active again
|
|
|
|
## Runtime Semantics
|
|
|
|
### `!start`
|
|
|
|
`!start` remains lightweight:
|
|
|
|
- if no agent is selected, the bot explains that an agent must be selected before normal messaging
|
|
- if an agent is already selected, the bot reports the current selection and reminds the user that `!new` creates a new room under that agent
|
|
|
|
### `!agent`
|
|
|
|
Introduce an agent-selection command.
|
|
|
|
Behavior:
|
|
|
|
- `!agent` shows the available agent list
|
|
- agent selection stores `selected_agent_id` in user metadata
|
|
- after a successful switch, the bot tells the user that existing chats bound to another agent are stale and that `!new` is required for continued work
|
|
|
|
The exact UI can be text-first for MVP. A richer UI can be added later without changing the state model.
|
|
|
|
### Normal message without selected agent
|
|
|
|
If the user has not selected an agent yet:
|
|
|
|
- do not call the platform
|
|
- return the available agent list
|
|
- ask the user to choose one first
|
|
|
|
This is an intentional one-time routing handshake, not an accidental fallback.
|
|
In a multi-agent deployment, the surface must not silently guess which agent an unbound user should talk to.
|
|
|
|
### Selecting an agent inside an unbound chat
|
|
|
|
If the current room has never been bound to any agent:
|
|
|
|
- store the new `selected_agent_id` for the user
|
|
- bind the current room to that same `agent_id`
|
|
- allow the room to become the active working chat immediately
|
|
|
|
This avoids forcing `!new` for the user's first usable chat.
|
|
|
|
### `!new`
|
|
|
|
`!new` creates a new working room under the current selected agent.
|
|
|
|
Behavior:
|
|
|
|
1. require `selected_agent_id`
|
|
2. create the new Matrix room
|
|
3. allocate a new `platform_chat_id`
|
|
4. store `agent_id = selected_agent_id` in the new room metadata
|
|
|
|
### Normal message in an unbound room with selected agent
|
|
|
|
If a room exists but has no `agent_id` yet and the user already has `selected_agent_id`:
|
|
|
|
- bind the room to `selected_agent_id`
|
|
- ensure it has `platform_chat_id`
|
|
- continue normal message dispatch
|
|
|
|
### Normal message in a bound room
|
|
|
|
If the room already has `agent_id` and it matches the current selected agent:
|
|
|
|
- route the message to that `agent_id`
|
|
- use the room's `platform_chat_id`
|
|
|
|
### Stale room after agent switch
|
|
|
|
If the room's bound `agent_id` differs from the user's current `selected_agent_id`:
|
|
|
|
- do not call the platform
|
|
- treat the room as stale
|
|
- return a short message telling the user that this chat belongs to the old agent and that they must use `!new`
|
|
|
|
### Returning to a previously selected agent
|
|
|
|
If the user later selects an old agent again:
|
|
|
|
- previously stale rooms do not become valid again
|
|
- the user must still create a fresh room via `!new`
|
|
|
|
## Routing and Component Changes
|
|
|
|
### Agent registry loader
|
|
|
|
Add a small loader responsible for:
|
|
|
|
- reading `agents.yaml`
|
|
- validating ids and labels
|
|
- exposing a read-only registry to runtime code
|
|
|
|
The runtime should not parse YAML ad hoc during message handling.
|
|
|
|
### Matrix runtime pre-check
|
|
|
|
Before dispatching a normal message, the Matrix runtime must resolve:
|
|
|
|
- whether the user has `selected_agent_id`
|
|
- whether the current room already has `agent_id`
|
|
- whether the room can be bound now
|
|
- whether the room is stale
|
|
|
|
This pre-check happens before handing the message to the existing dispatcher path.
|
|
|
|
### Routed platform client
|
|
|
|
The selected implementation keeps the shared `PlatformClient` protocol unchanged.
|
|
|
|
The Matrix runtime owns one routing-aware facade, for example `RoutedPlatformClient`, that implements `PlatformClient` and delegates to agent-specific real clients.
|
|
|
|
Responsibilities:
|
|
|
|
- resolve the current room binding from local Matrix metadata
|
|
- translate a local Matrix logical chat id into the room's `platform_chat_id`
|
|
- choose the correct per-agent delegate for the room's bound `agent_id`
|
|
- keep `get_or_create_user`, `get_settings`, and `update_settings` behavior stable for the rest of the runtime
|
|
|
|
This keeps the multi-agent logic inside the Matrix integration boundary instead of pushing agent selection into the shared protocol.
|
|
|
|
### Real platform bridge delegates
|
|
|
|
The current real backend path hardcodes a single runtime-level `agent_id`.
|
|
That must be replaced with per-agent delegates hidden behind the routing facade.
|
|
|
|
The selected design is:
|
|
|
|
- `RealPlatformClient` remains the low-level direct-agent delegate for one configured `agent_id`
|
|
- the routing facade holds or creates one `RealPlatformClient` delegate per configured agent
|
|
- `send_message(...)` and `stream_message(...)` on the facade resolve the room target and forward the call to the matching delegate
|
|
- the delegate creates a fresh upstream `AgentApi` for its configured `agent_id`
|
|
- no long-lived `AgentApi` instances are cached by user
|
|
|
|
This preserves the current fresh-connection-per-request behavior while avoiding a protocol break for Telegram or other surfaces.
|
|
|
|
## Error Handling
|
|
|
|
### Missing or invalid selected agent
|
|
|
|
If `selected_agent_id` is absent:
|
|
|
|
- ask the user to select an agent
|
|
|
|
If `selected_agent_id` points to an agent that no longer exists in config:
|
|
|
|
- treat the selection as invalid
|
|
- ask the user to select again
|
|
|
|
### Missing room binding
|
|
|
|
If the room has no `agent_id`:
|
|
|
|
- bind it only when the user has a valid current selection
|
|
- otherwise return the selection prompt
|
|
|
|
### Stale room
|
|
|
|
If the room is stale:
|
|
|
|
- do not attempt fallback routing
|
|
- do not silently rewrite room metadata
|
|
- instruct the user to run `!new`
|
|
|
|
### Invalid config
|
|
|
|
If the bot cannot load a valid agent registry:
|
|
|
|
- fail at startup
|
|
- do not start in degraded single-agent mode
|
|
|
|
## Testing Expectations
|
|
|
|
Tests for this design should prove:
|
|
|
|
- config parsing and startup validation
|
|
- selecting an agent persists `selected_agent_id`
|
|
- selecting an agent inside an unbound room activates that room
|
|
- `!new` binds the new room to the selected agent
|
|
- messages in a bound room use that room's `agent_id`
|
|
- stale rooms reject normal messaging with a clear `!new` instruction
|
|
- returning to the same agent later does not revive stale rooms
|
|
|
|
## Migration Notes
|
|
|
|
Existing rooms may have `platform_chat_id` but no `agent_id`.
|
|
|
|
For this MVP, treat those rooms as legacy-unbound rooms:
|
|
|
|
- if the user has a valid selected agent, the room may be bound on first use
|
|
- if no agent is selected, the room prompts for selection first
|
|
|
|
No automatic migration across agents is introduced.
|
|
|
|
### Existing users without `selected_agent_id`
|
|
|
|
Existing users upgraded from the single-agent model may have working rooms but no stored `selected_agent_id`.
|
|
|
|
For this MVP, that is handled explicitly:
|
|
|
|
- normal messaging is paused until the user selects an agent
|
|
- the first valid selection can bind an unbound room immediately
|
|
- the surface does not auto-assign a default agent in a multi-agent config
|
|
|
|
This is intentional. Once more than one agent exists, silent migration would be ambiguous and could route a user to the wrong backend target.
|