258 lines
7.3 KiB
Markdown
258 lines
7.3 KiB
Markdown
# Matrix Surface Restart State Persistence Design
|
|
|
|
## Goal
|
|
|
|
Make the Matrix surface survive a normal restart or container recreate without losing the minimal state required to keep working as a bot.
|
|
|
|
The result should be:
|
|
|
|
- after restart, the bot can still answer messages and execute commands
|
|
- the bot remembers the selected agent for each user
|
|
- the bot remembers which agent and `platform_chat_id` each room is bound to
|
|
- temporary UX flows may be lost without being treated as a bug
|
|
|
|
## Core Decision
|
|
|
|
The selected persistence model is:
|
|
|
|
`durable surface state only`
|
|
|
|
This means:
|
|
|
|
- persist only the state needed for routing and normal command handling
|
|
- do not persist temporary UI and wizard state
|
|
- require persistent local storage for the surface
|
|
- do not attempt recovery if those volumes are lost
|
|
|
|
## Why This Decision
|
|
|
|
The Matrix surface already has two different classes of state:
|
|
|
|
- stable local state that defines how rooms and users are routed
|
|
- temporary UX state that exists only to complete short-lived interactions
|
|
|
|
Trying to make all temporary UX state survive restart would add complexity and edge cases without improving the core requirement: the bot should still function normally after restart.
|
|
|
|
The chosen design keeps persistence aligned with what the surface actually owns:
|
|
|
|
- Matrix-side metadata and routing state are durable
|
|
- agent conversation memory is the platform's responsibility
|
|
- lost local volumes are treated as environment reset, not as an auto-recovery scenario
|
|
|
|
## Scope
|
|
|
|
This design covers:
|
|
|
|
- which Matrix surface data must persist across restart
|
|
- where that data lives
|
|
- how restart behavior interacts with multi-agent routing
|
|
- what state is intentionally non-durable
|
|
|
|
This design does not cover:
|
|
|
|
- platform-side persistence of agent memory
|
|
- workspace isolation between multiple agents
|
|
- automatic reconstruction after total local volume loss
|
|
- persistence of temporary UX flows
|
|
|
|
## Persistence Boundary
|
|
|
|
### Durable state
|
|
|
|
The Matrix surface must persist:
|
|
|
|
- `matrix_user:*`
|
|
- `matrix_room:*`
|
|
- `chat:*`
|
|
- `PLATFORM_CHAT_SEQ_KEY`
|
|
- `selected_agent_id`
|
|
- room-bound `agent_id`
|
|
- room-bound `platform_chat_id`
|
|
|
|
This is the minimal state required so that, after restart, the surface can:
|
|
|
|
- identify the user
|
|
- identify the room
|
|
- determine which agent should receive a message
|
|
- determine which `platform_chat_id` should be used
|
|
- continue allocating new `platform_chat_id` values without reusing an already issued sequence number
|
|
|
|
### Non-durable state
|
|
|
|
The Matrix surface does not need to persist:
|
|
|
|
- staged attachments
|
|
- pending `!load` selection
|
|
- pending `!yes/!no` confirmation
|
|
- any temporary service UI step
|
|
- live `AgentApi` instances or connection objects
|
|
|
|
After restart, those flows may be lost. The bot only needs to remain operational.
|
|
|
|
## Storage Model
|
|
|
|
### Surface durable storage
|
|
|
|
The Matrix surface must use persistent storage for:
|
|
|
|
- `lambda_matrix.db`
|
|
- `matrix_store`
|
|
|
|
`lambda_matrix.db` stores the local key-value state used by the surface.
|
|
`matrix_store` stores Matrix client state needed by `nio`.
|
|
|
|
These paths must be backed by persistent container storage in normal deployments.
|
|
|
|
### Shared `/workspace`
|
|
|
|
The current local runtime also uses `/workspace`, but workspace behavior is outside the scope of this design.
|
|
|
|
For this document, the only requirement is:
|
|
|
|
- do not make restart persistence depend on solving per-agent workspace isolation first
|
|
|
|
## Restart Assumptions
|
|
|
|
This design assumes:
|
|
|
|
- normal restart or redeploy with persistent local volumes still present
|
|
|
|
This design does not assume:
|
|
|
|
- automatic recovery after deleting or losing those volumes
|
|
|
|
If the relevant volumes are lost, the environment is treated as reset.
|
|
|
|
## Data Model Requirements
|
|
|
|
### User metadata
|
|
|
|
User metadata remains the durable location for user-level routing state.
|
|
|
|
Example:
|
|
|
|
```json
|
|
{
|
|
"space_id": "!space:example.org",
|
|
"next_chat_index": 4,
|
|
"selected_agent_id": "agent-2"
|
|
}
|
|
```
|
|
|
|
### Room metadata
|
|
|
|
Room metadata remains the durable location for room-level routing state.
|
|
|
|
Example:
|
|
|
|
```json
|
|
{
|
|
"room_type": "chat",
|
|
"chat_id": "C3",
|
|
"display_name": "Чат 3",
|
|
"matrix_user_id": "@alice:example.org",
|
|
"space_id": "!space:example.org",
|
|
"platform_chat_id": "42",
|
|
"agent_id": "agent-2"
|
|
}
|
|
```
|
|
|
|
### Platform chat sequence
|
|
|
|
The global `PLATFORM_CHAT_SEQ_KEY` remains part of durable surface state.
|
|
|
|
Its purpose is:
|
|
|
|
- allocate monotonically increasing `platform_chat_id` values
|
|
- avoid reusing a previously issued platform chat identifier during normal restart or redeploy
|
|
|
|
This sequence must be stored in the same durable surface store as the room and user metadata.
|
|
|
|
## Runtime Semantics After Restart
|
|
|
|
After restart, the Matrix surface must:
|
|
|
|
1. load the durable Matrix store
|
|
2. load the durable surface key-value state
|
|
3. load the agent registry config
|
|
4. resume normal room routing using persisted `selected_agent_id`, `agent_id`, and `platform_chat_id`
|
|
|
|
Expected behavior:
|
|
|
|
- a user with a valid previously selected agent does not need to reselect it
|
|
- a room previously bound to an agent remains bound to that agent
|
|
- normal messages and commands continue to work
|
|
|
|
### Lost temporary UX state
|
|
|
|
If the bot restarts during a transient UX flow:
|
|
|
|
- staged attachments may disappear
|
|
- pending `!load` selections may disappear
|
|
- pending confirmations may disappear
|
|
|
|
This is acceptable and should not block normal operation after restart.
|
|
|
|
## Interaction With Multi-Agent Routing
|
|
|
|
The multi-agent design introduces new durable state that must survive restart:
|
|
|
|
- `selected_agent_id` on the user
|
|
- `agent_id` on the room
|
|
- `PLATFORM_CHAT_SEQ_KEY` in the surface store
|
|
|
|
Restart persistence and multi-agent routing therefore belong together.
|
|
|
|
Without durable storage for those fields, a restart would make room routing ambiguous.
|
|
|
|
## Failure Handling
|
|
|
|
### Missing durable surface store
|
|
|
|
If the durable store paths are missing because the environment was reset:
|
|
|
|
- do not attempt to reconstruct a full working state from scratch in this design
|
|
- treat startup as a clean environment
|
|
- allow normal onboarding flows to begin again
|
|
|
|
### Invalid durable references
|
|
|
|
If persisted `selected_agent_id` or room `agent_id` references an agent no longer present in config:
|
|
|
|
- do not crash
|
|
- treat the selection or room binding as invalid
|
|
- ask the user to select a valid agent again
|
|
|
|
### Platform conversation memory
|
|
|
|
If the upstream platform loses agent memory across restart:
|
|
|
|
- that is outside the surface persistence boundary
|
|
- the surface must still route correctly
|
|
- platform memory persistence remains a platform responsibility
|
|
|
|
## Testing Expectations
|
|
|
|
Tests for this design should prove:
|
|
|
|
- `selected_agent_id` survives restart through durable local storage
|
|
- room `agent_id` and `platform_chat_id` survive restart through durable local storage
|
|
- the bot can route messages correctly after restart without user reconfiguration
|
|
- missing temporary UX state does not break normal messaging and command handling
|
|
- invalid persisted agent references degrade into reselection prompts rather than crashes
|
|
|
|
## Operational Notes
|
|
|
|
For the Matrix surface to survive restart in the intended way, deployment must persist:
|
|
|
|
- `lambda_matrix.db`
|
|
- `matrix_store`
|
|
|
|
This is a deployment requirement, not an optional optimization.
|
|
|
|
The design intentionally stops there. It does not require:
|
|
|
|
- hot reload of agent config
|
|
- recovery after total local state loss
|
|
- persistence of temporary UX flows
|
|
- a solved multi-agent workspace story
|