surfaces/docs/superpowers/specs/2026-04-24-matrix-surface-restart-state-persistence-design.md

258 lines
7.3 KiB
Markdown

# Matrix Surface Restart State Persistence Design
## Goal
Make the Matrix surface survive a normal restart or container recreate without losing the minimal state required to keep working as a bot.
The result should be:
- after restart, the bot can still answer messages and execute commands
- the bot remembers the selected agent for each user
- the bot remembers which agent and `platform_chat_id` each room is bound to
- temporary UX flows may be lost without being treated as a bug
## Core Decision
The selected persistence model is:
`durable surface state only`
This means:
- persist only the state needed for routing and normal command handling
- do not persist temporary UI and wizard state
- require persistent local storage for the surface
- do not attempt recovery if those volumes are lost
## Why This Decision
The Matrix surface already has two different classes of state:
- stable local state that defines how rooms and users are routed
- temporary UX state that exists only to complete short-lived interactions
Trying to make all temporary UX state survive restart would add complexity and edge cases without improving the core requirement: the bot should still function normally after restart.
The chosen design keeps persistence aligned with what the surface actually owns:
- Matrix-side metadata and routing state are durable
- agent conversation memory is the platform's responsibility
- lost local volumes are treated as environment reset, not as an auto-recovery scenario
## Scope
This design covers:
- which Matrix surface data must persist across restart
- where that data lives
- how restart behavior interacts with multi-agent routing
- what state is intentionally non-durable
This design does not cover:
- platform-side persistence of agent memory
- workspace isolation between multiple agents
- automatic reconstruction after total local volume loss
- persistence of temporary UX flows
## Persistence Boundary
### Durable state
The Matrix surface must persist:
- `matrix_user:*`
- `matrix_room:*`
- `chat:*`
- `PLATFORM_CHAT_SEQ_KEY`
- `selected_agent_id`
- room-bound `agent_id`
- room-bound `platform_chat_id`
This is the minimal state required so that, after restart, the surface can:
- identify the user
- identify the room
- determine which agent should receive a message
- determine which `platform_chat_id` should be used
- continue allocating new `platform_chat_id` values without reusing an already issued sequence number
### Non-durable state
The Matrix surface does not need to persist:
- staged attachments
- pending `!load` selection
- pending `!yes/!no` confirmation
- any temporary service UI step
- live `AgentApi` instances or connection objects
After restart, those flows may be lost. The bot only needs to remain operational.
## Storage Model
### Surface durable storage
The Matrix surface must use persistent storage for:
- `lambda_matrix.db`
- `matrix_store`
`lambda_matrix.db` stores the local key-value state used by the surface.
`matrix_store` stores Matrix client state needed by `nio`.
These paths must be backed by persistent container storage in normal deployments.
### Shared `/workspace`
The current local runtime also uses `/workspace`, but workspace behavior is outside the scope of this design.
For this document, the only requirement is:
- do not make restart persistence depend on solving per-agent workspace isolation first
## Restart Assumptions
This design assumes:
- normal restart or redeploy with persistent local volumes still present
This design does not assume:
- automatic recovery after deleting or losing those volumes
If the relevant volumes are lost, the environment is treated as reset.
## Data Model Requirements
### User metadata
User metadata remains the durable location for user-level routing state.
Example:
```json
{
"space_id": "!space:example.org",
"next_chat_index": 4,
"selected_agent_id": "agent-2"
}
```
### Room metadata
Room metadata remains the durable location for room-level routing state.
Example:
```json
{
"room_type": "chat",
"chat_id": "C3",
"display_name": "Чат 3",
"matrix_user_id": "@alice:example.org",
"space_id": "!space:example.org",
"platform_chat_id": "42",
"agent_id": "agent-2"
}
```
### Platform chat sequence
The global `PLATFORM_CHAT_SEQ_KEY` remains part of durable surface state.
Its purpose is:
- allocate monotonically increasing `platform_chat_id` values
- avoid reusing a previously issued platform chat identifier during normal restart or redeploy
This sequence must be stored in the same durable surface store as the room and user metadata.
## Runtime Semantics After Restart
After restart, the Matrix surface must:
1. load the durable Matrix store
2. load the durable surface key-value state
3. load the agent registry config
4. resume normal room routing using persisted `selected_agent_id`, `agent_id`, and `platform_chat_id`
Expected behavior:
- a user with a valid previously selected agent does not need to reselect it
- a room previously bound to an agent remains bound to that agent
- normal messages and commands continue to work
### Lost temporary UX state
If the bot restarts during a transient UX flow:
- staged attachments may disappear
- pending `!load` selections may disappear
- pending confirmations may disappear
This is acceptable and should not block normal operation after restart.
## Interaction With Multi-Agent Routing
The multi-agent design introduces new durable state that must survive restart:
- `selected_agent_id` on the user
- `agent_id` on the room
- `PLATFORM_CHAT_SEQ_KEY` in the surface store
Restart persistence and multi-agent routing therefore belong together.
Without durable storage for those fields, a restart would make room routing ambiguous.
## Failure Handling
### Missing durable surface store
If the durable store paths are missing because the environment was reset:
- do not attempt to reconstruct a full working state from scratch in this design
- treat startup as a clean environment
- allow normal onboarding flows to begin again
### Invalid durable references
If persisted `selected_agent_id` or room `agent_id` references an agent no longer present in config:
- do not crash
- treat the selection or room binding as invalid
- ask the user to select a valid agent again
### Platform conversation memory
If the upstream platform loses agent memory across restart:
- that is outside the surface persistence boundary
- the surface must still route correctly
- platform memory persistence remains a platform responsibility
## Testing Expectations
Tests for this design should prove:
- `selected_agent_id` survives restart through durable local storage
- room `agent_id` and `platform_chat_id` survive restart through durable local storage
- the bot can route messages correctly after restart without user reconfiguration
- missing temporary UX state does not break normal messaging and command handling
- invalid persisted agent references degrade into reselection prompts rather than crashes
## Operational Notes
For the Matrix surface to survive restart in the intended way, deployment must persist:
- `lambda_matrix.db`
- `matrix_store`
This is a deployment requirement, not an optional optimization.
The design intentionally stops there. It does not require:
- hot reload of agent config
- recovery after total local state loss
- persistence of temporary UX flows
- a solved multi-agent workspace story