6.8 KiB
Matrix Surface Restart State Persistence Design
Goal
Make the Matrix surface survive a normal restart or container recreate without losing the minimal state required to keep working as a bot.
The result should be:
- after restart, the bot can still answer messages and execute commands
- the bot remembers the selected agent for each user
- the bot remembers which agent and
platform_chat_ideach room is bound to - temporary UX flows may be lost without being treated as a bug
Core Decision
The selected persistence model is:
durable surface state only
This means:
- persist only the state needed for routing and normal command handling
- do not persist temporary UI and wizard state
- require persistent local storage for the surface
- do not attempt recovery if those volumes are lost
Why This Decision
The Matrix surface already has two different classes of state:
- stable local state that defines how rooms and users are routed
- temporary UX state that exists only to complete short-lived interactions
Trying to make all temporary UX state survive restart would add complexity and edge cases without improving the core requirement: the bot should still function normally after restart.
The chosen design keeps persistence aligned with what the surface actually owns:
- Matrix-side metadata and routing state are durable
- agent conversation memory is the platform's responsibility
- lost local volumes are treated as environment reset, not as an auto-recovery scenario
Scope
This design covers:
- which Matrix surface data must persist across restart
- where that data lives
- how restart behavior interacts with multi-agent routing
- what state is intentionally non-durable
This design does not cover:
- platform-side persistence of agent memory
- workspace isolation between multiple agents
- automatic reconstruction after total local volume loss
- persistence of temporary UX flows
Persistence Boundary
Durable state
The Matrix surface must persist:
matrix_user:*matrix_room:*chat:*selected_agent_id- room-bound
agent_id - room-bound
platform_chat_id
This is the minimal state required so that, after restart, the surface can:
- identify the user
- identify the room
- determine which agent should receive a message
- determine which
platform_chat_idshould be used
Non-durable state
The Matrix surface does not need to persist:
- staged attachments
- pending
!loadselection - pending
!yes/!noconfirmation - any temporary service UI step
- live
AgentApiinstances or connection objects
After restart, those flows may be lost. The bot only needs to remain operational.
Storage Model
Surface durable storage
The Matrix surface must use persistent storage for:
lambda_matrix.dbmatrix_store
lambda_matrix.db stores the local key-value state used by the surface.
matrix_store stores Matrix client state needed by nio.
These paths must be backed by persistent container storage in normal deployments.
Shared /workspace
The current local runtime also uses /workspace, but workspace behavior is outside the scope of this design.
For this document, the only requirement is:
- do not make restart persistence depend on solving per-agent workspace isolation first
Restart Assumptions
This design assumes:
- normal restart or redeploy with persistent local volumes still present
This design does not assume:
- automatic recovery after deleting or losing those volumes
If the relevant volumes are lost, the environment is treated as reset.
Data Model Requirements
User metadata
User metadata remains the durable location for user-level routing state.
Example:
{
"space_id": "!space:example.org",
"next_chat_index": 4,
"selected_agent_id": "agent-2"
}
Room metadata
Room metadata remains the durable location for room-level routing state.
Example:
{
"room_type": "chat",
"chat_id": "C3",
"display_name": "Чат 3",
"matrix_user_id": "@alice:example.org",
"space_id": "!space:example.org",
"platform_chat_id": "42",
"agent_id": "agent-2"
}
Runtime Semantics After Restart
After restart, the Matrix surface must:
- load the durable Matrix store
- load the durable surface key-value state
- load the agent registry config
- resume normal room routing using persisted
selected_agent_id,agent_id, andplatform_chat_id
Expected behavior:
- a user with a valid previously selected agent does not need to reselect it
- a room previously bound to an agent remains bound to that agent
- normal messages and commands continue to work
Lost temporary UX state
If the bot restarts during a transient UX flow:
- staged attachments may disappear
- pending
!loadselections may disappear - pending confirmations may disappear
This is acceptable and should not block normal operation after restart.
Interaction With Multi-Agent Routing
The multi-agent design introduces new durable state that must survive restart:
selected_agent_idon the useragent_idon the room
Restart persistence and multi-agent routing therefore belong together.
Without durable storage for those fields, a restart would make room routing ambiguous.
Failure Handling
Missing durable surface store
If the durable store paths are missing because the environment was reset:
- do not attempt to reconstruct a full working state from scratch in this design
- treat startup as a clean environment
- allow normal onboarding flows to begin again
Invalid durable references
If persisted selected_agent_id or room agent_id references an agent no longer present in config:
- do not crash
- treat the selection or room binding as invalid
- ask the user to select a valid agent again
Platform conversation memory
If the upstream platform loses agent memory across restart:
- that is outside the surface persistence boundary
- the surface must still route correctly
- platform memory persistence remains a platform responsibility
Testing Expectations
Tests for this design should prove:
selected_agent_idsurvives restart through durable local storage- room
agent_idandplatform_chat_idsurvive restart through durable local storage - the bot can route messages correctly after restart without user reconfiguration
- missing temporary UX state does not break normal messaging and command handling
- invalid persisted agent references degrade into reselection prompts rather than crashes
Operational Notes
For the Matrix surface to survive restart in the intended way, deployment must persist:
lambda_matrix.dbmatrix_store
This is a deployment requirement, not an optional optimization.
The design intentionally stops there. It does not require:
- hot reload of agent config
- recovery after total local state loss
- persistence of temporary UX flows
- a solved multi-agent workspace story