surfaces/docs/superpowers/specs/2026-04-24-matrix-surface-restart-state-persistence-design.md

6.8 KiB

Matrix Surface Restart State Persistence Design

Goal

Make the Matrix surface survive a normal restart or container recreate without losing the minimal state required to keep working as a bot.

The result should be:

  • after restart, the bot can still answer messages and execute commands
  • the bot remembers the selected agent for each user
  • the bot remembers which agent and platform_chat_id each room is bound to
  • temporary UX flows may be lost without being treated as a bug

Core Decision

The selected persistence model is:

durable surface state only

This means:

  • persist only the state needed for routing and normal command handling
  • do not persist temporary UI and wizard state
  • require persistent local storage for the surface
  • do not attempt recovery if those volumes are lost

Why This Decision

The Matrix surface already has two different classes of state:

  • stable local state that defines how rooms and users are routed
  • temporary UX state that exists only to complete short-lived interactions

Trying to make all temporary UX state survive restart would add complexity and edge cases without improving the core requirement: the bot should still function normally after restart.

The chosen design keeps persistence aligned with what the surface actually owns:

  • Matrix-side metadata and routing state are durable
  • agent conversation memory is the platform's responsibility
  • lost local volumes are treated as environment reset, not as an auto-recovery scenario

Scope

This design covers:

  • which Matrix surface data must persist across restart
  • where that data lives
  • how restart behavior interacts with multi-agent routing
  • what state is intentionally non-durable

This design does not cover:

  • platform-side persistence of agent memory
  • workspace isolation between multiple agents
  • automatic reconstruction after total local volume loss
  • persistence of temporary UX flows

Persistence Boundary

Durable state

The Matrix surface must persist:

  • matrix_user:*
  • matrix_room:*
  • chat:*
  • selected_agent_id
  • room-bound agent_id
  • room-bound platform_chat_id

This is the minimal state required so that, after restart, the surface can:

  • identify the user
  • identify the room
  • determine which agent should receive a message
  • determine which platform_chat_id should be used

Non-durable state

The Matrix surface does not need to persist:

  • staged attachments
  • pending !load selection
  • pending !yes/!no confirmation
  • any temporary service UI step
  • live AgentApi instances or connection objects

After restart, those flows may be lost. The bot only needs to remain operational.

Storage Model

Surface durable storage

The Matrix surface must use persistent storage for:

  • lambda_matrix.db
  • matrix_store

lambda_matrix.db stores the local key-value state used by the surface. matrix_store stores Matrix client state needed by nio.

These paths must be backed by persistent container storage in normal deployments.

Shared /workspace

The current local runtime also uses /workspace, but workspace behavior is outside the scope of this design.

For this document, the only requirement is:

  • do not make restart persistence depend on solving per-agent workspace isolation first

Restart Assumptions

This design assumes:

  • normal restart or redeploy with persistent local volumes still present

This design does not assume:

  • automatic recovery after deleting or losing those volumes

If the relevant volumes are lost, the environment is treated as reset.

Data Model Requirements

User metadata

User metadata remains the durable location for user-level routing state.

Example:

{
  "space_id": "!space:example.org",
  "next_chat_index": 4,
  "selected_agent_id": "agent-2"
}

Room metadata

Room metadata remains the durable location for room-level routing state.

Example:

{
  "room_type": "chat",
  "chat_id": "C3",
  "display_name": "Чат 3",
  "matrix_user_id": "@alice:example.org",
  "space_id": "!space:example.org",
  "platform_chat_id": "42",
  "agent_id": "agent-2"
}

Runtime Semantics After Restart

After restart, the Matrix surface must:

  1. load the durable Matrix store
  2. load the durable surface key-value state
  3. load the agent registry config
  4. resume normal room routing using persisted selected_agent_id, agent_id, and platform_chat_id

Expected behavior:

  • a user with a valid previously selected agent does not need to reselect it
  • a room previously bound to an agent remains bound to that agent
  • normal messages and commands continue to work

Lost temporary UX state

If the bot restarts during a transient UX flow:

  • staged attachments may disappear
  • pending !load selections may disappear
  • pending confirmations may disappear

This is acceptable and should not block normal operation after restart.

Interaction With Multi-Agent Routing

The multi-agent design introduces new durable state that must survive restart:

  • selected_agent_id on the user
  • agent_id on the room

Restart persistence and multi-agent routing therefore belong together.

Without durable storage for those fields, a restart would make room routing ambiguous.

Failure Handling

Missing durable surface store

If the durable store paths are missing because the environment was reset:

  • do not attempt to reconstruct a full working state from scratch in this design
  • treat startup as a clean environment
  • allow normal onboarding flows to begin again

Invalid durable references

If persisted selected_agent_id or room agent_id references an agent no longer present in config:

  • do not crash
  • treat the selection or room binding as invalid
  • ask the user to select a valid agent again

Platform conversation memory

If the upstream platform loses agent memory across restart:

  • that is outside the surface persistence boundary
  • the surface must still route correctly
  • platform memory persistence remains a platform responsibility

Testing Expectations

Tests for this design should prove:

  • selected_agent_id survives restart through durable local storage
  • room agent_id and platform_chat_id survive restart through durable local storage
  • the bot can route messages correctly after restart without user reconfiguration
  • missing temporary UX state does not break normal messaging and command handling
  • invalid persisted agent references degrade into reselection prompts rather than crashes

Operational Notes

For the Matrix surface to survive restart in the intended way, deployment must persist:

  • lambda_matrix.db
  • matrix_store

This is a deployment requirement, not an optional optimization.

The design intentionally stops there. It does not require:

  • hot reload of agent config
  • recovery after total local state loss
  • persistence of temporary UX flows
  • a solved multi-agent workspace story