22 KiB
Phase 01.1: Matrix restart reconciliation and dev reset workflow - Research
Researched: 2026-04-03 Domain: Matrix adapter restart reconciliation, local state recovery, dev reset workflow Confidence: HIGH
<user_constraints>
User Constraints (from CONTEXT.md)
Locked Decisions
- D-01: Локальный SQLite store больше не должен считаться единственной точкой истины для Matrix runtime в dev workflow.
- D-02: При старте бот должен пытаться восстановить минимально необходимое локальное состояние из уже существующих Matrix rooms / Space, а не требовать full reset.
- D-03: Reconciliation должен восстанавливать как минимум
matrix_user:*,matrix_room:*и missingchat:{user}:{chat_id}записи, если серверные комнаты уже существуют. - D-04: Reconciliation не должен создавать новые Space/rooms, если задача — именно восстановление локального state после рестарта.
- D-05: Обычный restart бота должен быть основным путём для разработки; удаление
lambda_matrix.dbиmatrix_storeне должно быть обязательным для проверки workflow. - D-06: Если local state неполон, бот должен либо восстановить его, либо логировать понятную причину, а не падать на командах вроде
!rename. - D-07: Несогласованность между
room_metaиChatManagerдолжна обнаруживаться и устраняться автоматически на startup или при первом обращении. - D-08: Нужен отдельный dev-only reset tool/script для controlled QA, вместо ручного набора shell-команд.
- D-09: Reset workflow должен как минимум поддерживать
local-onlyreset: удалениеlambda_matrix.dbиmatrix_storeс понятной инструкцией, что делать с server-side Matrix rooms. - D-10: Если full server-side cleanup не автоматизируется в этой фазе, tool должен явно печатать, какие ручные шаги обязательны в Matrix client.
Claude's Discretion
- Точное место вызова reconciliation в startup flow
- Внутренняя структура helper-модуля (
bootstrap.py,reconcile.pyили аналог) - Формат dev reset script и уровень автоматизации server-side cleanup
- Детали debug-logging и dry-run режима, если они помогают без раздувания scope
Deferred Ideas (OUT OF SCOPE)
- Full production-grade migration of historical Matrix state across schema versions
- Automatic server-side deletion/leave for all Matrix rooms and Space during reset, if it requires broader admin semantics
- Any Phase 2 SDK integration work </user_constraints>
Summary
Phase 01.1 should be planned as a bootstrap/recovery phase, not as another chat-feature phase. The current Matrix adapter has no startup reconciliation path: adapter/matrix/bot.py logs in and goes directly to sync_forever(), while routing and command handlers assume matrix_room:*, matrix_user:*, and chat:* keys already exist. That means local DB loss currently produces logical corruption, not just missing cache.
The safe standard approach is: perform a first sync that hydrates joined-room state, inspect the bot's current joined rooms and room state from the homeserver, rebuild the minimal local metadata needed for command routing, and only then enter the long-running sync loop. Reconciliation should be non-destructive and idempotent: if local keys already exist and match server state, leave them alone; if they are missing, recreate them; if they conflict, prefer the server room topology for Matrix-specific metadata and recreate missing ChatManager rows from that.
For reset, separate two workflows explicitly. local-only reset is the default and should be automated. Optional server-side cleanup may leave/forget rooms for the bot account, but it cannot promise global deletion of Matrix rooms for all members; if that is not automated, the tool must print the exact manual steps for the Matrix client.
Primary recommendation: Add a startup reconcile_matrix_state() step before sync_forever(), and ship a dev-only reset CLI with local-only, server-leave-forget, and dry-run modes.
Project Constraints (from CLAUDE.md)
- Do not treat missing Lambda SDK as a blocker.
- Keep all platform calls behind
platform/interface.py. - Current runtime implementation is
platform/mock.py; recommendations must work with that. - Prefer architecture changes in adapters and core without coupling to future SDK internals.
- Use pytest-based verification.
- Do not recommend committing
.env. - Respect dependency order:
core/first, thenplatform/, then adapters.
Standard Stack
Core
| Library | Version | Purpose | Why Standard |
|---|---|---|---|
| Python | 3.14.3 installed | Runtime for bot and scripts | Already available locally; codebase targets >=3.11. |
matrix-nio |
0.25.2, published 2024-10-04 | Matrix client, sync, room membership/state APIs | Already installed; exposes the exact bootstrap/reset APIs this phase needs. |
SQLiteStore (repo) |
local | Adapter/core KV persistence | Existing persistence contract for matrix_user:*, matrix_room:*, and chat:*. |
| Matrix Client-Server API | spec latest | Authoritative room membership/state semantics | Needed to reason about restart recovery and leave/forget behavior correctly. |
Supporting
| Library | Version | Purpose | When to Use |
|---|---|---|---|
pytest |
9.0.2, published 2025-12-06 | Test runner | For targeted adapter/bootstrap regression tests. |
pytest-asyncio |
1.3.0, published 2025-11-10 | Async test execution | For async reconciliation/reset flows. |
structlog |
25.5.0, published 2025-10-27 | Diagnostics | For reconciliation summaries and conflict logging. |
python-dotenv |
1.2.2, published 2026-03-01 | Env loading | Already used by adapter/matrix/bot.py for Matrix config. |
Alternatives Considered
| Instead of | Could Use | Tradeoff |
|---|---|---|
| Startup reconciliation from joined rooms + state | Force developers to wipe local DB and recreate rooms | Simpler code, but directly violates D-01, D-02, D-05. |
| Non-destructive local rebuild | Full auto-recreate of Space/rooms on missing local state | Easier to implement, but causes duplicate Matrix rooms and breaks D-04. |
| Dev reset script | README-only manual ritual | Lower code cost, but not repeatable and fails D-08..D-10. |
Installation:
uv sync
Version verification: Verified via installed environment and PyPI metadata on 2026-04-03:
matrix-nio0.25.2- 2024-10-04pytest9.0.2- 2025-12-06pytest-asyncio1.3.0- 2025-11-10structlog25.5.0- 2025-10-27python-dotenv1.2.2- 2026-03-01
Architecture Patterns
Recommended Project Structure
adapter/matrix/
├── bot.py # startup flow calls reconciliation before sync loop
├── reconcile.py # bootstrap/rebuild logic from Matrix server state
├── reset.py # dev-only reset CLI / entrypoint
├── room_router.py # room_id -> chat_id with recovery hook
├── store.py # metadata helpers, prefix scans, derived counters
└── handlers/
├── auth.py # first-time provisioning only
└── chat.py # uses recovered state, no provisioning fallback
Pattern 1: Two-Phase Startup Bootstrap
What: Split startup into login -> initial sync/full_state -> reconcile -> steady-state sync_forever.
When to use: Always for Matrix bot startup when local DB may be missing or stale.
Example:
# Source: matrix-nio AsyncClient docs/source + repo startup flow
client = AsyncClient(...)
runtime = build_runtime(store=SQLiteStore(db_path), client=client)
await login_or_restore_session(client)
await client.sync(timeout=0, full_state=True)
report = await reconcile_matrix_state(client, runtime.store, runtime.chat_mgr)
logger.info("matrix_reconcile_complete", **report)
await client.sync_forever(timeout=30000)
Pattern 2: Rebuild Local Metadata From Joined Rooms
What: Enumerate joined rooms, inspect local hydrated room objects or room state, and recreate missing matrix_room:*, matrix_user:*, and chat:* records.
When to use: On startup and optionally on unregistered:{room_id} fallback at runtime.
Example:
# Source: matrix-nio AsyncClient.joined_rooms/room_get_state + repo store contracts
joined = await client.joined_rooms()
for room_id in joined.rooms:
state = await client.room_get_state(room_id)
# detect: space room vs chat room, owner user, child relationship, display name
# rebuild matrix_room:{room_id}
# rebuild chat:{matrix_user_id}:{chat_id} if absent
Pattern 3: Non-Destructive Reconciliation Report
What: Return a structured report: scanned rooms, restored rooms, restored chats, conflicts, skipped rooms. When to use: Every reconciliation run, including dry-run. Example:
{
"joined_rooms": 4,
"restored_user_meta": 1,
"restored_room_meta": 3,
"restored_chat_rows": 3,
"conflicts": [],
"skipped_rooms": ["!dm:example.org"],
}
Pattern 4: Reset Modes Are Explicit
What: Separate local-only, server-leave-forget, and dry-run.
When to use: For dev/QA only. Never mix destructive server cleanup into normal startup.
Example:
uv run python -m adapter.matrix.reset --mode local-only
uv run python -m adapter.matrix.reset --mode server-leave-forget --dry-run
Anti-Patterns to Avoid
- Provisioning during reconciliation: Do not create a new Space or new rooms while trying to recover missing local state.
- Treating
next_chat_indexas primary truth: Derive it from recoveredchat_idvalues after scan; do not trust a missing or stale counter. - Routing unknown rooms straight through:
unregistered:{room_id}is a signal to reconcile, not a stable runtime identity. - Destructive reset by default: Startup must never leave/forget rooms automatically.
- Blindly trusting local
surface_ref: Ifchat:*andmatrix_room:*disagree, rebuild from Matrix room metadata and repair the chat row.
Don't Hand-Roll
| Problem | Don't Build | Use Instead | Why |
|---|---|---|---|
| Room discovery | Custom DB-only reconstruction heuristics | AsyncClient.joined_rooms() plus synced room state |
Server already knows which rooms the bot joined. |
| Space membership detection | Naming-convention parsing of room names | Matrix state: m.room.create.type, m.space.child, m.space.parent |
Names are mutable and non-authoritative. |
| Room cleanup semantics | Custom “delete room” assumptions | room_leave() + room_forget() semantics |
Client API supports leave/forget, not guaranteed global deletion. |
| Chat ID recovery | Hardcoded C1/C2/... reset |
Rebuild from existing matrix_room:*/server state and compute next index |
Prevents collisions after partial DB loss. |
| Diagnostic output | Ad hoc print() strings |
Structured reconciliation/reset report via structlog |
Easier manual QA and failure triage. |
Key insight: The homeserver already persists the bot’s room graph. This phase should rehydrate local cache from that graph, not attempt to replace it with a second custom truth model.
Common Pitfalls
Pitfall 1: Joining the sync loop before reconciliation
What goes wrong: Commands arrive while local metadata is still missing, producing unregistered:{room_id} routing or ChatManager misses.
Why it happens: Current main() enters sync_forever() immediately after login.
How to avoid: Perform initial sync and reconciliation first.
Warning signs: unregistered_room logs immediately after restart; ValueError("Chat ... not found") on !rename or !archive.
Pitfall 2: Recovering room metadata but not chat rows
What goes wrong: Room routing works, but ChatManager.rename/archive/list_active still fails because chat:{user}:{chat_id} rows were not recreated.
Why it happens: Matrix adapter metadata and core chat metadata live in different keyspaces.
How to avoid: Reconciliation must repair both stores in one pass.
Warning signs: matrix_room:* exists but chat:* keys do not.
Pitfall 3: Trusting stale next_chat_index
What goes wrong: New chats reuse existing C IDs after local recovery.
Why it happens: next_chat_id() increments a persisted counter that may be absent or behind.
How to avoid: After scan, set next_chat_index = max(recovered_chat_numbers) + 1.
Warning signs: New room gets C1 even though Space already contains prior rooms.
Pitfall 4: Assuming room names identify chat rooms safely
What goes wrong: Reconciliation binds the wrong room because a user renamed a room or Space.
Why it happens: Names are user-facing labels, not stable identifiers.
How to avoid: Prefer room state and existing chat_id metadata; use display names only as fallback.
Warning signs: Duplicate “Чат 1” names or renamed rooms break matching.
Pitfall 5: Over-promising full cleanup
What goes wrong: Reset script claims a “clean slate” but rooms still exist in Element or for other members. Why it happens: Leaving/forgetting affects the bot account’s membership/history, not necessarily global room deletion. How to avoid: Name the mode accurately and print the manual client steps when needed. Warning signs: QA reruns still show old rooms in the user’s client.
Code Examples
Verified patterns from official sources and the installed library surface:
Initial Sync Before Reconcile
# Source: matrix-nio AsyncClient.sync/sync_forever
await client.sync(timeout=0, full_state=True)
report = await reconcile_matrix_state(client, store, chat_mgr)
await client.sync_forever(timeout=30000)
Space Child Link Creation
# Source: Matrix client-server API state event + current auth/new-chat flow
await client.room_put_state(
room_id=space_id,
event_type="m.space.child",
content={"via": [homeserver]},
state_key=chat_room_id,
)
Bot-Side Leave/Forget Cleanup
# Source: matrix-nio AsyncClient.room_leave / room_forget
for room_id in room_ids:
await client.room_leave(room_id)
await client.room_forget(room_id)
Router Recovery Trigger
# Source: repo room_router contract
chat_id = await resolve_chat_id(store, room_id, matrix_user_id)
if chat_id.startswith("unregistered:"):
await reconcile_single_room(client, store, chat_mgr, room_id, matrix_user_id)
State of the Art
| Old Approach | Current Approach | When Changed | Impact |
|---|---|---|---|
| Local adapter DB treated as the operational truth | Rebuildable local cache from server room graph | Mature Matrix client practice; supported by current Matrix CS API and matrix-nio |
Restart no longer requires destructive local reset. |
| Manual room cleanup in client after experiments | Scripted leave/forget plus explicit manual instructions | Current matrix-nio 0.25.x API surface |
QA becomes repeatable and auditable. |
| Immediate steady-state sync after login | Initial sync/full-state bootstrap before long polling | Supported by current AsyncClient.sync() / sync_forever() behavior |
Reconciliation can run before any user traffic is handled. |
Deprecated/outdated:
README.mdMatrix manual QA instructionrm -f lambda_matrix.dbas the primary restart flow: outdated for this phase.- DM-first Matrix recovery assumptions in
docs/matrix-prototype.md: outdated relative to Phase 1 Space+rooms decisions.
Open Questions
-
How exactly should reconciliation identify the owning Matrix user for a recovered room when local
matrix_room:*is gone?- What we know: the bot can enumerate joined rooms and fetch room state; current healthy metadata stores
matrix_user_idandspace_id. - What's unclear: whether Phase 1-created rooms also expose enough server-side structure to recover owner deterministically without existing local metadata in every case.
- Recommendation: Plan a proof test against a real homeserver/client. If room-state-only ownership is ambiguous, persist a tiny bot-authored marker state event going forward, but keep that addition narrowly scoped.
- What we know: the bot can enumerate joined rooms and fetch room state; current healthy metadata stores
-
Should runtime recovery happen only on startup, or also lazily on first unknown room access?
- What we know: startup repair satisfies D-02/D-07 for common restart loss;
room_routeralready surfaces unknown rooms cleanly. - What's unclear: whether partial DB corruption during runtime is common enough to justify lazy single-room repair in Phase 01.1.
- Recommendation: Make startup reconciliation required, lazy room repair optional if it stays small.
- What we know: startup repair satisfies D-02/D-07 for common restart loss;
-
How much of server cleanup should Phase 01.1 automate?
- What we know:
room_leave()androom_forget()are available; global room deletion is not what the client API guarantees. - What's unclear: whether automating bot-side leave/forget is worth the extra risk for this urgent phase.
- Recommendation: Keep
local-onlymandatory. Make server cleanup optional and clearly labeled experimental/dev-only if included.
- What we know:
Environment Availability
| Dependency | Required By | Available | Version | Fallback |
|---|---|---|---|---|
| Python | Runtime, scripts, tests | ✓ | 3.14.3 | — |
uv |
Standard install/run workflow | ✓ | 0.9.30 | python -m + existing venv |
pytest |
Automated verification | ✓ | 9.0.2 | uv run pytest |
| Matrix homeserver credentials | Real restart/reset manual QA | ✗ in current shell | — | Manual-only after .env is configured |
| Matrix bot local DB/store paths | Reset workflow | ✓ | defaults in code | Can override with MATRIX_DB_PATH / MATRIX_STORE_PATH |
Missing dependencies with no fallback:
- Live Matrix credentials for real manual reconciliation/reset QA.
Missing dependencies with fallback:
- None for repository-only implementation and tests.
Validation Architecture
Test Framework
| Property | Value |
|---|---|
| Framework | pytest 9.0.2 + pytest-asyncio 1.3.0 |
| Config file | pyproject.toml |
| Quick run command | pytest tests/adapter/matrix -v |
| Full suite command | pytest tests/ -v |
Phase Requirements → Test Map
| Req ID | Behavior | Test Type | Automated Command | File Exists? |
|---|---|---|---|---|
| PH01.1-BOOT | Startup rebuilds missing matrix_user:*, matrix_room:*, and chat:* from existing rooms without creating new rooms |
unit/integration | pytest tests/adapter/matrix/test_reconcile.py -v |
❌ Wave 0 |
| PH01.1-ROUTER | Unknown room fallback can trigger repair or yields diagnosable warning without crashing commands | unit | pytest tests/adapter/matrix/test_room_router_reconcile.py -v |
❌ Wave 0 |
| PH01.1-COUNTER | Reconciliation resets next_chat_index to recovered max + 1 |
unit | pytest tests/adapter/matrix/test_reconcile.py -k next_chat_index -v |
❌ Wave 0 |
| PH01.1-RESET | Dev reset local-only removes local DB/store paths and prints next steps |
unit/smoke | pytest tests/adapter/matrix/test_reset.py -v |
❌ Wave 0 |
| PH01.1-NONDESTRUCTIVE | Reconciliation never calls room creation APIs | unit | pytest tests/adapter/matrix/test_reconcile.py -k no_create -v |
❌ Wave 0 |
Sampling Rate
- Per task commit:
pytest tests/adapter/matrix -v - Per wave merge:
pytest tests/ -v - Phase gate: Full suite green before
/gsd:verify-work
Wave 0 Gaps
tests/adapter/matrix/test_reconcile.py- startup reconciliation scenariostests/adapter/matrix/test_reset.py- CLI/script reset modes and outputtests/adapter/matrix/test_room_router_reconcile.py- lazy recovery or warning behavior- Integration fixture for a fake
AsyncClientresponse surface matchingjoined_rooms()androom_get_state()
Sources
Primary (HIGH confidence)
- Matrix Client-Server API - room state, leave, forget, joined rooms, Spaces semantics: https://spec.matrix.org/latest/client-server-api/index.html
matrix-nioinstalled 0.25.2 API surface verified locally on 2026-04-03 viaAsyncClient.sync,sync_forever,joined_rooms,room_get_state,room_leave,room_forget- Repo code: adapter/matrix/bot.py, adapter/matrix/store.py, adapter/matrix/room_router.py, adapter/matrix/handlers/auth.py, core/chat.py
- PyPI release metadata: https://pypi.org/project/matrix-nio/ , https://pypi.org/project/pytest/ , https://pypi.org/project/pytest-asyncio/ , https://pypi.org/project/structlog/ , https://pypi.org/project/python-dotenv/
Secondary (MEDIUM confidence)
- README.md - current manual reset habit and run commands
- docs/matrix-prototype.md - original Matrix UX intent, noting outdated DM/reaction sections
- 01-CONTEXT.md - locked Phase 1 Matrix decisions
- 01-VERIFICATION.md - what has already been verified and what still needs human Matrix QA
Tertiary (LOW confidence)
- None
Metadata
Confidence breakdown:
- Standard stack: HIGH - verified against installed environment, PyPI metadata, and official Matrix spec
- Architecture: HIGH - directly grounded in current repo flow plus current
matrix-nio/Matrix capabilities - Pitfalls: HIGH - derived from concrete gaps in current startup/store/router code
Research date: 2026-04-03 Valid until: 2026-05-03