# Phase 01.1: Matrix restart reconciliation and dev reset workflow - Research **Researched:** 2026-04-03 **Domain:** Matrix adapter restart reconciliation, local state recovery, dev reset workflow **Confidence:** HIGH ## User Constraints (from CONTEXT.md) ### Locked Decisions - **D-01:** Локальный SQLite store больше не должен считаться единственной точкой истины для Matrix runtime в dev workflow. - **D-02:** При старте бот должен пытаться восстановить минимально необходимое локальное состояние из уже существующих Matrix rooms / Space, а не требовать full reset. - **D-03:** Reconciliation должен восстанавливать как минимум `matrix_user:*`, `matrix_room:*` и missing `chat:{user}:{chat_id}` записи, если серверные комнаты уже существуют. - **D-04:** Reconciliation не должен создавать новые Space/rooms, если задача — именно восстановление локального state после рестарта. - **D-05:** Обычный restart бота должен быть основным путём для разработки; удаление `lambda_matrix.db` и `matrix_store` не должно быть обязательным для проверки workflow. - **D-06:** Если local state неполон, бот должен либо восстановить его, либо логировать понятную причину, а не падать на командах вроде `!rename`. - **D-07:** Несогласованность между `room_meta` и `ChatManager` должна обнаруживаться и устраняться автоматически на startup или при первом обращении. - **D-08:** Нужен отдельный dev-only reset tool/script для controlled QA, вместо ручного набора shell-команд. - **D-09:** Reset workflow должен как минимум поддерживать `local-only` reset: удаление `lambda_matrix.db` и `matrix_store` с понятной инструкцией, что делать с server-side Matrix rooms. - **D-10:** Если full server-side cleanup не автоматизируется в этой фазе, tool должен явно печатать, какие ручные шаги обязательны в Matrix client. ### Claude's Discretion - Точное место вызова reconciliation в startup flow - Внутренняя структура helper-модуля (`bootstrap.py`, `reconcile.py` или аналог) - Формат dev reset script и уровень автоматизации server-side cleanup - Детали debug-logging и dry-run режима, если они помогают без раздувания scope ### Deferred Ideas (OUT OF SCOPE) - Full production-grade migration of historical Matrix state across schema versions - Automatic server-side deletion/leave for all Matrix rooms and Space during reset, if it requires broader admin semantics - Any Phase 2 SDK integration work ## Summary Phase 01.1 should be planned as a bootstrap/recovery phase, not as another chat-feature phase. The current Matrix adapter has no startup reconciliation path: `adapter/matrix/bot.py` logs in and goes directly to `sync_forever()`, while routing and command handlers assume `matrix_room:*`, `matrix_user:*`, and `chat:*` keys already exist. That means local DB loss currently produces logical corruption, not just missing cache. The safe standard approach is: perform a first sync that hydrates joined-room state, inspect the bot's current joined rooms and room state from the homeserver, rebuild the minimal local metadata needed for command routing, and only then enter the long-running sync loop. Reconciliation should be non-destructive and idempotent: if local keys already exist and match server state, leave them alone; if they are missing, recreate them; if they conflict, prefer the server room topology for Matrix-specific metadata and recreate missing `ChatManager` rows from that. For reset, separate two workflows explicitly. `local-only` reset is the default and should be automated. Optional server-side cleanup may leave/forget rooms for the bot account, but it cannot promise global deletion of Matrix rooms for all members; if that is not automated, the tool must print the exact manual steps for the Matrix client. **Primary recommendation:** Add a startup `reconcile_matrix_state()` step before `sync_forever()`, and ship a dev-only reset CLI with `local-only`, `server-leave-forget`, and `dry-run` modes. ## Project Constraints (from CLAUDE.md) - Do not treat missing Lambda SDK as a blocker. - Keep all platform calls behind `platform/interface.py`. - Current runtime implementation is `platform/mock.py`; recommendations must work with that. - Prefer architecture changes in adapters and core without coupling to future SDK internals. - Use pytest-based verification. - Do not recommend committing `.env`. - Respect dependency order: `core/` first, then `platform/`, then adapters. ## Standard Stack ### Core | Library | Version | Purpose | Why Standard | |---------|---------|---------|--------------| | Python | 3.14.3 installed | Runtime for bot and scripts | Already available locally; codebase targets `>=3.11`. | | `matrix-nio` | 0.25.2, published 2024-10-04 | Matrix client, sync, room membership/state APIs | Already installed; exposes the exact bootstrap/reset APIs this phase needs. | | `SQLiteStore` (repo) | local | Adapter/core KV persistence | Existing persistence contract for `matrix_user:*`, `matrix_room:*`, and `chat:*`. | | Matrix Client-Server API | spec latest | Authoritative room membership/state semantics | Needed to reason about restart recovery and leave/forget behavior correctly. | ### Supporting | Library | Version | Purpose | When to Use | |---------|---------|---------|-------------| | `pytest` | 9.0.2, published 2025-12-06 | Test runner | For targeted adapter/bootstrap regression tests. | | `pytest-asyncio` | 1.3.0, published 2025-11-10 | Async test execution | For async reconciliation/reset flows. | | `structlog` | 25.5.0, published 2025-10-27 | Diagnostics | For reconciliation summaries and conflict logging. | | `python-dotenv` | 1.2.2, published 2026-03-01 | Env loading | Already used by `adapter/matrix/bot.py` for Matrix config. | ### Alternatives Considered | Instead of | Could Use | Tradeoff | |------------|-----------|----------| | Startup reconciliation from joined rooms + state | Force developers to wipe local DB and recreate rooms | Simpler code, but directly violates D-01, D-02, D-05. | | Non-destructive local rebuild | Full auto-recreate of Space/rooms on missing local state | Easier to implement, but causes duplicate Matrix rooms and breaks D-04. | | Dev reset script | README-only manual ritual | Lower code cost, but not repeatable and fails D-08..D-10. | **Installation:** ```bash uv sync ``` **Version verification:** Verified via installed environment and PyPI metadata on 2026-04-03: - `matrix-nio` `0.25.2` - 2024-10-04 - `pytest` `9.0.2` - 2025-12-06 - `pytest-asyncio` `1.3.0` - 2025-11-10 - `structlog` `25.5.0` - 2025-10-27 - `python-dotenv` `1.2.2` - 2026-03-01 ## Architecture Patterns ### Recommended Project Structure ```text adapter/matrix/ ├── bot.py # startup flow calls reconciliation before sync loop ├── reconcile.py # bootstrap/rebuild logic from Matrix server state ├── reset.py # dev-only reset CLI / entrypoint ├── room_router.py # room_id -> chat_id with recovery hook ├── store.py # metadata helpers, prefix scans, derived counters └── handlers/ ├── auth.py # first-time provisioning only └── chat.py # uses recovered state, no provisioning fallback ``` ### Pattern 1: Two-Phase Startup Bootstrap **What:** Split startup into `login -> initial sync/full_state -> reconcile -> steady-state sync_forever`. **When to use:** Always for Matrix bot startup when local DB may be missing or stale. **Example:** ```python # Source: matrix-nio AsyncClient docs/source + repo startup flow client = AsyncClient(...) runtime = build_runtime(store=SQLiteStore(db_path), client=client) await login_or_restore_session(client) await client.sync(timeout=0, full_state=True) report = await reconcile_matrix_state(client, runtime.store, runtime.chat_mgr) logger.info("matrix_reconcile_complete", **report) await client.sync_forever(timeout=30000) ``` ### Pattern 2: Rebuild Local Metadata From Joined Rooms **What:** Enumerate joined rooms, inspect local hydrated room objects or room state, and recreate missing `matrix_room:*`, `matrix_user:*`, and `chat:*` records. **When to use:** On startup and optionally on `unregistered:{room_id}` fallback at runtime. **Example:** ```python # Source: matrix-nio AsyncClient.joined_rooms/room_get_state + repo store contracts joined = await client.joined_rooms() for room_id in joined.rooms: state = await client.room_get_state(room_id) # detect: space room vs chat room, owner user, child relationship, display name # rebuild matrix_room:{room_id} # rebuild chat:{matrix_user_id}:{chat_id} if absent ``` ### Pattern 3: Non-Destructive Reconciliation Report **What:** Return a structured report: scanned rooms, restored rooms, restored chats, conflicts, skipped rooms. **When to use:** Every reconciliation run, including dry-run. **Example:** ```python { "joined_rooms": 4, "restored_user_meta": 1, "restored_room_meta": 3, "restored_chat_rows": 3, "conflicts": [], "skipped_rooms": ["!dm:example.org"], } ``` ### Pattern 4: Reset Modes Are Explicit **What:** Separate `local-only`, `server-leave-forget`, and `dry-run`. **When to use:** For dev/QA only. Never mix destructive server cleanup into normal startup. **Example:** ```bash uv run python -m adapter.matrix.reset --mode local-only uv run python -m adapter.matrix.reset --mode server-leave-forget --dry-run ``` ### Anti-Patterns to Avoid - **Provisioning during reconciliation:** Do not create a new Space or new rooms while trying to recover missing local state. - **Treating `next_chat_index` as primary truth:** Derive it from recovered `chat_id` values after scan; do not trust a missing or stale counter. - **Routing unknown rooms straight through:** `unregistered:{room_id}` is a signal to reconcile, not a stable runtime identity. - **Destructive reset by default:** Startup must never leave/forget rooms automatically. - **Blindly trusting local `surface_ref`:** If `chat:*` and `matrix_room:*` disagree, rebuild from Matrix room metadata and repair the chat row. ## Don't Hand-Roll | Problem | Don't Build | Use Instead | Why | |---------|-------------|-------------|-----| | Room discovery | Custom DB-only reconstruction heuristics | `AsyncClient.joined_rooms()` plus synced room state | Server already knows which rooms the bot joined. | | Space membership detection | Naming-convention parsing of room names | Matrix state: `m.room.create.type`, `m.space.child`, `m.space.parent` | Names are mutable and non-authoritative. | | Room cleanup semantics | Custom “delete room” assumptions | `room_leave()` + `room_forget()` semantics | Client API supports leave/forget, not guaranteed global deletion. | | Chat ID recovery | Hardcoded `C1/C2/...` reset | Rebuild from existing `matrix_room:*`/server state and compute next index | Prevents collisions after partial DB loss. | | Diagnostic output | Ad hoc `print()` strings | Structured reconciliation/reset report via `structlog` | Easier manual QA and failure triage. | **Key insight:** The homeserver already persists the bot’s room graph. This phase should rehydrate local cache from that graph, not attempt to replace it with a second custom truth model. ## Common Pitfalls ### Pitfall 1: Joining the sync loop before reconciliation **What goes wrong:** Commands arrive while local metadata is still missing, producing `unregistered:{room_id}` routing or `ChatManager` misses. **Why it happens:** Current `main()` enters `sync_forever()` immediately after login. **How to avoid:** Perform initial sync and reconciliation first. **Warning signs:** `unregistered_room` logs immediately after restart; `ValueError("Chat ... not found")` on `!rename` or `!archive`. ### Pitfall 2: Recovering room metadata but not chat rows **What goes wrong:** Room routing works, but `ChatManager.rename/archive/list_active` still fails because `chat:{user}:{chat_id}` rows were not recreated. **Why it happens:** Matrix adapter metadata and core chat metadata live in different keyspaces. **How to avoid:** Reconciliation must repair both stores in one pass. **Warning signs:** `matrix_room:*` exists but `chat:*` keys do not. ### Pitfall 3: Trusting stale `next_chat_index` **What goes wrong:** New chats reuse existing `C` IDs after local recovery. **Why it happens:** `next_chat_id()` increments a persisted counter that may be absent or behind. **How to avoid:** After scan, set `next_chat_index = max(recovered_chat_numbers) + 1`. **Warning signs:** New room gets `C1` even though Space already contains prior rooms. ### Pitfall 4: Assuming room names identify chat rooms safely **What goes wrong:** Reconciliation binds the wrong room because a user renamed a room or Space. **Why it happens:** Names are user-facing labels, not stable identifiers. **How to avoid:** Prefer room state and existing `chat_id` metadata; use display names only as fallback. **Warning signs:** Duplicate “Чат 1” names or renamed rooms break matching. ### Pitfall 5: Over-promising full cleanup **What goes wrong:** Reset script claims a “clean slate” but rooms still exist in Element or for other members. **Why it happens:** Leaving/forgetting affects the bot account’s membership/history, not necessarily global room deletion. **How to avoid:** Name the mode accurately and print the manual client steps when needed. **Warning signs:** QA reruns still show old rooms in the user’s client. ## Code Examples Verified patterns from official sources and the installed library surface: ### Initial Sync Before Reconcile ```python # Source: matrix-nio AsyncClient.sync/sync_forever await client.sync(timeout=0, full_state=True) report = await reconcile_matrix_state(client, store, chat_mgr) await client.sync_forever(timeout=30000) ``` ### Space Child Link Creation ```python # Source: Matrix client-server API state event + current auth/new-chat flow await client.room_put_state( room_id=space_id, event_type="m.space.child", content={"via": [homeserver]}, state_key=chat_room_id, ) ``` ### Bot-Side Leave/Forget Cleanup ```python # Source: matrix-nio AsyncClient.room_leave / room_forget for room_id in room_ids: await client.room_leave(room_id) await client.room_forget(room_id) ``` ### Router Recovery Trigger ```python # Source: repo room_router contract chat_id = await resolve_chat_id(store, room_id, matrix_user_id) if chat_id.startswith("unregistered:"): await reconcile_single_room(client, store, chat_mgr, room_id, matrix_user_id) ``` ## State of the Art | Old Approach | Current Approach | When Changed | Impact | |--------------|------------------|--------------|--------| | Local adapter DB treated as the operational truth | Rebuildable local cache from server room graph | Mature Matrix client practice; supported by current Matrix CS API and `matrix-nio` | Restart no longer requires destructive local reset. | | Manual room cleanup in client after experiments | Scripted leave/forget plus explicit manual instructions | Current `matrix-nio` 0.25.x API surface | QA becomes repeatable and auditable. | | Immediate steady-state sync after login | Initial sync/full-state bootstrap before long polling | Supported by current `AsyncClient.sync()` / `sync_forever()` behavior | Reconciliation can run before any user traffic is handled. | **Deprecated/outdated:** - `README.md` Matrix manual QA instruction `rm -f lambda_matrix.db` as the primary restart flow: outdated for this phase. - DM-first Matrix recovery assumptions in `docs/matrix-prototype.md`: outdated relative to Phase 1 Space+rooms decisions. ## Open Questions 1. **How exactly should reconciliation identify the owning Matrix user for a recovered room when local `matrix_room:*` is gone?** - What we know: the bot can enumerate joined rooms and fetch room state; current healthy metadata stores `matrix_user_id` and `space_id`. - What's unclear: whether Phase 1-created rooms also expose enough server-side structure to recover owner deterministically without existing local metadata in every case. - Recommendation: Plan a proof test against a real homeserver/client. If room-state-only ownership is ambiguous, persist a tiny bot-authored marker state event going forward, but keep that addition narrowly scoped. 2. **Should runtime recovery happen only on startup, or also lazily on first unknown room access?** - What we know: startup repair satisfies D-02/D-07 for common restart loss; `room_router` already surfaces unknown rooms cleanly. - What's unclear: whether partial DB corruption during runtime is common enough to justify lazy single-room repair in Phase 01.1. - Recommendation: Make startup reconciliation required, lazy room repair optional if it stays small. 3. **How much of server cleanup should Phase 01.1 automate?** - What we know: `room_leave()` and `room_forget()` are available; global room deletion is not what the client API guarantees. - What's unclear: whether automating bot-side leave/forget is worth the extra risk for this urgent phase. - Recommendation: Keep `local-only` mandatory. Make server cleanup optional and clearly labeled experimental/dev-only if included. ## Environment Availability | Dependency | Required By | Available | Version | Fallback | |------------|------------|-----------|---------|----------| | Python | Runtime, scripts, tests | ✓ | 3.14.3 | — | | `uv` | Standard install/run workflow | ✓ | 0.9.30 | `python -m` + existing venv | | `pytest` | Automated verification | ✓ | 9.0.2 | `uv run pytest` | | Matrix homeserver credentials | Real restart/reset manual QA | ✗ in current shell | — | Manual-only after `.env` is configured | | Matrix bot local DB/store paths | Reset workflow | ✓ | defaults in code | Can override with `MATRIX_DB_PATH` / `MATRIX_STORE_PATH` | **Missing dependencies with no fallback:** - Live Matrix credentials for real manual reconciliation/reset QA. **Missing dependencies with fallback:** - None for repository-only implementation and tests. ## Validation Architecture ### Test Framework | Property | Value | |----------|-------| | Framework | `pytest 9.0.2` + `pytest-asyncio 1.3.0` | | Config file | `pyproject.toml` | | Quick run command | `pytest tests/adapter/matrix -v` | | Full suite command | `pytest tests/ -v` | ### Phase Requirements → Test Map | Req ID | Behavior | Test Type | Automated Command | File Exists? | |--------|----------|-----------|-------------------|-------------| | PH01.1-BOOT | Startup rebuilds missing `matrix_user:*`, `matrix_room:*`, and `chat:*` from existing rooms without creating new rooms | unit/integration | `pytest tests/adapter/matrix/test_reconcile.py -v` | ❌ Wave 0 | | PH01.1-ROUTER | Unknown room fallback can trigger repair or yields diagnosable warning without crashing commands | unit | `pytest tests/adapter/matrix/test_room_router_reconcile.py -v` | ❌ Wave 0 | | PH01.1-COUNTER | Reconciliation resets `next_chat_index` to recovered max + 1 | unit | `pytest tests/adapter/matrix/test_reconcile.py -k next_chat_index -v` | ❌ Wave 0 | | PH01.1-RESET | Dev reset `local-only` removes local DB/store paths and prints next steps | unit/smoke | `pytest tests/adapter/matrix/test_reset.py -v` | ❌ Wave 0 | | PH01.1-NONDESTRUCTIVE | Reconciliation never calls room creation APIs | unit | `pytest tests/adapter/matrix/test_reconcile.py -k no_create -v` | ❌ Wave 0 | ### Sampling Rate - **Per task commit:** `pytest tests/adapter/matrix -v` - **Per wave merge:** `pytest tests/ -v` - **Phase gate:** Full suite green before `/gsd:verify-work` ### Wave 0 Gaps - [ ] `tests/adapter/matrix/test_reconcile.py` - startup reconciliation scenarios - [ ] `tests/adapter/matrix/test_reset.py` - CLI/script reset modes and output - [ ] `tests/adapter/matrix/test_room_router_reconcile.py` - lazy recovery or warning behavior - [ ] Integration fixture for a fake `AsyncClient` response surface matching `joined_rooms()` and `room_get_state()` ## Sources ### Primary (HIGH confidence) - Matrix Client-Server API - room state, leave, forget, joined rooms, Spaces semantics: https://spec.matrix.org/latest/client-server-api/index.html - `matrix-nio` installed 0.25.2 API surface verified locally on 2026-04-03 via `AsyncClient.sync`, `sync_forever`, `joined_rooms`, `room_get_state`, `room_leave`, `room_forget` - Repo code: [adapter/matrix/bot.py](/Users/a/MAI/sem2/lambda/surfaces-bot/adapter/matrix/bot.py), [adapter/matrix/store.py](/Users/a/MAI/sem2/lambda/surfaces-bot/adapter/matrix/store.py), [adapter/matrix/room_router.py](/Users/a/MAI/sem2/lambda/surfaces-bot/adapter/matrix/room_router.py), [adapter/matrix/handlers/auth.py](/Users/a/MAI/sem2/lambda/surfaces-bot/adapter/matrix/handlers/auth.py), [core/chat.py](/Users/a/MAI/sem2/lambda/surfaces-bot/core/chat.py) - PyPI release metadata: https://pypi.org/project/matrix-nio/ , https://pypi.org/project/pytest/ , https://pypi.org/project/pytest-asyncio/ , https://pypi.org/project/structlog/ , https://pypi.org/project/python-dotenv/ ### Secondary (MEDIUM confidence) - [README.md](/Users/a/MAI/sem2/lambda/surfaces-bot/README.md) - current manual reset habit and run commands - [docs/matrix-prototype.md](/Users/a/MAI/sem2/lambda/surfaces-bot/docs/matrix-prototype.md) - original Matrix UX intent, noting outdated DM/reaction sections - [01-CONTEXT.md](/Users/a/MAI/sem2/lambda/surfaces-bot/.planning/phases/01-matrix-qa-polish/01-CONTEXT.md) - locked Phase 1 Matrix decisions - [01-VERIFICATION.md](/Users/a/MAI/sem2/lambda/surfaces-bot/.planning/phases/01-matrix-qa-polish/01-VERIFICATION.md) - what has already been verified and what still needs human Matrix QA ### Tertiary (LOW confidence) - None ## Metadata **Confidence breakdown:** - Standard stack: HIGH - verified against installed environment, PyPI metadata, and official Matrix spec - Architecture: HIGH - directly grounded in current repo flow plus current `matrix-nio`/Matrix capabilities - Pitfalls: HIGH - derived from concrete gaps in current startup/store/router code **Research date:** 2026-04-03 **Valid until:** 2026-05-03