diff --git a/.planning/phases/01.1-matrix-restart-reconciliation-and-dev-reset-workflow/01.1-RESEARCH.md b/.planning/phases/01.1-matrix-restart-reconciliation-and-dev-reset-workflow/01.1-RESEARCH.md new file mode 100644 index 0000000..792031d --- /dev/null +++ b/.planning/phases/01.1-matrix-restart-reconciliation-and-dev-reset-workflow/01.1-RESEARCH.md @@ -0,0 +1,350 @@ +# Phase 01.1: Matrix restart reconciliation and dev reset workflow - Research + +**Researched:** 2026-04-03 +**Domain:** Matrix adapter restart reconciliation, local state recovery, dev reset workflow +**Confidence:** HIGH + + +## User Constraints (from CONTEXT.md) + +### Locked Decisions +- **D-01:** Локальный SQLite store больше не должен считаться единственной точкой истины для Matrix runtime в dev workflow. +- **D-02:** При старте бот должен пытаться восстановить минимально необходимое локальное состояние из уже существующих Matrix rooms / Space, а не требовать full reset. +- **D-03:** Reconciliation должен восстанавливать как минимум `matrix_user:*`, `matrix_room:*` и missing `chat:{user}:{chat_id}` записи, если серверные комнаты уже существуют. +- **D-04:** Reconciliation не должен создавать новые Space/rooms, если задача — именно восстановление локального state после рестарта. +- **D-05:** Обычный restart бота должен быть основным путём для разработки; удаление `lambda_matrix.db` и `matrix_store` не должно быть обязательным для проверки workflow. +- **D-06:** Если local state неполон, бот должен либо восстановить его, либо логировать понятную причину, а не падать на командах вроде `!rename`. +- **D-07:** Несогласованность между `room_meta` и `ChatManager` должна обнаруживаться и устраняться автоматически на startup или при первом обращении. +- **D-08:** Нужен отдельный dev-only reset tool/script для controlled QA, вместо ручного набора shell-команд. +- **D-09:** Reset workflow должен как минимум поддерживать `local-only` reset: удаление `lambda_matrix.db` и `matrix_store` с понятной инструкцией, что делать с server-side Matrix rooms. +- **D-10:** Если full server-side cleanup не автоматизируется в этой фазе, tool должен явно печатать, какие ручные шаги обязательны в Matrix client. + +### Claude's Discretion +- Точное место вызова reconciliation в startup flow +- Внутренняя структура helper-модуля (`bootstrap.py`, `reconcile.py` или аналог) +- Формат dev reset script и уровень автоматизации server-side cleanup +- Детали debug-logging и dry-run режима, если они помогают без раздувания scope + +### Deferred Ideas (OUT OF SCOPE) +- Full production-grade migration of historical Matrix state across schema versions +- Automatic server-side deletion/leave for all Matrix rooms and Space during reset, if it requires broader admin semantics +- Any Phase 2 SDK integration work + + +## Summary + +Phase 01.1 should be planned as a bootstrap/recovery phase, not as another chat-feature phase. The current Matrix adapter has no startup reconciliation path: `adapter/matrix/bot.py` logs in and goes directly to `sync_forever()`, while routing and command handlers assume `matrix_room:*`, `matrix_user:*`, and `chat:*` keys already exist. That means local DB loss currently produces logical corruption, not just missing cache. + +The safe standard approach is: perform a first sync that hydrates joined-room state, inspect the bot's current joined rooms and room state from the homeserver, rebuild the minimal local metadata needed for command routing, and only then enter the long-running sync loop. Reconciliation should be non-destructive and idempotent: if local keys already exist and match server state, leave them alone; if they are missing, recreate them; if they conflict, prefer the server room topology for Matrix-specific metadata and recreate missing `ChatManager` rows from that. + +For reset, separate two workflows explicitly. `local-only` reset is the default and should be automated. Optional server-side cleanup may leave/forget rooms for the bot account, but it cannot promise global deletion of Matrix rooms for all members; if that is not automated, the tool must print the exact manual steps for the Matrix client. + +**Primary recommendation:** Add a startup `reconcile_matrix_state()` step before `sync_forever()`, and ship a dev-only reset CLI with `local-only`, `server-leave-forget`, and `dry-run` modes. + +## Project Constraints (from CLAUDE.md) + +- Do not treat missing Lambda SDK as a blocker. +- Keep all platform calls behind `platform/interface.py`. +- Current runtime implementation is `platform/mock.py`; recommendations must work with that. +- Prefer architecture changes in adapters and core without coupling to future SDK internals. +- Use pytest-based verification. +- Do not recommend committing `.env`. +- Respect dependency order: `core/` first, then `platform/`, then adapters. + +## Standard Stack + +### Core +| Library | Version | Purpose | Why Standard | +|---------|---------|---------|--------------| +| Python | 3.14.3 installed | Runtime for bot and scripts | Already available locally; codebase targets `>=3.11`. | +| `matrix-nio` | 0.25.2, published 2024-10-04 | Matrix client, sync, room membership/state APIs | Already installed; exposes the exact bootstrap/reset APIs this phase needs. | +| `SQLiteStore` (repo) | local | Adapter/core KV persistence | Existing persistence contract for `matrix_user:*`, `matrix_room:*`, and `chat:*`. | +| Matrix Client-Server API | spec latest | Authoritative room membership/state semantics | Needed to reason about restart recovery and leave/forget behavior correctly. | + +### Supporting +| Library | Version | Purpose | When to Use | +|---------|---------|---------|-------------| +| `pytest` | 9.0.2, published 2025-12-06 | Test runner | For targeted adapter/bootstrap regression tests. | +| `pytest-asyncio` | 1.3.0, published 2025-11-10 | Async test execution | For async reconciliation/reset flows. | +| `structlog` | 25.5.0, published 2025-10-27 | Diagnostics | For reconciliation summaries and conflict logging. | +| `python-dotenv` | 1.2.2, published 2026-03-01 | Env loading | Already used by `adapter/matrix/bot.py` for Matrix config. | + +### Alternatives Considered +| Instead of | Could Use | Tradeoff | +|------------|-----------|----------| +| Startup reconciliation from joined rooms + state | Force developers to wipe local DB and recreate rooms | Simpler code, but directly violates D-01, D-02, D-05. | +| Non-destructive local rebuild | Full auto-recreate of Space/rooms on missing local state | Easier to implement, but causes duplicate Matrix rooms and breaks D-04. | +| Dev reset script | README-only manual ritual | Lower code cost, but not repeatable and fails D-08..D-10. | + +**Installation:** +```bash +uv sync +``` + +**Version verification:** Verified via installed environment and PyPI metadata on 2026-04-03: +- `matrix-nio` `0.25.2` - 2024-10-04 +- `pytest` `9.0.2` - 2025-12-06 +- `pytest-asyncio` `1.3.0` - 2025-11-10 +- `structlog` `25.5.0` - 2025-10-27 +- `python-dotenv` `1.2.2` - 2026-03-01 + +## Architecture Patterns + +### Recommended Project Structure +```text +adapter/matrix/ +├── bot.py # startup flow calls reconciliation before sync loop +├── reconcile.py # bootstrap/rebuild logic from Matrix server state +├── reset.py # dev-only reset CLI / entrypoint +├── room_router.py # room_id -> chat_id with recovery hook +├── store.py # metadata helpers, prefix scans, derived counters +└── handlers/ + ├── auth.py # first-time provisioning only + └── chat.py # uses recovered state, no provisioning fallback +``` + +### Pattern 1: Two-Phase Startup Bootstrap +**What:** Split startup into `login -> initial sync/full_state -> reconcile -> steady-state sync_forever`. +**When to use:** Always for Matrix bot startup when local DB may be missing or stale. +**Example:** +```python +# Source: matrix-nio AsyncClient docs/source + repo startup flow +client = AsyncClient(...) +runtime = build_runtime(store=SQLiteStore(db_path), client=client) + +await login_or_restore_session(client) +await client.sync(timeout=0, full_state=True) +report = await reconcile_matrix_state(client, runtime.store, runtime.chat_mgr) +logger.info("matrix_reconcile_complete", **report) +await client.sync_forever(timeout=30000) +``` + +### Pattern 2: Rebuild Local Metadata From Joined Rooms +**What:** Enumerate joined rooms, inspect local hydrated room objects or room state, and recreate missing `matrix_room:*`, `matrix_user:*`, and `chat:*` records. +**When to use:** On startup and optionally on `unregistered:{room_id}` fallback at runtime. +**Example:** +```python +# Source: matrix-nio AsyncClient.joined_rooms/room_get_state + repo store contracts +joined = await client.joined_rooms() +for room_id in joined.rooms: + state = await client.room_get_state(room_id) + # detect: space room vs chat room, owner user, child relationship, display name + # rebuild matrix_room:{room_id} + # rebuild chat:{matrix_user_id}:{chat_id} if absent +``` + +### Pattern 3: Non-Destructive Reconciliation Report +**What:** Return a structured report: scanned rooms, restored rooms, restored chats, conflicts, skipped rooms. +**When to use:** Every reconciliation run, including dry-run. +**Example:** +```python +{ + "joined_rooms": 4, + "restored_user_meta": 1, + "restored_room_meta": 3, + "restored_chat_rows": 3, + "conflicts": [], + "skipped_rooms": ["!dm:example.org"], +} +``` + +### Pattern 4: Reset Modes Are Explicit +**What:** Separate `local-only`, `server-leave-forget`, and `dry-run`. +**When to use:** For dev/QA only. Never mix destructive server cleanup into normal startup. +**Example:** +```bash +uv run python -m adapter.matrix.reset --mode local-only +uv run python -m adapter.matrix.reset --mode server-leave-forget --dry-run +``` + +### Anti-Patterns to Avoid +- **Provisioning during reconciliation:** Do not create a new Space or new rooms while trying to recover missing local state. +- **Treating `next_chat_index` as primary truth:** Derive it from recovered `chat_id` values after scan; do not trust a missing or stale counter. +- **Routing unknown rooms straight through:** `unregistered:{room_id}` is a signal to reconcile, not a stable runtime identity. +- **Destructive reset by default:** Startup must never leave/forget rooms automatically. +- **Blindly trusting local `surface_ref`:** If `chat:*` and `matrix_room:*` disagree, rebuild from Matrix room metadata and repair the chat row. + +## Don't Hand-Roll + +| Problem | Don't Build | Use Instead | Why | +|---------|-------------|-------------|-----| +| Room discovery | Custom DB-only reconstruction heuristics | `AsyncClient.joined_rooms()` plus synced room state | Server already knows which rooms the bot joined. | +| Space membership detection | Naming-convention parsing of room names | Matrix state: `m.room.create.type`, `m.space.child`, `m.space.parent` | Names are mutable and non-authoritative. | +| Room cleanup semantics | Custom “delete room” assumptions | `room_leave()` + `room_forget()` semantics | Client API supports leave/forget, not guaranteed global deletion. | +| Chat ID recovery | Hardcoded `C1/C2/...` reset | Rebuild from existing `matrix_room:*`/server state and compute next index | Prevents collisions after partial DB loss. | +| Diagnostic output | Ad hoc `print()` strings | Structured reconciliation/reset report via `structlog` | Easier manual QA and failure triage. | + +**Key insight:** The homeserver already persists the bot’s room graph. This phase should rehydrate local cache from that graph, not attempt to replace it with a second custom truth model. + +## Common Pitfalls + +### Pitfall 1: Joining the sync loop before reconciliation +**What goes wrong:** Commands arrive while local metadata is still missing, producing `unregistered:{room_id}` routing or `ChatManager` misses. +**Why it happens:** Current `main()` enters `sync_forever()` immediately after login. +**How to avoid:** Perform initial sync and reconciliation first. +**Warning signs:** `unregistered_room` logs immediately after restart; `ValueError("Chat ... not found")` on `!rename` or `!archive`. + +### Pitfall 2: Recovering room metadata but not chat rows +**What goes wrong:** Room routing works, but `ChatManager.rename/archive/list_active` still fails because `chat:{user}:{chat_id}` rows were not recreated. +**Why it happens:** Matrix adapter metadata and core chat metadata live in different keyspaces. +**How to avoid:** Reconciliation must repair both stores in one pass. +**Warning signs:** `matrix_room:*` exists but `chat:*` keys do not. + +### Pitfall 3: Trusting stale `next_chat_index` +**What goes wrong:** New chats reuse existing `C` IDs after local recovery. +**Why it happens:** `next_chat_id()` increments a persisted counter that may be absent or behind. +**How to avoid:** After scan, set `next_chat_index = max(recovered_chat_numbers) + 1`. +**Warning signs:** New room gets `C1` even though Space already contains prior rooms. + +### Pitfall 4: Assuming room names identify chat rooms safely +**What goes wrong:** Reconciliation binds the wrong room because a user renamed a room or Space. +**Why it happens:** Names are user-facing labels, not stable identifiers. +**How to avoid:** Prefer room state and existing `chat_id` metadata; use display names only as fallback. +**Warning signs:** Duplicate “Чат 1” names or renamed rooms break matching. + +### Pitfall 5: Over-promising full cleanup +**What goes wrong:** Reset script claims a “clean slate” but rooms still exist in Element or for other members. +**Why it happens:** Leaving/forgetting affects the bot account’s membership/history, not necessarily global room deletion. +**How to avoid:** Name the mode accurately and print the manual client steps when needed. +**Warning signs:** QA reruns still show old rooms in the user’s client. + +## Code Examples + +Verified patterns from official sources and the installed library surface: + +### Initial Sync Before Reconcile +```python +# Source: matrix-nio AsyncClient.sync/sync_forever +await client.sync(timeout=0, full_state=True) +report = await reconcile_matrix_state(client, store, chat_mgr) +await client.sync_forever(timeout=30000) +``` + +### Space Child Link Creation +```python +# Source: Matrix client-server API state event + current auth/new-chat flow +await client.room_put_state( + room_id=space_id, + event_type="m.space.child", + content={"via": [homeserver]}, + state_key=chat_room_id, +) +``` + +### Bot-Side Leave/Forget Cleanup +```python +# Source: matrix-nio AsyncClient.room_leave / room_forget +for room_id in room_ids: + await client.room_leave(room_id) + await client.room_forget(room_id) +``` + +### Router Recovery Trigger +```python +# Source: repo room_router contract +chat_id = await resolve_chat_id(store, room_id, matrix_user_id) +if chat_id.startswith("unregistered:"): + await reconcile_single_room(client, store, chat_mgr, room_id, matrix_user_id) +``` + +## State of the Art + +| Old Approach | Current Approach | When Changed | Impact | +|--------------|------------------|--------------|--------| +| Local adapter DB treated as the operational truth | Rebuildable local cache from server room graph | Mature Matrix client practice; supported by current Matrix CS API and `matrix-nio` | Restart no longer requires destructive local reset. | +| Manual room cleanup in client after experiments | Scripted leave/forget plus explicit manual instructions | Current `matrix-nio` 0.25.x API surface | QA becomes repeatable and auditable. | +| Immediate steady-state sync after login | Initial sync/full-state bootstrap before long polling | Supported by current `AsyncClient.sync()` / `sync_forever()` behavior | Reconciliation can run before any user traffic is handled. | + +**Deprecated/outdated:** +- `README.md` Matrix manual QA instruction `rm -f lambda_matrix.db` as the primary restart flow: outdated for this phase. +- DM-first Matrix recovery assumptions in `docs/matrix-prototype.md`: outdated relative to Phase 1 Space+rooms decisions. + +## Open Questions + +1. **How exactly should reconciliation identify the owning Matrix user for a recovered room when local `matrix_room:*` is gone?** + - What we know: the bot can enumerate joined rooms and fetch room state; current healthy metadata stores `matrix_user_id` and `space_id`. + - What's unclear: whether Phase 1-created rooms also expose enough server-side structure to recover owner deterministically without existing local metadata in every case. + - Recommendation: Plan a proof test against a real homeserver/client. If room-state-only ownership is ambiguous, persist a tiny bot-authored marker state event going forward, but keep that addition narrowly scoped. + +2. **Should runtime recovery happen only on startup, or also lazily on first unknown room access?** + - What we know: startup repair satisfies D-02/D-07 for common restart loss; `room_router` already surfaces unknown rooms cleanly. + - What's unclear: whether partial DB corruption during runtime is common enough to justify lazy single-room repair in Phase 01.1. + - Recommendation: Make startup reconciliation required, lazy room repair optional if it stays small. + +3. **How much of server cleanup should Phase 01.1 automate?** + - What we know: `room_leave()` and `room_forget()` are available; global room deletion is not what the client API guarantees. + - What's unclear: whether automating bot-side leave/forget is worth the extra risk for this urgent phase. + - Recommendation: Keep `local-only` mandatory. Make server cleanup optional and clearly labeled experimental/dev-only if included. + +## Environment Availability + +| Dependency | Required By | Available | Version | Fallback | +|------------|------------|-----------|---------|----------| +| Python | Runtime, scripts, tests | ✓ | 3.14.3 | — | +| `uv` | Standard install/run workflow | ✓ | 0.9.30 | `python -m` + existing venv | +| `pytest` | Automated verification | ✓ | 9.0.2 | `uv run pytest` | +| Matrix homeserver credentials | Real restart/reset manual QA | ✗ in current shell | — | Manual-only after `.env` is configured | +| Matrix bot local DB/store paths | Reset workflow | ✓ | defaults in code | Can override with `MATRIX_DB_PATH` / `MATRIX_STORE_PATH` | + +**Missing dependencies with no fallback:** +- Live Matrix credentials for real manual reconciliation/reset QA. + +**Missing dependencies with fallback:** +- None for repository-only implementation and tests. + +## Validation Architecture + +### Test Framework +| Property | Value | +|----------|-------| +| Framework | `pytest 9.0.2` + `pytest-asyncio 1.3.0` | +| Config file | `pyproject.toml` | +| Quick run command | `pytest tests/adapter/matrix -v` | +| Full suite command | `pytest tests/ -v` | + +### Phase Requirements → Test Map +| Req ID | Behavior | Test Type | Automated Command | File Exists? | +|--------|----------|-----------|-------------------|-------------| +| PH01.1-BOOT | Startup rebuilds missing `matrix_user:*`, `matrix_room:*`, and `chat:*` from existing rooms without creating new rooms | unit/integration | `pytest tests/adapter/matrix/test_reconcile.py -v` | ❌ Wave 0 | +| PH01.1-ROUTER | Unknown room fallback can trigger repair or yields diagnosable warning without crashing commands | unit | `pytest tests/adapter/matrix/test_room_router_reconcile.py -v` | ❌ Wave 0 | +| PH01.1-COUNTER | Reconciliation resets `next_chat_index` to recovered max + 1 | unit | `pytest tests/adapter/matrix/test_reconcile.py -k next_chat_index -v` | ❌ Wave 0 | +| PH01.1-RESET | Dev reset `local-only` removes local DB/store paths and prints next steps | unit/smoke | `pytest tests/adapter/matrix/test_reset.py -v` | ❌ Wave 0 | +| PH01.1-NONDESTRUCTIVE | Reconciliation never calls room creation APIs | unit | `pytest tests/adapter/matrix/test_reconcile.py -k no_create -v` | ❌ Wave 0 | + +### Sampling Rate +- **Per task commit:** `pytest tests/adapter/matrix -v` +- **Per wave merge:** `pytest tests/ -v` +- **Phase gate:** Full suite green before `/gsd:verify-work` + +### Wave 0 Gaps +- [ ] `tests/adapter/matrix/test_reconcile.py` - startup reconciliation scenarios +- [ ] `tests/adapter/matrix/test_reset.py` - CLI/script reset modes and output +- [ ] `tests/adapter/matrix/test_room_router_reconcile.py` - lazy recovery or warning behavior +- [ ] Integration fixture for a fake `AsyncClient` response surface matching `joined_rooms()` and `room_get_state()` + +## Sources + +### Primary (HIGH confidence) +- Matrix Client-Server API - room state, leave, forget, joined rooms, Spaces semantics: https://spec.matrix.org/latest/client-server-api/index.html +- `matrix-nio` installed 0.25.2 API surface verified locally on 2026-04-03 via `AsyncClient.sync`, `sync_forever`, `joined_rooms`, `room_get_state`, `room_leave`, `room_forget` +- Repo code: [adapter/matrix/bot.py](/Users/a/MAI/sem2/lambda/surfaces-bot/adapter/matrix/bot.py), [adapter/matrix/store.py](/Users/a/MAI/sem2/lambda/surfaces-bot/adapter/matrix/store.py), [adapter/matrix/room_router.py](/Users/a/MAI/sem2/lambda/surfaces-bot/adapter/matrix/room_router.py), [adapter/matrix/handlers/auth.py](/Users/a/MAI/sem2/lambda/surfaces-bot/adapter/matrix/handlers/auth.py), [core/chat.py](/Users/a/MAI/sem2/lambda/surfaces-bot/core/chat.py) +- PyPI release metadata: https://pypi.org/project/matrix-nio/ , https://pypi.org/project/pytest/ , https://pypi.org/project/pytest-asyncio/ , https://pypi.org/project/structlog/ , https://pypi.org/project/python-dotenv/ + +### Secondary (MEDIUM confidence) +- [README.md](/Users/a/MAI/sem2/lambda/surfaces-bot/README.md) - current manual reset habit and run commands +- [docs/matrix-prototype.md](/Users/a/MAI/sem2/lambda/surfaces-bot/docs/matrix-prototype.md) - original Matrix UX intent, noting outdated DM/reaction sections +- [01-CONTEXT.md](/Users/a/MAI/sem2/lambda/surfaces-bot/.planning/phases/01-matrix-qa-polish/01-CONTEXT.md) - locked Phase 1 Matrix decisions +- [01-VERIFICATION.md](/Users/a/MAI/sem2/lambda/surfaces-bot/.planning/phases/01-matrix-qa-polish/01-VERIFICATION.md) - what has already been verified and what still needs human Matrix QA + +### Tertiary (LOW confidence) +- None + +## Metadata + +**Confidence breakdown:** +- Standard stack: HIGH - verified against installed environment, PyPI metadata, and official Matrix spec +- Architecture: HIGH - directly grounded in current repo flow plus current `matrix-nio`/Matrix capabilities +- Pitfalls: HIGH - derived from concrete gaps in current startup/store/router code + +**Research date:** 2026-04-03 +**Valid until:** 2026-05-03