surfaces/.planning/phases/01.1-matrix-restart-reconciliation-and-dev-reset-workflow/01.1-RESEARCH.md

350 lines
22 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Phase 01.1: Matrix restart reconciliation and dev reset workflow - Research
**Researched:** 2026-04-03
**Domain:** Matrix adapter restart reconciliation, local state recovery, dev reset workflow
**Confidence:** HIGH
<user_constraints>
## User Constraints (from CONTEXT.md)
### Locked Decisions
- **D-01:** Локальный SQLite store больше не должен считаться единственной точкой истины для Matrix runtime в dev workflow.
- **D-02:** При старте бот должен пытаться восстановить минимально необходимое локальное состояние из уже существующих Matrix rooms / Space, а не требовать full reset.
- **D-03:** Reconciliation должен восстанавливать как минимум `matrix_user:*`, `matrix_room:*` и missing `chat:{user}:{chat_id}` записи, если серверные комнаты уже существуют.
- **D-04:** Reconciliation не должен создавать новые Space/rooms, если задача — именно восстановление локального state после рестарта.
- **D-05:** Обычный restart бота должен быть основным путём для разработки; удаление `lambda_matrix.db` и `matrix_store` не должно быть обязательным для проверки workflow.
- **D-06:** Если local state неполон, бот должен либо восстановить его, либо логировать понятную причину, а не падать на командах вроде `!rename`.
- **D-07:** Несогласованность между `room_meta` и `ChatManager` должна обнаруживаться и устраняться автоматически на startup или при первом обращении.
- **D-08:** Нужен отдельный dev-only reset tool/script для controlled QA, вместо ручного набора shell-команд.
- **D-09:** Reset workflow должен как минимум поддерживать `local-only` reset: удаление `lambda_matrix.db` и `matrix_store` с понятной инструкцией, что делать с server-side Matrix rooms.
- **D-10:** Если full server-side cleanup не автоматизируется в этой фазе, tool должен явно печатать, какие ручные шаги обязательны в Matrix client.
### Claude's Discretion
- Точное место вызова reconciliation в startup flow
- Внутренняя структура helper-модуля (`bootstrap.py`, `reconcile.py` или аналог)
- Формат dev reset script и уровень автоматизации server-side cleanup
- Детали debug-logging и dry-run режима, если они помогают без раздувания scope
### Deferred Ideas (OUT OF SCOPE)
- Full production-grade migration of historical Matrix state across schema versions
- Automatic server-side deletion/leave for all Matrix rooms and Space during reset, if it requires broader admin semantics
- Any Phase 2 SDK integration work
</user_constraints>
## Summary
Phase 01.1 should be planned as a bootstrap/recovery phase, not as another chat-feature phase. The current Matrix adapter has no startup reconciliation path: `adapter/matrix/bot.py` logs in and goes directly to `sync_forever()`, while routing and command handlers assume `matrix_room:*`, `matrix_user:*`, and `chat:*` keys already exist. That means local DB loss currently produces logical corruption, not just missing cache.
The safe standard approach is: perform a first sync that hydrates joined-room state, inspect the bot's current joined rooms and room state from the homeserver, rebuild the minimal local metadata needed for command routing, and only then enter the long-running sync loop. Reconciliation should be non-destructive and idempotent: if local keys already exist and match server state, leave them alone; if they are missing, recreate them; if they conflict, prefer the server room topology for Matrix-specific metadata and recreate missing `ChatManager` rows from that.
For reset, separate two workflows explicitly. `local-only` reset is the default and should be automated. Optional server-side cleanup may leave/forget rooms for the bot account, but it cannot promise global deletion of Matrix rooms for all members; if that is not automated, the tool must print the exact manual steps for the Matrix client.
**Primary recommendation:** Add a startup `reconcile_matrix_state()` step before `sync_forever()`, and ship a dev-only reset CLI with `local-only`, `server-leave-forget`, and `dry-run` modes.
## Project Constraints (from CLAUDE.md)
- Do not treat missing Lambda SDK as a blocker.
- Keep all platform calls behind `platform/interface.py`.
- Current runtime implementation is `platform/mock.py`; recommendations must work with that.
- Prefer architecture changes in adapters and core without coupling to future SDK internals.
- Use pytest-based verification.
- Do not recommend committing `.env`.
- Respect dependency order: `core/` first, then `platform/`, then adapters.
## Standard Stack
### Core
| Library | Version | Purpose | Why Standard |
|---------|---------|---------|--------------|
| Python | 3.14.3 installed | Runtime for bot and scripts | Already available locally; codebase targets `>=3.11`. |
| `matrix-nio` | 0.25.2, published 2024-10-04 | Matrix client, sync, room membership/state APIs | Already installed; exposes the exact bootstrap/reset APIs this phase needs. |
| `SQLiteStore` (repo) | local | Adapter/core KV persistence | Existing persistence contract for `matrix_user:*`, `matrix_room:*`, and `chat:*`. |
| Matrix Client-Server API | spec latest | Authoritative room membership/state semantics | Needed to reason about restart recovery and leave/forget behavior correctly. |
### Supporting
| Library | Version | Purpose | When to Use |
|---------|---------|---------|-------------|
| `pytest` | 9.0.2, published 2025-12-06 | Test runner | For targeted adapter/bootstrap regression tests. |
| `pytest-asyncio` | 1.3.0, published 2025-11-10 | Async test execution | For async reconciliation/reset flows. |
| `structlog` | 25.5.0, published 2025-10-27 | Diagnostics | For reconciliation summaries and conflict logging. |
| `python-dotenv` | 1.2.2, published 2026-03-01 | Env loading | Already used by `adapter/matrix/bot.py` for Matrix config. |
### Alternatives Considered
| Instead of | Could Use | Tradeoff |
|------------|-----------|----------|
| Startup reconciliation from joined rooms + state | Force developers to wipe local DB and recreate rooms | Simpler code, but directly violates D-01, D-02, D-05. |
| Non-destructive local rebuild | Full auto-recreate of Space/rooms on missing local state | Easier to implement, but causes duplicate Matrix rooms and breaks D-04. |
| Dev reset script | README-only manual ritual | Lower code cost, but not repeatable and fails D-08..D-10. |
**Installation:**
```bash
uv sync
```
**Version verification:** Verified via installed environment and PyPI metadata on 2026-04-03:
- `matrix-nio` `0.25.2` - 2024-10-04
- `pytest` `9.0.2` - 2025-12-06
- `pytest-asyncio` `1.3.0` - 2025-11-10
- `structlog` `25.5.0` - 2025-10-27
- `python-dotenv` `1.2.2` - 2026-03-01
## Architecture Patterns
### Recommended Project Structure
```text
adapter/matrix/
├── bot.py # startup flow calls reconciliation before sync loop
├── reconcile.py # bootstrap/rebuild logic from Matrix server state
├── reset.py # dev-only reset CLI / entrypoint
├── room_router.py # room_id -> chat_id with recovery hook
├── store.py # metadata helpers, prefix scans, derived counters
└── handlers/
├── auth.py # first-time provisioning only
└── chat.py # uses recovered state, no provisioning fallback
```
### Pattern 1: Two-Phase Startup Bootstrap
**What:** Split startup into `login -> initial sync/full_state -> reconcile -> steady-state sync_forever`.
**When to use:** Always for Matrix bot startup when local DB may be missing or stale.
**Example:**
```python
# Source: matrix-nio AsyncClient docs/source + repo startup flow
client = AsyncClient(...)
runtime = build_runtime(store=SQLiteStore(db_path), client=client)
await login_or_restore_session(client)
await client.sync(timeout=0, full_state=True)
report = await reconcile_matrix_state(client, runtime.store, runtime.chat_mgr)
logger.info("matrix_reconcile_complete", **report)
await client.sync_forever(timeout=30000)
```
### Pattern 2: Rebuild Local Metadata From Joined Rooms
**What:** Enumerate joined rooms, inspect local hydrated room objects or room state, and recreate missing `matrix_room:*`, `matrix_user:*`, and `chat:*` records.
**When to use:** On startup and optionally on `unregistered:{room_id}` fallback at runtime.
**Example:**
```python
# Source: matrix-nio AsyncClient.joined_rooms/room_get_state + repo store contracts
joined = await client.joined_rooms()
for room_id in joined.rooms:
state = await client.room_get_state(room_id)
# detect: space room vs chat room, owner user, child relationship, display name
# rebuild matrix_room:{room_id}
# rebuild chat:{matrix_user_id}:{chat_id} if absent
```
### Pattern 3: Non-Destructive Reconciliation Report
**What:** Return a structured report: scanned rooms, restored rooms, restored chats, conflicts, skipped rooms.
**When to use:** Every reconciliation run, including dry-run.
**Example:**
```python
{
"joined_rooms": 4,
"restored_user_meta": 1,
"restored_room_meta": 3,
"restored_chat_rows": 3,
"conflicts": [],
"skipped_rooms": ["!dm:example.org"],
}
```
### Pattern 4: Reset Modes Are Explicit
**What:** Separate `local-only`, `server-leave-forget`, and `dry-run`.
**When to use:** For dev/QA only. Never mix destructive server cleanup into normal startup.
**Example:**
```bash
uv run python -m adapter.matrix.reset --mode local-only
uv run python -m adapter.matrix.reset --mode server-leave-forget --dry-run
```
### Anti-Patterns to Avoid
- **Provisioning during reconciliation:** Do not create a new Space or new rooms while trying to recover missing local state.
- **Treating `next_chat_index` as primary truth:** Derive it from recovered `chat_id` values after scan; do not trust a missing or stale counter.
- **Routing unknown rooms straight through:** `unregistered:{room_id}` is a signal to reconcile, not a stable runtime identity.
- **Destructive reset by default:** Startup must never leave/forget rooms automatically.
- **Blindly trusting local `surface_ref`:** If `chat:*` and `matrix_room:*` disagree, rebuild from Matrix room metadata and repair the chat row.
## Don't Hand-Roll
| Problem | Don't Build | Use Instead | Why |
|---------|-------------|-------------|-----|
| Room discovery | Custom DB-only reconstruction heuristics | `AsyncClient.joined_rooms()` plus synced room state | Server already knows which rooms the bot joined. |
| Space membership detection | Naming-convention parsing of room names | Matrix state: `m.room.create.type`, `m.space.child`, `m.space.parent` | Names are mutable and non-authoritative. |
| Room cleanup semantics | Custom “delete room” assumptions | `room_leave()` + `room_forget()` semantics | Client API supports leave/forget, not guaranteed global deletion. |
| Chat ID recovery | Hardcoded `C1/C2/...` reset | Rebuild from existing `matrix_room:*`/server state and compute next index | Prevents collisions after partial DB loss. |
| Diagnostic output | Ad hoc `print()` strings | Structured reconciliation/reset report via `structlog` | Easier manual QA and failure triage. |
**Key insight:** The homeserver already persists the bots room graph. This phase should rehydrate local cache from that graph, not attempt to replace it with a second custom truth model.
## Common Pitfalls
### Pitfall 1: Joining the sync loop before reconciliation
**What goes wrong:** Commands arrive while local metadata is still missing, producing `unregistered:{room_id}` routing or `ChatManager` misses.
**Why it happens:** Current `main()` enters `sync_forever()` immediately after login.
**How to avoid:** Perform initial sync and reconciliation first.
**Warning signs:** `unregistered_room` logs immediately after restart; `ValueError("Chat ... not found")` on `!rename` or `!archive`.
### Pitfall 2: Recovering room metadata but not chat rows
**What goes wrong:** Room routing works, but `ChatManager.rename/archive/list_active` still fails because `chat:{user}:{chat_id}` rows were not recreated.
**Why it happens:** Matrix adapter metadata and core chat metadata live in different keyspaces.
**How to avoid:** Reconciliation must repair both stores in one pass.
**Warning signs:** `matrix_room:*` exists but `chat:*` keys do not.
### Pitfall 3: Trusting stale `next_chat_index`
**What goes wrong:** New chats reuse existing `C` IDs after local recovery.
**Why it happens:** `next_chat_id()` increments a persisted counter that may be absent or behind.
**How to avoid:** After scan, set `next_chat_index = max(recovered_chat_numbers) + 1`.
**Warning signs:** New room gets `C1` even though Space already contains prior rooms.
### Pitfall 4: Assuming room names identify chat rooms safely
**What goes wrong:** Reconciliation binds the wrong room because a user renamed a room or Space.
**Why it happens:** Names are user-facing labels, not stable identifiers.
**How to avoid:** Prefer room state and existing `chat_id` metadata; use display names only as fallback.
**Warning signs:** Duplicate “Чат 1” names or renamed rooms break matching.
### Pitfall 5: Over-promising full cleanup
**What goes wrong:** Reset script claims a “clean slate” but rooms still exist in Element or for other members.
**Why it happens:** Leaving/forgetting affects the bot accounts membership/history, not necessarily global room deletion.
**How to avoid:** Name the mode accurately and print the manual client steps when needed.
**Warning signs:** QA reruns still show old rooms in the users client.
## Code Examples
Verified patterns from official sources and the installed library surface:
### Initial Sync Before Reconcile
```python
# Source: matrix-nio AsyncClient.sync/sync_forever
await client.sync(timeout=0, full_state=True)
report = await reconcile_matrix_state(client, store, chat_mgr)
await client.sync_forever(timeout=30000)
```
### Space Child Link Creation
```python
# Source: Matrix client-server API state event + current auth/new-chat flow
await client.room_put_state(
room_id=space_id,
event_type="m.space.child",
content={"via": [homeserver]},
state_key=chat_room_id,
)
```
### Bot-Side Leave/Forget Cleanup
```python
# Source: matrix-nio AsyncClient.room_leave / room_forget
for room_id in room_ids:
await client.room_leave(room_id)
await client.room_forget(room_id)
```
### Router Recovery Trigger
```python
# Source: repo room_router contract
chat_id = await resolve_chat_id(store, room_id, matrix_user_id)
if chat_id.startswith("unregistered:"):
await reconcile_single_room(client, store, chat_mgr, room_id, matrix_user_id)
```
## State of the Art
| Old Approach | Current Approach | When Changed | Impact |
|--------------|------------------|--------------|--------|
| Local adapter DB treated as the operational truth | Rebuildable local cache from server room graph | Mature Matrix client practice; supported by current Matrix CS API and `matrix-nio` | Restart no longer requires destructive local reset. |
| Manual room cleanup in client after experiments | Scripted leave/forget plus explicit manual instructions | Current `matrix-nio` 0.25.x API surface | QA becomes repeatable and auditable. |
| Immediate steady-state sync after login | Initial sync/full-state bootstrap before long polling | Supported by current `AsyncClient.sync()` / `sync_forever()` behavior | Reconciliation can run before any user traffic is handled. |
**Deprecated/outdated:**
- `README.md` Matrix manual QA instruction `rm -f lambda_matrix.db` as the primary restart flow: outdated for this phase.
- DM-first Matrix recovery assumptions in `docs/matrix-prototype.md`: outdated relative to Phase 1 Space+rooms decisions.
## Open Questions
1. **How exactly should reconciliation identify the owning Matrix user for a recovered room when local `matrix_room:*` is gone?**
- What we know: the bot can enumerate joined rooms and fetch room state; current healthy metadata stores `matrix_user_id` and `space_id`.
- What's unclear: whether Phase 1-created rooms also expose enough server-side structure to recover owner deterministically without existing local metadata in every case.
- Recommendation: Plan a proof test against a real homeserver/client. If room-state-only ownership is ambiguous, persist a tiny bot-authored marker state event going forward, but keep that addition narrowly scoped.
2. **Should runtime recovery happen only on startup, or also lazily on first unknown room access?**
- What we know: startup repair satisfies D-02/D-07 for common restart loss; `room_router` already surfaces unknown rooms cleanly.
- What's unclear: whether partial DB corruption during runtime is common enough to justify lazy single-room repair in Phase 01.1.
- Recommendation: Make startup reconciliation required, lazy room repair optional if it stays small.
3. **How much of server cleanup should Phase 01.1 automate?**
- What we know: `room_leave()` and `room_forget()` are available; global room deletion is not what the client API guarantees.
- What's unclear: whether automating bot-side leave/forget is worth the extra risk for this urgent phase.
- Recommendation: Keep `local-only` mandatory. Make server cleanup optional and clearly labeled experimental/dev-only if included.
## Environment Availability
| Dependency | Required By | Available | Version | Fallback |
|------------|------------|-----------|---------|----------|
| Python | Runtime, scripts, tests | ✓ | 3.14.3 | — |
| `uv` | Standard install/run workflow | ✓ | 0.9.30 | `python -m` + existing venv |
| `pytest` | Automated verification | ✓ | 9.0.2 | `uv run pytest` |
| Matrix homeserver credentials | Real restart/reset manual QA | ✗ in current shell | — | Manual-only after `.env` is configured |
| Matrix bot local DB/store paths | Reset workflow | ✓ | defaults in code | Can override with `MATRIX_DB_PATH` / `MATRIX_STORE_PATH` |
**Missing dependencies with no fallback:**
- Live Matrix credentials for real manual reconciliation/reset QA.
**Missing dependencies with fallback:**
- None for repository-only implementation and tests.
## Validation Architecture
### Test Framework
| Property | Value |
|----------|-------|
| Framework | `pytest 9.0.2` + `pytest-asyncio 1.3.0` |
| Config file | `pyproject.toml` |
| Quick run command | `pytest tests/adapter/matrix -v` |
| Full suite command | `pytest tests/ -v` |
### Phase Requirements → Test Map
| Req ID | Behavior | Test Type | Automated Command | File Exists? |
|--------|----------|-----------|-------------------|-------------|
| PH01.1-BOOT | Startup rebuilds missing `matrix_user:*`, `matrix_room:*`, and `chat:*` from existing rooms without creating new rooms | unit/integration | `pytest tests/adapter/matrix/test_reconcile.py -v` | ❌ Wave 0 |
| PH01.1-ROUTER | Unknown room fallback can trigger repair or yields diagnosable warning without crashing commands | unit | `pytest tests/adapter/matrix/test_room_router_reconcile.py -v` | ❌ Wave 0 |
| PH01.1-COUNTER | Reconciliation resets `next_chat_index` to recovered max + 1 | unit | `pytest tests/adapter/matrix/test_reconcile.py -k next_chat_index -v` | ❌ Wave 0 |
| PH01.1-RESET | Dev reset `local-only` removes local DB/store paths and prints next steps | unit/smoke | `pytest tests/adapter/matrix/test_reset.py -v` | ❌ Wave 0 |
| PH01.1-NONDESTRUCTIVE | Reconciliation never calls room creation APIs | unit | `pytest tests/adapter/matrix/test_reconcile.py -k no_create -v` | ❌ Wave 0 |
### Sampling Rate
- **Per task commit:** `pytest tests/adapter/matrix -v`
- **Per wave merge:** `pytest tests/ -v`
- **Phase gate:** Full suite green before `/gsd:verify-work`
### Wave 0 Gaps
- [ ] `tests/adapter/matrix/test_reconcile.py` - startup reconciliation scenarios
- [ ] `tests/adapter/matrix/test_reset.py` - CLI/script reset modes and output
- [ ] `tests/adapter/matrix/test_room_router_reconcile.py` - lazy recovery or warning behavior
- [ ] Integration fixture for a fake `AsyncClient` response surface matching `joined_rooms()` and `room_get_state()`
## Sources
### Primary (HIGH confidence)
- Matrix Client-Server API - room state, leave, forget, joined rooms, Spaces semantics: https://spec.matrix.org/latest/client-server-api/index.html
- `matrix-nio` installed 0.25.2 API surface verified locally on 2026-04-03 via `AsyncClient.sync`, `sync_forever`, `joined_rooms`, `room_get_state`, `room_leave`, `room_forget`
- Repo code: [adapter/matrix/bot.py](/Users/a/MAI/sem2/lambda/surfaces-bot/adapter/matrix/bot.py), [adapter/matrix/store.py](/Users/a/MAI/sem2/lambda/surfaces-bot/adapter/matrix/store.py), [adapter/matrix/room_router.py](/Users/a/MAI/sem2/lambda/surfaces-bot/adapter/matrix/room_router.py), [adapter/matrix/handlers/auth.py](/Users/a/MAI/sem2/lambda/surfaces-bot/adapter/matrix/handlers/auth.py), [core/chat.py](/Users/a/MAI/sem2/lambda/surfaces-bot/core/chat.py)
- PyPI release metadata: https://pypi.org/project/matrix-nio/ , https://pypi.org/project/pytest/ , https://pypi.org/project/pytest-asyncio/ , https://pypi.org/project/structlog/ , https://pypi.org/project/python-dotenv/
### Secondary (MEDIUM confidence)
- [README.md](/Users/a/MAI/sem2/lambda/surfaces-bot/README.md) - current manual reset habit and run commands
- [docs/matrix-prototype.md](/Users/a/MAI/sem2/lambda/surfaces-bot/docs/matrix-prototype.md) - original Matrix UX intent, noting outdated DM/reaction sections
- [01-CONTEXT.md](/Users/a/MAI/sem2/lambda/surfaces-bot/.planning/phases/01-matrix-qa-polish/01-CONTEXT.md) - locked Phase 1 Matrix decisions
- [01-VERIFICATION.md](/Users/a/MAI/sem2/lambda/surfaces-bot/.planning/phases/01-matrix-qa-polish/01-VERIFICATION.md) - what has already been verified and what still needs human Matrix QA
### Tertiary (LOW confidence)
- None
## Metadata
**Confidence breakdown:**
- Standard stack: HIGH - verified against installed environment, PyPI metadata, and official Matrix spec
- Architecture: HIGH - directly grounded in current repo flow plus current `matrix-nio`/Matrix capabilities
- Pitfalls: HIGH - derived from concrete gaps in current startup/store/router code
**Research date:** 2026-04-03
**Valid until:** 2026-05-03