surfaces/.planning/phases/01.1-matrix-restart-reconciliation-and-dev-reset-workflow/01.1-RESEARCH.md

22 KiB
Raw Blame History

Phase 01.1: Matrix restart reconciliation and dev reset workflow - Research

Researched: 2026-04-03 Domain: Matrix adapter restart reconciliation, local state recovery, dev reset workflow Confidence: HIGH

<user_constraints>

User Constraints (from CONTEXT.md)

Locked Decisions

  • D-01: Локальный SQLite store больше не должен считаться единственной точкой истины для Matrix runtime в dev workflow.
  • D-02: При старте бот должен пытаться восстановить минимально необходимое локальное состояние из уже существующих Matrix rooms / Space, а не требовать full reset.
  • D-03: Reconciliation должен восстанавливать как минимум matrix_user:*, matrix_room:* и missing chat:{user}:{chat_id} записи, если серверные комнаты уже существуют.
  • D-04: Reconciliation не должен создавать новые Space/rooms, если задача — именно восстановление локального state после рестарта.
  • D-05: Обычный restart бота должен быть основным путём для разработки; удаление lambda_matrix.db и matrix_store не должно быть обязательным для проверки workflow.
  • D-06: Если local state неполон, бот должен либо восстановить его, либо логировать понятную причину, а не падать на командах вроде !rename.
  • D-07: Несогласованность между room_meta и ChatManager должна обнаруживаться и устраняться автоматически на startup или при первом обращении.
  • D-08: Нужен отдельный dev-only reset tool/script для controlled QA, вместо ручного набора shell-команд.
  • D-09: Reset workflow должен как минимум поддерживать local-only reset: удаление lambda_matrix.db и matrix_store с понятной инструкцией, что делать с server-side Matrix rooms.
  • D-10: Если full server-side cleanup не автоматизируется в этой фазе, tool должен явно печатать, какие ручные шаги обязательны в Matrix client.

Claude's Discretion

  • Точное место вызова reconciliation в startup flow
  • Внутренняя структура helper-модуля (bootstrap.py, reconcile.py или аналог)
  • Формат dev reset script и уровень автоматизации server-side cleanup
  • Детали debug-logging и dry-run режима, если они помогают без раздувания scope

Deferred Ideas (OUT OF SCOPE)

  • Full production-grade migration of historical Matrix state across schema versions
  • Automatic server-side deletion/leave for all Matrix rooms and Space during reset, if it requires broader admin semantics
  • Any Phase 2 SDK integration work </user_constraints>

Summary

Phase 01.1 should be planned as a bootstrap/recovery phase, not as another chat-feature phase. The current Matrix adapter has no startup reconciliation path: adapter/matrix/bot.py logs in and goes directly to sync_forever(), while routing and command handlers assume matrix_room:*, matrix_user:*, and chat:* keys already exist. That means local DB loss currently produces logical corruption, not just missing cache.

The safe standard approach is: perform a first sync that hydrates joined-room state, inspect the bot's current joined rooms and room state from the homeserver, rebuild the minimal local metadata needed for command routing, and only then enter the long-running sync loop. Reconciliation should be non-destructive and idempotent: if local keys already exist and match server state, leave them alone; if they are missing, recreate them; if they conflict, prefer the server room topology for Matrix-specific metadata and recreate missing ChatManager rows from that.

For reset, separate two workflows explicitly. local-only reset is the default and should be automated. Optional server-side cleanup may leave/forget rooms for the bot account, but it cannot promise global deletion of Matrix rooms for all members; if that is not automated, the tool must print the exact manual steps for the Matrix client.

Primary recommendation: Add a startup reconcile_matrix_state() step before sync_forever(), and ship a dev-only reset CLI with local-only, server-leave-forget, and dry-run modes.

Project Constraints (from CLAUDE.md)

  • Do not treat missing Lambda SDK as a blocker.
  • Keep all platform calls behind platform/interface.py.
  • Current runtime implementation is platform/mock.py; recommendations must work with that.
  • Prefer architecture changes in adapters and core without coupling to future SDK internals.
  • Use pytest-based verification.
  • Do not recommend committing .env.
  • Respect dependency order: core/ first, then platform/, then adapters.

Standard Stack

Core

Library Version Purpose Why Standard
Python 3.14.3 installed Runtime for bot and scripts Already available locally; codebase targets >=3.11.
matrix-nio 0.25.2, published 2024-10-04 Matrix client, sync, room membership/state APIs Already installed; exposes the exact bootstrap/reset APIs this phase needs.
SQLiteStore (repo) local Adapter/core KV persistence Existing persistence contract for matrix_user:*, matrix_room:*, and chat:*.
Matrix Client-Server API spec latest Authoritative room membership/state semantics Needed to reason about restart recovery and leave/forget behavior correctly.

Supporting

Library Version Purpose When to Use
pytest 9.0.2, published 2025-12-06 Test runner For targeted adapter/bootstrap regression tests.
pytest-asyncio 1.3.0, published 2025-11-10 Async test execution For async reconciliation/reset flows.
structlog 25.5.0, published 2025-10-27 Diagnostics For reconciliation summaries and conflict logging.
python-dotenv 1.2.2, published 2026-03-01 Env loading Already used by adapter/matrix/bot.py for Matrix config.

Alternatives Considered

Instead of Could Use Tradeoff
Startup reconciliation from joined rooms + state Force developers to wipe local DB and recreate rooms Simpler code, but directly violates D-01, D-02, D-05.
Non-destructive local rebuild Full auto-recreate of Space/rooms on missing local state Easier to implement, but causes duplicate Matrix rooms and breaks D-04.
Dev reset script README-only manual ritual Lower code cost, but not repeatable and fails D-08..D-10.

Installation:

uv sync

Version verification: Verified via installed environment and PyPI metadata on 2026-04-03:

  • matrix-nio 0.25.2 - 2024-10-04
  • pytest 9.0.2 - 2025-12-06
  • pytest-asyncio 1.3.0 - 2025-11-10
  • structlog 25.5.0 - 2025-10-27
  • python-dotenv 1.2.2 - 2026-03-01

Architecture Patterns

adapter/matrix/
├── bot.py                 # startup flow calls reconciliation before sync loop
├── reconcile.py           # bootstrap/rebuild logic from Matrix server state
├── reset.py               # dev-only reset CLI / entrypoint
├── room_router.py         # room_id -> chat_id with recovery hook
├── store.py               # metadata helpers, prefix scans, derived counters
└── handlers/
    ├── auth.py            # first-time provisioning only
    └── chat.py            # uses recovered state, no provisioning fallback

Pattern 1: Two-Phase Startup Bootstrap

What: Split startup into login -> initial sync/full_state -> reconcile -> steady-state sync_forever. When to use: Always for Matrix bot startup when local DB may be missing or stale. Example:

# Source: matrix-nio AsyncClient docs/source + repo startup flow
client = AsyncClient(...)
runtime = build_runtime(store=SQLiteStore(db_path), client=client)

await login_or_restore_session(client)
await client.sync(timeout=0, full_state=True)
report = await reconcile_matrix_state(client, runtime.store, runtime.chat_mgr)
logger.info("matrix_reconcile_complete", **report)
await client.sync_forever(timeout=30000)

Pattern 2: Rebuild Local Metadata From Joined Rooms

What: Enumerate joined rooms, inspect local hydrated room objects or room state, and recreate missing matrix_room:*, matrix_user:*, and chat:* records. When to use: On startup and optionally on unregistered:{room_id} fallback at runtime. Example:

# Source: matrix-nio AsyncClient.joined_rooms/room_get_state + repo store contracts
joined = await client.joined_rooms()
for room_id in joined.rooms:
    state = await client.room_get_state(room_id)
    # detect: space room vs chat room, owner user, child relationship, display name
    # rebuild matrix_room:{room_id}
    # rebuild chat:{matrix_user_id}:{chat_id} if absent

Pattern 3: Non-Destructive Reconciliation Report

What: Return a structured report: scanned rooms, restored rooms, restored chats, conflicts, skipped rooms. When to use: Every reconciliation run, including dry-run. Example:

{
    "joined_rooms": 4,
    "restored_user_meta": 1,
    "restored_room_meta": 3,
    "restored_chat_rows": 3,
    "conflicts": [],
    "skipped_rooms": ["!dm:example.org"],
}

Pattern 4: Reset Modes Are Explicit

What: Separate local-only, server-leave-forget, and dry-run. When to use: For dev/QA only. Never mix destructive server cleanup into normal startup. Example:

uv run python -m adapter.matrix.reset --mode local-only
uv run python -m adapter.matrix.reset --mode server-leave-forget --dry-run

Anti-Patterns to Avoid

  • Provisioning during reconciliation: Do not create a new Space or new rooms while trying to recover missing local state.
  • Treating next_chat_index as primary truth: Derive it from recovered chat_id values after scan; do not trust a missing or stale counter.
  • Routing unknown rooms straight through: unregistered:{room_id} is a signal to reconcile, not a stable runtime identity.
  • Destructive reset by default: Startup must never leave/forget rooms automatically.
  • Blindly trusting local surface_ref: If chat:* and matrix_room:* disagree, rebuild from Matrix room metadata and repair the chat row.

Don't Hand-Roll

Problem Don't Build Use Instead Why
Room discovery Custom DB-only reconstruction heuristics AsyncClient.joined_rooms() plus synced room state Server already knows which rooms the bot joined.
Space membership detection Naming-convention parsing of room names Matrix state: m.room.create.type, m.space.child, m.space.parent Names are mutable and non-authoritative.
Room cleanup semantics Custom “delete room” assumptions room_leave() + room_forget() semantics Client API supports leave/forget, not guaranteed global deletion.
Chat ID recovery Hardcoded C1/C2/... reset Rebuild from existing matrix_room:*/server state and compute next index Prevents collisions after partial DB loss.
Diagnostic output Ad hoc print() strings Structured reconciliation/reset report via structlog Easier manual QA and failure triage.

Key insight: The homeserver already persists the bots room graph. This phase should rehydrate local cache from that graph, not attempt to replace it with a second custom truth model.

Common Pitfalls

Pitfall 1: Joining the sync loop before reconciliation

What goes wrong: Commands arrive while local metadata is still missing, producing unregistered:{room_id} routing or ChatManager misses. Why it happens: Current main() enters sync_forever() immediately after login. How to avoid: Perform initial sync and reconciliation first. Warning signs: unregistered_room logs immediately after restart; ValueError("Chat ... not found") on !rename or !archive.

Pitfall 2: Recovering room metadata but not chat rows

What goes wrong: Room routing works, but ChatManager.rename/archive/list_active still fails because chat:{user}:{chat_id} rows were not recreated. Why it happens: Matrix adapter metadata and core chat metadata live in different keyspaces. How to avoid: Reconciliation must repair both stores in one pass. Warning signs: matrix_room:* exists but chat:* keys do not.

Pitfall 3: Trusting stale next_chat_index

What goes wrong: New chats reuse existing C IDs after local recovery. Why it happens: next_chat_id() increments a persisted counter that may be absent or behind. How to avoid: After scan, set next_chat_index = max(recovered_chat_numbers) + 1. Warning signs: New room gets C1 even though Space already contains prior rooms.

Pitfall 4: Assuming room names identify chat rooms safely

What goes wrong: Reconciliation binds the wrong room because a user renamed a room or Space. Why it happens: Names are user-facing labels, not stable identifiers. How to avoid: Prefer room state and existing chat_id metadata; use display names only as fallback. Warning signs: Duplicate “Чат 1” names or renamed rooms break matching.

Pitfall 5: Over-promising full cleanup

What goes wrong: Reset script claims a “clean slate” but rooms still exist in Element or for other members. Why it happens: Leaving/forgetting affects the bot accounts membership/history, not necessarily global room deletion. How to avoid: Name the mode accurately and print the manual client steps when needed. Warning signs: QA reruns still show old rooms in the users client.

Code Examples

Verified patterns from official sources and the installed library surface:

Initial Sync Before Reconcile

# Source: matrix-nio AsyncClient.sync/sync_forever
await client.sync(timeout=0, full_state=True)
report = await reconcile_matrix_state(client, store, chat_mgr)
await client.sync_forever(timeout=30000)
# Source: Matrix client-server API state event + current auth/new-chat flow
await client.room_put_state(
    room_id=space_id,
    event_type="m.space.child",
    content={"via": [homeserver]},
    state_key=chat_room_id,
)

Bot-Side Leave/Forget Cleanup

# Source: matrix-nio AsyncClient.room_leave / room_forget
for room_id in room_ids:
    await client.room_leave(room_id)
    await client.room_forget(room_id)

Router Recovery Trigger

# Source: repo room_router contract
chat_id = await resolve_chat_id(store, room_id, matrix_user_id)
if chat_id.startswith("unregistered:"):
    await reconcile_single_room(client, store, chat_mgr, room_id, matrix_user_id)

State of the Art

Old Approach Current Approach When Changed Impact
Local adapter DB treated as the operational truth Rebuildable local cache from server room graph Mature Matrix client practice; supported by current Matrix CS API and matrix-nio Restart no longer requires destructive local reset.
Manual room cleanup in client after experiments Scripted leave/forget plus explicit manual instructions Current matrix-nio 0.25.x API surface QA becomes repeatable and auditable.
Immediate steady-state sync after login Initial sync/full-state bootstrap before long polling Supported by current AsyncClient.sync() / sync_forever() behavior Reconciliation can run before any user traffic is handled.

Deprecated/outdated:

  • README.md Matrix manual QA instruction rm -f lambda_matrix.db as the primary restart flow: outdated for this phase.
  • DM-first Matrix recovery assumptions in docs/matrix-prototype.md: outdated relative to Phase 1 Space+rooms decisions.

Open Questions

  1. How exactly should reconciliation identify the owning Matrix user for a recovered room when local matrix_room:* is gone?

    • What we know: the bot can enumerate joined rooms and fetch room state; current healthy metadata stores matrix_user_id and space_id.
    • What's unclear: whether Phase 1-created rooms also expose enough server-side structure to recover owner deterministically without existing local metadata in every case.
    • Recommendation: Plan a proof test against a real homeserver/client. If room-state-only ownership is ambiguous, persist a tiny bot-authored marker state event going forward, but keep that addition narrowly scoped.
  2. Should runtime recovery happen only on startup, or also lazily on first unknown room access?

    • What we know: startup repair satisfies D-02/D-07 for common restart loss; room_router already surfaces unknown rooms cleanly.
    • What's unclear: whether partial DB corruption during runtime is common enough to justify lazy single-room repair in Phase 01.1.
    • Recommendation: Make startup reconciliation required, lazy room repair optional if it stays small.
  3. How much of server cleanup should Phase 01.1 automate?

    • What we know: room_leave() and room_forget() are available; global room deletion is not what the client API guarantees.
    • What's unclear: whether automating bot-side leave/forget is worth the extra risk for this urgent phase.
    • Recommendation: Keep local-only mandatory. Make server cleanup optional and clearly labeled experimental/dev-only if included.

Environment Availability

Dependency Required By Available Version Fallback
Python Runtime, scripts, tests 3.14.3
uv Standard install/run workflow 0.9.30 python -m + existing venv
pytest Automated verification 9.0.2 uv run pytest
Matrix homeserver credentials Real restart/reset manual QA ✗ in current shell Manual-only after .env is configured
Matrix bot local DB/store paths Reset workflow defaults in code Can override with MATRIX_DB_PATH / MATRIX_STORE_PATH

Missing dependencies with no fallback:

  • Live Matrix credentials for real manual reconciliation/reset QA.

Missing dependencies with fallback:

  • None for repository-only implementation and tests.

Validation Architecture

Test Framework

Property Value
Framework pytest 9.0.2 + pytest-asyncio 1.3.0
Config file pyproject.toml
Quick run command pytest tests/adapter/matrix -v
Full suite command pytest tests/ -v

Phase Requirements → Test Map

Req ID Behavior Test Type Automated Command File Exists?
PH01.1-BOOT Startup rebuilds missing matrix_user:*, matrix_room:*, and chat:* from existing rooms without creating new rooms unit/integration pytest tests/adapter/matrix/test_reconcile.py -v Wave 0
PH01.1-ROUTER Unknown room fallback can trigger repair or yields diagnosable warning without crashing commands unit pytest tests/adapter/matrix/test_room_router_reconcile.py -v Wave 0
PH01.1-COUNTER Reconciliation resets next_chat_index to recovered max + 1 unit pytest tests/adapter/matrix/test_reconcile.py -k next_chat_index -v Wave 0
PH01.1-RESET Dev reset local-only removes local DB/store paths and prints next steps unit/smoke pytest tests/adapter/matrix/test_reset.py -v Wave 0
PH01.1-NONDESTRUCTIVE Reconciliation never calls room creation APIs unit pytest tests/adapter/matrix/test_reconcile.py -k no_create -v Wave 0

Sampling Rate

  • Per task commit: pytest tests/adapter/matrix -v
  • Per wave merge: pytest tests/ -v
  • Phase gate: Full suite green before /gsd:verify-work

Wave 0 Gaps

  • tests/adapter/matrix/test_reconcile.py - startup reconciliation scenarios
  • tests/adapter/matrix/test_reset.py - CLI/script reset modes and output
  • tests/adapter/matrix/test_room_router_reconcile.py - lazy recovery or warning behavior
  • Integration fixture for a fake AsyncClient response surface matching joined_rooms() and room_get_state()

Sources

Primary (HIGH confidence)

Secondary (MEDIUM confidence)

Tertiary (LOW confidence)

  • None

Metadata

Confidence breakdown:

  • Standard stack: HIGH - verified against installed environment, PyPI metadata, and official Matrix spec
  • Architecture: HIGH - directly grounded in current repo flow plus current matrix-nio/Matrix capabilities
  • Pitfalls: HIGH - derived from concrete gaps in current startup/store/router code

Research date: 2026-04-03 Valid until: 2026-05-03