# Phase 05: MVP Deployment - Research **Researched:** 2026-04-28 **Domain:** Matrix bot production deployment, restart reconciliation, per-room context isolation, shared-volume file transfer **Confidence:** HIGH ## Project Constraints (from CLAUDE.md) - All platform calls must stay behind `platform/interface.py` (`PlatformClient` protocol). - Current platform implementation is a mock / replaceable adapter; architecture must not depend on unfinished upstream SDK. - Keep architecture decisions inside this repo and document contracts locally. - Prefer async, adapter/core separation, and do not bypass the existing `core/` and `adapter/` layering. - Use `uv sync` for dependency installation. - Use `pytest tests/ -v` and adapter-specific pytest slices for verification. - Never commit `.env`. - Dependency order remains fixed: `core/` first, `platform/` second, adapters after that. ## Summary Phase 05 should not introduce a new stack. The established implementation path is to harden the existing `matrix-nio + SQLiteStore + RoutedPlatformClient + shared workspace volume` design so production restart behavior matches the current Space+rooms UX. The main architectural rule is: Matrix topology is authoritative for room existence, while local SQLite metadata is authoritative only after reconciliation has rebuilt it. The production-safe approach is to bind every working Matrix room to its own durable `platform_chat_id`, rotate only that identifier for `!clear`, and make restart recovery idempotent. Reconciliation should rebuild `user_meta`, `room_meta`, `ChatManager` entries, and missing routing fields from Matrix Space membership and room state before `sync_forever()` begins processing live traffic. Unknown rooms must be reconciled first, not silently converted into new chats. For files, keep the current shared-volume contract and relative `workspace_path` transport. Do not build HTTP file shims or embed file payloads in bot-side state. For deployment artifacts, split runtime intent explicitly: `docker-compose.prod.yml` is a bot-only handoff contract, while `docker-compose.fullstack.yml` is the internal E2E harness that brings up platform services and shared volumes together. **Primary recommendation:** Implement Phase 05 as a reconciliation-and-deploy hardening pass on the current Matrix stack, with Matrix Space state as source of truth and per-room `platform_chat_id` as the routing key. ## Standard Stack ### Core | Library | Version | Purpose | Why Standard | |---------|---------|---------|--------------| | `matrix-nio` | 0.25.2 | Async Matrix client, Spaces, media upload/download, token login, sync loop | Already in repo; official docs confirm support for Spaces, token login, `room_put_state`, `upload`, `download`, and `sync_forever` | | `sqlite3` / `SQLiteStore` | stdlib / repo-local | Durable bot metadata (`room_meta`, `user_meta`, routing state) | Small, local, restart-safe KV layer already used by runtime and tests | | `PyYAML` | 6.0.3 | Agent registry / deployment config parsing | Current repo standard for `config/matrix-agents.yaml`-style artifacts | | `httpx` | 0.28.1 | Async HTTP for auxiliary platform calls | Already used; fits async runtime and current codebase | | Docker Compose | v2 spec; local install `v2.40.3` | Prod/fullstack topology, shared named volumes, health-gated startup | Officially supports multi-file overlays, named volumes, and `service_healthy` gating | ### Supporting | Library | Version | Purpose | When to Use | |---------|---------|---------|-------------| | `structlog` | 25.5.0 | Structured runtime logging | Use for reconciliation summaries, routing mismatches, and deploy diagnostics | | `pydantic` | 2.13.3 | Typed config / payload validation | Use for any new deployment config or reconciliation report structures | | `python-dotenv` | 1.2.2 | Local env loading | Keep for local and compose-driven runtime config | | `pytest` | 9.0.3 | Test runner | Full phase verification and regression slices | | `pytest-asyncio` | 1.3.0 | Async test execution | Required for reconciliation/runtime tests | ### Alternatives Considered | Instead of | Could Use | Tradeoff | |------------|-----------|----------| | `matrix-nio` | Synapse Admin / raw Matrix HTTP calls | Worse fit; repo already depends on nio abstractions and tests | | repo-local `SQLiteStore` | Redis/Postgres | Unnecessary operational scope increase for MVP deployment | | shared volume file flow | custom file proxy / presigned URLs | More moving parts, more auth/cleanup edge cases, no need for MVP | | split compose files | one overloaded compose file with profiles | Harder operator handoff; less explicit prod vs internal-test intent | **Installation:** ```bash uv sync ``` **Version verification:** Verified on 2026-04-28 from PyPI and local environment. | Package | Verified Version | Publish Date | Source | |---------|------------------|--------------|--------| | `matrix-nio` | 0.25.2 | 2024-10-04 | PyPI | | `httpx` | 0.28.1 | 2024-12-06 | PyPI | | `structlog` | 25.5.0 | 2025-10-27 | PyPI | | `pydantic` | 2.13.3 | 2026-04-20 | PyPI | | `aiohttp` | 3.13.5 | 2026-03-31 | PyPI | | `PyYAML` | 6.0.3 | 2025-09-25 | PyPI | | `python-dotenv` | 1.2.2 | 2026-03-01 | PyPI | | `pytest` | 9.0.3 | 2026-04-07 | PyPI | | `pytest-asyncio` | 1.3.0 | 2025-11-10 | PyPI | ## Architecture Patterns ### Recommended Project Structure ```text adapter/matrix/ ├── bot.py # startup, sync bootstrap, live callbacks ├── reconciliation.py # new: restart recovery from Matrix state ├── files.py # shared-volume path building / materialization ├── routed_platform.py # room -> agent_id + platform_chat_id routing ├── store.py # room_meta/user_meta helpers and counters └── handlers/ ├── auth.py # Space + first room provisioning ├── chat.py # !new / !archive / !rename └── context_commands.py # !save / !load / !clear / !context deploy/ ├── docker-compose.prod.yml # bot-only handoff └── docker-compose.fullstack.yml # internal E2E stack ``` ### Pattern 1: Matrix Space State Is Canonical, SQLite Is Rebuildable **What:** Treat Matrix Space membership and child-room state as the source of truth for room topology; use local SQLite metadata as a cached routing index that reconciliation can rebuild. **When to use:** Startup, DB loss, stale local metadata, and any deployment where rooms may outlive the bot process. **Example:** ```python # Source: repo pattern from adapter/matrix/store.py + Matrix Space state room_meta = { "room_type": "chat", "chat_id": "C7", "display_name": "Research", "matrix_user_id": "@alice:example.org", "space_id": "!space:example.org", "agent_id": "agent-1", "platform_chat_id": "42", } await set_room_meta(store, room_id, room_meta) await chat_mgr.get_or_create( user_id=room_meta["matrix_user_id"], chat_id=room_meta["chat_id"], platform="matrix", surface_ref=room_id, name=room_meta["display_name"], ) ``` ### Pattern 2: Per-Room `platform_chat_id` Is the Only Real Context Boundary **What:** Route every working Matrix room to its own durable `platform_chat_id`. **When to use:** Normal messaging, `!save`, `!load`, `!context`, `!clear`, restart restoration. **Example:** ```python # Source: adapter/matrix/routed_platform.py + adapter/matrix/handlers/context_commands.py old_chat_id = room_meta["platform_chat_id"] new_chat_id = await next_platform_chat_id(store) await set_platform_chat_id(store, room_id, new_chat_id) disconnect = getattr(platform, "disconnect_chat", None) if callable(disconnect): await disconnect(old_chat_id) ``` ### Pattern 3: `!clear` Means Chat-ID Rotation, Not Global Wipe **What:** Implement real clear by rotating only the current room's `platform_chat_id` and disconnecting the old upstream chat session. **When to use:** User-triggered context reset for one room. **Example:** ```python # Source: adapter/matrix/handlers/context_commands.py room_id = await _resolve_room_id(event, chat_mgr) old_chat_id = (room_meta or {}).get("platform_chat_id") or room_id new_chat_id = await next_platform_chat_id(store) await set_platform_chat_id(store, room_id, new_chat_id) ``` ### Pattern 4: Shared-Volume File Handoff Uses Relative Workspace Paths **What:** Persist incoming Matrix media into a room-scoped path under the shared workspace, and pass only relative paths to the agent. **When to use:** User uploads, staged attachments, agent-emitted files. **Example:** ```python # Source: adapter/matrix/files.py relative_path = ( Path("surfaces") / "matrix" / safe_user / safe_room / "inbox" / f"{stamp}-{safe_name}" ) return Attachment( type=attachment.type, url=attachment.url, filename=filename, mime_type=attachment.mime_type, workspace_path=relative_path.as_posix(), ) ``` ### Pattern 5: Compose Split By Operational Intent **What:** Keep one compose artifact for operator handoff and one for internal full-stack testing. **When to use:** Deployment packaging. **Example:** ```yaml # docker-compose.prod.yml services: matrix-bot: image: surfaces-bot:latest env_file: .env volumes: - agents:/agents # docker-compose.fullstack.yml services: matrix-bot: extends: file: docker-compose.prod.yml service: matrix-bot platform-agent: ... volumes: agents: ``` ### Anti-Patterns to Avoid - **Lazy bootstrap as restart strategy:** `_bootstrap_unregistered_room()` is acceptable for first-contact repair, not as the primary restart recovery path in production. - **Per-user context identity:** a user-level or DM-level chat id breaks Space+rooms isolation and makes `!clear` incorrect. - **Global reset endpoint semantics:** `!clear` must not wipe other rooms or all agent state for a user. - **Absolute attachment paths in platform payloads:** keep agent attachment references relative to its workspace contract. - **Sleep-based service readiness:** use Compose healthchecks and dependency conditions, not shell `sleep`. ## Don't Hand-Roll | Problem | Don't Build | Use Instead | Why | |---------|-------------|-------------|-----| | Matrix room/Space protocol | Raw custom HTTP wrappers for state events | `matrix-nio` `room_create`, `room_put_state`, `space_get_hierarchy`, `sync_forever`, `upload`, `download` | Official support already exists and repo tests are built around nio | | Restart topology discovery | Ad hoc timeline scraping | Full-state sync plus room state / Space child reconciliation | Timeline replay is noisy and brittle; state is the stable source | | File transfer bus | Base64 blobs or custom bot-side file API | Shared `/agents/` volume with relative `workspace_path` | Lower operational complexity and already matches upstream agent contract | | Compose startup sequencing | Shell loops / sleeps | `healthcheck` + `depends_on: condition: service_healthy` | Official Compose behavior is deterministic and observable | | Context reset | Deleting all SQLite rows or resetting the whole user | Rotate current room `platform_chat_id` and drop that room's live agent connection | Preserves other rooms and matches user expectation | **Key insight:** The deceptively hard problems in this phase are already solved by the current stack: Matrix room state, nio media handling, named volumes, and service health gating. Custom alternatives add more failure modes than value. ## Common Pitfalls ### Pitfall 1: Unknown room after restart creates a duplicate working chat **What goes wrong:** The bot treats an existing room as unregistered and provisions a fresh room/tree. **Why it happens:** Local SQLite metadata is missing, but Matrix topology still exists. **How to avoid:** Run reconciliation before live sync callbacks; only allow lazy bootstrap for genuinely new first-contact rooms. **Warning signs:** New `Чат N` rooms appear after restart without a matching user action. ### Pitfall 2: `!clear` resets the wrong scope **What goes wrong:** Clearing one room also clears another room, or does nothing because the upstream session key did not change. **Why it happens:** Context is keyed by user or local `chat_id` instead of durable room-local `platform_chat_id`. **How to avoid:** Always resolve room -> `platform_chat_id`, rotate it, and disconnect only the old upstream chat. **Warning signs:** Two rooms share response history or `!context` reports the same platform context id. ### Pitfall 3: Space child linkage is incomplete **What goes wrong:** Rooms exist but do not appear correctly under the user's Space. **Why it happens:** Missing or malformed `m.space.child` state, especially missing `via` data. **How to avoid:** Persist `space_id`, write `m.space.child` with `state_key=room_id`, and reconcile child links on startup. **Warning signs:** Element shows the room outside the Space, or not at all in the hierarchy. ### Pitfall 4: Shared volume works locally but fails in deployment **What goes wrong:** Agent-generated files cannot be read by the bot, or bot-downloaded files are unreadable by the agent. **Why it happens:** Mount mismatch, wrong root (`/workspace` vs `/agents`), or container user/group permissions. **How to avoid:** Standardize one shared root, keep relative workspace paths, and align container permissions with Compose volume configuration. **Warning signs:** Attachment paths exist in metadata but not on disk inside the other container. ### Pitfall 5: Compose `depends_on` starts too early **What goes wrong:** Bot starts before dependent services are actually ready. **Why it happens:** Short-form `depends_on` only waits for container start, not health. **How to avoid:** Use healthchecks and long-form `depends_on` with `service_healthy` in the full-stack compose file. **Warning signs:** First requests fail after fresh `docker compose up`, then succeed on retry. ## Code Examples Verified patterns from official sources and current repo: ### Create a Space with `matrix-nio` ```python # Source: matrix-nio API docs space_resp = await client.room_create( name=f"Lambda — {display_name}", visibility=RoomVisibility.private, invite=[matrix_user_id], space=True, ) ``` ### Add a child room to a Space ```python # Source: current repo pattern + Matrix spec await client.room_put_state( room_id=space_id, event_type="m.space.child", content={"via": [homeserver]}, state_key=chat_room_id, ) ``` ### Persist room-scoped attachment paths ```python # Source: adapter/matrix/files.py relative_path, absolute_path = build_workspace_attachment_path( workspace_root=workspace_root, matrix_user_id=matrix_user_id, room_id=room_id, filename=filename, ) absolute_path.parent.mkdir(parents=True, exist_ok=True) absolute_path.write_bytes(body) ``` ### Health-gated startup in Compose ```yaml # Source: Docker Compose docs services: matrix-bot: depends_on: platform-agent: condition: service_healthy platform-agent: healthcheck: test: ["CMD", "curl", "-f", "http://localhost:8000/health"] interval: 10s timeout: 5s retries: 5 ``` ## State of the Art | Old Approach | Current Approach | When Changed | Impact | |--------------|------------------|--------------|--------| | Per-user or single shared platform context | Per-room `platform_chat_id` | Repo direction corrected on 2026-04-28 | Enables true room isolation and correct `!clear` | | Single overloaded compose runtime | Separate prod handoff and full-stack E2E compose files | Current Phase 05 scope | Reduces operator ambiguity | | Unknown room auto-bootstrap as recovery | Explicit reconciliation before live traffic | Recommended for Phase 05 | Prevents duplicate chat trees after restart | | File payloads treated as transport concern | Shared-volume relative path contract | Already present in repo | Keeps bot/platform contract simple and durable | **Deprecated/outdated:** - Single-chat / DM-first deployment direction: explicitly discarded in Phase 05 reset. - Global reset semantics for Matrix context commands: does not match Space+rooms UX. - Using only local store as truth for restart recovery: unsafe once deployed rooms outlive the process. ## Open Questions 1. **What exact Matrix state should reconciliation trust for `chat_id` labels?** - What we know: `room_meta.chat_id` is local and not derivable from Matrix protocol by default. - What's unclear: whether chat labels should be reconstructed from room names, stored custom state, or cached local metadata when present. - Recommendation: persist `chat_id` in local SQLite, but make reconciliation able to regenerate a stable fallback label and avoid blocking routing if the label is missing. 2. **What readiness probe exists for `platform-agent` in the full-stack compose?** - What we know: Compose health gating is the right pattern. - What's unclear: whether upstream agent image already exposes a reliable health endpoint. - Recommendation: inspect upstream container and add a bot-facing probe before finalizing `docker-compose.fullstack.yml`. 3. **Should prod mount root remain `/workspace` or be renamed to `/agents` externally?** - What we know: current code defaults to `SURFACES_WORKSPACE_DIR=/workspace`, while deployment docs describe shared `/agents/`. - What's unclear: whether external handoff wants a host path named `/agents` while containers still use `/workspace`. - Recommendation: keep one in-container canonical path and let host-side naming vary only in Compose mounts. ## Environment Availability | Dependency | Required By | Available | Version | Fallback | |------------|------------|-----------|---------|----------| | Python | bot runtime | ✓ | 3.14.3 | — | | `uv` | dependency install | ✓ | 0.9.30 | `pip` | | `pytest` | validation | ✓ | 9.0.2 installed | `python -m pytest` | | Docker Engine | deployment packaging / E2E compose | ✓ | 29.1.3 | none | | Docker Compose | split runtime orchestration | ✓ | 2.40.3 | none | **Missing dependencies with no fallback:** - None **Missing dependencies with fallback:** - None ## Validation Architecture ### Test Framework | Property | Value | |----------|-------| | Framework | `pytest` + `pytest-asyncio` | | Config file | `pyproject.toml` | | Quick run command | `pytest tests/adapter/matrix/test_restart_persistence.py -v` | | Full suite command | `pytest tests/ -v` | ### Phase Requirements → Test Map | Req ID | Behavior | Test Type | Automated Command | File Exists? | |--------|----------|-----------|-------------------|-------------| | PH05-01 | Space+rooms onboarding remains primary UX | integration | `pytest tests/adapter/matrix/test_invite_space.py tests/adapter/matrix/test_chat_space.py -v` | ✅ | | PH05-02 | Per-room `platform_chat_id` isolates routing and powers real clear | integration | `pytest tests/adapter/matrix/test_routed_platform.py tests/adapter/matrix/test_context_commands.py -v` | ✅ | | PH05-03 | Restart reconciliation restores routing metadata | integration | `pytest tests/adapter/matrix/test_restart_persistence.py -v` | ❌ new reconciliation tests needed | | PH05-04 | Shared-volume file transfer is room-safe | integration | `pytest tests/adapter/matrix/test_files.py tests/platform/test_real.py -v` | ✅ partial | | PH05-05 | Split prod/fullstack compose artifacts stay coherent | smoke | `docker compose -f docker-compose.prod.yml config && docker compose -f docker-compose.fullstack.yml config` | ❌ Wave 0 | ### Sampling Rate - **Per task commit:** `pytest tests/adapter/matrix/test_restart_persistence.py -v` - **Per wave merge:** `pytest tests/adapter/matrix/ -v` - **Phase gate:** `pytest tests/ -v` plus both compose files passing `docker compose ... config` ### Wave 0 Gaps - [ ] `tests/adapter/matrix/test_reconciliation.py` — startup recovery of user/room metadata from Matrix state - [ ] `tests/adapter/matrix/test_context_commands.py` additions — `!clear` command contract and room-local rotation semantics - [ ] `tests/adapter/matrix/test_compose_artifacts.py` or equivalent smoke command documentation — split compose validation - [ ] `tests/adapter/matrix/test_files.py` additions — cross-room attachment path isolation and shared-root consistency ## Sources ### Primary (HIGH confidence) - Local repo code and tests: - `adapter/matrix/bot.py` - `adapter/matrix/store.py` - `adapter/matrix/files.py` - `adapter/matrix/routed_platform.py` - `adapter/matrix/handlers/auth.py` - `adapter/matrix/handlers/context_commands.py` - `tests/adapter/matrix/test_restart_persistence.py` - `tests/adapter/matrix/test_files.py` - `tests/platform/test_real.py` - Matrix-nio API docs: https://matrix-nio.readthedocs.io/en/latest/nio.html - Matrix-nio async client docs: https://matrix-nio.readthedocs.io/en/latest/_modules/nio/client/async_client.html - Matrix-nio PyPI release page: https://pypi.org/project/matrix-nio/ - Matrix spec Spaces / hierarchy: https://spec.matrix.org/v1.18/server-server-api/ - Matrix spec changelog note on `via` for `m.space.child`: https://spec.matrix.org/v1.16/changelog/v1.9/ - Docker Compose CLI reference: https://docs.docker.com/reference/cli/docker/compose/ - Docker Compose services reference: https://docs.docker.com/reference/compose-file/services/ ### Secondary (MEDIUM confidence) - `docs/deploy-architecture.md` — repo-local deployment contract clarified on 2026-04-27 - `docs/research/matrix-spaces.md` — prior internal research aligned with spec, but not treated as primary - `README.md` runtime notes for current Matrix backend and shared workspace behavior ### Tertiary (LOW confidence) - None ## Metadata **Confidence breakdown:** - Standard stack: HIGH - current repo stack verified against official docs and package registries - Architecture: HIGH - recommendations align with existing runtime boundaries and official Matrix / Compose behavior - Pitfalls: HIGH - derived from current code paths, existing tests, and official protocol/runtime semantics **Research date:** 2026-04-28 **Valid until:** 2026-05-28