surfaces/.planning/phases/05-mvp-deployment/05-RESEARCH.md

22 KiB

Phase 05: MVP Deployment - Research

Researched: 2026-04-28 Domain: Matrix bot production deployment, restart reconciliation, per-room context isolation, shared-volume file transfer Confidence: HIGH

Project Constraints (from CLAUDE.md)

  • All platform calls must stay behind platform/interface.py (PlatformClient protocol).
  • Current platform implementation is a mock / replaceable adapter; architecture must not depend on unfinished upstream SDK.
  • Keep architecture decisions inside this repo and document contracts locally.
  • Prefer async, adapter/core separation, and do not bypass the existing core/ and adapter/ layering.
  • Use uv sync for dependency installation.
  • Use pytest tests/ -v and adapter-specific pytest slices for verification.
  • Never commit .env.
  • Dependency order remains fixed: core/ first, platform/ second, adapters after that.

Summary

Phase 05 should not introduce a new stack. The established implementation path is to harden the existing matrix-nio + SQLiteStore + RoutedPlatformClient + shared workspace volume design so production restart behavior matches the current Space+rooms UX. The main architectural rule is: Matrix topology is authoritative for room existence, while local SQLite metadata is authoritative only after reconciliation has rebuilt it.

The production-safe approach is to bind every working Matrix room to its own durable platform_chat_id, rotate only that identifier for !clear, and make restart recovery idempotent. Reconciliation should rebuild user_meta, room_meta, ChatManager entries, and missing routing fields from Matrix Space membership and room state before sync_forever() begins processing live traffic. Unknown rooms must be reconciled first, not silently converted into new chats.

For files, keep the current shared-volume contract and relative workspace_path transport. Do not build HTTP file shims or embed file payloads in bot-side state. For deployment artifacts, split runtime intent explicitly: docker-compose.prod.yml is a bot-only handoff contract, while docker-compose.fullstack.yml is the internal E2E harness that brings up platform services and shared volumes together.

Primary recommendation: Implement Phase 05 as a reconciliation-and-deploy hardening pass on the current Matrix stack, with Matrix Space state as source of truth and per-room platform_chat_id as the routing key.

Standard Stack

Core

Library Version Purpose Why Standard
matrix-nio 0.25.2 Async Matrix client, Spaces, media upload/download, token login, sync loop Already in repo; official docs confirm support for Spaces, token login, room_put_state, upload, download, and sync_forever
sqlite3 / SQLiteStore stdlib / repo-local Durable bot metadata (room_meta, user_meta, routing state) Small, local, restart-safe KV layer already used by runtime and tests
PyYAML 6.0.3 Agent registry / deployment config parsing Current repo standard for config/matrix-agents.yaml-style artifacts
httpx 0.28.1 Async HTTP for auxiliary platform calls Already used; fits async runtime and current codebase
Docker Compose v2 spec; local install v2.40.3 Prod/fullstack topology, shared named volumes, health-gated startup Officially supports multi-file overlays, named volumes, and service_healthy gating

Supporting

Library Version Purpose When to Use
structlog 25.5.0 Structured runtime logging Use for reconciliation summaries, routing mismatches, and deploy diagnostics
pydantic 2.13.3 Typed config / payload validation Use for any new deployment config or reconciliation report structures
python-dotenv 1.2.2 Local env loading Keep for local and compose-driven runtime config
pytest 9.0.3 Test runner Full phase verification and regression slices
pytest-asyncio 1.3.0 Async test execution Required for reconciliation/runtime tests

Alternatives Considered

Instead of Could Use Tradeoff
matrix-nio Synapse Admin / raw Matrix HTTP calls Worse fit; repo already depends on nio abstractions and tests
repo-local SQLiteStore Redis/Postgres Unnecessary operational scope increase for MVP deployment
shared volume file flow custom file proxy / presigned URLs More moving parts, more auth/cleanup edge cases, no need for MVP
split compose files one overloaded compose file with profiles Harder operator handoff; less explicit prod vs internal-test intent

Installation:

uv sync

Version verification: Verified on 2026-04-28 from PyPI and local environment.

Package Verified Version Publish Date Source
matrix-nio 0.25.2 2024-10-04 PyPI
httpx 0.28.1 2024-12-06 PyPI
structlog 25.5.0 2025-10-27 PyPI
pydantic 2.13.3 2026-04-20 PyPI
aiohttp 3.13.5 2026-03-31 PyPI
PyYAML 6.0.3 2025-09-25 PyPI
python-dotenv 1.2.2 2026-03-01 PyPI
pytest 9.0.3 2026-04-07 PyPI
pytest-asyncio 1.3.0 2025-11-10 PyPI

Architecture Patterns

adapter/matrix/
├── bot.py                    # startup, sync bootstrap, live callbacks
├── reconciliation.py         # new: restart recovery from Matrix state
├── files.py                  # shared-volume path building / materialization
├── routed_platform.py        # room -> agent_id + platform_chat_id routing
├── store.py                  # room_meta/user_meta helpers and counters
└── handlers/
    ├── auth.py               # Space + first room provisioning
    ├── chat.py               # !new / !archive / !rename
    └── context_commands.py   # !save / !load / !clear / !context

deploy/
├── docker-compose.prod.yml       # bot-only handoff
└── docker-compose.fullstack.yml  # internal E2E stack

Pattern 1: Matrix Space State Is Canonical, SQLite Is Rebuildable

What: Treat Matrix Space membership and child-room state as the source of truth for room topology; use local SQLite metadata as a cached routing index that reconciliation can rebuild. When to use: Startup, DB loss, stale local metadata, and any deployment where rooms may outlive the bot process. Example:

# Source: repo pattern from adapter/matrix/store.py + Matrix Space state
room_meta = {
    "room_type": "chat",
    "chat_id": "C7",
    "display_name": "Research",
    "matrix_user_id": "@alice:example.org",
    "space_id": "!space:example.org",
    "agent_id": "agent-1",
    "platform_chat_id": "42",
}
await set_room_meta(store, room_id, room_meta)
await chat_mgr.get_or_create(
    user_id=room_meta["matrix_user_id"],
    chat_id=room_meta["chat_id"],
    platform="matrix",
    surface_ref=room_id,
    name=room_meta["display_name"],
)

Pattern 2: Per-Room platform_chat_id Is the Only Real Context Boundary

What: Route every working Matrix room to its own durable platform_chat_id. When to use: Normal messaging, !save, !load, !context, !clear, restart restoration. Example:

# Source: adapter/matrix/routed_platform.py + adapter/matrix/handlers/context_commands.py
old_chat_id = room_meta["platform_chat_id"]
new_chat_id = await next_platform_chat_id(store)
await set_platform_chat_id(store, room_id, new_chat_id)

disconnect = getattr(platform, "disconnect_chat", None)
if callable(disconnect):
    await disconnect(old_chat_id)

Pattern 3: !clear Means Chat-ID Rotation, Not Global Wipe

What: Implement real clear by rotating only the current room's platform_chat_id and disconnecting the old upstream chat session. When to use: User-triggered context reset for one room. Example:

# Source: adapter/matrix/handlers/context_commands.py
room_id = await _resolve_room_id(event, chat_mgr)
old_chat_id = (room_meta or {}).get("platform_chat_id") or room_id
new_chat_id = await next_platform_chat_id(store)
await set_platform_chat_id(store, room_id, new_chat_id)

Pattern 4: Shared-Volume File Handoff Uses Relative Workspace Paths

What: Persist incoming Matrix media into a room-scoped path under the shared workspace, and pass only relative paths to the agent. When to use: User uploads, staged attachments, agent-emitted files. Example:

# Source: adapter/matrix/files.py
relative_path = (
    Path("surfaces") / "matrix" / safe_user / safe_room / "inbox" / f"{stamp}-{safe_name}"
)
return Attachment(
    type=attachment.type,
    url=attachment.url,
    filename=filename,
    mime_type=attachment.mime_type,
    workspace_path=relative_path.as_posix(),
)

Pattern 5: Compose Split By Operational Intent

What: Keep one compose artifact for operator handoff and one for internal full-stack testing. When to use: Deployment packaging. Example:

# docker-compose.prod.yml
services:
  matrix-bot:
    image: surfaces-bot:latest
    env_file: .env
    volumes:
      - agents:/agents

# docker-compose.fullstack.yml
services:
  matrix-bot:
    extends:
      file: docker-compose.prod.yml
      service: matrix-bot
  platform-agent:
    ...
volumes:
  agents:

Anti-Patterns to Avoid

  • Lazy bootstrap as restart strategy: _bootstrap_unregistered_room() is acceptable for first-contact repair, not as the primary restart recovery path in production.
  • Per-user context identity: a user-level or DM-level chat id breaks Space+rooms isolation and makes !clear incorrect.
  • Global reset endpoint semantics: !clear must not wipe other rooms or all agent state for a user.
  • Absolute attachment paths in platform payloads: keep agent attachment references relative to its workspace contract.
  • Sleep-based service readiness: use Compose healthchecks and dependency conditions, not shell sleep.

Don't Hand-Roll

Problem Don't Build Use Instead Why
Matrix room/Space protocol Raw custom HTTP wrappers for state events matrix-nio room_create, room_put_state, space_get_hierarchy, sync_forever, upload, download Official support already exists and repo tests are built around nio
Restart topology discovery Ad hoc timeline scraping Full-state sync plus room state / Space child reconciliation Timeline replay is noisy and brittle; state is the stable source
File transfer bus Base64 blobs or custom bot-side file API Shared /agents/ volume with relative workspace_path Lower operational complexity and already matches upstream agent contract
Compose startup sequencing Shell loops / sleeps healthcheck + depends_on: condition: service_healthy Official Compose behavior is deterministic and observable
Context reset Deleting all SQLite rows or resetting the whole user Rotate current room platform_chat_id and drop that room's live agent connection Preserves other rooms and matches user expectation

Key insight: The deceptively hard problems in this phase are already solved by the current stack: Matrix room state, nio media handling, named volumes, and service health gating. Custom alternatives add more failure modes than value.

Common Pitfalls

Pitfall 1: Unknown room after restart creates a duplicate working chat

What goes wrong: The bot treats an existing room as unregistered and provisions a fresh room/tree. Why it happens: Local SQLite metadata is missing, but Matrix topology still exists. How to avoid: Run reconciliation before live sync callbacks; only allow lazy bootstrap for genuinely new first-contact rooms. Warning signs: New Чат N rooms appear after restart without a matching user action.

Pitfall 2: !clear resets the wrong scope

What goes wrong: Clearing one room also clears another room, or does nothing because the upstream session key did not change. Why it happens: Context is keyed by user or local chat_id instead of durable room-local platform_chat_id. How to avoid: Always resolve room -> platform_chat_id, rotate it, and disconnect only the old upstream chat. Warning signs: Two rooms share response history or !context reports the same platform context id.

Pitfall 3: Space child linkage is incomplete

What goes wrong: Rooms exist but do not appear correctly under the user's Space. Why it happens: Missing or malformed m.space.child state, especially missing via data. How to avoid: Persist space_id, write m.space.child with state_key=room_id, and reconcile child links on startup. Warning signs: Element shows the room outside the Space, or not at all in the hierarchy.

Pitfall 4: Shared volume works locally but fails in deployment

What goes wrong: Agent-generated files cannot be read by the bot, or bot-downloaded files are unreadable by the agent. Why it happens: Mount mismatch, wrong root (/workspace vs /agents), or container user/group permissions. How to avoid: Standardize one shared root, keep relative workspace paths, and align container permissions with Compose volume configuration. Warning signs: Attachment paths exist in metadata but not on disk inside the other container.

Pitfall 5: Compose depends_on starts too early

What goes wrong: Bot starts before dependent services are actually ready. Why it happens: Short-form depends_on only waits for container start, not health. How to avoid: Use healthchecks and long-form depends_on with service_healthy in the full-stack compose file. Warning signs: First requests fail after fresh docker compose up, then succeed on retry.

Code Examples

Verified patterns from official sources and current repo:

Create a Space with matrix-nio

# Source: matrix-nio API docs
space_resp = await client.room_create(
    name=f"Lambda — {display_name}",
    visibility=RoomVisibility.private,
    invite=[matrix_user_id],
    space=True,
)

Add a child room to a Space

# Source: current repo pattern + Matrix spec
await client.room_put_state(
    room_id=space_id,
    event_type="m.space.child",
    content={"via": [homeserver]},
    state_key=chat_room_id,
)

Persist room-scoped attachment paths

# Source: adapter/matrix/files.py
relative_path, absolute_path = build_workspace_attachment_path(
    workspace_root=workspace_root,
    matrix_user_id=matrix_user_id,
    room_id=room_id,
    filename=filename,
)
absolute_path.parent.mkdir(parents=True, exist_ok=True)
absolute_path.write_bytes(body)

Health-gated startup in Compose

# Source: Docker Compose docs
services:
  matrix-bot:
    depends_on:
      platform-agent:
        condition: service_healthy

  platform-agent:
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 10s
      timeout: 5s
      retries: 5

State of the Art

Old Approach Current Approach When Changed Impact
Per-user or single shared platform context Per-room platform_chat_id Repo direction corrected on 2026-04-28 Enables true room isolation and correct !clear
Single overloaded compose runtime Separate prod handoff and full-stack E2E compose files Current Phase 05 scope Reduces operator ambiguity
Unknown room auto-bootstrap as recovery Explicit reconciliation before live traffic Recommended for Phase 05 Prevents duplicate chat trees after restart
File payloads treated as transport concern Shared-volume relative path contract Already present in repo Keeps bot/platform contract simple and durable

Deprecated/outdated:

  • Single-chat / DM-first deployment direction: explicitly discarded in Phase 05 reset.
  • Global reset semantics for Matrix context commands: does not match Space+rooms UX.
  • Using only local store as truth for restart recovery: unsafe once deployed rooms outlive the process.

Open Questions

  1. What exact Matrix state should reconciliation trust for chat_id labels?

    • What we know: room_meta.chat_id is local and not derivable from Matrix protocol by default.
    • What's unclear: whether chat labels should be reconstructed from room names, stored custom state, or cached local metadata when present.
    • Recommendation: persist chat_id in local SQLite, but make reconciliation able to regenerate a stable fallback label and avoid blocking routing if the label is missing.
  2. What readiness probe exists for platform-agent in the full-stack compose?

    • What we know: Compose health gating is the right pattern.
    • What's unclear: whether upstream agent image already exposes a reliable health endpoint.
    • Recommendation: inspect upstream container and add a bot-facing probe before finalizing docker-compose.fullstack.yml.
  3. Should prod mount root remain /workspace or be renamed to /agents externally?

    • What we know: current code defaults to SURFACES_WORKSPACE_DIR=/workspace, while deployment docs describe shared /agents/.
    • What's unclear: whether external handoff wants a host path named /agents while containers still use /workspace.
    • Recommendation: keep one in-container canonical path and let host-side naming vary only in Compose mounts.

Environment Availability

Dependency Required By Available Version Fallback
Python bot runtime 3.14.3
uv dependency install 0.9.30 pip
pytest validation 9.0.2 installed python -m pytest
Docker Engine deployment packaging / E2E compose 29.1.3 none
Docker Compose split runtime orchestration 2.40.3 none

Missing dependencies with no fallback:

  • None

Missing dependencies with fallback:

  • None

Validation Architecture

Test Framework

Property Value
Framework pytest + pytest-asyncio
Config file pyproject.toml
Quick run command pytest tests/adapter/matrix/test_restart_persistence.py -v
Full suite command pytest tests/ -v

Phase Requirements → Test Map

Req ID Behavior Test Type Automated Command File Exists?
PH05-01 Space+rooms onboarding remains primary UX integration pytest tests/adapter/matrix/test_invite_space.py tests/adapter/matrix/test_chat_space.py -v
PH05-02 Per-room platform_chat_id isolates routing and powers real clear integration pytest tests/adapter/matrix/test_routed_platform.py tests/adapter/matrix/test_context_commands.py -v
PH05-03 Restart reconciliation restores routing metadata integration pytest tests/adapter/matrix/test_restart_persistence.py -v new reconciliation tests needed
PH05-04 Shared-volume file transfer is room-safe integration pytest tests/adapter/matrix/test_files.py tests/platform/test_real.py -v partial
PH05-05 Split prod/fullstack compose artifacts stay coherent smoke docker compose -f docker-compose.prod.yml config && docker compose -f docker-compose.fullstack.yml config Wave 0

Sampling Rate

  • Per task commit: pytest tests/adapter/matrix/test_restart_persistence.py -v
  • Per wave merge: pytest tests/adapter/matrix/ -v
  • Phase gate: pytest tests/ -v plus both compose files passing docker compose ... config

Wave 0 Gaps

  • tests/adapter/matrix/test_reconciliation.py — startup recovery of user/room metadata from Matrix state
  • tests/adapter/matrix/test_context_commands.py additions — !clear command contract and room-local rotation semantics
  • tests/adapter/matrix/test_compose_artifacts.py or equivalent smoke command documentation — split compose validation
  • tests/adapter/matrix/test_files.py additions — cross-room attachment path isolation and shared-root consistency

Sources

Primary (HIGH confidence)

Secondary (MEDIUM confidence)

  • docs/deploy-architecture.md — repo-local deployment contract clarified on 2026-04-27
  • docs/research/matrix-spaces.md — prior internal research aligned with spec, but not treated as primary
  • README.md runtime notes for current Matrix backend and shared workspace behavior

Tertiary (LOW confidence)

  • None

Metadata

Confidence breakdown:

  • Standard stack: HIGH - current repo stack verified against official docs and package registries
  • Architecture: HIGH - recommendations align with existing runtime boundaries and official Matrix / Compose behavior
  • Pitfalls: HIGH - derived from current code paths, existing tests, and official protocol/runtime semantics

Research date: 2026-04-28 Valid until: 2026-05-28