22 KiB
Phase 05: MVP Deployment - Research
Researched: 2026-04-28 Domain: Matrix bot production deployment, restart reconciliation, per-room context isolation, shared-volume file transfer Confidence: HIGH
Project Constraints (from CLAUDE.md)
- All platform calls must stay behind
platform/interface.py(PlatformClientprotocol). - Current platform implementation is a mock / replaceable adapter; architecture must not depend on unfinished upstream SDK.
- Keep architecture decisions inside this repo and document contracts locally.
- Prefer async, adapter/core separation, and do not bypass the existing
core/andadapter/layering. - Use
uv syncfor dependency installation. - Use
pytest tests/ -vand adapter-specific pytest slices for verification. - Never commit
.env. - Dependency order remains fixed:
core/first,platform/second, adapters after that.
Summary
Phase 05 should not introduce a new stack. The established implementation path is to harden the existing matrix-nio + SQLiteStore + RoutedPlatformClient + shared workspace volume design so production restart behavior matches the current Space+rooms UX. The main architectural rule is: Matrix topology is authoritative for room existence, while local SQLite metadata is authoritative only after reconciliation has rebuilt it.
The production-safe approach is to bind every working Matrix room to its own durable platform_chat_id, rotate only that identifier for !clear, and make restart recovery idempotent. Reconciliation should rebuild user_meta, room_meta, ChatManager entries, and missing routing fields from Matrix Space membership and room state before sync_forever() begins processing live traffic. Unknown rooms must be reconciled first, not silently converted into new chats.
For files, keep the current shared-volume contract and relative workspace_path transport. Do not build HTTP file shims or embed file payloads in bot-side state. For deployment artifacts, split runtime intent explicitly: docker-compose.prod.yml is a bot-only handoff contract, while docker-compose.fullstack.yml is the internal E2E harness that brings up platform services and shared volumes together.
Primary recommendation: Implement Phase 05 as a reconciliation-and-deploy hardening pass on the current Matrix stack, with Matrix Space state as source of truth and per-room platform_chat_id as the routing key.
Standard Stack
Core
| Library | Version | Purpose | Why Standard |
|---|---|---|---|
matrix-nio |
0.25.2 | Async Matrix client, Spaces, media upload/download, token login, sync loop | Already in repo; official docs confirm support for Spaces, token login, room_put_state, upload, download, and sync_forever |
sqlite3 / SQLiteStore |
stdlib / repo-local | Durable bot metadata (room_meta, user_meta, routing state) |
Small, local, restart-safe KV layer already used by runtime and tests |
PyYAML |
6.0.3 | Agent registry / deployment config parsing | Current repo standard for config/matrix-agents.yaml-style artifacts |
httpx |
0.28.1 | Async HTTP for auxiliary platform calls | Already used; fits async runtime and current codebase |
| Docker Compose | v2 spec; local install v2.40.3 |
Prod/fullstack topology, shared named volumes, health-gated startup | Officially supports multi-file overlays, named volumes, and service_healthy gating |
Supporting
| Library | Version | Purpose | When to Use |
|---|---|---|---|
structlog |
25.5.0 | Structured runtime logging | Use for reconciliation summaries, routing mismatches, and deploy diagnostics |
pydantic |
2.13.3 | Typed config / payload validation | Use for any new deployment config or reconciliation report structures |
python-dotenv |
1.2.2 | Local env loading | Keep for local and compose-driven runtime config |
pytest |
9.0.3 | Test runner | Full phase verification and regression slices |
pytest-asyncio |
1.3.0 | Async test execution | Required for reconciliation/runtime tests |
Alternatives Considered
| Instead of | Could Use | Tradeoff |
|---|---|---|
matrix-nio |
Synapse Admin / raw Matrix HTTP calls | Worse fit; repo already depends on nio abstractions and tests |
repo-local SQLiteStore |
Redis/Postgres | Unnecessary operational scope increase for MVP deployment |
| shared volume file flow | custom file proxy / presigned URLs | More moving parts, more auth/cleanup edge cases, no need for MVP |
| split compose files | one overloaded compose file with profiles | Harder operator handoff; less explicit prod vs internal-test intent |
Installation:
uv sync
Version verification: Verified on 2026-04-28 from PyPI and local environment.
| Package | Verified Version | Publish Date | Source |
|---|---|---|---|
matrix-nio |
0.25.2 | 2024-10-04 | PyPI |
httpx |
0.28.1 | 2024-12-06 | PyPI |
structlog |
25.5.0 | 2025-10-27 | PyPI |
pydantic |
2.13.3 | 2026-04-20 | PyPI |
aiohttp |
3.13.5 | 2026-03-31 | PyPI |
PyYAML |
6.0.3 | 2025-09-25 | PyPI |
python-dotenv |
1.2.2 | 2026-03-01 | PyPI |
pytest |
9.0.3 | 2026-04-07 | PyPI |
pytest-asyncio |
1.3.0 | 2025-11-10 | PyPI |
Architecture Patterns
Recommended Project Structure
adapter/matrix/
├── bot.py # startup, sync bootstrap, live callbacks
├── reconciliation.py # new: restart recovery from Matrix state
├── files.py # shared-volume path building / materialization
├── routed_platform.py # room -> agent_id + platform_chat_id routing
├── store.py # room_meta/user_meta helpers and counters
└── handlers/
├── auth.py # Space + first room provisioning
├── chat.py # !new / !archive / !rename
└── context_commands.py # !save / !load / !clear / !context
deploy/
├── docker-compose.prod.yml # bot-only handoff
└── docker-compose.fullstack.yml # internal E2E stack
Pattern 1: Matrix Space State Is Canonical, SQLite Is Rebuildable
What: Treat Matrix Space membership and child-room state as the source of truth for room topology; use local SQLite metadata as a cached routing index that reconciliation can rebuild. When to use: Startup, DB loss, stale local metadata, and any deployment where rooms may outlive the bot process. Example:
# Source: repo pattern from adapter/matrix/store.py + Matrix Space state
room_meta = {
"room_type": "chat",
"chat_id": "C7",
"display_name": "Research",
"matrix_user_id": "@alice:example.org",
"space_id": "!space:example.org",
"agent_id": "agent-1",
"platform_chat_id": "42",
}
await set_room_meta(store, room_id, room_meta)
await chat_mgr.get_or_create(
user_id=room_meta["matrix_user_id"],
chat_id=room_meta["chat_id"],
platform="matrix",
surface_ref=room_id,
name=room_meta["display_name"],
)
Pattern 2: Per-Room platform_chat_id Is the Only Real Context Boundary
What: Route every working Matrix room to its own durable platform_chat_id.
When to use: Normal messaging, !save, !load, !context, !clear, restart restoration.
Example:
# Source: adapter/matrix/routed_platform.py + adapter/matrix/handlers/context_commands.py
old_chat_id = room_meta["platform_chat_id"]
new_chat_id = await next_platform_chat_id(store)
await set_platform_chat_id(store, room_id, new_chat_id)
disconnect = getattr(platform, "disconnect_chat", None)
if callable(disconnect):
await disconnect(old_chat_id)
Pattern 3: !clear Means Chat-ID Rotation, Not Global Wipe
What: Implement real clear by rotating only the current room's platform_chat_id and disconnecting the old upstream chat session.
When to use: User-triggered context reset for one room.
Example:
# Source: adapter/matrix/handlers/context_commands.py
room_id = await _resolve_room_id(event, chat_mgr)
old_chat_id = (room_meta or {}).get("platform_chat_id") or room_id
new_chat_id = await next_platform_chat_id(store)
await set_platform_chat_id(store, room_id, new_chat_id)
Pattern 4: Shared-Volume File Handoff Uses Relative Workspace Paths
What: Persist incoming Matrix media into a room-scoped path under the shared workspace, and pass only relative paths to the agent. When to use: User uploads, staged attachments, agent-emitted files. Example:
# Source: adapter/matrix/files.py
relative_path = (
Path("surfaces") / "matrix" / safe_user / safe_room / "inbox" / f"{stamp}-{safe_name}"
)
return Attachment(
type=attachment.type,
url=attachment.url,
filename=filename,
mime_type=attachment.mime_type,
workspace_path=relative_path.as_posix(),
)
Pattern 5: Compose Split By Operational Intent
What: Keep one compose artifact for operator handoff and one for internal full-stack testing. When to use: Deployment packaging. Example:
# docker-compose.prod.yml
services:
matrix-bot:
image: surfaces-bot:latest
env_file: .env
volumes:
- agents:/agents
# docker-compose.fullstack.yml
services:
matrix-bot:
extends:
file: docker-compose.prod.yml
service: matrix-bot
platform-agent:
...
volumes:
agents:
Anti-Patterns to Avoid
- Lazy bootstrap as restart strategy:
_bootstrap_unregistered_room()is acceptable for first-contact repair, not as the primary restart recovery path in production. - Per-user context identity: a user-level or DM-level chat id breaks Space+rooms isolation and makes
!clearincorrect. - Global reset endpoint semantics:
!clearmust not wipe other rooms or all agent state for a user. - Absolute attachment paths in platform payloads: keep agent attachment references relative to its workspace contract.
- Sleep-based service readiness: use Compose healthchecks and dependency conditions, not shell
sleep.
Don't Hand-Roll
| Problem | Don't Build | Use Instead | Why |
|---|---|---|---|
| Matrix room/Space protocol | Raw custom HTTP wrappers for state events | matrix-nio room_create, room_put_state, space_get_hierarchy, sync_forever, upload, download |
Official support already exists and repo tests are built around nio |
| Restart topology discovery | Ad hoc timeline scraping | Full-state sync plus room state / Space child reconciliation | Timeline replay is noisy and brittle; state is the stable source |
| File transfer bus | Base64 blobs or custom bot-side file API | Shared /agents/ volume with relative workspace_path |
Lower operational complexity and already matches upstream agent contract |
| Compose startup sequencing | Shell loops / sleeps | healthcheck + depends_on: condition: service_healthy |
Official Compose behavior is deterministic and observable |
| Context reset | Deleting all SQLite rows or resetting the whole user | Rotate current room platform_chat_id and drop that room's live agent connection |
Preserves other rooms and matches user expectation |
Key insight: The deceptively hard problems in this phase are already solved by the current stack: Matrix room state, nio media handling, named volumes, and service health gating. Custom alternatives add more failure modes than value.
Common Pitfalls
Pitfall 1: Unknown room after restart creates a duplicate working chat
What goes wrong: The bot treats an existing room as unregistered and provisions a fresh room/tree.
Why it happens: Local SQLite metadata is missing, but Matrix topology still exists.
How to avoid: Run reconciliation before live sync callbacks; only allow lazy bootstrap for genuinely new first-contact rooms.
Warning signs: New Чат N rooms appear after restart without a matching user action.
Pitfall 2: !clear resets the wrong scope
What goes wrong: Clearing one room also clears another room, or does nothing because the upstream session key did not change.
Why it happens: Context is keyed by user or local chat_id instead of durable room-local platform_chat_id.
How to avoid: Always resolve room -> platform_chat_id, rotate it, and disconnect only the old upstream chat.
Warning signs: Two rooms share response history or !context reports the same platform context id.
Pitfall 3: Space child linkage is incomplete
What goes wrong: Rooms exist but do not appear correctly under the user's Space.
Why it happens: Missing or malformed m.space.child state, especially missing via data.
How to avoid: Persist space_id, write m.space.child with state_key=room_id, and reconcile child links on startup.
Warning signs: Element shows the room outside the Space, or not at all in the hierarchy.
Pitfall 4: Shared volume works locally but fails in deployment
What goes wrong: Agent-generated files cannot be read by the bot, or bot-downloaded files are unreadable by the agent.
Why it happens: Mount mismatch, wrong root (/workspace vs /agents), or container user/group permissions.
How to avoid: Standardize one shared root, keep relative workspace paths, and align container permissions with Compose volume configuration.
Warning signs: Attachment paths exist in metadata but not on disk inside the other container.
Pitfall 5: Compose depends_on starts too early
What goes wrong: Bot starts before dependent services are actually ready.
Why it happens: Short-form depends_on only waits for container start, not health.
How to avoid: Use healthchecks and long-form depends_on with service_healthy in the full-stack compose file.
Warning signs: First requests fail after fresh docker compose up, then succeed on retry.
Code Examples
Verified patterns from official sources and current repo:
Create a Space with matrix-nio
# Source: matrix-nio API docs
space_resp = await client.room_create(
name=f"Lambda — {display_name}",
visibility=RoomVisibility.private,
invite=[matrix_user_id],
space=True,
)
Add a child room to a Space
# Source: current repo pattern + Matrix spec
await client.room_put_state(
room_id=space_id,
event_type="m.space.child",
content={"via": [homeserver]},
state_key=chat_room_id,
)
Persist room-scoped attachment paths
# Source: adapter/matrix/files.py
relative_path, absolute_path = build_workspace_attachment_path(
workspace_root=workspace_root,
matrix_user_id=matrix_user_id,
room_id=room_id,
filename=filename,
)
absolute_path.parent.mkdir(parents=True, exist_ok=True)
absolute_path.write_bytes(body)
Health-gated startup in Compose
# Source: Docker Compose docs
services:
matrix-bot:
depends_on:
platform-agent:
condition: service_healthy
platform-agent:
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 10s
timeout: 5s
retries: 5
State of the Art
| Old Approach | Current Approach | When Changed | Impact |
|---|---|---|---|
| Per-user or single shared platform context | Per-room platform_chat_id |
Repo direction corrected on 2026-04-28 | Enables true room isolation and correct !clear |
| Single overloaded compose runtime | Separate prod handoff and full-stack E2E compose files | Current Phase 05 scope | Reduces operator ambiguity |
| Unknown room auto-bootstrap as recovery | Explicit reconciliation before live traffic | Recommended for Phase 05 | Prevents duplicate chat trees after restart |
| File payloads treated as transport concern | Shared-volume relative path contract | Already present in repo | Keeps bot/platform contract simple and durable |
Deprecated/outdated:
- Single-chat / DM-first deployment direction: explicitly discarded in Phase 05 reset.
- Global reset semantics for Matrix context commands: does not match Space+rooms UX.
- Using only local store as truth for restart recovery: unsafe once deployed rooms outlive the process.
Open Questions
-
What exact Matrix state should reconciliation trust for
chat_idlabels?- What we know:
room_meta.chat_idis local and not derivable from Matrix protocol by default. - What's unclear: whether chat labels should be reconstructed from room names, stored custom state, or cached local metadata when present.
- Recommendation: persist
chat_idin local SQLite, but make reconciliation able to regenerate a stable fallback label and avoid blocking routing if the label is missing.
- What we know:
-
What readiness probe exists for
platform-agentin the full-stack compose?- What we know: Compose health gating is the right pattern.
- What's unclear: whether upstream agent image already exposes a reliable health endpoint.
- Recommendation: inspect upstream container and add a bot-facing probe before finalizing
docker-compose.fullstack.yml.
-
Should prod mount root remain
/workspaceor be renamed to/agentsexternally?- What we know: current code defaults to
SURFACES_WORKSPACE_DIR=/workspace, while deployment docs describe shared/agents/. - What's unclear: whether external handoff wants a host path named
/agentswhile containers still use/workspace. - Recommendation: keep one in-container canonical path and let host-side naming vary only in Compose mounts.
- What we know: current code defaults to
Environment Availability
| Dependency | Required By | Available | Version | Fallback |
|---|---|---|---|---|
| Python | bot runtime | ✓ | 3.14.3 | — |
uv |
dependency install | ✓ | 0.9.30 | pip |
pytest |
validation | ✓ | 9.0.2 installed | python -m pytest |
| Docker Engine | deployment packaging / E2E compose | ✓ | 29.1.3 | none |
| Docker Compose | split runtime orchestration | ✓ | 2.40.3 | none |
Missing dependencies with no fallback:
- None
Missing dependencies with fallback:
- None
Validation Architecture
Test Framework
| Property | Value |
|---|---|
| Framework | pytest + pytest-asyncio |
| Config file | pyproject.toml |
| Quick run command | pytest tests/adapter/matrix/test_restart_persistence.py -v |
| Full suite command | pytest tests/ -v |
Phase Requirements → Test Map
| Req ID | Behavior | Test Type | Automated Command | File Exists? |
|---|---|---|---|---|
| PH05-01 | Space+rooms onboarding remains primary UX | integration | pytest tests/adapter/matrix/test_invite_space.py tests/adapter/matrix/test_chat_space.py -v |
✅ |
| PH05-02 | Per-room platform_chat_id isolates routing and powers real clear |
integration | pytest tests/adapter/matrix/test_routed_platform.py tests/adapter/matrix/test_context_commands.py -v |
✅ |
| PH05-03 | Restart reconciliation restores routing metadata | integration | pytest tests/adapter/matrix/test_restart_persistence.py -v |
❌ new reconciliation tests needed |
| PH05-04 | Shared-volume file transfer is room-safe | integration | pytest tests/adapter/matrix/test_files.py tests/platform/test_real.py -v |
✅ partial |
| PH05-05 | Split prod/fullstack compose artifacts stay coherent | smoke | docker compose -f docker-compose.prod.yml config && docker compose -f docker-compose.fullstack.yml config |
❌ Wave 0 |
Sampling Rate
- Per task commit:
pytest tests/adapter/matrix/test_restart_persistence.py -v - Per wave merge:
pytest tests/adapter/matrix/ -v - Phase gate:
pytest tests/ -vplus both compose files passingdocker compose ... config
Wave 0 Gaps
tests/adapter/matrix/test_reconciliation.py— startup recovery of user/room metadata from Matrix statetests/adapter/matrix/test_context_commands.pyadditions —!clearcommand contract and room-local rotation semanticstests/adapter/matrix/test_compose_artifacts.pyor equivalent smoke command documentation — split compose validationtests/adapter/matrix/test_files.pyadditions — cross-room attachment path isolation and shared-root consistency
Sources
Primary (HIGH confidence)
- Local repo code and tests:
adapter/matrix/bot.pyadapter/matrix/store.pyadapter/matrix/files.pyadapter/matrix/routed_platform.pyadapter/matrix/handlers/auth.pyadapter/matrix/handlers/context_commands.pytests/adapter/matrix/test_restart_persistence.pytests/adapter/matrix/test_files.pytests/platform/test_real.py
- Matrix-nio API docs: https://matrix-nio.readthedocs.io/en/latest/nio.html
- Matrix-nio async client docs: https://matrix-nio.readthedocs.io/en/latest/_modules/nio/client/async_client.html
- Matrix-nio PyPI release page: https://pypi.org/project/matrix-nio/
- Matrix spec Spaces / hierarchy: https://spec.matrix.org/v1.18/server-server-api/
- Matrix spec changelog note on
viaform.space.child: https://spec.matrix.org/v1.16/changelog/v1.9/ - Docker Compose CLI reference: https://docs.docker.com/reference/cli/docker/compose/
- Docker Compose services reference: https://docs.docker.com/reference/compose-file/services/
Secondary (MEDIUM confidence)
docs/deploy-architecture.md— repo-local deployment contract clarified on 2026-04-27docs/research/matrix-spaces.md— prior internal research aligned with spec, but not treated as primaryREADME.mdruntime notes for current Matrix backend and shared workspace behavior
Tertiary (LOW confidence)
- None
Metadata
Confidence breakdown:
- Standard stack: HIGH - current repo stack verified against official docs and package registries
- Architecture: HIGH - recommendations align with existing runtime boundaries and official Matrix / Compose behavior
- Pitfalls: HIGH - derived from current code paths, existing tests, and official protocol/runtime semantics
Research date: 2026-04-28 Valid until: 2026-05-28