BrowserUse_and_ComputerUse_.../TODO.md

# Hermes Agent - Future Improvements

> Ideas for enhancing the agent's capabilities, generated from self-analysis of the codebase.

---

## 🚨 HIGH PRIORITY - Immediate Fixes

These items need to be addressed ASAP:

### 1. SUDO Breaking Terminal Tool 🔐 ✅ COMPLETE
- [x] **Problem:** SUDO commands break the terminal tool execution (hangs indefinitely)
- [x] **Fix:** Created custom environment wrappers in `tools/terminal_tool.py`
  - `stdin=subprocess.DEVNULL` prevents hanging on interactive prompts
  - Sudo fails gracefully with clear error if no password configured
  - Same UX as Claude Code - agent sees error, tells user to run it themselves
- [x] **All 5 environments now have consistent behavior:**
  - `_LocalEnvironment` - local execution
  - `_DockerEnvironment` - Docker containers
  - `_SingularityEnvironment` - Singularity/Apptainer containers
  - `_ModalEnvironment` - Modal cloud sandboxes
  - `_SSHEnvironment` - remote SSH execution
- [x] **Optional sudo support via `SUDO_PASSWORD` env var:**
  - Shared `_transform_sudo_command()` helper used by all environments
  - If set, auto-transforms `sudo cmd` → pipes password via `sudo -S`
  - Documented in `.env.example`, `cli-config.yaml`, and README
  - Works for chained commands: `cmd1 && sudo cmd2`
- [x] **Interactive sudo prompt in CLI mode:**
  - When sudo detected and no password configured, prompts user
  - 45-second timeout (auto-skips if no input)
  - Hidden password input via `getpass` (password not visible)
  - Password cached for session (don't ask repeatedly)
  - Spinner pauses during prompt for clean UX
  - Uses `HERMES_INTERACTIVE` env var to detect CLI mode

### 2. Fix `browser_get_images` Tool 🖼️ ✅ VERIFIED WORKING
- [x] **Tested:** Tool works correctly on multiple sites
- [x] **Results:** Successfully extracts image URLs, alt text, dimensions
- [x] **Note:** Some sites (Pixabay, etc.) have Cloudflare bot protection that blocks headless browsers - this is expected behavior, not a bug

### 3. Better Action Logging for Debugging 📝 ✅ COMPLETE
- [x] **Problem:** Need better logging of agent actions for debugging
- [x] **Implementation:**
  - Save full session trajectories to `logs/` directory as JSON
  - Each session gets a unique file: `session_YYYYMMDD_HHMMSS_UUID.json`
  - Logs all messages, tool calls with inputs/outputs, timestamps
  - Structured JSON format for easy parsing and replay
  - Automatic on CLI runs (configurable)

### 4. Automatic Context Compression 🗜️ ✅ COMPLETE
- [x] **Problem:** Long conversations exceed model context limits, causing errors
- [x] **Solution:** Auto-compress middle turns when approaching limit
- [x] **Implementation:**
  - Fetches model context lengths from OpenRouter `/api/v1/models` API (cached 1hr)
  - Tracks actual token usage from API responses (`usage.prompt_tokens`)
  - Triggers at 85% of model's context limit (configurable)
  - Protects first 3 turns (system, initial request, first response)
  - Protects last 4 turns (recent context most relevant)
  - Summarizes middle turns using fast model (Gemini Flash)
  - Inserts summary as user message, conversation continues seamlessly
  - If context error occurs, attempts compression before failing
- [x] **Configuration (cli-config.yaml / env vars):**
  - `CONTEXT_COMPRESSION_ENABLED` (default: true)
  - `CONTEXT_COMPRESSION_THRESHOLD` (default: 0.85 = 85%)
  - `CONTEXT_COMPRESSION_MODEL` (default: google/gemini-2.0-flash-001)

### 5. Stream Thinking Summaries in Real-Time 💭 ⏸️ DEFERRED
- [ ] **Problem:** Thinking/reasoning summaries not shown while streaming
- [ ] **Complexity:** This is a significant refactor - leaving for later

**OpenRouter Streaming Info:**
- Uses `stream=True` with OpenAI SDK
- Reasoning comes in `choices[].delta.reasoning_details` chunks
- Types: `reasoning.summary`, `reasoning.text`, `reasoning.encrypted`
- Tool call arguments stream as partial JSON (need accumulation)
- Items paradigm: same ID emitted multiple times with updated content

**Key Challenges:**
- Tool call JSON accumulation (partial `{"query": "wea` → `{"query": "weather"}`)
- Multiple concurrent outputs (thinking + tool calls + text simultaneously)
- State management for partial responses
- Error handling if connection drops mid-stream
- Deciding when tool calls are "complete" enough to execute

**UX Questions to Resolve:**
- Show raw thinking text or summarized?
- Live expanding text vs. spinner replacement?
- Markdown rendering while streaming?
- How to handle thinking + tool call display simultaneously?

**Implementation Options:**
- New `run_conversation_streaming()` method (keep non-streaming as fallback)
- Wrapper that handles streaming internally
- Big refactor of existing `run_conversation()`

**References:**
- https://openrouter.ai/docs/api/reference/streaming
- https://openrouter.ai/docs/guides/best-practices/reasoning-tokens#streaming-response

---

## 1. Subagent Architecture (Context Isolation) 🎯

**Problem:** Long-running tools (terminal commands, browser automation, complex file operations) consume massive context. A single `ls -la` can add hundreds of lines. Browser snapshots, debugging sessions, and iterative terminal work quickly bloat the main conversation, leaving less room for actual reasoning.

**Solution:** The main agent becomes an **orchestrator** that delegates context-heavy tasks to **subagents**.

**Architecture:**
```
┌─────────────────────────────────────────────────────────────────┐
│  ORCHESTRATOR (main agent)                                      │
│  - Receives user request                                        │
│  - Plans approach                                               │
│  - Delegates heavy tasks to subagents                           │
│  - Receives summarized results                                  │
│  - Maintains clean, focused context                             │
└─────────────────────────────────────────────────────────────────┘
         │                    │                    │
         ▼                    ▼                    ▼
┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐
│ TERMINAL AGENT  │  │ BROWSER AGENT   │  │ CODE AGENT      │
│ - terminal tool │  │ - browser tools │  │ - file tools    │
│ - file tools    │  │ - web_search    │  │ - terminal      │
│                 │  │ - web_extract   │  │                 │
│ Isolated context│  │ Isolated context│  │ Isolated context│
│ Returns summary │  │ Returns summary │  │ Returns summary │
└─────────────────┘  └─────────────────┘  └─────────────────┘
```

**How it works:**
1. User asks: "Set up a new Python project with FastAPI and tests"
2. Orchestrator plans: "I need to create files, install deps, write code"
3. Orchestrator calls: `terminal_task(goal="Create venv, install fastapi pytest", context="New project in ~/myapp")`
4. **Subagent spawns** with fresh context, only terminal/file tools
5. Subagent iterates (may take 10+ tool calls, lots of output)
6. Subagent completes → returns summary: "Created venv, installed fastapi==0.109.0, pytest==8.0.0"
7. Orchestrator receives **only the summary**, context stays clean
8. Orchestrator continues with next subtask

**Key tools to implement:**
- [ ] `terminal_task(goal, context, cwd?)` - Delegate terminal/shell work
- [ ] `browser_task(goal, context, start_url?)` - Delegate web research/automation
- [ ] `code_task(goal, context, files?)` - Delegate code writing/modification
- [ ] Generic `delegate_task(goal, context, toolsets=[])` - Flexible delegation

**Implementation details:**
- [ ] Subagent uses same `run_agent.py` but with:
  - Fresh/empty conversation history
  - Limited toolset (only what's needed)
  - Smaller max_iterations (focused task)
  - Task-specific system prompt
- [ ] Subagent returns structured result:
  ```python
  {
    "success": True,
    "summary": "Installed 3 packages, created 2 files",
    "details": "Optional longer explanation if needed",
    "artifacts": ["~/myapp/requirements.txt", "~/myapp/main.py"],  # Files created
    "errors": []  # Any issues encountered
  }
  ```
- [ ] Orchestrator sees only the summary in its context
- [ ] Full subagent transcript saved separately for debugging

**Benefits:**
- 🧹 **Clean context** - Orchestrator stays focused, doesn't drown in tool output
- 📊 **Better token efficiency** - 50 terminal outputs → 1 summary paragraph
- 🎯 **Focused subagents** - Each agent has just the tools it needs
- 🔄 **Parallel potential** - Independent subtasks could run concurrently
- 🐛 **Easier debugging** - Each subtask has its own isolated transcript

**When to use subagents vs direct tools:**
- **Subagent**: Multi-step tasks, iteration likely, lots of output expected
- **Direct**: Quick one-off commands, simple file reads, user needs to see output

**Files to modify:** `run_agent.py` (add orchestration mode), new `tools/delegate_tools.py`, new `subagent_runner.py`

---

## 2. Planning & Task Management 📋

**Problem:** Agent handles tasks reactively without explicit planning. Complex multi-step tasks lack structure, progress tracking, and the ability to decompose work into manageable chunks.

**Ideas:**
- [ ] **Task decomposition tool** - Break complex requests into subtasks:
  ```
  User: "Set up a new Python project with FastAPI, tests, and Docker"

  Agent creates plan:
  ├── 1. Create project structure and requirements.txt
  ├── 2. Implement FastAPI app skeleton
  ├── 3. Add pytest configuration and initial tests
  ├── 4. Create Dockerfile and docker-compose.yml
  └── 5. Verify everything works together
  ```
  - Each subtask becomes a trackable unit
  - Agent can report progress: "Completed 3/5 tasks"

- [ ] **Progress checkpoints** - Periodic self-assessment:
  - After N tool calls or time elapsed, pause to evaluate
  - "What have I accomplished? What remains? Am I on track?"
  - Detect if stuck in loops or making no progress
  - Could trigger replanning if approach isn't working

- [ ] **Explicit plan storage** - Persist plan in conversation:
  - Store as structured data (not just in context)
  - Update status as tasks complete
  - User can ask "What's the plan?" or "What's left?"
  - Survives context compression (plans are protected)

- [ ] **Failure recovery with replanning** - When things go wrong:
  - Record what failed and why
  - Revise plan to work around the issue
  - "Step 3 failed because X, adjusting approach to Y"
  - Prevents repeating failed strategies

**Files to modify:** `run_agent.py` (add planning hooks), new `tools/planning_tool.py`

---

## 3. Tool Composition & Learning 🔧

**Problem:** Tools are atomic. Complex tasks require repeated manual orchestration of the same tool sequences.

**Ideas:**
- [ ] **Macro tools / Tool chains** - Define reusable tool sequences:
  ```yaml
  research_topic:
    description: "Deep research on a topic"
    steps:
      - web_search: {query: "$topic"}
      - web_extract: {urls: "$search_results.urls[:3]"}
      - summarize: {content: "$extracted"}
  ```
  - Could be defined in skills or a new `macros/` directory
  - Agent can invoke macro as single tool call

- [ ] **Tool failure patterns** - Learn from failures:
  - Track: tool, input pattern, error type, what worked instead
  - Before calling a tool, check: "Has this pattern failed before?"
  - Persistent across sessions (stored in skills or separate DB)

- [ ] **Parallel tool execution** - When tools are independent, run concurrently:
  - Detect independence (no data dependencies between calls)
  - Use `asyncio.gather()` for parallel execution
  - Already have async support in some tools, just need orchestration

**Files to modify:** `model_tools.py`, `toolsets.py`, new `tool_macros.py`

---

## 4. Dynamic Skills Expansion 📚

**Problem:** Skills system is elegant but static. Skills must be manually created and added.

**Ideas:**
- [ ] **Skill acquisition from successful tasks** - After completing a complex task:
  - "This approach worked well. Save as a skill?"
  - Extract: goal, steps taken, tools used, key decisions
  - Generate SKILL.md automatically
  - Store in user's skills directory

- [ ] **Skill templates** - Common patterns that can be parameterized:
  ```markdown
  # Debug {language} Error
  1. Reproduce the error
  2. Search for error message: `web_search("{error_message} {language}")`
  3. Check common causes: {common_causes}
  4. Apply fix and verify
  ```

- [ ] **Skill chaining** - Combine skills for complex workflows:
  - Skills can reference other skills as dependencies
  - "To do X, first apply skill Y, then skill Z"
  - Directed graph of skill dependencies

**Files to modify:** `tools/skills_tool.py`, `skills/` directory structure, new `skill_generator.py`

---

## 5. Task Continuation Hints 🎯

**Problem:** Could be more helpful by suggesting logical next steps.

**Ideas:**
- [ ] **Suggest next steps** - At end of a task, suggest logical continuations:
  - "Code is written. Want me to also write tests / docs / deploy?"
  - Based on common workflows for task type
  - Non-intrusive, just offer options

**Files to modify:** `run_agent.py`, response generation logic

---

## 7. Interactive Clarifying Questions Tool ❓

**Problem:** Agent sometimes makes assumptions or guesses when it should ask the user. Currently can only ask via text, which gets lost in long outputs.

**Ideas:**
- [ ] **Multiple-choice prompt tool** - Let agent present structured choices to user:
  ```
  ask_user_choice(
    question="Should the language switcher enable only German or all languages?",
    choices=[
      "Only enable German - works immediately",
      "Enable all, mark untranslated - show fallback notice",
      "Let me specify something else"
    ]
  )
  ```
  - Renders as interactive terminal UI with arrow key / Tab navigation
  - User selects option, result returned to agent
  - Up to 4 choices + optional free-text option

- [ ] **Implementation:**
  - Use `inquirer` or `questionary` Python library for rich terminal prompts
  - Tool returns selected option text (or user's custom input)
  - **CLI-only** - only works when running via `cli.py` (not API/programmatic use)
  - Graceful fallback: if not in interactive mode, return error asking agent to rephrase as text

- [ ] **Use cases:**
  - Clarify ambiguous requirements before starting work
  - Confirm destructive operations with clear options
  - Let user choose between implementation approaches
  - Checkpoint complex multi-step workflows

**Files to modify:** New `tools/ask_user_tool.py`, `cli.py` (detect interactive mode), `model_tools.py`

---

## 6. Resource Awareness & Efficiency 💰

**Problem:** No awareness of costs, time, or resource usage. Could be smarter about efficiency.

**Ideas:**
- [ ] **Tool result caching** - Don't repeat identical operations:
  - Cache web searches, extractions within a session
  - Invalidation based on time-sensitivity of query
  - Hash-based lookup: same input → cached output

- [ ] **Lazy evaluation** - Don't fetch everything upfront:
  - Get summaries first, full content only if needed
  - "I found 5 relevant pages. Want me to deep-dive on any?"

**Files to modify:** `model_tools.py`, new `resource_tracker.py`

---

## 9. Collaborative Problem Solving 🤝

**Problem:** Interaction is command/response. Complex problems benefit from dialogue.

**Ideas:**
- [ ] **Assumption surfacing** - Make implicit assumptions explicit:
  - "I'm assuming you want Python 3.11+. Correct?"
  - "This solution assumes you have sudo access..."
  - Let user correct before going down wrong path

- [ ] **Checkpoint & confirm** - For high-stakes operations:
  - "About to delete 47 files. Here's the list - proceed?"
  - "This will modify your database. Want a backup first?"
  - Configurable threshold for when to ask

**Files to modify:** `run_agent.py`, system prompt configuration

---

## 7. Project-Local Context 💾

**Problem:** Valuable context lost between sessions.

**Ideas:**
- [ ] **Project awareness** - Remember project-specific context:
  - Store `.hermes/context.md` in project directory
  - "This is a Django project using PostgreSQL"
  - Coding style preferences, deployment setup, etc.
  - Load automatically when working in that directory

- [ ] **Handoff notes** - Leave notes for future sessions:
  - Write to `.hermes/notes.md` in project
  - "TODO for next session: finish implementing X"
  - "Known issues: Y doesn't work on Windows"

**Files to modify:** New `project_context.py`, auto-load in `run_agent.py`

---

## 8. Graceful Degradation & Robustness 🛡️

**Problem:** When things go wrong, recovery is limited. Should fail gracefully.

**Ideas:**
- [ ] **Fallback chains** - When primary approach fails, have backups:
  - `web_extract` fails → try `browser_navigate` → try `web_search` for cached version
  - Define fallback order per tool type

- [ ] **Partial progress preservation** - Don't lose work on failure:
  - Long task fails midway → save what we've got
  - "I completed 3/5 steps before the error. Here's what I have..."

- [ ] **Self-healing** - Detect and recover from bad states:
  - Browser stuck → close and retry
  - Terminal hung → timeout and reset

**Files to modify:** `model_tools.py`, tool implementations, new `fallback_manager.py`

---

## 9. Tools & Skills Wishlist 🧰

*Things that would need new tool implementations (can't do well with current tools):*

### High-Impact

- [ ] **Audio/Video Transcription** 🎬 *(See also: Section 16 for detailed spec)*
  - Transcribe audio files, podcasts, YouTube videos
  - Extract key moments from video
  - Voice memo transcription for messaging integrations
  - *Provider options: Whisper API, Deepgram, local Whisper*

- [ ] **Diagram Rendering** 📊
  - Render Mermaid/PlantUML to actual images
  - Can generate the code, but rendering requires external service or tool
  - "Show me how these components connect" → actual visual diagram

### Medium-Impact

- [ ] **Canvas / Visual Workspace** 🖼️
  - Agent-controlled visual panel for rendering interactive UI
  - Inspired by OpenClaw's Canvas feature
  - **Capabilities:**
    - `present` / `hide` - Show/hide the canvas panel
    - `navigate` - Load HTML files or URLs into the canvas
    - `eval` - Execute JavaScript in the canvas context
    - `snapshot` - Capture the rendered UI as an image
  - **Use cases:**
    - Display generated HTML/CSS/JS previews
    - Show interactive data visualizations (charts, graphs)
    - Render diagrams (Mermaid → rendered output)
    - Present structured information in rich format
    - A2UI-style component system for structured agent UI
  - **Implementation options:**
    - Electron-based panel for CLI
    - WebSocket-connected web app
    - VS Code webview extension
  - *Would let agent "show" things rather than just describe them*

- [ ] **Document Generation** 📄
  - Create styled PDFs, Word docs, presentations
  - *Can do basic PDF via terminal tools, but limited*

- [ ] **Diff/Patch Tool** 📝
  - Surgical code modifications with preview
  - "Change line 45-50 to X" without rewriting whole file
  - Show diffs before applying
  - *Can use `diff`/`patch` but a native tool would be safer*

### Skills to Create

- [ ] **Domain-specific skill packs:**
  - DevOps/Infrastructure (Terraform, K8s, AWS)
  - Data Science workflows (EDA, model training)
  - Security/pentesting procedures

- [ ] **Framework-specific skills:**
  - React/Vue/Angular patterns
  - Django/Rails/Express conventions
  - Database optimization playbooks

- [ ] **Troubleshooting flowcharts:**
  - "Docker container won't start" → decision tree
  - "Production is slow" → systematic diagnosis

---

## 10. Messaging Platform Integrations 💬

**Problem:** Agent currently only works via `cli.py` which requires direct terminal access. Users may want to interact via messaging apps from their phone or other devices.

**Architecture:**
- `run_agent.py` already accepts `conversation_history` parameter and returns updated messages ✅
- Need: persistent session storage, platform monitors, session key resolution

**Implementation approach:**
```
┌─────────────────────────────────────────────────────────────┐
│  Platform Monitor (e.g., telegram_monitor.py)               │
│  ├─ Long-running daemon connecting to messaging platform    │
│  ├─ On message: resolve session key → load history from disk│
│  ├─ Call run_agent.py with loaded history                   │
│  ├─ Save updated history back to disk (JSONL)               │
│  └─ Send response back to platform                          │
└─────────────────────────────────────────────────────────────┘
```

**Platform support (each user sets up their own credentials):**
- [ ] **Telegram** - via `python-telegram-bot` or `grammy` equivalent
  - Bot token from @BotFather
  - Easiest to set up, good for personal use
- [ ] **Discord** - via `discord.py`
  - Bot token from Discord Developer Portal
  - Can work in servers (group sessions) or DMs
- [ ] **WhatsApp** - via `baileys` (WhatsApp Web protocol)
  - QR code scan to authenticate
  - More complex, but reaches most people

**Session management:**
- [ ] **Session store** - JSONL persistence per session key
  - `~/.hermes/sessions/{session_key}.jsonl`
  - Session keys: `telegram:dm:{user_id}`, `discord:channel:{id}`, etc.
- [ ] **Session expiry** - Configurable reset policies
  - Daily reset (default 4am) OR idle timeout (e.g., 2 hours)
  - Manual reset via `/reset` or `/new` command in chat
- [ ] **Session continuity** - Conversations persist across messages until reset

**Files to create:** `monitors/telegram_monitor.py`, `monitors/discord_monitor.py`, `monitors/session_store.py`

---

## 11. Scheduled Tasks / Cron Jobs ⏰

**Problem:** Agent only runs on-demand. Some tasks benefit from scheduled execution (daily summaries, monitoring, reminders).

**Ideas:**
- [ ] **Cron-style scheduler** - Run agent turns on a schedule
  - Store jobs in `~/.hermes/cron/jobs.json`
  - Each job: `{ id, schedule, prompt, session_mode, delivery }`
  - Uses APScheduler or similar Python library

- [ ] **Session modes:**
  - `isolated` - Fresh session each run (no history, clean context)
  - `main` - Append to main session (agent remembers previous scheduled runs)

- [ ] **Delivery options:**
  - Write output to file (`~/.hermes/cron/output/{job_id}/{timestamp}.md`)
  - Send to messaging channel (if integrations enabled)
  - Both

- [ ] **CLI interface:**
  ```bash
  # List scheduled jobs
  python cli.py --cron list

  # Add a job (runs daily at 9am)
  python cli.py --cron add "Summarize my email inbox" --schedule "0 9 * * *"

  # Quick syntax for simple intervals
  python cli.py --cron add "Check server status" --every 30m

  # Remove a job
  python cli.py --cron remove <job_id>
  ```

- [ ] **Agent self-scheduling** - Let the agent create its own cron jobs
  - New tool: `schedule_task(prompt, schedule, session_mode)`
  - "Remind me to check the deployment tomorrow at 9am"
  - Agent can set follow-up tasks for itself

- [ ] **In-chat command:** `/cronjob {prompt} {frequency}` when using messaging integrations

**Files to create:** `cron/scheduler.py`, `cron/jobs.py`, `tools/schedule_tool.py`

---

## 12. Text-to-Speech (TTS) 🔊

**Problem:** Agent can only respond with text. Some users prefer audio responses (accessibility, hands-free use, podcasts).

**Ideas:**
- [ ] **TTS tool** - Generate audio files from text
  ```python
  tts_generate(text="Here's your summary...", voice="nova", output="summary.mp3")
  ```
  - Returns path to generated audio file
  - For messaging integrations: can send as voice message

- [ ] **Provider options:**
  - Edge TTS (free, good quality, many voices)
  - OpenAI TTS (paid, excellent quality)
  - ElevenLabs (paid, best quality, voice cloning)
  - Local options (Coqui TTS, Bark)

- [ ] **Modes:**
  - On-demand: User explicitly asks "read this to me"
  - Auto-TTS: Configurable to always generate audio for responses
  - Long-text handling: Summarize or chunk very long responses

- [ ] **Integration with messaging:**
  - When enabled, can send voice notes instead of/alongside text
  - User preference per channel

**Files to create:** `tools/tts_tool.py`, config in `cli-config.yaml`

---

## 13. Speech-to-Text / Audio Transcription 🎤

**Problem:** Users may want to send voice memos instead of typing. Agent is blind to audio content.

**Ideas:**
- [ ] **Voice memo transcription** - For messaging integrations
  - User sends voice message → transcribe → process as text
  - Seamless: user speaks, agent responds

- [ ] **Audio/video file transcription** - Existing idea, expanded:
  - Transcribe local audio files (mp3, wav, m4a)
  - Transcribe YouTube videos (download audio → transcribe)
  - Extract key moments with timestamps

- [ ] **Provider options:**
  - OpenAI Whisper API (good quality, cheap)
  - Deepgram (fast, good for real-time)
  - Local Whisper (free, runs on GPU)
  - Groq Whisper (fast, free tier available)

- [ ] **Tool interface:**
  ```python
  transcribe(source="audio.mp3")  # Local file
  transcribe(source="https://youtube.com/...")  # YouTube
  transcribe(source="voice_message", data=bytes)  # Voice memo
  ```

**Files to create:** `tools/transcribe_tool.py`, integrate with messaging monitors

---

## Priority Order (Suggested)

1. **🎯 Subagent Architecture** - Critical for context management, enables everything else
2. **Memory & Context Management** - Complements subagents for remaining context
3. **Self-Reflection** - Improves reliability and reduces wasted tool calls
4. **Project-Local Context** - Practical win, keeps useful info across sessions
5. **Messaging Integrations** - Unlocks mobile access, new interaction patterns
6. **Scheduled Tasks / Cron Jobs** - Enables automation, reminders, monitoring
7. **Tool Composition** - Quality of life, builds on other improvements
8. **Dynamic Skills** - Force multiplier for repeated tasks
9. **Interactive Clarifying Questions** - Better UX for ambiguous tasks
10. **TTS / Audio Transcription** - Accessibility, hands-free use

---

## Removed Items (Unrealistic)

The following were removed because they're architecturally impossible:

- ~~Proactive suggestions / Prefetching~~ - Agent only runs on user request, can't interject
- ~~Clipboard integration~~ - No access to user's local system clipboard

The following **moved to active TODO** (now possible with new architecture):

- ~~Session save/restore~~ → See **Messaging Integrations** (session persistence)
- ~~Voice/TTS playback~~ → See **TTS** (can generate audio files, send via messaging)
- ~~Set reminders~~ → See **Scheduled Tasks / Cron Jobs**

The following were removed because they're **already possible**:

- ~~HTTP/API Client~~ → Use `curl` or Python `requests` in terminal
- ~~Structured Data Manipulation~~ → Use `pandas` in terminal
- ~~Git-Native Operations~~ → Use `git` CLI in terminal
- ~~Symbolic Math~~ → Use `SymPy` in terminal
- ~~Code Quality Tools~~ → Run linters (`eslint`, `black`, `mypy`) in terminal
- ~~Testing Framework~~ → Run `pytest`, `jest`, etc. in terminal
- ~~Translation~~ → LLM handles this fine, or use translation APIs

---

---

## 🧪 Brainstorm Ideas (Not Yet Fleshed Out)

*These are early-stage ideas that need more thinking before implementation. Captured here so they don't get lost.*

### Remote/Distributed Execution 🌐

**Concept:** Run agent on a powerful remote server while interacting from a thin client.

**Why interesting:**
- Run on beefy GPU server for local LLM inference
- Agent has access to remote machine's resources (files, tools, internet)
- User interacts via lightweight client (phone, low-power laptop)

**Open questions:**
- How does this differ from just SSH + running cli.py on remote?
- Would need secure communication channel (WebSocket? gRPC?)
- How to handle tool outputs that reference remote paths?
- Credential management for remote execution
- Latency considerations for interactive use

**Possible architecture:**
```
┌─────────────┐         ┌─────────────────────────┐
│ Thin Client │ ◄─────► │ Remote Hermes Server    │
│ (phone/web) │  WS/API │ - Full agent + tools    │
└─────────────┘         │ - GPU for local LLM     │
                        │ - Access to server files│
                        └─────────────────────────┘
```

**Related to:** Messaging integrations (could be the "server" that monitors receive from)

---

### Multi-Agent Parallel Execution 🤖🤖

**Concept:** Extension of Subagent Architecture (Section 1) - run multiple subagents in parallel.

**Why interesting:**
- Independent subtasks don't need to wait for each other
- "Research X while setting up Y" - both run simultaneously
- Faster completion for complex multi-part tasks

**Open questions:**
- How to detect which tasks are truly independent?
- Resource management (API rate limits, concurrent connections)
- How to merge results when parallel tasks have conflicts?
- Cost implications of multiple parallel LLM calls

*Note: Basic subagent delegation (Section 1) should be implemented first, parallel execution is an optimization on top.*

---

### Plugin/Extension System 🔌

**Concept:** Allow users to add custom tools/skills without modifying core code.

**Why interesting:**
- Community contributions
- Organization-specific tools
- Clean separation of core vs. extensions

**Open questions:**
- Security implications of loading arbitrary code
- Versioning and compatibility
- Discovery and installation UX

---

*Last updated: $(date +%Y-%m-%d)* 🤖