- Added a new method `_extract_reasoning` to extract reasoning content from assistant messages, accommodating multiple formats from various providers. - Updated message handling to ensure all assistant messages include reasoning content for API compatibility, preserving multi-turn reasoning context. - Enhanced logging to capture reasoning details for debugging and analysis. - Modified the TODO.md to reflect changes in planning and task management, emphasizing the need for structured task decomposition and progress tracking.
738 lines
30 KiB
Markdown
738 lines
30 KiB
Markdown
# Hermes Agent - Future Improvements
|
|
|
|
> Ideas for enhancing the agent's capabilities, generated from self-analysis of the codebase.
|
|
|
|
---
|
|
|
|
## 🚨 HIGH PRIORITY - Immediate Fixes
|
|
|
|
These items need to be addressed ASAP:
|
|
|
|
### 1. SUDO Breaking Terminal Tool 🔐 ✅ COMPLETE
|
|
- [x] **Problem:** SUDO commands break the terminal tool execution (hangs indefinitely)
|
|
- [x] **Fix:** Created custom environment wrappers in `tools/terminal_tool.py`
|
|
- `stdin=subprocess.DEVNULL` prevents hanging on interactive prompts
|
|
- Sudo fails gracefully with clear error if no password configured
|
|
- Same UX as Claude Code - agent sees error, tells user to run it themselves
|
|
- [x] **All 5 environments now have consistent behavior:**
|
|
- `_LocalEnvironment` - local execution
|
|
- `_DockerEnvironment` - Docker containers
|
|
- `_SingularityEnvironment` - Singularity/Apptainer containers
|
|
- `_ModalEnvironment` - Modal cloud sandboxes
|
|
- `_SSHEnvironment` - remote SSH execution
|
|
- [x] **Optional sudo support via `SUDO_PASSWORD` env var:**
|
|
- Shared `_transform_sudo_command()` helper used by all environments
|
|
- If set, auto-transforms `sudo cmd` → pipes password via `sudo -S`
|
|
- Documented in `.env.example`, `cli-config.yaml`, and README
|
|
- Works for chained commands: `cmd1 && sudo cmd2`
|
|
- [x] **Interactive sudo prompt in CLI mode:**
|
|
- When sudo detected and no password configured, prompts user
|
|
- 45-second timeout (auto-skips if no input)
|
|
- Hidden password input via `getpass` (password not visible)
|
|
- Password cached for session (don't ask repeatedly)
|
|
- Spinner pauses during prompt for clean UX
|
|
- Uses `HERMES_INTERACTIVE` env var to detect CLI mode
|
|
|
|
### 2. Fix `browser_get_images` Tool 🖼️ ✅ VERIFIED WORKING
|
|
- [x] **Tested:** Tool works correctly on multiple sites
|
|
- [x] **Results:** Successfully extracts image URLs, alt text, dimensions
|
|
- [x] **Note:** Some sites (Pixabay, etc.) have Cloudflare bot protection that blocks headless browsers - this is expected behavior, not a bug
|
|
|
|
### 3. Better Action Logging for Debugging 📝 ✅ COMPLETE
|
|
- [x] **Problem:** Need better logging of agent actions for debugging
|
|
- [x] **Implementation:**
|
|
- Save full session trajectories to `logs/` directory as JSON
|
|
- Each session gets a unique file: `session_YYYYMMDD_HHMMSS_UUID.json`
|
|
- Logs all messages, tool calls with inputs/outputs, timestamps
|
|
- Structured JSON format for easy parsing and replay
|
|
- Automatic on CLI runs (configurable)
|
|
|
|
### 4. Automatic Context Compression 🗜️ ✅ COMPLETE
|
|
- [x] **Problem:** Long conversations exceed model context limits, causing errors
|
|
- [x] **Solution:** Auto-compress middle turns when approaching limit
|
|
- [x] **Implementation:**
|
|
- Fetches model context lengths from OpenRouter `/api/v1/models` API (cached 1hr)
|
|
- Tracks actual token usage from API responses (`usage.prompt_tokens`)
|
|
- Triggers at 85% of model's context limit (configurable)
|
|
- Protects first 3 turns (system, initial request, first response)
|
|
- Protects last 4 turns (recent context most relevant)
|
|
- Summarizes middle turns using fast model (Gemini Flash)
|
|
- Inserts summary as user message, conversation continues seamlessly
|
|
- If context error occurs, attempts compression before failing
|
|
- [x] **Configuration (cli-config.yaml / env vars):**
|
|
- `CONTEXT_COMPRESSION_ENABLED` (default: true)
|
|
- `CONTEXT_COMPRESSION_THRESHOLD` (default: 0.85 = 85%)
|
|
- `CONTEXT_COMPRESSION_MODEL` (default: google/gemini-2.0-flash-001)
|
|
|
|
### 5. Stream Thinking Summaries in Real-Time 💭 ⏸️ DEFERRED
|
|
- [ ] **Problem:** Thinking/reasoning summaries not shown while streaming
|
|
- [ ] **Complexity:** This is a significant refactor - leaving for later
|
|
|
|
**OpenRouter Streaming Info:**
|
|
- Uses `stream=True` with OpenAI SDK
|
|
- Reasoning comes in `choices[].delta.reasoning_details` chunks
|
|
- Types: `reasoning.summary`, `reasoning.text`, `reasoning.encrypted`
|
|
- Tool call arguments stream as partial JSON (need accumulation)
|
|
- Items paradigm: same ID emitted multiple times with updated content
|
|
|
|
**Key Challenges:**
|
|
- Tool call JSON accumulation (partial `{"query": "wea` → `{"query": "weather"}`)
|
|
- Multiple concurrent outputs (thinking + tool calls + text simultaneously)
|
|
- State management for partial responses
|
|
- Error handling if connection drops mid-stream
|
|
- Deciding when tool calls are "complete" enough to execute
|
|
|
|
**UX Questions to Resolve:**
|
|
- Show raw thinking text or summarized?
|
|
- Live expanding text vs. spinner replacement?
|
|
- Markdown rendering while streaming?
|
|
- How to handle thinking + tool call display simultaneously?
|
|
|
|
**Implementation Options:**
|
|
- New `run_conversation_streaming()` method (keep non-streaming as fallback)
|
|
- Wrapper that handles streaming internally
|
|
- Big refactor of existing `run_conversation()`
|
|
|
|
**References:**
|
|
- https://openrouter.ai/docs/api/reference/streaming
|
|
- https://openrouter.ai/docs/guides/best-practices/reasoning-tokens#streaming-response
|
|
|
|
---
|
|
|
|
## 1. Subagent Architecture (Context Isolation) 🎯
|
|
|
|
**Problem:** Long-running tools (terminal commands, browser automation, complex file operations) consume massive context. A single `ls -la` can add hundreds of lines. Browser snapshots, debugging sessions, and iterative terminal work quickly bloat the main conversation, leaving less room for actual reasoning.
|
|
|
|
**Solution:** The main agent becomes an **orchestrator** that delegates context-heavy tasks to **subagents**.
|
|
|
|
**Architecture:**
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ ORCHESTRATOR (main agent) │
|
|
│ - Receives user request │
|
|
│ - Plans approach │
|
|
│ - Delegates heavy tasks to subagents │
|
|
│ - Receives summarized results │
|
|
│ - Maintains clean, focused context │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
│ │ │
|
|
▼ ▼ ▼
|
|
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
|
|
│ TERMINAL AGENT │ │ BROWSER AGENT │ │ CODE AGENT │
|
|
│ - terminal tool │ │ - browser tools │ │ - file tools │
|
|
│ - file tools │ │ - web_search │ │ - terminal │
|
|
│ │ │ - web_extract │ │ │
|
|
│ Isolated context│ │ Isolated context│ │ Isolated context│
|
|
│ Returns summary │ │ Returns summary │ │ Returns summary │
|
|
└─────────────────┘ └─────────────────┘ └─────────────────┘
|
|
```
|
|
|
|
**How it works:**
|
|
1. User asks: "Set up a new Python project with FastAPI and tests"
|
|
2. Orchestrator plans: "I need to create files, install deps, write code"
|
|
3. Orchestrator calls: `terminal_task(goal="Create venv, install fastapi pytest", context="New project in ~/myapp")`
|
|
4. **Subagent spawns** with fresh context, only terminal/file tools
|
|
5. Subagent iterates (may take 10+ tool calls, lots of output)
|
|
6. Subagent completes → returns summary: "Created venv, installed fastapi==0.109.0, pytest==8.0.0"
|
|
7. Orchestrator receives **only the summary**, context stays clean
|
|
8. Orchestrator continues with next subtask
|
|
|
|
**Key tools to implement:**
|
|
- [ ] `terminal_task(goal, context, cwd?)` - Delegate terminal/shell work
|
|
- [ ] `browser_task(goal, context, start_url?)` - Delegate web research/automation
|
|
- [ ] `code_task(goal, context, files?)` - Delegate code writing/modification
|
|
- [ ] Generic `delegate_task(goal, context, toolsets=[])` - Flexible delegation
|
|
|
|
**Implementation details:**
|
|
- [ ] Subagent uses same `run_agent.py` but with:
|
|
- Fresh/empty conversation history
|
|
- Limited toolset (only what's needed)
|
|
- Smaller max_iterations (focused task)
|
|
- Task-specific system prompt
|
|
- [ ] Subagent returns structured result:
|
|
```python
|
|
{
|
|
"success": True,
|
|
"summary": "Installed 3 packages, created 2 files",
|
|
"details": "Optional longer explanation if needed",
|
|
"artifacts": ["~/myapp/requirements.txt", "~/myapp/main.py"], # Files created
|
|
"errors": [] # Any issues encountered
|
|
}
|
|
```
|
|
- [ ] Orchestrator sees only the summary in its context
|
|
- [ ] Full subagent transcript saved separately for debugging
|
|
|
|
**Benefits:**
|
|
- 🧹 **Clean context** - Orchestrator stays focused, doesn't drown in tool output
|
|
- 📊 **Better token efficiency** - 50 terminal outputs → 1 summary paragraph
|
|
- 🎯 **Focused subagents** - Each agent has just the tools it needs
|
|
- 🔄 **Parallel potential** - Independent subtasks could run concurrently
|
|
- 🐛 **Easier debugging** - Each subtask has its own isolated transcript
|
|
|
|
**When to use subagents vs direct tools:**
|
|
- **Subagent**: Multi-step tasks, iteration likely, lots of output expected
|
|
- **Direct**: Quick one-off commands, simple file reads, user needs to see output
|
|
|
|
**Files to modify:** `run_agent.py` (add orchestration mode), new `tools/delegate_tools.py`, new `subagent_runner.py`
|
|
|
|
---
|
|
|
|
## 2. Planning & Task Management 📋
|
|
|
|
**Problem:** Agent handles tasks reactively without explicit planning. Complex multi-step tasks lack structure, progress tracking, and the ability to decompose work into manageable chunks.
|
|
|
|
**Ideas:**
|
|
- [ ] **Task decomposition tool** - Break complex requests into subtasks:
|
|
```
|
|
User: "Set up a new Python project with FastAPI, tests, and Docker"
|
|
|
|
Agent creates plan:
|
|
├── 1. Create project structure and requirements.txt
|
|
├── 2. Implement FastAPI app skeleton
|
|
├── 3. Add pytest configuration and initial tests
|
|
├── 4. Create Dockerfile and docker-compose.yml
|
|
└── 5. Verify everything works together
|
|
```
|
|
- Each subtask becomes a trackable unit
|
|
- Agent can report progress: "Completed 3/5 tasks"
|
|
|
|
- [ ] **Progress checkpoints** - Periodic self-assessment:
|
|
- After N tool calls or time elapsed, pause to evaluate
|
|
- "What have I accomplished? What remains? Am I on track?"
|
|
- Detect if stuck in loops or making no progress
|
|
- Could trigger replanning if approach isn't working
|
|
|
|
- [ ] **Explicit plan storage** - Persist plan in conversation:
|
|
- Store as structured data (not just in context)
|
|
- Update status as tasks complete
|
|
- User can ask "What's the plan?" or "What's left?"
|
|
- Survives context compression (plans are protected)
|
|
|
|
- [ ] **Failure recovery with replanning** - When things go wrong:
|
|
- Record what failed and why
|
|
- Revise plan to work around the issue
|
|
- "Step 3 failed because X, adjusting approach to Y"
|
|
- Prevents repeating failed strategies
|
|
|
|
**Files to modify:** `run_agent.py` (add planning hooks), new `tools/planning_tool.py`
|
|
|
|
---
|
|
|
|
## 3. Tool Composition & Learning 🔧
|
|
|
|
**Problem:** Tools are atomic. Complex tasks require repeated manual orchestration of the same tool sequences.
|
|
|
|
**Ideas:**
|
|
- [ ] **Macro tools / Tool chains** - Define reusable tool sequences:
|
|
```yaml
|
|
research_topic:
|
|
description: "Deep research on a topic"
|
|
steps:
|
|
- web_search: {query: "$topic"}
|
|
- web_extract: {urls: "$search_results.urls[:3]"}
|
|
- summarize: {content: "$extracted"}
|
|
```
|
|
- Could be defined in skills or a new `macros/` directory
|
|
- Agent can invoke macro as single tool call
|
|
|
|
- [ ] **Tool failure patterns** - Learn from failures:
|
|
- Track: tool, input pattern, error type, what worked instead
|
|
- Before calling a tool, check: "Has this pattern failed before?"
|
|
- Persistent across sessions (stored in skills or separate DB)
|
|
|
|
- [ ] **Parallel tool execution** - When tools are independent, run concurrently:
|
|
- Detect independence (no data dependencies between calls)
|
|
- Use `asyncio.gather()` for parallel execution
|
|
- Already have async support in some tools, just need orchestration
|
|
|
|
**Files to modify:** `model_tools.py`, `toolsets.py`, new `tool_macros.py`
|
|
|
|
---
|
|
|
|
## 4. Dynamic Skills Expansion 📚
|
|
|
|
**Problem:** Skills system is elegant but static. Skills must be manually created and added.
|
|
|
|
**Ideas:**
|
|
- [ ] **Skill acquisition from successful tasks** - After completing a complex task:
|
|
- "This approach worked well. Save as a skill?"
|
|
- Extract: goal, steps taken, tools used, key decisions
|
|
- Generate SKILL.md automatically
|
|
- Store in user's skills directory
|
|
|
|
- [ ] **Skill templates** - Common patterns that can be parameterized:
|
|
```markdown
|
|
# Debug {language} Error
|
|
1. Reproduce the error
|
|
2. Search for error message: `web_search("{error_message} {language}")`
|
|
3. Check common causes: {common_causes}
|
|
4. Apply fix and verify
|
|
```
|
|
|
|
- [ ] **Skill chaining** - Combine skills for complex workflows:
|
|
- Skills can reference other skills as dependencies
|
|
- "To do X, first apply skill Y, then skill Z"
|
|
- Directed graph of skill dependencies
|
|
|
|
**Files to modify:** `tools/skills_tool.py`, `skills/` directory structure, new `skill_generator.py`
|
|
|
|
---
|
|
|
|
## 5. Task Continuation Hints 🎯
|
|
|
|
**Problem:** Could be more helpful by suggesting logical next steps.
|
|
|
|
**Ideas:**
|
|
- [ ] **Suggest next steps** - At end of a task, suggest logical continuations:
|
|
- "Code is written. Want me to also write tests / docs / deploy?"
|
|
- Based on common workflows for task type
|
|
- Non-intrusive, just offer options
|
|
|
|
**Files to modify:** `run_agent.py`, response generation logic
|
|
|
|
---
|
|
|
|
## 7. Interactive Clarifying Questions Tool ❓
|
|
|
|
**Problem:** Agent sometimes makes assumptions or guesses when it should ask the user. Currently can only ask via text, which gets lost in long outputs.
|
|
|
|
**Ideas:**
|
|
- [ ] **Multiple-choice prompt tool** - Let agent present structured choices to user:
|
|
```
|
|
ask_user_choice(
|
|
question="Should the language switcher enable only German or all languages?",
|
|
choices=[
|
|
"Only enable German - works immediately",
|
|
"Enable all, mark untranslated - show fallback notice",
|
|
"Let me specify something else"
|
|
]
|
|
)
|
|
```
|
|
- Renders as interactive terminal UI with arrow key / Tab navigation
|
|
- User selects option, result returned to agent
|
|
- Up to 4 choices + optional free-text option
|
|
|
|
- [ ] **Implementation:**
|
|
- Use `inquirer` or `questionary` Python library for rich terminal prompts
|
|
- Tool returns selected option text (or user's custom input)
|
|
- **CLI-only** - only works when running via `cli.py` (not API/programmatic use)
|
|
- Graceful fallback: if not in interactive mode, return error asking agent to rephrase as text
|
|
|
|
- [ ] **Use cases:**
|
|
- Clarify ambiguous requirements before starting work
|
|
- Confirm destructive operations with clear options
|
|
- Let user choose between implementation approaches
|
|
- Checkpoint complex multi-step workflows
|
|
|
|
**Files to modify:** New `tools/ask_user_tool.py`, `cli.py` (detect interactive mode), `model_tools.py`
|
|
|
|
---
|
|
|
|
## 6. Resource Awareness & Efficiency 💰
|
|
|
|
**Problem:** No awareness of costs, time, or resource usage. Could be smarter about efficiency.
|
|
|
|
**Ideas:**
|
|
- [ ] **Tool result caching** - Don't repeat identical operations:
|
|
- Cache web searches, extractions within a session
|
|
- Invalidation based on time-sensitivity of query
|
|
- Hash-based lookup: same input → cached output
|
|
|
|
- [ ] **Lazy evaluation** - Don't fetch everything upfront:
|
|
- Get summaries first, full content only if needed
|
|
- "I found 5 relevant pages. Want me to deep-dive on any?"
|
|
|
|
**Files to modify:** `model_tools.py`, new `resource_tracker.py`
|
|
|
|
---
|
|
|
|
## 9. Collaborative Problem Solving 🤝
|
|
|
|
**Problem:** Interaction is command/response. Complex problems benefit from dialogue.
|
|
|
|
**Ideas:**
|
|
- [ ] **Assumption surfacing** - Make implicit assumptions explicit:
|
|
- "I'm assuming you want Python 3.11+. Correct?"
|
|
- "This solution assumes you have sudo access..."
|
|
- Let user correct before going down wrong path
|
|
|
|
- [ ] **Checkpoint & confirm** - For high-stakes operations:
|
|
- "About to delete 47 files. Here's the list - proceed?"
|
|
- "This will modify your database. Want a backup first?"
|
|
- Configurable threshold for when to ask
|
|
|
|
**Files to modify:** `run_agent.py`, system prompt configuration
|
|
|
|
---
|
|
|
|
## 7. Project-Local Context 💾
|
|
|
|
**Problem:** Valuable context lost between sessions.
|
|
|
|
**Ideas:**
|
|
- [ ] **Project awareness** - Remember project-specific context:
|
|
- Store `.hermes/context.md` in project directory
|
|
- "This is a Django project using PostgreSQL"
|
|
- Coding style preferences, deployment setup, etc.
|
|
- Load automatically when working in that directory
|
|
|
|
- [ ] **Handoff notes** - Leave notes for future sessions:
|
|
- Write to `.hermes/notes.md` in project
|
|
- "TODO for next session: finish implementing X"
|
|
- "Known issues: Y doesn't work on Windows"
|
|
|
|
**Files to modify:** New `project_context.py`, auto-load in `run_agent.py`
|
|
|
|
---
|
|
|
|
## 8. Graceful Degradation & Robustness 🛡️
|
|
|
|
**Problem:** When things go wrong, recovery is limited. Should fail gracefully.
|
|
|
|
**Ideas:**
|
|
- [ ] **Fallback chains** - When primary approach fails, have backups:
|
|
- `web_extract` fails → try `browser_navigate` → try `web_search` for cached version
|
|
- Define fallback order per tool type
|
|
|
|
- [ ] **Partial progress preservation** - Don't lose work on failure:
|
|
- Long task fails midway → save what we've got
|
|
- "I completed 3/5 steps before the error. Here's what I have..."
|
|
|
|
- [ ] **Self-healing** - Detect and recover from bad states:
|
|
- Browser stuck → close and retry
|
|
- Terminal hung → timeout and reset
|
|
|
|
**Files to modify:** `model_tools.py`, tool implementations, new `fallback_manager.py`
|
|
|
|
---
|
|
|
|
## 9. Tools & Skills Wishlist 🧰
|
|
|
|
*Things that would need new tool implementations (can't do well with current tools):*
|
|
|
|
### High-Impact
|
|
|
|
- [ ] **Audio/Video Transcription** 🎬 *(See also: Section 16 for detailed spec)*
|
|
- Transcribe audio files, podcasts, YouTube videos
|
|
- Extract key moments from video
|
|
- Voice memo transcription for messaging integrations
|
|
- *Provider options: Whisper API, Deepgram, local Whisper*
|
|
|
|
- [ ] **Diagram Rendering** 📊
|
|
- Render Mermaid/PlantUML to actual images
|
|
- Can generate the code, but rendering requires external service or tool
|
|
- "Show me how these components connect" → actual visual diagram
|
|
|
|
### Medium-Impact
|
|
|
|
- [ ] **Canvas / Visual Workspace** 🖼️
|
|
- Agent-controlled visual panel for rendering interactive UI
|
|
- Inspired by OpenClaw's Canvas feature
|
|
- **Capabilities:**
|
|
- `present` / `hide` - Show/hide the canvas panel
|
|
- `navigate` - Load HTML files or URLs into the canvas
|
|
- `eval` - Execute JavaScript in the canvas context
|
|
- `snapshot` - Capture the rendered UI as an image
|
|
- **Use cases:**
|
|
- Display generated HTML/CSS/JS previews
|
|
- Show interactive data visualizations (charts, graphs)
|
|
- Render diagrams (Mermaid → rendered output)
|
|
- Present structured information in rich format
|
|
- A2UI-style component system for structured agent UI
|
|
- **Implementation options:**
|
|
- Electron-based panel for CLI
|
|
- WebSocket-connected web app
|
|
- VS Code webview extension
|
|
- *Would let agent "show" things rather than just describe them*
|
|
|
|
- [ ] **Document Generation** 📄
|
|
- Create styled PDFs, Word docs, presentations
|
|
- *Can do basic PDF via terminal tools, but limited*
|
|
|
|
- [ ] **Diff/Patch Tool** 📝
|
|
- Surgical code modifications with preview
|
|
- "Change line 45-50 to X" without rewriting whole file
|
|
- Show diffs before applying
|
|
- *Can use `diff`/`patch` but a native tool would be safer*
|
|
|
|
### Skills to Create
|
|
|
|
- [ ] **Domain-specific skill packs:**
|
|
- DevOps/Infrastructure (Terraform, K8s, AWS)
|
|
- Data Science workflows (EDA, model training)
|
|
- Security/pentesting procedures
|
|
|
|
- [ ] **Framework-specific skills:**
|
|
- React/Vue/Angular patterns
|
|
- Django/Rails/Express conventions
|
|
- Database optimization playbooks
|
|
|
|
- [ ] **Troubleshooting flowcharts:**
|
|
- "Docker container won't start" → decision tree
|
|
- "Production is slow" → systematic diagnosis
|
|
|
|
---
|
|
|
|
## 10. Messaging Platform Integrations 💬
|
|
|
|
**Problem:** Agent currently only works via `cli.py` which requires direct terminal access. Users may want to interact via messaging apps from their phone or other devices.
|
|
|
|
**Architecture:**
|
|
- `run_agent.py` already accepts `conversation_history` parameter and returns updated messages ✅
|
|
- Need: persistent session storage, platform monitors, session key resolution
|
|
|
|
**Implementation approach:**
|
|
```
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ Platform Monitor (e.g., telegram_monitor.py) │
|
|
│ ├─ Long-running daemon connecting to messaging platform │
|
|
│ ├─ On message: resolve session key → load history from disk│
|
|
│ ├─ Call run_agent.py with loaded history │
|
|
│ ├─ Save updated history back to disk (JSONL) │
|
|
│ └─ Send response back to platform │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
**Platform support (each user sets up their own credentials):**
|
|
- [ ] **Telegram** - via `python-telegram-bot` or `grammy` equivalent
|
|
- Bot token from @BotFather
|
|
- Easiest to set up, good for personal use
|
|
- [ ] **Discord** - via `discord.py`
|
|
- Bot token from Discord Developer Portal
|
|
- Can work in servers (group sessions) or DMs
|
|
- [ ] **WhatsApp** - via `baileys` (WhatsApp Web protocol)
|
|
- QR code scan to authenticate
|
|
- More complex, but reaches most people
|
|
|
|
**Session management:**
|
|
- [ ] **Session store** - JSONL persistence per session key
|
|
- `~/.hermes/sessions/{session_key}.jsonl`
|
|
- Session keys: `telegram:dm:{user_id}`, `discord:channel:{id}`, etc.
|
|
- [ ] **Session expiry** - Configurable reset policies
|
|
- Daily reset (default 4am) OR idle timeout (e.g., 2 hours)
|
|
- Manual reset via `/reset` or `/new` command in chat
|
|
- [ ] **Session continuity** - Conversations persist across messages until reset
|
|
|
|
**Files to create:** `monitors/telegram_monitor.py`, `monitors/discord_monitor.py`, `monitors/session_store.py`
|
|
|
|
---
|
|
|
|
## 11. Scheduled Tasks / Cron Jobs ⏰
|
|
|
|
**Problem:** Agent only runs on-demand. Some tasks benefit from scheduled execution (daily summaries, monitoring, reminders).
|
|
|
|
**Ideas:**
|
|
- [ ] **Cron-style scheduler** - Run agent turns on a schedule
|
|
- Store jobs in `~/.hermes/cron/jobs.json`
|
|
- Each job: `{ id, schedule, prompt, session_mode, delivery }`
|
|
- Uses APScheduler or similar Python library
|
|
|
|
- [ ] **Session modes:**
|
|
- `isolated` - Fresh session each run (no history, clean context)
|
|
- `main` - Append to main session (agent remembers previous scheduled runs)
|
|
|
|
- [ ] **Delivery options:**
|
|
- Write output to file (`~/.hermes/cron/output/{job_id}/{timestamp}.md`)
|
|
- Send to messaging channel (if integrations enabled)
|
|
- Both
|
|
|
|
- [ ] **CLI interface:**
|
|
```bash
|
|
# List scheduled jobs
|
|
python cli.py --cron list
|
|
|
|
# Add a job (runs daily at 9am)
|
|
python cli.py --cron add "Summarize my email inbox" --schedule "0 9 * * *"
|
|
|
|
# Quick syntax for simple intervals
|
|
python cli.py --cron add "Check server status" --every 30m
|
|
|
|
# Remove a job
|
|
python cli.py --cron remove <job_id>
|
|
```
|
|
|
|
- [ ] **Agent self-scheduling** - Let the agent create its own cron jobs
|
|
- New tool: `schedule_task(prompt, schedule, session_mode)`
|
|
- "Remind me to check the deployment tomorrow at 9am"
|
|
- Agent can set follow-up tasks for itself
|
|
|
|
- [ ] **In-chat command:** `/cronjob {prompt} {frequency}` when using messaging integrations
|
|
|
|
**Files to create:** `cron/scheduler.py`, `cron/jobs.py`, `tools/schedule_tool.py`
|
|
|
|
---
|
|
|
|
## 12. Text-to-Speech (TTS) 🔊
|
|
|
|
**Problem:** Agent can only respond with text. Some users prefer audio responses (accessibility, hands-free use, podcasts).
|
|
|
|
**Ideas:**
|
|
- [ ] **TTS tool** - Generate audio files from text
|
|
```python
|
|
tts_generate(text="Here's your summary...", voice="nova", output="summary.mp3")
|
|
```
|
|
- Returns path to generated audio file
|
|
- For messaging integrations: can send as voice message
|
|
|
|
- [ ] **Provider options:**
|
|
- Edge TTS (free, good quality, many voices)
|
|
- OpenAI TTS (paid, excellent quality)
|
|
- ElevenLabs (paid, best quality, voice cloning)
|
|
- Local options (Coqui TTS, Bark)
|
|
|
|
- [ ] **Modes:**
|
|
- On-demand: User explicitly asks "read this to me"
|
|
- Auto-TTS: Configurable to always generate audio for responses
|
|
- Long-text handling: Summarize or chunk very long responses
|
|
|
|
- [ ] **Integration with messaging:**
|
|
- When enabled, can send voice notes instead of/alongside text
|
|
- User preference per channel
|
|
|
|
**Files to create:** `tools/tts_tool.py`, config in `cli-config.yaml`
|
|
|
|
---
|
|
|
|
## 13. Speech-to-Text / Audio Transcription 🎤
|
|
|
|
**Problem:** Users may want to send voice memos instead of typing. Agent is blind to audio content.
|
|
|
|
**Ideas:**
|
|
- [ ] **Voice memo transcription** - For messaging integrations
|
|
- User sends voice message → transcribe → process as text
|
|
- Seamless: user speaks, agent responds
|
|
|
|
- [ ] **Audio/video file transcription** - Existing idea, expanded:
|
|
- Transcribe local audio files (mp3, wav, m4a)
|
|
- Transcribe YouTube videos (download audio → transcribe)
|
|
- Extract key moments with timestamps
|
|
|
|
- [ ] **Provider options:**
|
|
- OpenAI Whisper API (good quality, cheap)
|
|
- Deepgram (fast, good for real-time)
|
|
- Local Whisper (free, runs on GPU)
|
|
- Groq Whisper (fast, free tier available)
|
|
|
|
- [ ] **Tool interface:**
|
|
```python
|
|
transcribe(source="audio.mp3") # Local file
|
|
transcribe(source="https://youtube.com/...") # YouTube
|
|
transcribe(source="voice_message", data=bytes) # Voice memo
|
|
```
|
|
|
|
**Files to create:** `tools/transcribe_tool.py`, integrate with messaging monitors
|
|
|
|
---
|
|
|
|
## Priority Order (Suggested)
|
|
|
|
1. **🎯 Subagent Architecture** - Critical for context management, enables everything else
|
|
2. **Memory & Context Management** - Complements subagents for remaining context
|
|
3. **Self-Reflection** - Improves reliability and reduces wasted tool calls
|
|
4. **Project-Local Context** - Practical win, keeps useful info across sessions
|
|
5. **Messaging Integrations** - Unlocks mobile access, new interaction patterns
|
|
6. **Scheduled Tasks / Cron Jobs** - Enables automation, reminders, monitoring
|
|
7. **Tool Composition** - Quality of life, builds on other improvements
|
|
8. **Dynamic Skills** - Force multiplier for repeated tasks
|
|
9. **Interactive Clarifying Questions** - Better UX for ambiguous tasks
|
|
10. **TTS / Audio Transcription** - Accessibility, hands-free use
|
|
|
|
---
|
|
|
|
## Removed Items (Unrealistic)
|
|
|
|
The following were removed because they're architecturally impossible:
|
|
|
|
- ~~Proactive suggestions / Prefetching~~ - Agent only runs on user request, can't interject
|
|
- ~~Clipboard integration~~ - No access to user's local system clipboard
|
|
|
|
The following **moved to active TODO** (now possible with new architecture):
|
|
|
|
- ~~Session save/restore~~ → See **Messaging Integrations** (session persistence)
|
|
- ~~Voice/TTS playback~~ → See **TTS** (can generate audio files, send via messaging)
|
|
- ~~Set reminders~~ → See **Scheduled Tasks / Cron Jobs**
|
|
|
|
The following were removed because they're **already possible**:
|
|
|
|
- ~~HTTP/API Client~~ → Use `curl` or Python `requests` in terminal
|
|
- ~~Structured Data Manipulation~~ → Use `pandas` in terminal
|
|
- ~~Git-Native Operations~~ → Use `git` CLI in terminal
|
|
- ~~Symbolic Math~~ → Use `SymPy` in terminal
|
|
- ~~Code Quality Tools~~ → Run linters (`eslint`, `black`, `mypy`) in terminal
|
|
- ~~Testing Framework~~ → Run `pytest`, `jest`, etc. in terminal
|
|
- ~~Translation~~ → LLM handles this fine, or use translation APIs
|
|
|
|
---
|
|
|
|
---
|
|
|
|
## 🧪 Brainstorm Ideas (Not Yet Fleshed Out)
|
|
|
|
*These are early-stage ideas that need more thinking before implementation. Captured here so they don't get lost.*
|
|
|
|
### Remote/Distributed Execution 🌐
|
|
|
|
**Concept:** Run agent on a powerful remote server while interacting from a thin client.
|
|
|
|
**Why interesting:**
|
|
- Run on beefy GPU server for local LLM inference
|
|
- Agent has access to remote machine's resources (files, tools, internet)
|
|
- User interacts via lightweight client (phone, low-power laptop)
|
|
|
|
**Open questions:**
|
|
- How does this differ from just SSH + running cli.py on remote?
|
|
- Would need secure communication channel (WebSocket? gRPC?)
|
|
- How to handle tool outputs that reference remote paths?
|
|
- Credential management for remote execution
|
|
- Latency considerations for interactive use
|
|
|
|
**Possible architecture:**
|
|
```
|
|
┌─────────────┐ ┌─────────────────────────┐
|
|
│ Thin Client │ ◄─────► │ Remote Hermes Server │
|
|
│ (phone/web) │ WS/API │ - Full agent + tools │
|
|
└─────────────┘ │ - GPU for local LLM │
|
|
│ - Access to server files│
|
|
└─────────────────────────┘
|
|
```
|
|
|
|
**Related to:** Messaging integrations (could be the "server" that monitors receive from)
|
|
|
|
---
|
|
|
|
### Multi-Agent Parallel Execution 🤖🤖
|
|
|
|
**Concept:** Extension of Subagent Architecture (Section 1) - run multiple subagents in parallel.
|
|
|
|
**Why interesting:**
|
|
- Independent subtasks don't need to wait for each other
|
|
- "Research X while setting up Y" - both run simultaneously
|
|
- Faster completion for complex multi-part tasks
|
|
|
|
**Open questions:**
|
|
- How to detect which tasks are truly independent?
|
|
- Resource management (API rate limits, concurrent connections)
|
|
- How to merge results when parallel tasks have conflicts?
|
|
- Cost implications of multiple parallel LLM calls
|
|
|
|
*Note: Basic subagent delegation (Section 1) should be implemented first, parallel execution is an optimization on top.*
|
|
|
|
---
|
|
|
|
### Plugin/Extension System 🔌
|
|
|
|
**Concept:** Allow users to add custom tools/skills without modifying core code.
|
|
|
|
**Why interesting:**
|
|
- Community contributions
|
|
- Organization-specific tools
|
|
- Clean separation of core vs. extensions
|
|
|
|
**Open questions:**
|
|
- Security implications of loading arbitrary code
|
|
- Versioning and compatibility
|
|
- Discovery and installation UX
|
|
|
|
---
|
|
|
|
*Last updated: $(date +%Y-%m-%d)* 🤖
|