The architecture has been updated

2026-03-31 23:31:36 +03:00 · 2026-03-31 23:31:36 +03:00 · a01257ead9
commit a01257ead9
parent 805f7a017e
1119 changed files with 226 additions and 352 deletions
--- a/hermes_code/environments/README.md
+++ b/hermes_code/environments/README.md
@ -0,0 +1,334 @@
+# Hermes-Agent Atropos Environments
+
+This directory contains the integration layer between **hermes-agent's** tool-calling capabilities and the **Atropos** RL training framework. It provides everything needed to run agentic LLMs through multi-turn tool-calling loops, score their output with arbitrary reward functions, and feed results into Atropos for training or evaluation.
+
+## Architecture Overview
+
+```
+                        Atropos Framework
+                    ┌───────────────────────┐
+                    │       BaseEnv          │  (atroposlib)
+                    │  - Server management   │
+                    │  - Worker scheduling   │
+                    │  - Wandb logging       │
+                    │  - CLI (serve/process/ │
+                    │    evaluate)           │
+                    └───────────┬───────────┘
+                                │ inherits
+                    ┌───────────┴───────────┐
+                    │  HermesAgentBaseEnv    │  hermes_base_env.py
+                    │  - Terminal backend    │
+                    │  - Tool resolution     │
+                    │  - Agent loop          │
+                    │  - ToolContext          │
+                    │  - Async patches       │
+                    └───────────┬───────────┘
+                                │ inherits
+              ┌─────────────────┼─────────────────┐
+              │                 │                  │
+     TerminalTestEnv     HermesSweEnv    TerminalBench2EvalEnv
+     (stack testing)     (SWE training)   (TB2 benchmark eval)
+```
+
+### Inheritance Chain
+
+**BaseEnv** (from `atroposlib`) is the Atropos base class. It provides:
+- Server management (OpenAI-compatible API servers, VLLM, SGLang)
+- Worker scheduling for parallel rollouts
+- Wandb integration for metrics and rollout logging
+- CLI interface with three subcommands: `serve`, `process`, `evaluate`
+- `evaluate_log()` for saving eval results to JSON + samples.jsonl
+
+**HermesAgentBaseEnv** (`hermes_base_env.py`) extends BaseEnv with hermes-agent specifics:
+- Sets `os.environ["TERMINAL_ENV"]` to configure the terminal backend (local, docker, modal, daytona, ssh, singularity)
+- Resolves hermes-agent toolsets via `_resolve_tools_for_group()` (calls `get_tool_definitions()` which queries `tools/registry.py`)
+- Implements `collect_trajectory()` which runs the full agent loop and computes rewards
+- Supports two-phase operation (Phase 1: OpenAI server, Phase 2: VLLM ManagedServer)
+- Applies monkey patches for async-safe tool operation at import time
+
+Concrete environments inherit from `HermesAgentBaseEnv` and implement:
+- `setup()` -- Load dataset, initialize state
+- `get_next_item()` -- Return the next item for rollout
+- `format_prompt()` -- Convert a dataset item into the user message
+- `compute_reward()` -- Score the rollout using ToolContext
+- `evaluate()` -- Periodic evaluation logic
+
+## Core Components
+
+### Agent Loop (`agent_loop.py`)
+
+`HermesAgentLoop` is the reusable multi-turn agent engine. It runs the same pattern as hermes-agent's `run_agent.py`:
+
+1. Send messages + tools to the API via `server.chat_completion()`
+2. If the response contains `tool_calls`, execute each one via `handle_function_call()` (which delegates to `tools/registry.py`'s `dispatch()`)
+3. Append tool results to the conversation and go back to step 1
+4. If the response has no tool_calls, the agent is done
+
+Tool calls are executed in a thread pool (`run_in_executor`) so backends that use `asyncio.run()` internally (Modal, Docker) don't deadlock inside Atropos's event loop.
+
+Returns an `AgentResult` containing the full conversation history, turn count, reasoning content per turn, tool errors, and optional ManagedServer state (for Phase 2).
+
+### Tool Context (`tool_context.py`)
+
+`ToolContext` is a per-rollout handle that gives reward/verification functions direct access to **all** hermes-agent tools, scoped to the rollout's `task_id`. The same `task_id` means the terminal/browser session is the SAME one the model used during its rollout -- all state (files, processes, browser tabs) is preserved.
+
+```python
+async def compute_reward(self, item, result, ctx: ToolContext):
+    # Run tests in the model's terminal sandbox
+    test = ctx.terminal("pytest -v")
+    if test["exit_code"] == 0:
+        return 1.0
+
+    # Check if a file was created
+    content = ctx.read_file("/workspace/solution.py")
+    if content.get("content"):
+        return 0.5
+
+    # Download files locally for verification (binary-safe)
+    ctx.download_file("/remote/output.bin", "/local/output.bin")
+
+    return 0.0
+```
+
+Available methods:
+- **Terminal**: `terminal(command, timeout)` -- run shell commands
+- **Files**: `read_file(path)`, `write_file(path, content)`, `search(query, path)`
+- **Transfers**: `upload_file()`, `upload_dir()`, `download_file()`, `download_dir()` -- binary-safe file transfers between host and sandbox
+- **Web**: `web_search(query)`, `web_extract(urls)`
+- **Browser**: `browser_navigate(url)`, `browser_snapshot()`
+- **Generic**: `call_tool(name, args)` -- call any hermes-agent tool by name
+- **Cleanup**: `cleanup()` -- release all resources (called automatically after `compute_reward`)
+
+### Patches (`patches.py`)
+
+**Problem**: Some hermes-agent tools use `asyncio.run()` internally (e.g., the Modal backend via SWE-ReX). This crashes when called from inside Atropos's event loop because `asyncio.run()` cannot be nested.
+
+**Solution**: `patches.py` monkey-patches `SwerexModalEnvironment` to use a dedicated background thread (`_AsyncWorker`) with its own event loop. The calling code sees the same sync interface, but internally the async work happens on a separate thread that doesn't conflict with Atropos's loop.
+
+What gets patched:
+- `SwerexModalEnvironment.__init__` -- creates Modal deployment on a background thread
+- `SwerexModalEnvironment.execute` -- runs commands on the same background thread
+- `SwerexModalEnvironment.stop` -- stops deployment on the background thread
+
+The patches are:
+- **Idempotent** -- calling `apply_patches()` multiple times is safe
+- **Transparent** -- same interface and behavior, only the internal async execution changes
+- **Universal** -- works identically in normal CLI use (no running event loop)
+
+Applied automatically at import time by `hermes_base_env.py`.
+
+### Tool Call Parsers (`tool_call_parsers/`)
+
+Client-side parsers that extract structured `tool_calls` from raw model output text. Used in **Phase 2** (VLLM server type) where ManagedServer's `/generate` endpoint returns raw text without tool call parsing.
+
+Each parser is a standalone reimplementation of the corresponding VLLM parser's `extract_tool_calls()` logic. No VLLM dependency -- only standard library (`re`, `json`, `uuid`) and `openai` types.
+
+Available parsers:
+- `hermes` -- Hermes/ChatML `<tool_call>` XML format
+- `mistral` -- Mistral `[TOOL_CALLS]` format
+- `llama3_json` -- Llama 3 JSON tool calling
+- `qwen` -- Qwen tool calling format
+- `qwen3_coder` -- Qwen3 Coder format
+- `deepseek_v3` -- DeepSeek V3 format
+- `deepseek_v3_1` -- DeepSeek V3.1 format
+- `kimi_k2` -- Kimi K2 format
+- `longcat` -- Longcat format
+- `glm45` / `glm47` -- GLM model formats
+
+Usage:
+```python
+from environments.tool_call_parsers import get_parser
+
+parser = get_parser("hermes")
+content, tool_calls = parser.parse(raw_model_output)
+```
+
+In Phase 1 (OpenAI server type), these parsers are not needed -- the server handles tool call parsing natively.
+
+## Two-Phase Operation
+
+### Phase 1: OpenAI Server (Evaluation / SFT Data Generation)
+
+Uses `server.chat_completion()` with `tools=` parameter. The server (VLLM, SGLang, OpenRouter, OpenAI) handles tool call parsing natively. Returns `ChatCompletion` objects with structured `tool_calls`.
+
+- Good for: evaluation, SFT data generation, testing
+- Run with: `serve` (with `run-api`), `process`, or `evaluate` subcommands
+- Placeholder tokens are created for the Atropos pipeline
+
+### Phase 2: VLLM ManagedServer (Full RL Training)
+
+Uses ManagedServer for exact token IDs + logprobs via `/generate`. Client-side tool call parser (from `tool_call_parsers/`) reconstructs structured `tool_calls` from raw output.
+
+- Good for: full RL training with GRPO/PPO
+- Run with: `serve` subcommand
+- Real tokens, masks, and logprobs flow through the pipeline
+
+## Directory Structure
+
+```
+environments/
+├── README.md                     # This file
+├── __init__.py                   # Package exports
+├── hermes_base_env.py            # Abstract base (HermesAgentBaseEnv)
+├── agent_loop.py                 # Multi-turn agent engine (HermesAgentLoop)
+├── tool_context.py               # Per-rollout tool access for reward functions
+├── patches.py                    # Async-safety patches for Modal backend
+│
+├── tool_call_parsers/            # Phase 2 client-side parsers
+│   ├── __init__.py               # Registry + base class
+│   ├── hermes_parser.py
+│   ├── mistral_parser.py
+│   ├── llama_parser.py
+│   ├── qwen_parser.py
+│   ├── qwen3_coder_parser.py
+│   ├── deepseek_v3_parser.py
+│   ├── deepseek_v3_1_parser.py
+│   ├── kimi_k2_parser.py
+│   ├── longcat_parser.py
+│   ├── glm45_parser.py
+│   └── glm47_parser.py
+│
+├── terminal_test_env/            # Stack validation environment
+│   └── terminal_test_env.py
+│
+├── hermes_swe_env/               # SWE-bench style training environment
+│   └── hermes_swe_env.py
+│
+└── benchmarks/                   # Evaluation benchmarks
+    ├── terminalbench_2/          # 89 terminal tasks, Modal sandboxes
+    │   └── terminalbench2_env.py
+    ├── tblite/                   # 100 calibrated tasks (fast TB2 proxy)
+    │   └── tblite_env.py
+    └── yc_bench/                 # Long-horizon strategic benchmark
+        └── yc_bench_env.py
+```
+
+## Concrete Environments
+
+### TerminalTestEnv (`terminal_test_env/`)
+
+A self-contained environment with inline tasks (no external dataset needed) for validating the full stack end-to-end. Each task asks the model to create a file at a known path, and the verifier checks the content matches.
+
+```bash
+# Serve mode (needs run-api)
+run-api
+python environments/terminal_test_env/terminal_test_env.py serve
+
+# Process mode (no run-api, saves to JSONL)
+python environments/terminal_test_env/terminal_test_env.py process \
+    --env.data_path_to_save_groups terminal_test_output.jsonl
+```
+
+### HermesSweEnv (`hermes_swe_env/`)
+
+SWE-bench style training environment. The model gets a coding task, uses terminal + file + web tools to solve it, and the reward function runs tests in the same Modal sandbox.
+
+```bash
+python environments/hermes_swe_env/hermes_swe_env.py serve \
+    --openai.model_name YourModel \
+    --env.dataset_name bigcode/humanevalpack \
+    --env.terminal_backend modal
+```
+
+### TerminalBench2EvalEnv (`benchmarks/terminalbench_2/`)
+
+**Eval-only** environment for the Terminal-Bench 2.0 benchmark (89 tasks). Each task gets a pre-built Docker Hub image, a natural language instruction, and a test suite. The agent uses terminal + file tools to solve the task, then the test suite verifies correctness.
+
+Follows the standard Atropos eval pattern (like GPQA, MMLU, etc.):
+- Run via `evaluate` subcommand (no `run-api` needed)
+- `setup()` loads the dataset, `evaluate()` runs all tasks
+- `rollout_and_score_eval()` handles per-task agent loop + test verification
+- Downloads verifier output locally for reliable reward checking (Harbor pattern)
+
+```bash
+# Run full benchmark
+python environments/benchmarks/terminalbench_2/terminalbench2_env.py evaluate \
+    --openai.model_name anthropic/claude-opus-4.6
+
+# Run subset of tasks
+python environments/benchmarks/terminalbench_2/terminalbench2_env.py evaluate \
+    --openai.model_name anthropic/claude-opus-4.6 \
+    --env.task_filter fix-git,git-multibranch
+
+# Skip specific tasks
+python environments/benchmarks/terminalbench_2/terminalbench2_env.py evaluate \
+    --openai.model_name anthropic/claude-opus-4.6 \
+    --env.skip_tasks heavy-task,slow-task
+```
+
+## Creating a New Environment
+
+### Training Environment
+
+1. Create a new directory under `environments/`
+2. Create your env file inheriting from `HermesAgentBaseEnv`
+3. Implement the four abstract methods + `evaluate()`
+
+```python
+from environments.hermes_base_env import HermesAgentBaseEnv, HermesAgentEnvConfig
+
+class MyEnvConfig(HermesAgentEnvConfig):
+    pass  # Add custom fields as needed
+
+class MyEnv(HermesAgentBaseEnv):
+    name = "my-env"
+    env_config_cls = MyEnvConfig
+
+    @classmethod
+    def config_init(cls):
+        env_config = MyEnvConfig(
+            enabled_toolsets=["terminal", "file"],
+            terminal_backend="modal",
+            # ... other config
+        )
+        server_configs = [APIServerConfig(...)]
+        return env_config, server_configs
+
+    async def setup(self):
+        self.dataset = load_dataset(...)
+        self.iter = 0
+
+    async def get_next_item(self):
+        item = self.dataset[self.iter % len(self.dataset)]
+        self.iter += 1
+        return item
+
+    def format_prompt(self, item):
+        return item["instruction"]
+
+    async def compute_reward(self, item, result, ctx):
+        # ctx gives you full tool access to the rollout's sandbox
+        test = ctx.terminal("pytest -v")
+        return 1.0 if test["exit_code"] == 0 else 0.0
+
+    async def evaluate(self, *args, **kwargs):
+        # Periodic evaluation logic
+        ...
+
+if __name__ == "__main__":
+    MyEnv.cli()
+```
+
+### Eval-Only Environment (Benchmark)
+
+For eval benchmarks, follow the pattern in `terminalbench2_env.py`:
+1. Create under `environments/benchmarks/your-benchmark/`
+2. Inherit from `HermesAgentBaseEnv`
+3. Set eval-only config: `eval_handling=STOP_TRAIN`, `steps_per_eval=1`, `total_steps=1`
+4. Stub the training methods (`collect_trajectories`, `score`)
+5. Implement `rollout_and_score_eval()` and `evaluate()`
+6. Run with `evaluate` subcommand
+
+## Key Config Fields
+
+| Field | Description | Default |
+|-------|-------------|---------|
+| `enabled_toolsets` | Which hermes toolsets to enable | `None` (all) |
+| `disabled_toolsets` | Toolsets to disable | `None` |
+| `distribution` | Probabilistic toolset distribution name | `None` |
+| `max_agent_turns` | Max LLM calls per rollout | `30` |
+| `agent_temperature` | Sampling temperature | `1.0` |
+| `terminal_backend` | `local`, `docker`, `modal`, `daytona`, `ssh`, `singularity` | `local` |
+| `system_prompt` | System message for the agent | `None` |
+| `tool_call_parser` | Parser name for Phase 2 | `hermes` |
+| `eval_handling` | `STOP_TRAIN`, `LIMIT_TRAIN`, `NONE` | `STOP_TRAIN` |
--- a/hermes_code/environments/init.py
+++ b/hermes_code/environments/init.py
@ -0,0 +1,36 @@
+"""
+Hermes-Agent Atropos Environments
+
+Provides a layered integration between hermes-agent's tool-calling capabilities
+and the Atropos RL training framework.
+
+Core layers:
+    - agent_loop: Reusable multi-turn agent loop with standard OpenAI-spec tool calling
+    - tool_context: Per-rollout tool access handle for reward/verification functions
+    - hermes_base_env: Abstract base environment (BaseEnv subclass) for Atropos
+    - tool_call_parsers: Client-side tool call parser registry for Phase 2 (VLLM /generate)
+
+Concrete environments:
+    - terminal_test_env/: Simple file-creation tasks for testing the stack
+    - hermes_swe_env/: SWE-bench style tasks with Modal sandboxes
+
+Benchmarks (eval-only):
+    - benchmarks/terminalbench_2/: Terminal-Bench 2.0 evaluation
+"""
+
+try:
+    from environments.agent_loop import AgentResult, HermesAgentLoop
+    from environments.tool_context import ToolContext
+    from environments.hermes_base_env import HermesAgentBaseEnv, HermesAgentEnvConfig
+except ImportError:
+    # atroposlib not installed — environments are unavailable but
+    # submodules like tool_call_parsers can still be imported directly.
+    pass
+
+__all__ = [
+    "AgentResult",
+    "HermesAgentLoop",
+    "ToolContext",
+    "HermesAgentBaseEnv",
+    "HermesAgentEnvConfig",
+]
--- a/hermes_code/environments/agent_loop.py
+++ b/hermes_code/environments/agent_loop.py
@ -0,0 +1,511 @@
+"""
+HermesAgentLoop -- Reusable Multi-Turn Agent Engine
+
+Runs the hermes-agent tool-calling loop using standard OpenAI-spec tool calling.
+Works with any server that returns ChatCompletion objects with tool_calls:
+    - Phase 1: OpenAI server type (VLLM, SGLang, OpenRouter, OpenAI API)
+    - Phase 2: ManagedServer with client-side tool call parser
+
+The loop passes tools= and checks response.choices[0].message.tool_calls,
+identical to hermes-agent's run_agent.py. Tool execution is dispatched via
+handle_function_call() from model_tools.py.
+"""
+
+import asyncio
+import concurrent.futures
+import json
+import logging
+import os
+import uuid
+from dataclasses import dataclass, field
+from typing import Any, Dict, List, Optional, Set
+
+from model_tools import handle_function_call
+
+# Thread pool for running sync tool calls that internally use asyncio.run()
+# (e.g., the Modal/Docker/Daytona terminal backends). Running them in a separate
+# thread gives them a clean event loop so they don't deadlock inside Atropos's loop.
+# Size must be large enough for concurrent eval tasks (e.g., 89 TB2 tasks all
+# making tool calls). Too small = thread pool starvation, tasks queue for minutes.
+# Resized at runtime by HermesAgentBaseEnv.__init__ via resize_tool_pool().
+_tool_executor = concurrent.futures.ThreadPoolExecutor(max_workers=128)
+
+
+def resize_tool_pool(max_workers: int):
+    """
+    Replace the global tool executor with a new one of the given size.
+
+    Called by HermesAgentBaseEnv.__init__ based on config.tool_pool_size.
+    Safe to call before any tasks are submitted.
+    """
+    global _tool_executor
+    old_executor = _tool_executor
+    _tool_executor = concurrent.futures.ThreadPoolExecutor(max_workers=max_workers)
+    old_executor.shutdown(wait=False)
+    logger.info("Tool thread pool resized to %d workers", max_workers)
+
+logger = logging.getLogger(__name__)
+
+
+@dataclass
+class ToolError:
+    """Record of a tool execution error during the agent loop."""
+
+    turn: int                  # Which turn the error occurred on
+    tool_name: str             # Which tool was called
+    arguments: str             # The arguments passed (truncated)
+    error: str                 # The error message
+    tool_result: str           # The raw result returned to the model
+
+
+@dataclass
+class AgentResult:
+    """Result of running the agent loop."""
+
+    # Full conversation history in OpenAI message format
+    messages: List[Dict[str, Any]]
+    # ManagedServer.get_state() if available (Phase 2), None otherwise
+    managed_state: Optional[Dict[str, Any]] = None
+    # How many LLM calls were made
+    turns_used: int = 0
+    # True if model stopped calling tools naturally (vs hitting max_turns)
+    finished_naturally: bool = False
+    # Extracted reasoning content per turn (from PR #297 helpers)
+    reasoning_per_turn: List[Optional[str]] = field(default_factory=list)
+    # Tool errors encountered during the loop
+    tool_errors: List[ToolError] = field(default_factory=list)
+
+
+def _extract_reasoning_from_message(message) -> Optional[str]:
+    """
+    Extract reasoning content from a ChatCompletion message.
+
+    Handles multiple provider formats:
+    1. message.reasoning_content field (some providers)
+    2. message.reasoning field (some providers)
+    3. message.reasoning_details[].text (OpenRouter style)
+
+    Note: <think> block extraction from content is NOT done here -- that's
+    handled by the response already in Phase 1 (server does it) or by
+    ManagedServer's patch in Phase 2.
+
+    Args:
+        message: The assistant message from ChatCompletion response
+
+    Returns:
+        Extracted reasoning text, or None if not found
+    """
+    # Check reasoning_content field (common across providers)
+    if hasattr(message, "reasoning_content") and message.reasoning_content:
+        return message.reasoning_content
+
+    # Check reasoning field
+    if hasattr(message, "reasoning") and message.reasoning:
+        return message.reasoning
+
+    # Check reasoning_details (OpenRouter style)
+    if hasattr(message, "reasoning_details") and message.reasoning_details:
+        for detail in message.reasoning_details:
+            if hasattr(detail, "text") and detail.text:
+                return detail.text
+            if isinstance(detail, dict) and detail.get("text"):
+                return detail["text"]
+
+    return None
+
+
+class HermesAgentLoop:
+    """
+    Runs hermes-agent's tool-calling loop using standard OpenAI-spec tool calling.
+
+    Same pattern as run_agent.py:
+    - Pass tools= to the API
+    - Check response.choices[0].message.tool_calls
+    - Dispatch via handle_function_call()
+
+    Works identically with any server type -- OpenAI, VLLM, SGLang, OpenRouter,
+    or ManagedServer with a parser. The server determines how tool_calls get
+    populated on the response.
+    """
+
+    def __init__(
+        self,
+        server,
+        tool_schemas: List[Dict[str, Any]],
+        valid_tool_names: Set[str],
+        max_turns: int = 30,
+        task_id: Optional[str] = None,
+        temperature: float = 1.0,
+        max_tokens: Optional[int] = None,
+        extra_body: Optional[Dict[str, Any]] = None,
+    ):
+        """
+        Initialize the agent loop.
+
+        Args:
+            server: Server object with chat_completion() method (OpenAIServer,
+                    ManagedServer, ServerManager, etc.)
+            tool_schemas: OpenAI-format tool definitions from get_tool_definitions()
+            valid_tool_names: Set of tool names the model is allowed to call
+            max_turns: Maximum number of LLM calls before stopping
+            task_id: Unique ID for terminal/browser session isolation
+            temperature: Sampling temperature for generation
+            max_tokens: Max tokens per generation (None for server default)
+            extra_body: Extra parameters passed to the OpenAI client's create() call.
+                        Used for OpenRouter provider preferences, transforms, etc.
+                        e.g. {"provider": {"ignore": ["DeepInfra"]}}
+        """
+        self.server = server
+        self.tool_schemas = tool_schemas
+        self.valid_tool_names = valid_tool_names
+        self.max_turns = max_turns
+        self.task_id = task_id or str(uuid.uuid4())
+        self.temperature = temperature
+        self.max_tokens = max_tokens
+        self.extra_body = extra_body
+
+    async def run(self, messages: List[Dict[str, Any]]) -> AgentResult:
+        """
+        Execute the full agent loop using standard OpenAI tool calling.
+
+        Args:
+            messages: Initial conversation messages (system + user).
+                      Modified in-place as the conversation progresses.
+
+        Returns:
+            AgentResult with full conversation history, managed state, and metadata
+        """
+        reasoning_per_turn = []
+        tool_errors: List[ToolError] = []
+
+        # Per-loop TodoStore for the todo tool (ephemeral, dies with the loop)
+        from tools.todo_tool import TodoStore, todo_tool as _todo_tool
+        _todo_store = TodoStore()
+
+        # Extract user task from first user message for browser_snapshot context
+        _user_task = None
+        for msg in messages:
+            if msg.get("role") == "user":
+                content = msg.get("content", "")
+                if isinstance(content, str) and content.strip():
+                    _user_task = content.strip()[:500]  # Cap to avoid huge strings
+                break
+
+        import time as _time
+
+        for turn in range(self.max_turns):
+            turn_start = _time.monotonic()
+
+            # Build the chat_completion kwargs
+            chat_kwargs = {
+                "messages": messages,
+                "n": 1,
+                "temperature": self.temperature,
+            }
+
+            # Only pass tools if we have them
+            if self.tool_schemas:
+                chat_kwargs["tools"] = self.tool_schemas
+
+            # Only pass max_tokens if explicitly set
+            if self.max_tokens is not None:
+                chat_kwargs["max_tokens"] = self.max_tokens
+
+            # Inject extra_body for provider-specific params (e.g., OpenRouter
+            # provider preferences like banned/preferred providers, transforms)
+            if self.extra_body:
+                chat_kwargs["extra_body"] = self.extra_body
+
+            # Make the API call -- standard OpenAI spec
+            api_start = _time.monotonic()
+            try:
+                response = await self.server.chat_completion(**chat_kwargs)
+            except Exception as e:
+                api_elapsed = _time.monotonic() - api_start
+                logger.error("API call failed on turn %d (%.1fs): %s", turn + 1, api_elapsed, e)
+                return AgentResult(
+                    messages=messages,
+                    managed_state=self._get_managed_state(),
+                    turns_used=turn + 1,
+                    finished_naturally=False,
+                    reasoning_per_turn=reasoning_per_turn,
+                    tool_errors=tool_errors,
+                )
+
+            api_elapsed = _time.monotonic() - api_start
+
+            if not response or not response.choices:
+                logger.warning("Empty response on turn %d (api=%.1fs)", turn + 1, api_elapsed)
+                return AgentResult(
+                    messages=messages,
+                    managed_state=self._get_managed_state(),
+                    turns_used=turn + 1,
+                    finished_naturally=False,
+                    reasoning_per_turn=reasoning_per_turn,
+                    tool_errors=tool_errors,
+                )
+
+            assistant_msg = response.choices[0].message
+
+            # Extract reasoning content from the response (all provider formats)
+            reasoning = _extract_reasoning_from_message(assistant_msg)
+            reasoning_per_turn.append(reasoning)
+
+            # Check for tool calls -- standard OpenAI spec.
+            # Fallback: if response has no structured tool_calls but content
+            # contains raw tool call tags (e.g. <tool_call>), parse them using
+            # hermes-agent's standalone parsers. This handles the case where
+            # ManagedServer's ToolCallTranslator couldn't parse because vLLM
+            # isn't installed.
+            if (
+                not assistant_msg.tool_calls
+                and assistant_msg.content
+                and self.tool_schemas
+                and "<tool_call>" in (assistant_msg.content or "")
+            ):
+                try:
+                    from environments.tool_call_parsers import get_parser
+                    fallback_parser = get_parser("hermes")
+                    parsed_content, parsed_calls = fallback_parser.parse(
+                        assistant_msg.content
+                    )
+                    if parsed_calls:
+                        assistant_msg.tool_calls = parsed_calls
+                        if parsed_content is not None:
+                            assistant_msg.content = parsed_content
+                        logger.debug(
+                            "Fallback parser extracted %d tool calls from raw content",
+                            len(parsed_calls),
+                        )
+                except Exception:
+                    pass  # Fall through to no tool calls
+
+            if assistant_msg.tool_calls:
+                # Normalize tool calls to dicts — they may come as objects
+                # (OpenAI API) or dicts (vLLM ToolCallTranslator).
+                def _tc_to_dict(tc):
+                    if isinstance(tc, dict):
+                        return {
+                            "id": tc.get("id", f"call_{uuid.uuid4().hex[:8]}"),
+                            "type": "function",
+                            "function": {
+                                "name": tc.get("function", {}).get("name", tc.get("name", "")),
+                                "arguments": tc.get("function", {}).get("arguments", tc.get("arguments", "{}")),
+                            },
+                        }
+                    return {
+                        "id": tc.id,
+                        "type": "function",
+                        "function": {
+                            "name": tc.function.name,
+                            "arguments": tc.function.arguments,
+                        },
+                    }
+
+                # Build the assistant message dict for conversation history
+                msg_dict: Dict[str, Any] = {
+                    "role": "assistant",
+                    "content": assistant_msg.content or "",
+                    "tool_calls": [_tc_to_dict(tc) for tc in assistant_msg.tool_calls],
+                }
+
+                # Preserve reasoning_content for multi-turn chat template handling
+                # (e.g., Kimi-K2's template renders <think> blocks differently
+                # for history vs. the latest turn based on this field)
+                if reasoning:
+                    msg_dict["reasoning_content"] = reasoning
+
+                messages.append(msg_dict)
+
+                # Execute each tool call via hermes-agent's dispatch
+                for tc in assistant_msg.tool_calls:
+                    # Handle both object (OpenAI) and dict (vLLM) formats
+                    if isinstance(tc, dict):
+                        tool_name = tc.get("function", {}).get("name", tc.get("name", ""))
+                        tool_args_raw = tc.get("function", {}).get("arguments", tc.get("arguments", "{}"))
+                    else:
+                        tool_name = tc.function.name
+                        tool_args_raw = tc.function.arguments
+
+                    # Validate tool name
+                    if tool_name not in self.valid_tool_names:
+                        tool_result = json.dumps(
+                            {
+                                "error": f"Unknown tool '{tool_name}'. "
+                                f"Available tools: {sorted(self.valid_tool_names)}"
+                            }
+                        )
+                        tool_errors.append(ToolError(
+                            turn=turn + 1, tool_name=tool_name,
+                            arguments=tool_args_raw[:200],
+                            error=f"Unknown tool '{tool_name}'",
+                            tool_result=tool_result,
+                        ))
+                        logger.warning(
+                            "Model called unknown tool '%s' on turn %d",
+                            tool_name, turn + 1,
+                        )
+                    else:
+                        # Parse arguments
+                        try:
+                            args = json.loads(tool_args_raw)
+                        except json.JSONDecodeError as e:
+                            args = None
+                            tool_result = json.dumps(
+                                {"error": f"Invalid JSON in tool arguments: {e}. Please retry with valid JSON."}
+                            )
+                            tool_errors.append(ToolError(
+                                turn=turn + 1, tool_name=tool_name,
+                                arguments=tool_args_raw[:200],
+                                error=f"Invalid JSON: {e}",
+                                tool_result=tool_result,
+                            ))
+                            logger.warning(
+                                "Invalid JSON in tool call arguments for '%s': %s",
+                                tool_name, tool_args_raw[:200],
+                            )
+
+                        # Dispatch tool only if arguments parsed successfully
+                        if args is not None:
+                            try:
+                                if tool_name == "terminal":
+                                    backend = os.getenv("TERMINAL_ENV", "local")
+                                    cmd_preview = args.get("command", "")[:80]
+                                    logger.info(
+                                        "[%s] $ %s", self.task_id[:8], cmd_preview,
+                                    )
+
+                                tool_submit_time = _time.monotonic()
+
+                                # Todo tool -- handle locally (needs per-loop TodoStore)
+                                if tool_name == "todo":
+                                    tool_result = _todo_tool(
+                                        todos=args.get("todos"),
+                                        merge=args.get("merge", False),
+                                        store=_todo_store,
+                                    )
+                                    tool_elapsed = _time.monotonic() - tool_submit_time
+                                elif tool_name == "memory":
+                                    tool_result = json.dumps({"error": "Memory is not available in RL environments."})
+                                    tool_elapsed = _time.monotonic() - tool_submit_time
+                                elif tool_name == "session_search":
+                                    tool_result = json.dumps({"error": "Session search is not available in RL environments."})
+                                    tool_elapsed = _time.monotonic() - tool_submit_time
+                                else:
+                                    # Run tool calls in a thread pool so backends that
+                                    # use asyncio.run() internally (modal, docker, daytona) get
+                                    # a clean event loop instead of deadlocking.
+                                    loop = asyncio.get_event_loop()
+                                    # Capture current tool_name/args for the lambda
+                                    _tn, _ta, _tid = tool_name, args, self.task_id
+                                    tool_result = await loop.run_in_executor(
+                                        _tool_executor,
+                                        lambda: handle_function_call(
+                                            _tn, _ta, task_id=_tid,
+                                            user_task=_user_task,
+                                        ),
+                                    )
+                                    tool_elapsed = _time.monotonic() - tool_submit_time
+
+                                # Log slow tools and thread pool stats for debugging
+                                pool_active = _tool_executor._work_queue.qsize()
+                                if tool_elapsed > 30:
+                                    logger.warning(
+                                        "[%s] turn %d: %s took %.1fs (pool queue=%d)",
+                                        self.task_id[:8], turn + 1, tool_name,
+                                        tool_elapsed, pool_active,
+                                    )
+                            except Exception as e:
+                                tool_result = json.dumps(
+                                    {"error": f"Tool execution failed: {type(e).__name__}: {str(e)}"}
+                                )
+                                tool_errors.append(ToolError(
+                                    turn=turn + 1, tool_name=tool_name,
+                                    arguments=tool_args_raw[:200],
+                                    error=f"{type(e).__name__}: {str(e)}",
+                                    tool_result=tool_result,
+                                ))
+                                logger.error(
+                                    "Tool '%s' execution failed on turn %d: %s",
+                                    tool_name, turn + 1, e,
+                                )
+
+                        # Also check if the tool returned an error in its JSON result
+                        try:
+                            result_data = json.loads(tool_result)
+                            if isinstance(result_data, dict):
+                                err = result_data.get("error")
+                                exit_code = result_data.get("exit_code")
+                                if err and exit_code and exit_code < 0:
+                                    tool_errors.append(ToolError(
+                                        turn=turn + 1, tool_name=tool_name,
+                                        arguments=tool_args_raw[:200],
+                                        error=str(err),
+                                        tool_result=tool_result[:500],
+                                    ))
+                        except (json.JSONDecodeError, TypeError):
+                            pass
+
+                    # Add tool response to conversation
+                    tc_id = tc.get("id", "") if isinstance(tc, dict) else tc.id
+                    messages.append(
+                        {
+                            "role": "tool",
+                            "tool_call_id": tc_id,
+                            "content": tool_result,
+                        }
+                    )
+
+                turn_elapsed = _time.monotonic() - turn_start
+                logger.info(
+                    "[%s] turn %d: api=%.1fs, %d tools, turn_total=%.1fs",
+                    self.task_id[:8], turn + 1, api_elapsed,
+                    len(assistant_msg.tool_calls), turn_elapsed,
+                )
+
+            else:
+                # No tool calls -- model is done
+                msg_dict = {
+                    "role": "assistant",
+                    "content": assistant_msg.content or "",
+                }
+                if reasoning:
+                    msg_dict["reasoning_content"] = reasoning
+                messages.append(msg_dict)
+
+                turn_elapsed = _time.monotonic() - turn_start
+                logger.info(
+                    "[%s] turn %d: api=%.1fs, no tools (finished), turn_total=%.1fs",
+                    self.task_id[:8], turn + 1, api_elapsed, turn_elapsed,
+                )
+
+                return AgentResult(
+                    messages=messages,
+                    managed_state=self._get_managed_state(),
+                    turns_used=turn + 1,
+                    finished_naturally=True,
+                    reasoning_per_turn=reasoning_per_turn,
+                    tool_errors=tool_errors,
+                )
+
+        # Hit max turns without the model stopping
+        logger.info("Agent hit max_turns (%d) without finishing", self.max_turns)
+        return AgentResult(
+            messages=messages,
+            managed_state=self._get_managed_state(),
+            turns_used=self.max_turns,
+            finished_naturally=False,
+            reasoning_per_turn=reasoning_per_turn,
+            tool_errors=tool_errors,
+        )
+
+    def _get_managed_state(self) -> Optional[Dict[str, Any]]:
+        """
+        Get ManagedServer state if the server supports it.
+
+        Returns state dict with SequenceNodes containing tokens/logprobs/masks,
+        or None if the server doesn't support get_state() (e.g., regular OpenAI server).
+        """
+        if hasattr(self.server, "get_state"):
+            return self.server.get_state()
+        return None
--- a/hermes_code/environments/agentic_opd_env.py
+++ b/hermes_code/environments/agentic_opd_env.py
--- a/hermes_code/environments/benchmarks/init.py
+++ b/hermes_code/environments/benchmarks/init.py
--- a/hermes_code/environments/benchmarks/tblite/README.md
+++ b/hermes_code/environments/benchmarks/tblite/README.md
@ -0,0 +1,73 @@
+# OpenThoughts-TBLite Evaluation Environment
+
+This environment evaluates terminal agents on the [OpenThoughts-TBLite](https://huggingface.co/datasets/open-thoughts/OpenThoughts-TBLite) benchmark, a difficulty-calibrated subset of [Terminal-Bench 2.0](https://www.tbench.ai/leaderboard/terminal-bench/2.0).
+
+## Source
+
+OpenThoughts-TBLite was created by the [OpenThoughts](https://www.openthoughts.ai/) Agent team in collaboration with [Snorkel AI](https://snorkel.ai/) and [Bespoke Labs](https://bespokelabs.ai/). The original dataset and documentation live at:
+
+- **Dataset (source):** [open-thoughts/OpenThoughts-TBLite](https://huggingface.co/datasets/open-thoughts/OpenThoughts-TBLite)
+- **GitHub:** [open-thoughts/OpenThoughts-TBLite](https://github.com/open-thoughts/OpenThoughts-TBLite)
+- **Blog post:** [openthoughts.ai/blog/openthoughts-tblite](https://www.openthoughts.ai/blog/openthoughts-tblite)
+
+## Our Dataset
+
+We converted the source into the same schema used by our Terminal-Bench 2.0 environment (pre-built Docker Hub images, base64-encoded test tarballs, etc.) and published it as:
+
+- **Dataset (ours):** [NousResearch/openthoughts-tblite](https://huggingface.co/datasets/NousResearch/openthoughts-tblite)
+- **Docker images:** `nousresearch/tblite-<task-name>:latest` on Docker Hub (100 images)
+
+The conversion script is at `scripts/prepare_tblite_dataset.py`.
+
+## Why TBLite?
+
+Terminal-Bench 2.0 is one of the strongest frontier evaluations for terminal agents, but when a model scores near the floor (e.g., Qwen 3 8B at <1%), many changes look identical in aggregate score. TBLite addresses this by calibrating task difficulty using Claude Haiku 4.5 as a reference:
+
+| Difficulty | Pass Rate Range | Tasks |
+|------------|----------------|-------|
+| Easy       | >= 70%         | 40    |
+| Medium     | 40-69%         | 26    |
+| Hard       | 10-39%         | 26    |
+| Extreme    | < 10%          | 8     |
+
+This gives enough solvable tasks to detect small improvements quickly, while preserving enough hard tasks to avoid saturation. The correlation between TBLite and TB2 scores is **r = 0.911**.
+
+TBLite also runs 2.6-8x faster than the full TB2, making it practical for iteration loops.
+
+## Usage
+
+```bash
+# Run the full benchmark
+python environments/benchmarks/tblite/tblite_env.py evaluate
+
+# Filter to specific tasks
+python environments/benchmarks/tblite/tblite_env.py evaluate \
+    --env.task_filter "broken-python,pandas-etl"
+
+# Use a different model
+python environments/benchmarks/tblite/tblite_env.py evaluate \
+    --server.model_name "qwen/qwen3-30b"
+```
+
+## Architecture
+
+`TBLiteEvalEnv` is a thin subclass of `TerminalBench2EvalEnv`. All evaluation logic (agent loop, Docker sandbox management, test verification, metrics) is inherited. Only the defaults differ:
+
+| Setting        | TB2                              | TBLite                                  |
+|----------------|----------------------------------|-----------------------------------------|
+| Dataset        | `NousResearch/terminal-bench-2`  | `NousResearch/openthoughts-tblite`      |
+| Tasks          | 89                               | 100                                     |
+| Task timeout   | 1800s (30 min)                   | 1200s (20 min)                          |
+| Wandb name     | `terminal-bench-2`               | `openthoughts-tblite`                   |
+
+## Citation
+
+```bibtex
+@software{OpenThoughts-TBLite,
+  author = {OpenThoughts-Agent team, Snorkel AI, Bespoke Labs},
+  month = Feb,
+  title = {{OpenThoughts-TBLite: A High-Signal Benchmark for Iterating on Terminal Agents}},
+  howpublished = {https://www.openthoughts.ai/blog/openthoughts-tblite},
+  year = {2026}
+}
+```
--- a/hermes_code/environments/benchmarks/tblite/init.py
+++ b/hermes_code/environments/benchmarks/tblite/init.py
--- a/hermes_code/environments/benchmarks/tblite/default.yaml
+++ b/hermes_code/environments/benchmarks/tblite/default.yaml
@ -0,0 +1,39 @@
+# OpenThoughts-TBLite Evaluation -- Default Configuration
+#
+# Eval-only environment for the TBLite benchmark (100 difficulty-calibrated
+# terminal tasks, a faster proxy for Terminal-Bench 2.0).
+# Uses Modal terminal backend for per-task cloud-isolated sandboxes
+# and OpenRouter for inference.
+#
+# Usage:
+#   python environments/benchmarks/tblite/tblite_env.py evaluate \
+#       --config environments/benchmarks/tblite/default.yaml
+#
+#   # Override model:
+#   python environments/benchmarks/tblite/tblite_env.py evaluate \
+#       --config environments/benchmarks/tblite/default.yaml \
+#       --openai.model_name anthropic/claude-sonnet-4
+
+env:
+  enabled_toolsets: ["terminal", "file"]
+  max_agent_turns: 60
+  max_token_length: 32000
+  agent_temperature: 0.8
+  terminal_backend: "modal"
+  terminal_timeout: 300        # 5 min per command (builds, pip install)
+  tool_pool_size: 128          # thread pool for 100 parallel tasks
+  dataset_name: "NousResearch/openthoughts-tblite"
+  test_timeout: 600
+  task_timeout: 1200           # 20 min wall-clock per task (TBLite tasks are faster)
+  tokenizer_name: "NousResearch/Hermes-3-Llama-3.1-8B"
+  use_wandb: true
+  wandb_name: "openthoughts-tblite"
+  ensure_scores_are_not_same: false
+  data_dir_to_save_evals: "environments/benchmarks/evals/openthoughts-tblite"
+
+openai:
+  base_url: "https://openrouter.ai/api/v1"
+  model_name: "anthropic/claude-opus-4.6"
+  server_type: "openai"
+  health_check: false
+  # api_key loaded from OPENROUTER_API_KEY in .env
--- a/hermes_code/environments/benchmarks/tblite/local.yaml
+++ b/hermes_code/environments/benchmarks/tblite/local.yaml
@ -0,0 +1,38 @@
+# OpenThoughts-TBLite Evaluation -- Docker Backend (Local Compute)
+#
+# Runs tasks in Docker containers on the local machine.
+# Sandboxed like Modal but no cloud costs. Good for dev/testing.
+#
+# Usage:
+#   python environments/benchmarks/tblite/tblite_env.py evaluate \
+#       --config environments/benchmarks/tblite/local.yaml
+#
+#   # Override concurrency:
+#   python environments/benchmarks/tblite/tblite_env.py evaluate \
+#       --config environments/benchmarks/tblite/local.yaml \
+#       --env.eval_concurrency 4
+
+env:
+  enabled_toolsets: ["terminal", "file"]
+  max_agent_turns: 60
+  max_token_length: 32000
+  agent_temperature: 0.8
+  terminal_backend: "docker"
+  terminal_timeout: 300
+  tool_pool_size: 16
+  dataset_name: "NousResearch/openthoughts-tblite"
+  test_timeout: 600
+  task_timeout: 1200
+  eval_concurrency: 8          # max 8 tasks at once
+  tokenizer_name: "NousResearch/Hermes-3-Llama-3.1-8B"
+  use_wandb: false
+  wandb_name: "openthoughts-tblite-local"
+  ensure_scores_are_not_same: false
+  data_dir_to_save_evals: "environments/benchmarks/evals/openthoughts-tblite-local"
+
+openai:
+  base_url: "https://openrouter.ai/api/v1"
+  model_name: "anthropic/claude-sonnet-4"
+  server_type: "openai"
+  health_check: false
+  # api_key loaded from OPENROUTER_API_KEY in .env
--- a/hermes_code/environments/benchmarks/tblite/local_vllm.yaml
+++ b/hermes_code/environments/benchmarks/tblite/local_vllm.yaml
@ -0,0 +1,40 @@
+# OpenThoughts-TBLite Evaluation -- Local vLLM Backend
+#
+# Runs against a local vLLM server with Docker sandboxes.
+#
+# Start the vLLM server from the atropos directory:
+#   python -m example_trainer.vllm_api_server \
+#       --model Qwen/Qwen3-4B-Instruct-2507 \
+#       --port 9001 \
+#       --gpu-memory-utilization 0.8 \
+#       --max-model-len=32000
+#
+# Then run:
+#   python environments/benchmarks/tblite/tblite_env.py evaluate \
+#       --config environments/benchmarks/tblite/local_vllm.yaml
+
+env:
+  enabled_toolsets: ["terminal", "file"]
+  max_agent_turns: 60
+  max_token_length: 16000
+  agent_temperature: 0.6
+  terminal_backend: "docker"
+  terminal_timeout: 300
+  tool_pool_size: 16
+  dataset_name: "NousResearch/openthoughts-tblite"
+  test_timeout: 600
+  task_timeout: 1200
+  eval_concurrency: 8
+  tool_call_parser: "hermes"
+  system_prompt: "You are an expert terminal agent. You MUST use the provided tools to complete tasks. Use the terminal tool to run shell commands, read_file to read files, write_file to write files, search_files to search, and patch to edit files. Do NOT write out solutions as text - execute them using the tools. Always start by exploring the environment with terminal commands."
+  tokenizer_name: "Qwen/Qwen3-4B-Instruct-2507"
+  use_wandb: false
+  wandb_name: "tblite-qwen3-4b-instruct"
+  ensure_scores_are_not_same: false
+  data_dir_to_save_evals: "environments/benchmarks/evals/tblite-qwen3-4b-local"
+
+openai:
+  base_url: "http://localhost:9001"
+  model_name: "Qwen/Qwen3-4B-Instruct-2507"
+  server_type: "vllm"
+  health_check: false
--- a/hermes_code/environments/benchmarks/tblite/run_eval.sh
+++ b/hermes_code/environments/benchmarks/tblite/run_eval.sh
@ -0,0 +1,42 @@
+#!/bin/bash
+
+# OpenThoughts-TBLite Evaluation
+#
+# Run from repo root:
+#   bash environments/benchmarks/tblite/run_eval.sh
+#
+# Override model:
+#   bash environments/benchmarks/tblite/run_eval.sh \
+#       --openai.model_name anthropic/claude-sonnet-4
+#
+# Run a subset:
+#   bash environments/benchmarks/tblite/run_eval.sh \
+#       --env.task_filter broken-python,pandas-etl
+#
+# All terminal settings (backend, timeout, lifetime, pool size) are
+# configured via env config fields -- no env vars needed.
+
+set -euo pipefail
+
+mkdir -p logs evals/openthoughts-tblite
+LOG_FILE="logs/tblite_$(date +%Y%m%d_%H%M%S).log"
+
+echo "OpenThoughts-TBLite Evaluation"
+echo "Log file: $LOG_FILE"
+echo ""
+
+# Unbuffered python output so logs are written in real-time
+export PYTHONUNBUFFERED=1
+
+# Show INFO-level agent loop timing (api/tool durations per turn)
+# These go to the log file; tqdm + [START]/[PASS]/[FAIL] go to terminal
+export LOGLEVEL=INFO
+
+python tblite_env.py evaluate \
+  --config default.yaml \
+  "$@" \
+  2>&1 | tee "$LOG_FILE"
+
+echo ""
+echo "Log saved to: $LOG_FILE"
+echo "Eval results: evals/openthoughts-tblite/"
--- a/hermes_code/environments/benchmarks/tblite/tblite_env.py
+++ b/hermes_code/environments/benchmarks/tblite/tblite_env.py
@ -0,0 +1,119 @@
+"""
+OpenThoughts-TBLite Evaluation Environment
+
+A lighter, faster alternative to Terminal-Bench 2.0 for iterating on terminal
+agents. Uses the same evaluation logic as TerminalBench2EvalEnv but defaults
+to the NousResearch/openthoughts-tblite dataset (100 difficulty-calibrated
+tasks vs TB2's 89 harder tasks).
+
+TBLite tasks are a curated subset of TB2 with a difficulty distribution
+designed to give meaningful signal even for smaller models:
+  - Easy (40 tasks):   >= 70% pass rate with Claude Haiku 4.5
+  - Medium (26 tasks): 40-69% pass rate
+  - Hard (26 tasks):   10-39% pass rate
+  - Extreme (8 tasks): < 10% pass rate
+
+Usage:
+    python environments/benchmarks/tblite/tblite_env.py evaluate
+
+    # Filter to specific tasks:
+    python environments/benchmarks/tblite/tblite_env.py evaluate \\
+        --env.task_filter "broken-python,pandas-etl"
+"""
+
+import os
+import sys
+from pathlib import Path
+from typing import List, Tuple
+
+_repo_root = Path(__file__).resolve().parent.parent.parent.parent
+if str(_repo_root) not in sys.path:
+    sys.path.insert(0, str(_repo_root))
+
+from pydantic import Field
+
+from atroposlib.envs.base import EvalHandlingEnum
+from atroposlib.envs.server_handling.server_manager import APIServerConfig
+
+from environments.benchmarks.terminalbench_2.terminalbench2_env import (
+    TerminalBench2EvalConfig,
+    TerminalBench2EvalEnv,
+)
+
+
+class TBLiteEvalConfig(TerminalBench2EvalConfig):
+    """Configuration for the OpenThoughts-TBLite evaluation environment.
+
+    Inherits all TB2 config fields. Only the dataset default and task timeout
+    differ -- TBLite tasks are calibrated to be faster.
+    """
+
+    dataset_name: str = Field(
+        default="NousResearch/openthoughts-tblite",
+        description="HuggingFace dataset containing TBLite tasks.",
+    )
+
+    task_timeout: int = Field(
+        default=1200,
+        description="Maximum wall-clock seconds per task. TBLite tasks are "
+        "generally faster than TB2, so 20 minutes is usually sufficient.",
+    )
+
+
+class TBLiteEvalEnv(TerminalBench2EvalEnv):
+    """OpenThoughts-TBLite evaluation environment.
+
+    Inherits all evaluation logic from TerminalBench2EvalEnv (agent loop,
+    test verification, Docker image resolution, metrics, wandb logging).
+    Only the default configuration differs.
+    """
+
+    name = "openthoughts-tblite"
+    env_config_cls = TBLiteEvalConfig
+
+    @classmethod
+    def config_init(cls) -> Tuple[TBLiteEvalConfig, List[APIServerConfig]]:
+        env_config = TBLiteEvalConfig(
+            enabled_toolsets=["terminal", "file"],
+            disabled_toolsets=None,
+            distribution=None,
+
+            max_agent_turns=60,
+            max_token_length=16000,
+            agent_temperature=0.6,
+            system_prompt=None,
+
+            terminal_backend="modal",
+            terminal_timeout=300,
+
+            test_timeout=180,
+
+            # 100 tasks in parallel
+            tool_pool_size=128,
+
+            eval_handling=EvalHandlingEnum.STOP_TRAIN,
+            group_size=1,
+            steps_per_eval=1,
+            total_steps=1,
+
+            tokenizer_name="NousResearch/Hermes-3-Llama-3.1-8B",
+            use_wandb=True,
+            wandb_name="openthoughts-tblite",
+            ensure_scores_are_not_same=False,
+        )
+
+        server_configs = [
+            APIServerConfig(
+                base_url="https://openrouter.ai/api/v1",
+                model_name="anthropic/claude-sonnet-4",
+                server_type="openai",
+                api_key=os.getenv("OPENROUTER_API_KEY", ""),
+                health_check=False,
+            )
+        ]
+
+        return env_config, server_configs
+
+
+if __name__ == "__main__":
+    TBLiteEvalEnv.cli()
--- a/hermes_code/environments/benchmarks/terminalbench_2/init.py
+++ b/hermes_code/environments/benchmarks/terminalbench_2/init.py
--- a/hermes_code/environments/benchmarks/terminalbench_2/default.yaml
+++ b/hermes_code/environments/benchmarks/terminalbench_2/default.yaml
@ -0,0 +1,42 @@
+# Terminal-Bench 2.0 Evaluation -- Default Configuration
+#
+# Eval-only environment for the TB2 benchmark (89 terminal tasks).
+# Uses Modal terminal backend for per-task cloud-isolated sandboxes
+# and OpenRouter for inference.
+#
+# Usage:
+#   python environments/benchmarks/terminalbench_2/terminalbench2_env.py evaluate \
+#       --config environments/benchmarks/terminalbench_2/default.yaml
+#
+#   # Override model:
+#   python environments/benchmarks/terminalbench_2/terminalbench2_env.py evaluate \
+#       --config environments/benchmarks/terminalbench_2/default.yaml \
+#       --openai.model_name anthropic/claude-sonnet-4
+
+env:
+  enabled_toolsets: ["terminal", "file"]
+  max_agent_turns: 60
+  max_token_length: 32000
+  agent_temperature: 0.8
+  terminal_backend: "modal"
+  terminal_timeout: 300        # 5 min per command (builds, pip install)
+  tool_pool_size: 128          # thread pool for 89 parallel tasks
+  dataset_name: "NousResearch/terminal-bench-2"
+  test_timeout: 600
+  task_timeout: 1800           # 30 min wall-clock per task, auto-FAIL if exceeded
+  tokenizer_name: "NousResearch/Hermes-3-Llama-3.1-8B"
+  use_wandb: true
+  wandb_name: "terminal-bench-2"
+  ensure_scores_are_not_same: false
+  data_dir_to_save_evals: "environments/benchmarks/evals/terminal-bench-2"
+  # CRITICAL: Limit concurrent Modal sandbox creations to avoid deadlocks.
+  # Modal's blocking calls (App.lookup, etc.) deadlock when too many sandboxes
+  # are created simultaneously inside thread pool workers via asyncio.run().
+  max_concurrent_tasks: 8
+
+openai:
+  base_url: "https://openrouter.ai/api/v1"
+  model_name: "anthropic/claude-opus-4.6"
+  server_type: "openai"
+  health_check: false
+  # api_key loaded from OPENROUTER_API_KEY in .env
--- a/hermes_code/environments/benchmarks/terminalbench_2/run_eval.sh
+++ b/hermes_code/environments/benchmarks/terminalbench_2/run_eval.sh
@ -0,0 +1,42 @@
+#!/bin/bash
+
+# Terminal-Bench 2.0 Evaluation
+#
+# Run from repo root:
+#   bash environments/benchmarks/terminalbench_2/run_eval.sh
+#
+# Override model:
+#   bash environments/benchmarks/terminalbench_2/run_eval.sh \
+#       --openai.model_name anthropic/claude-sonnet-4
+#
+# Run a subset:
+#   bash environments/benchmarks/terminalbench_2/run_eval.sh \
+#       --env.task_filter fix-git,git-multibranch
+#
+# All terminal settings (backend, timeout, lifetime, pool size) are
+# configured via env config fields -- no env vars needed.
+
+set -euo pipefail
+
+mkdir -p logs evals/terminal-bench-2
+LOG_FILE="logs/terminalbench2_$(date +%Y%m%d_%H%M%S).log"
+
+echo "Terminal-Bench 2.0 Evaluation"
+echo "Log file: $LOG_FILE"
+echo ""
+
+# Unbuffered python output so logs are written in real-time
+export PYTHONUNBUFFERED=1
+
+# Show INFO-level agent loop timing (api/tool durations per turn)
+# These go to the log file; tqdm + [START]/[PASS]/[FAIL] go to terminal
+export LOGLEVEL=INFO
+
+python terminalbench2_env.py evaluate \
+  --config default.yaml \
+  "$@" \
+  2>&1 | tee "$LOG_FILE"
+
+echo ""
+echo "Log saved to: $LOG_FILE"
+echo "Eval results: evals/terminal-bench-2/"
--- a/hermes_code/environments/benchmarks/terminalbench_2/terminalbench2_env.py
+++ b/hermes_code/environments/benchmarks/terminalbench_2/terminalbench2_env.py
@ -0,0 +1,515 @@
+"""
+TerminalBench2Env -- Terminal-Bench 2.0 Evaluation Environment
+
+Evaluates agentic LLMs on challenging terminal tasks from Terminal-Bench 2.0.
+Each task provides a unique Docker environment (pre-built on Docker Hub), a natural
+language instruction, and a test suite for verification. The agent uses terminal +
+file tools to complete the task, then the test suite runs inside the same sandbox.
+
+This is an eval-only environment (not a training environment). It is designed to
+be run via the `evaluate` subcommand:
+
+    python environments/terminalbench2_env.py evaluate \\
+        --env.dataset_name NousResearch/terminal-bench-2
+
+The evaluate flow:
+    1. setup()     -- Loads the TB2 dataset from HuggingFace
+    2. evaluate()  -- Iterates over all tasks, running each through:
+        a. rollout_and_score_eval()  -- Per-task agent loop + test verification
+            - Resolves Docker image (pre-built Hub image or Dockerfile fallback)
+            - Registers per-task Modal sandbox via register_task_env_overrides()
+            - Runs the HermesAgentLoop (terminal + file tools)
+            - Uploads test suite and runs test.sh in the same sandbox
+            - Returns binary pass/fail result
+        b. Aggregates per-task, per-category, and overall pass rates
+        c. Logs results via evaluate_log() and wandb
+
+Key features:
+  - Per-task Modal sandboxes using pre-built Docker Hub images
+  - Binary reward: 1.0 if all tests pass, 0.0 otherwise
+  - Concurrency-controlled parallel evaluation via asyncio.Semaphore
+  - Per-task, per-category, and aggregate pass rate tracking
+"""
+
+import asyncio
+import base64
+import io
+import json
+import logging
+import os
+import shutil
+import sys
+import tarfile
+import tempfile
+import time
+import uuid
+from collections import defaultdict
+from pathlib import Path
+from typing import Any, Dict, List, Optional, Tuple, Union
+
+# Ensure repo root is on sys.path for imports
+_repo_root = Path(__file__).resolve().parent.parent.parent.parent
+if str(_repo_root) not in sys.path:
+    sys.path.insert(0, str(_repo_root))
+
+from pydantic import Field
+
+from atroposlib.envs.base import EvalHandlingEnum
+from atroposlib.envs.server_handling.server_manager import APIServerConfig
+
+from environments.agent_loop import AgentResult, HermesAgentLoop
+from environments.hermes_base_env import HermesAgentBaseEnv, HermesAgentEnvConfig
+from environments.tool_context import ToolContext
+from tools.terminal_tool import (
+    register_task_env_overrides,
+    clear_task_env_overrides,
+    cleanup_vm,
+)
+
+logger = logging.getLogger(__name__)
+
+
+# =============================================================================
+# Configuration
+# =============================================================================
+
+class TerminalBench2EvalConfig(HermesAgentEnvConfig):
+    """
+    Configuration for the Terminal-Bench 2.0 evaluation environment.
+
+    Extends HermesAgentEnvConfig with TB2-specific settings for dataset loading,
+    test execution, task filtering, and eval concurrency.
+    """
+
+    # --- Dataset ---
+    dataset_name: str = Field(
+        default="NousResearch/terminal-bench-2",
+        description="HuggingFace dataset containing TB2 tasks.",
+    )
+
+    # --- Test execution ---
+    test_timeout: int = Field(
+        default=180,
+        description="Timeout in seconds for running the test suite after agent completes.",
+    )
+
+    # --- Image strategy ---
+    force_build: bool = Field(
+        default=False,
+        description="If True, always build from Dockerfile (ignore docker_image). "
+        "Useful for testing custom Dockerfiles.",
+    )
+
+    # --- Task filtering (comma-separated from CLI) ---
+    task_filter: Optional[str] = Field(
+        default=None,
+        description="Comma-separated task names to run (e.g., 'fix-git,git-multibranch'). "
+        "If not set, all tasks are run.",
+    )
+    skip_tasks: Optional[str] = Field(
+        default=None,
+        description="Comma-separated task names to skip on top of the default skip list.",
+    )
+
+    # --- Per-task wall-clock timeout ---
+    task_timeout: int = Field(
+        default=1800,
+        description="Maximum wall-clock seconds per task (agent loop + verification). "
+        "Tasks exceeding this are scored as FAIL. Default 30 minutes.",
+    )
+
+    # --- Concurrency control ---
+    max_concurrent_tasks: int = Field(
+        default=8,
+        description="Maximum number of tasks to run concurrently. "
+        "Limits concurrent Modal sandbox creations to avoid async/threading deadlocks. "
+        "Modal has internal limits and creating too many sandboxes simultaneously "
+        "causes blocking calls to deadlock inside the thread pool.",
+    )
+
+    # --- Eval concurrency ---
+    eval_concurrency: int = Field(
+        default=0,
+        description="Maximum number of tasks to evaluate in parallel. "
+        "0 means unlimited (all tasks run concurrently). "
+        "Set to 8 for local backends to avoid overwhelming the machine.",
+    )
+
+
+# Tasks that cannot run properly on Modal and are excluded from scoring.
+MODAL_INCOMPATIBLE_TASKS = {
+    "qemu-startup",        # Needs KVM/hardware virtualization
+    "qemu-alpine-ssh",     # Needs KVM/hardware virtualization
+    "crack-7z-hash",       # Password brute-force -- too slow for cloud sandbox timeouts
+}
+
+
+# =============================================================================
+# Tar extraction helper
+# =============================================================================
+
+def _extract_base64_tar(b64_data: str, target_dir: Path):
+    """Extract a base64-encoded tar.gz archive into target_dir."""
+    if not b64_data:
+        return
+    raw = base64.b64decode(b64_data)
+    buf = io.BytesIO(raw)
+    with tarfile.open(fileobj=buf, mode="r:gz") as tar:
+        tar.extractall(path=str(target_dir))
+
+
+# =============================================================================
+# Main Environment
+# =============================================================================
+
+class TerminalBench2EvalEnv(HermesAgentBaseEnv):
+    """
+    Terminal-Bench 2.0 evaluation environment (eval-only, no training).
+
+    Inherits from HermesAgentBaseEnv for:
+      - Terminal backend setup (os.environ["TERMINAL_ENV"])
+      - Tool resolution via _resolve_tools_for_group()
+      - Monkey patches for async-safe tool operation
+      - Wandb trajectory formatting
+
+    The evaluate flow (triggered by `environment.py evaluate`):
+      1. setup()    -- Load dataset from HuggingFace
+      2. evaluate() -- Run all tasks through rollout_and_score_eval()
+
+    Each task in rollout_and_score_eval():
+      1. Resolve Docker image (pre-built Hub image or Dockerfile fallback)
+      2. Register per-task Modal sandbox override
+      3. Run HermesAgentLoop with terminal + file tools
+      4. Upload test suite and execute test.sh in the same sandbox
+      5. Check /logs/verifier/reward.txt for pass/fail
+      6. Clean up sandbox, overrides, and temp files
+    """
+
+    name = "terminal-bench-2"
+    env_config_cls = TerminalBench2EvalConfig
+
+    @classmethod
+    def config_init(cls) -> Tuple[TerminalBench2EvalConfig, List[APIServerConfig]]:
+        """
+        Default configuration for Terminal-Bench 2.0 evaluation.
+
+        Uses eval-only settings:
+          - eval_handling=STOP_TRAIN so the eval flow runs cleanly
+          - steps_per_eval=1, total_steps=1 so eval triggers immediately
+          - group_size=1 (one rollout per group, each task is expensive)
+
+        Uses Modal terminal backend (cloud-isolated sandbox per task) and
+        OpenRouter with Claude for inference.
+        """
+        env_config = TerminalBench2EvalConfig(
+            # Terminal + file tools only (the agent interacts via shell commands)
+            enabled_toolsets=["terminal", "file"],
+            disabled_toolsets=None,
+            distribution=None,
+
+            # Agent settings -- TB2 tasks are complex, need many turns
+            max_agent_turns=60,
+            max_token_length=***
+            agent_temperature=0.6,
+            system_prompt=None,
+
+            # Modal backend for per-task cloud-isolated sandboxes
+            terminal_backend="modal",
+            terminal_timeout=300,   # 5 min per command (builds, pip install, etc.)
+
+            # Test execution timeout (TB2 test scripts can install deps like pytest)
+            test_timeout=180,
+
+            # 89 tasks run in parallel, each needs a thread for tool calls
+            tool_pool_size=128,
+
+            # --- Eval-only Atropos settings ---
+            # These settings make the env work as an eval-only environment:
+            #   - STOP_TRAIN: pauses training during eval (standard for eval envs)
+            #   - steps_per_eval=1, total_steps=1: eval triggers immediately
+            #   - group_size=1: one rollout per group (each task is expensive)
+            eval_handling=EvalHandlingEnum.STOP_TRAIN,
+            group_size=1,
+            steps_per_eval=1,
+            total_steps=1,
+
+            tokenizer_name="NousRe...1-8B",
+            use_wandb=True,
+            wandb_name="terminal-bench-2",
+            ensure_scores_are_not_same=False,  # Binary rewards may all be 0 or 1
+        )
+
+        # OpenRouter with Claude -- API key loaded from .env
+        server_configs = [
+            APIServerConfig(
+                base_url="https://openrouter.ai/api/v1",
+                model_name="anthropic/claude-sonnet-4",
+                server_type="openai",
+                api_key=os.get...EY", ""),
+                health_check=False,
+            )
+        ]
+
+        return env_config, server_configs
+
+    # =========================================================================
+    # Setup -- load dataset
+    # =========================================================================
+
+    async def setup(self):
+        """Load the Terminal-Bench 2.0 dataset from HuggingFace."""
+        from datasets import load_dataset
+
+        # Auto-set terminal_lifetime to task_timeout + 120s so sandboxes
+        # never get killed during an active task, but still get cleaned up
+        # promptly after the task times out.
+        lifetime = self.config.task_timeout + 120
+        self.config.terminal_lifetime = lifetime
+        os.environ["TERMINAL_LIFETIME_SECONDS"] = str(lifetime)
+        print(f"  Terminal lifetime auto-set to {lifetime}s (task_timeout + 120s)")
+
+        print(f"Loading TB2 dataset from: {self.config.dataset_name}")
+        ds = load_dataset(self.config.dataset_name, split="train")
+
+        # Apply task filters (comma-separated strings from CLI)
+        tasks = list(ds)
+        if self.config.task_filter:
+            allowed = {name.strip() for name in self.config.task_filter.split(",")}
+            tasks = [t for t in tasks if t["task_name"] in allowed]
+            print(f"  Filtered to {len(tasks)} tasks: {sorted(allowed)}")
+
+        # Skip tasks incompatible with the current backend (e.g., QEMU on Modal)
+        # plus any user-specified skip_tasks
+        skip = set(MODAL_INCOMPATIBLE_TASKS) if self.config.terminal_backend == "modal" else set()
+        if self.config.skip_tasks:
+            skip |= {name.strip() for name in self.config.skip_tasks.split(",")}
+        if skip:
+            before = len(tasks)
+            tasks = [t for t in tasks if t["task_name"] not in skip]
+            skipped = before - len(tasks)
+            if skipped > 0:
+                print(f"  Skipped {skipped} incompatible tasks: {sorted(skip & {t['task_name'] for t in ds})}")
+
+        self.all_eval_items = tasks
+        self.iter = 0
+
+        # Build category index for per-category metrics
+        self.category_index: Dict[str, List[int]] = defaultdict(list)
+        for i, task in enumerate(self.all_eval_items):
+            self.category_index[task.get("category", "unknown")].append(i)
+
+        # Reward tracking for wandb logging
+        self.eval_metrics: List[Tuple[str, float]] = []
+
+        # Streaming JSONL writer -- saves each task's full conversation
+        # immediately on completion so data is preserved even on Ctrl+C.
+        # Timestamped filename so each run produces a unique file.
+        import datetime
+        log_dir = os.path.join(os.path.dirname(__file__), "logs")
+        os.makedirs(log_dir, exist_ok=True)
+        run_ts = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
+        self._streaming_path = os.path.join(log_dir, f"samples_{run_ts}.jsonl")
+        self._streaming_file = open(self._streaming_path, "w")
+        self._streaming_lock = __import__("threading").Lock()
+        print(f"  Streaming results to: {self._streaming_path}")
+
+        print(f"TB2 ready: {len(self.all_eval_items)} tasks across {len(self.category_index)} categories")
+        for cat, indices in sorted(self.category_index.items()):
+            print(f"  {cat}: {len(indices)} tasks")
+
+    def _save_result(self, result: Dict[str, Any]):
+        """Write a single task result to the streaming JSONL file immediately."""
+        if not hasattr(self, "_streaming_file") or self._streaming_file.closed:
+            return
+        with self._streaming_lock:
+            self._streaming_file.write(json.dumps(result, ensure_ascii=False, default=str) + "\n")
+            self._streaming_file.flush()
+
+    # =========================================================================
+    # Training pipeline stubs -- NOT used in eval-only mode
+    # =========================================================================
+    # These satisfy the abstract method requirements from HermesAgentBaseEnv.
+    # The evaluate subcommand calls setup() -> evaluate() directly, bypassing
+    # the training pipeline entirely.
+
+    async def get_next_item(self):
+        """Return next item (stub -- not used in eval-only mode)."""
+        item = self.all_eval_items[self.iter % len(self.all_eval_items)]
+        self.iter += 1
+        return item
+
+    def format_prompt(self, item: Dict[str, Any]) -> str:
+        """Return the task's instruction as the user prompt."""
+        return item["instruction"]
+
+    async def compute_reward(self, item, result, ctx) -> float:
+        """Compute reward (stub -- actual verification is in rollout_and_score_eval)."""
+        return 0.0
+
+    async def collect_trajectories(self, item):
+        """Collect trajectories (stub -- not used in eval-only mode)."""
+        return None, []
+
+    async def score(self, rollout_group_data):
+        """Score rollouts (stub -- not used in eval-only mode)."""
+        return None
+
+    # =========================================================================
+    # Docker image resolution
+    # =========================================================================
+
+    def _resolve_task_image(
+        self, item: Dict[str, Any], task_name: str
+    ) -> Tuple[str, Optional[Path]]:
+        """
+        Resolve the Docker image for a task, with fallback to Dockerfile.
+
+        Strategy (mirrors Harbor's approach):
+        1. If force_build=True, always build from Dockerfile in environment_tar
+        2. If docker_image is available, use the pre-built Docker Hub image (fast)
+        3. Otherwise, extract Dockerfile from environment_tar and build (slow)
+
+        Returns:
+            (modal_image, temp_dir) -- modal_image is a Docker Hub name or a
+            Dockerfile path. temp_dir is set if we extracted files that need
+            cleanup later.
+        """
+        docker_image = item.get("docker_image", "")
+        environment_tar = item.get("environment_tar", "")
+
+        # Fast path: use pre-built Docker Hub image
+        if docker_image and not self.config.force_build:
+            logger.info("Task %s: using pre-built image %s", task_name, docker_image)
+            return docker_image, None
+
+        # Slow path: extract Dockerfile from environment_tar and build
+        if environment_tar:
+            task_dir = Path(tempfile.mkdtemp(prefix=f"tb2-{task_name}-"))
+            _extract_base64_tar(environment_tar, task_dir)
+            dockerfile_path = task_dir / "Dockerfile"
+            if dockerfile_path.exists():
+                logger.info(
+                    "Task %s: building from Dockerfile (force_build=%s, docker_image=%s)",
+                    task_name, self.config.force_build, bool(docker_image),
+                )
+                return str(dockerfile_path), task_dir
+
+        # Neither available -- fall back to Hub image if force_build was True
+        if docker_image:
+            logger.warning(
+                "Task %s: force_build=True but no environment_tar, "
+                "falling back to docker_image %s", task_name, docker_image,
+            )
+            return docker_image, None
+
+        return "", None
+
+    # =========================================================================
+    # Per-task evaluation -- agent loop + test verification
+    # =========================================================================
+
+    async def rollout_and_score_eval(self, eval_item: Dict[str, Any]) -> Dict:
+        """
+        Evaluate a single TB2 task: run the agent loop, then verify with tests.
+
+        This is the core evaluation method. For each task it:
+        1. Resolves the Docker image and registers the Modal sandbox override
+        2. Runs HermesAgentLoop with terminal + file tools
+        3. Uploads the test suite into the sandbox
+        4. Executes test.sh and checks the result
+        5. Cleans up the sandbox and temp files
+
+        Args:
+            eval_item: A single TB2 task dict from the dataset
+
+        Returns:
+            Dict with 'passed' (bool), 'reward' (float), 'task_name' (str),
+            'category' (str), and optional debug info
+        """
+        task_name = eval_item.get("task_name", "unknown")
+        category = eval_item.get("category", "unknown")
+        task_id = str(uuid.uuid4())
+        task_dir = None  # Set if we extract a Dockerfile (needs cleanup)
+
+        from tqdm import tqdm
+        tqdm.write(f"  [START] {task_name} (task_id={task_id[:8]})")
+        task_start = time.time()
+
+        try:
+            # --- 1. Resolve Docker image ---
+            modal_image, task_dir = self._resolve_task_image(eval_item, task_name)
+            if not modal_image:
+                logger.error("Task %s: no docker_image or environment_tar, skipping", task_name)
+                return {
+                    "passed": False, "reward": 0.0,
+                    "task_name": task_name, "category": category,
+                    "error": "no_image",
+                }
+
+            # --- 2. Register per-task image override ---
+            # Set both modal_image and docker_image so the task image is used
+            # regardless of which backend is configured.
+            register_task_env_overrides(task_id, {
+                "modal_image": modal_image,
+                "docker_image": modal_image,
+                "cwd": "/app",
+            })
+            logger.info(
+                "Task %s: registered image override for task_id %s",
+                task_name, task_id[:8],
+            )
+
+            # --- 3. Resolve tools and build messages ---
+            tools, valid_names = self._resolve_tools_for_group()
+
+            messages: List[Dict[str, Any]] = []
+            if self.config.system_prompt:
+                messages.append({"role": "system", "content": self.config.system_prompt})
+            messages.append({"role": "user", "content": self.format_prompt(eval_item)})
+
+            # --- 4. Run agent loop ---
+            # Use ManagedServer (Phase 2) for vLLM/SGLang backends to get
+            # token-level tracking via /generate. Falls back to direct
+            # ServerManager (Phase 1) for OpenAI endpoints.
+            if self._use_managed_server():
+                async with self.server.managed_server(
+                    tokenizer=self.tokenizer,
+                    preserve_think_blocks=bool(self.config.thinking_mode),
+                ) as managed:
+                    agent = HermesAgentLoop(
+                        server=managed,
+                        tool_schemas=tools,
+                        valid_tool_names=valid_names,
+                        max_turns=self.config.max_agent_turns,
+                        task_id=task_id,
+                        temperature=self.config.agent_temperature,
+                        max_tokens=self.config.max_token_length,
+                        extra_body=self.config.extra_body,
+                    )
+                    result = await agent.run(messages)
+            else:
+                agent = HermesAgentLoop(
+                    server=self.server,
+                    tool_schemas=tools,
+                    valid_tool_names=valid_names,
+                    max_turns=self.config.max_agent_turns,
+                    task_id=task_id,
+                    temperature=self.config.agent_temperature,
+                    max_tokens=self.config.max_token_length,
+                    extra_body=self.config.extra_body,
+                )
+                result = await agent.run(messages)
+
+            # --- 5. Verify -- run test suite in the agent's sandbox ---
+            # Skip verification if the agent produced no meaningful output
+            only_system_and_user = all(
+                msg.get("role") in ("system", "user") for msg in result.messages
+            )
+            if result.turns_used == 0 or only_system_and_user:
+                logger.warning(
+                    "Task %s: agent produced no output (turns=%d). Reward=0.",
+                    task_name, result.turns_used,
+                )
+                reward = 0.0
+            else:
+                # Run tests in a thread so the blocking ctx.terminal() calls
--- a/hermes_code/environments/benchmarks/yc_bench/README.md
+++ b/hermes_code/environments/benchmarks/yc_bench/README.md
@ -0,0 +1,115 @@
+# YC-Bench: Long-Horizon Agent Benchmark
+
+[YC-Bench](https://github.com/collinear-ai/yc-bench) by [Collinear AI](https://collinear.ai/) is a deterministic, long-horizon benchmark that tests LLM agents' ability to act as a tech startup CEO. The agent manages a simulated company over 1-3 years, making compounding decisions about resource allocation, cash flow, task management, and prestige specialisation across 4 skill domains.
+
+Unlike TerminalBench2 (which evaluates per-task coding ability with binary pass/fail), YC-Bench measures **long-term strategic coherence** — whether an agent can maintain consistent strategy, manage compounding consequences, and adapt plans over hundreds of turns.
+
+## Setup
+
+```bash
+# Install yc-bench (optional dependency)
+pip install "hermes-agent[yc-bench]"
+
+# Or install from source
+git clone https://github.com/collinear-ai/yc-bench
+cd yc-bench && pip install -e .
+
+# Verify
+yc-bench --help
+```
+
+## Running
+
+```bash
+# From the repo root:
+bash environments/benchmarks/yc_bench/run_eval.sh
+
+# Or directly:
+python environments/benchmarks/yc_bench/yc_bench_env.py evaluate \
+    --config environments/benchmarks/yc_bench/default.yaml
+
+# Override model:
+bash environments/benchmarks/yc_bench/run_eval.sh \
+    --openai.model_name anthropic/claude-opus-4-20250514
+
+# Quick single-preset test:
+bash environments/benchmarks/yc_bench/run_eval.sh \
+    --env.presets '["fast_test"]' --env.seeds '[1]'
+```
+
+## How It Works
+
+### Architecture
+
+```
+HermesAgentLoop (our agent)
+  -> terminal tool -> subprocess("yc-bench company status") -> JSON output
+  -> terminal tool -> subprocess("yc-bench task accept --task-id X") -> JSON
+  -> terminal tool -> subprocess("yc-bench sim resume") -> JSON (advance time)
+  -> ... (100-500 turns per run)
+```
+
+The environment initialises the simulation via `yc-bench sim init` (NOT `yc-bench run`, which would start yc-bench's own built-in agent loop). Our `HermesAgentLoop` then drives all interaction through CLI commands.
+
+### Simulation Mechanics
+
+- **4 skill domains**: research, inference, data_environment, training
+- **Prestige system** (1.0-10.0): Gates access to higher-paying tasks
+- **Employee management**: Junior/Mid/Senior with domain-specific skill rates
+- **Throughput splitting**: `effective_rate = base_rate / N` active tasks per employee
+- **Financial pressure**: Monthly payroll, bankruptcy = game over
+- **Deterministic**: SHA256-based RNG — same seed + preset = same world
+
+### Difficulty Presets
+
+| Preset | Employees | Tasks | Focus |
+|-----------|-----------|-------|-------|
+| tutorial  | 3         | 50    | Basic loop mechanics |
+| easy      | 5         | 100   | Throughput awareness |
+| **medium**| 5         | 150   | Prestige climbing + domain specialisation |
+| **hard**  | 7         | 200   | Precise ETA reasoning |
+| nightmare | 8         | 300   | Sustained perfection under payroll pressure |
+| fast_test | (varies)  | (varies) | Quick validation (~50 turns) |
+
+Default eval runs **fast_test + medium + hard** × 3 seeds = 9 runs.
+
+### Scoring
+
+```
+composite = 0.5 × survival + 0.5 × normalised_funds
+```
+
+- **Survival** (binary): Did the company avoid bankruptcy?
+- **Normalised funds** (0.0-1.0): Log-scale relative to initial $250K capital
+
+## Configuration
+
+Key fields in `default.yaml`:
+
+| Field | Default | Description |
+|-------|---------|-------------|
+| `presets` | `["fast_test", "medium", "hard"]` | Which presets to evaluate |
+| `seeds` | `[1, 2, 3]` | RNG seeds per preset |
+| `max_agent_turns` | 200 | Max LLM calls per run |
+| `run_timeout` | 3600 | Wall-clock timeout per run (seconds) |
+| `survival_weight` | 0.5 | Weight of survival in composite score |
+| `funds_weight` | 0.5 | Weight of normalised funds in composite |
+| `horizon_years` | null | Override horizon (null = auto from preset) |
+
+## Cost & Time Estimates
+
+Each run is 100-500 LLM turns. Approximate costs per run at typical API rates:
+
+| Preset | Turns | Time | Est. Cost |
+|--------|-------|------|-----------|
+| fast_test | ~50 | 5-10 min | $1-5 |
+| medium | ~200 | 20-40 min | $5-15 |
+| hard | ~300 | 30-60 min | $10-25 |
+
+Full default eval (9 runs): ~3-6 hours, $50-200 depending on model.
+
+## References
+
+- [collinear-ai/yc-bench](https://github.com/collinear-ai/yc-bench) — Official repository
+- [Collinear AI](https://collinear.ai/) — Company behind yc-bench
+- [TerminalBench2](../terminalbench_2/) — Per-task coding benchmark (complementary)
--- a/hermes_code/environments/benchmarks/yc_bench/init.py
+++ b/hermes_code/environments/benchmarks/yc_bench/init.py
--- a/hermes_code/environments/benchmarks/yc_bench/default.yaml
+++ b/hermes_code/environments/benchmarks/yc_bench/default.yaml
@ -0,0 +1,43 @@
+# YC-Bench Evaluation -- Default Configuration
+#
+# Long-horizon agent benchmark: agent plays CEO of an AI startup over
+# a simulated 1-3 year run, interacting via yc-bench CLI subcommands.
+#
+# Requires: pip install "hermes-agent[yc-bench]"
+#
+# Usage:
+#   python environments/benchmarks/yc_bench/yc_bench_env.py evaluate \
+#       --config environments/benchmarks/yc_bench/default.yaml
+#
+#   # Override model:
+#   python environments/benchmarks/yc_bench/yc_bench_env.py evaluate \
+#       --config environments/benchmarks/yc_bench/default.yaml \
+#       --openai.model_name anthropic/claude-opus-4-20250514
+
+env:
+  enabled_toolsets: ["terminal"]
+  max_agent_turns: 200
+  max_token_length: 32000
+  agent_temperature: 0.0
+  terminal_backend: "local"
+  terminal_timeout: 60
+  presets: ["fast_test", "medium", "hard"]
+  seeds: [1, 2, 3]
+  run_timeout: 3600          # 60 min wall-clock per run, auto-FAIL if exceeded
+  survival_weight: 0.5       # weight of binary survival in composite score
+  funds_weight: 0.5          # weight of normalised final funds in composite score
+  db_dir: "/tmp/yc_bench_dbs"
+  company_name: "BenchCo"
+  start_date: "01/01/2025"   # MM/DD/YYYY (yc-bench convention)
+  tokenizer_name: "NousResearch/Hermes-3-Llama-3.1-8B"
+  use_wandb: true
+  wandb_name: "yc-bench"
+  ensure_scores_are_not_same: false
+  data_dir_to_save_evals: "environments/benchmarks/evals/yc-bench"
+
+openai:
+  base_url: "https://openrouter.ai/api/v1"
+  model_name: "anthropic/claude-sonnet-4.6"
+  server_type: "openai"
+  health_check: false
+  # api_key loaded from OPENROUTER_API_KEY in .env
--- a/hermes_code/environments/benchmarks/yc_bench/run_eval.sh
+++ b/hermes_code/environments/benchmarks/yc_bench/run_eval.sh
@ -0,0 +1,34 @@
+#!/bin/bash
+
+# YC-Bench Evaluation
+#
+# Requires: pip install "hermes-agent[yc-bench]"
+#
+# Run from repo root:
+#   bash environments/benchmarks/yc_bench/run_eval.sh
+#
+# Override model:
+#   bash environments/benchmarks/yc_bench/run_eval.sh \
+#       --openai.model_name anthropic/claude-opus-4-20250514
+#
+# Run a single preset:
+#   bash environments/benchmarks/yc_bench/run_eval.sh \
+#       --env.presets '["fast_test"]' --env.seeds '[1]'
+
+set -euo pipefail
+
+mkdir -p logs evals/yc-bench
+LOG_FILE="logs/yc_bench_$(date +%Y%m%d_%H%M%S).log"
+
+echo "YC-Bench Evaluation"
+echo "Log: $LOG_FILE"
+echo ""
+
+PYTHONUNBUFFERED=1 LOGLEVEL="${LOGLEVEL:-INFO}" \
+  python environments/benchmarks/yc_bench/yc_bench_env.py evaluate \
+  --config environments/benchmarks/yc_bench/default.yaml \
+  "$@" \
+  2>&1 | tee "$LOG_FILE"
+
+echo ""
+echo "Log saved to: $LOG_FILE"
--- a/hermes_code/environments/benchmarks/yc_bench/yc_bench_env.py
+++ b/hermes_code/environments/benchmarks/yc_bench/yc_bench_env.py
@ -0,0 +1,847 @@
+"""
+YCBenchEvalEnv -- YC-Bench Long-Horizon Agent Benchmark Environment
+
+Evaluates agentic LLMs on YC-Bench: a deterministic, long-horizon benchmark
+where the agent acts as CEO of an AI startup over a simulated 1-3 year run.
+The agent manages cash flow, employees, tasks, and prestige across 4 domains,
+interacting exclusively via CLI subprocess calls against a SQLite-backed
+discrete-event simulation.
+
+Unlike TerminalBench2 (per-task binary pass/fail), YC-Bench measures sustained
+multi-turn strategic coherence -- whether an agent can manage compounding
+decisions over hundreds of turns without going bankrupt.
+
+This is an eval-only environment. Run via:
+
+    python environments/benchmarks/yc_bench/yc_bench_env.py evaluate \
+        --config environments/benchmarks/yc_bench/default.yaml
+
+The evaluate flow:
+    1. setup()     -- Verifies yc-bench installed, builds eval matrix (preset x seed)
+    2. evaluate()  -- Iterates over all runs sequentially through:
+        a. rollout_and_score_eval()  -- Per-run agent loop
+            - Initialises a fresh yc-bench simulation via `sim init` (NOT `run`)
+            - Runs HermesAgentLoop with terminal tool only
+            - Reads final SQLite DB to extract score
+            - Returns survival (0/1) + normalised funds score
+        b. Aggregates per-preset and overall metrics
+        c. Logs results via evaluate_log() and wandb
+
+Key features:
+  - CLI-only interface: agent calls yc-bench subcommands via terminal tool
+  - Deterministic: same seed + preset = same world (SHA256-based RNG)
+  - Multi-dimensional scoring: survival + normalised final funds
+  - Per-preset difficulty breakdown in results
+  - Isolated SQLite DB per run (no cross-run state leakage)
+
+Requires: pip install hermes-agent[yc-bench]
+"""
+
+import asyncio
+import datetime
+import json
+import logging
+import math
+import os
+import sqlite3
+import subprocess
+import sys
+import threading
+import time
+import uuid
+from collections import defaultdict
+from pathlib import Path
+from typing import Any, Dict, List, Optional, Tuple
+
+_repo_root = Path(__file__).resolve().parent.parent.parent.parent
+if str(_repo_root) not in sys.path:
+    sys.path.insert(0, str(_repo_root))
+
+from pydantic import Field
+
+from atroposlib.envs.base import EvalHandlingEnum
+from atroposlib.envs.server_handling.server_manager import APIServerConfig
+
+from environments.agent_loop import HermesAgentLoop
+from environments.hermes_base_env import HermesAgentBaseEnv, HermesAgentEnvConfig
+
+logger = logging.getLogger(__name__)
+
+# =============================================================================
+# System prompt
+# =============================================================================
+
+YC_BENCH_SYSTEM_PROMPT = """\
+You are the autonomous CEO of an early-stage AI startup in a deterministic
+business simulation. You manage the company exclusively through the `yc-bench`
+CLI tool. Your primary goal is to **survive** until the simulation horizon ends
+without going bankrupt, while **maximising final funds**.
+
+## Simulation Mechanics
+
+- **Funds**: You start with $250,000 seed capital. Revenue comes from completing
+  tasks. Rewards scale with your prestige: `base × (1 + scale × (prestige − 1))`.
+- **Domains**: There are 4 skill domains: **research**, **inference**,
+  **data_environment**, and **training**. Each has its own prestige level
+  (1.0-10.0). Higher prestige unlocks better-paying tasks.
+- **Employees**: You have employees (Junior/Mid/Senior) with domain-specific
+  skill rates. **Throughput splits**: `effective_rate = base_rate / N` where N
+  is the number of active tasks assigned to that employee. Focus beats breadth.
+- **Payroll**: Deducted automatically on the first business day of each month.
+  Running out of funds = bankruptcy = game over.
+- **Time**: The simulation runs on business days (Mon-Fri), 09:00-18:00.
+  Time only advances when you call `yc-bench sim resume`.
+
+## Task Lifecycle
+
+1. Browse market tasks with `market browse`
+2. Accept a task with `task accept` (this sets its deadline)
+3. Assign employees with `task assign`
+4. Dispatch with `task dispatch` to start work
+5. Call `sim resume` to advance time and let employees make progress
+6. Tasks complete when all domain requirements are fulfilled
+
+**Penalties for failure vary by difficulty preset.** Completing a task on time
+earns full reward + prestige gain. Missing a deadline or cancelling a task
+incurs prestige penalties -- cancelling is always more costly than letting a
+task fail, so cancel only as a last resort.
+
+## CLI Commands
+
+### Observe
+- `yc-bench company status`                                         -- funds, prestige, runway
+- `yc-bench employee list`                                          -- skills, salary, active tasks
+- `yc-bench market browse [--domain D] [--required-prestige-lte N]` -- available tasks
+- `yc-bench task list [--status active|planned]`                    -- your tasks
+- `yc-bench task inspect --task-id UUID`                            -- progress, deadline, assignments
+- `yc-bench finance ledger [--category monthly_payroll|task_reward]` -- transaction history
+- `yc-bench report monthly`                                         -- monthly P&L
+
+### Act
+- `yc-bench task accept --task-id UUID`                              -- accept from market
+- `yc-bench task assign --task-id UUID --employee-id UUID`           -- assign employee
+- `yc-bench task dispatch --task-id UUID`                            -- start work (needs >=1 assignment)
+- `yc-bench task cancel --task-id UUID --reason "text"`              -- cancel (prestige penalty)
+- `yc-bench sim resume`                                              -- advance simulation clock
+
+### Memory (persists across context truncation)
+- `yc-bench scratchpad read`            -- read your persistent notes
+- `yc-bench scratchpad write --content "text"`  -- overwrite notes
+- `yc-bench scratchpad append --content "text"` -- append to notes
+- `yc-bench scratchpad clear`           -- clear notes
+
+## Strategy Guidelines
+
+1. **Specialise in 2-3 domains** to climb the prestige ladder faster and unlock
+   high-reward tasks. Don't spread thin across all 4 domains early on.
+2. **Focus employees** -- assigning one employee to many tasks halves their
+   throughput per additional task. Keep assignments concentrated.
+3. **Use the scratchpad** to track your strategy, upcoming deadlines, and
+   employee assignments. This persists even if conversation context is truncated.
+4. **Monitor runway** -- always know how many months of payroll you can cover.
+   Accept high-reward tasks before payroll dates.
+5. **Don't over-accept** -- taking too many tasks and missing deadlines cascades
+   into prestige loss, locking you out of profitable contracts.
+6. Use `finance ledger` and `report monthly` to track revenue trends.
+
+## Your Turn
+
+Each turn:
+1. Call `yc-bench company status` and `yc-bench task list` to orient yourself.
+2. Check for completed tasks and pending deadlines.
+3. Browse market for profitable tasks within your prestige level.
+4. Accept, assign, and dispatch tasks strategically.
+5. Call `yc-bench sim resume` to advance time.
+6. Repeat until the simulation ends.
+
+Think step by step before acting."""
+
+# Starting funds in cents ($250,000)
+INITIAL_FUNDS_CENTS = 25_000_000
+
+# Default horizon per preset (years)
+_PRESET_HORIZONS = {
+    "tutorial": 1,
+    "easy": 1,
+    "medium": 1,
+    "hard": 1,
+    "nightmare": 1,
+    "fast_test": 1,
+    "default": 3,
+    "high_reward": 1,
+}
+
+
+# =============================================================================
+# Configuration
+# =============================================================================
+
+class YCBenchEvalConfig(HermesAgentEnvConfig):
+    """
+    Configuration for the YC-Bench evaluation environment.
+
+    Extends HermesAgentEnvConfig with YC-Bench-specific settings for
+    preset selection, seed control, scoring, and simulation parameters.
+    """
+
+    presets: List[str] = Field(
+        default=["fast_test", "medium", "hard"],
+        description="YC-Bench preset names to evaluate.",
+    )
+    seeds: List[int] = Field(
+        default=[1, 2, 3],
+        description="Random seeds -- each preset x seed = one run.",
+    )
+    run_timeout: int = Field(
+        default=3600,
+        description="Maximum wall-clock seconds per run. Default 60 minutes.",
+    )
+    survival_weight: float = Field(
+        default=0.5,
+        description="Weight of survival (0/1) in composite score.",
+    )
+    funds_weight: float = Field(
+        default=0.5,
+        description="Weight of normalised final funds in composite score.",
+    )
+    db_dir: str = Field(
+        default="/tmp/yc_bench_dbs",
+        description="Directory for per-run SQLite databases.",
+    )
+    horizon_years: Optional[int] = Field(
+        default=None,
+        description=(
+            "Simulation horizon in years. If None (default), inferred from "
+            "preset name (1 year for most, 3 for 'default')."
+        ),
+    )
+    company_name: str = Field(
+        default="BenchCo",
+        description="Name of the simulated company.",
+    )
+    start_date: str = Field(
+        default="01/01/2025",
+        description="Simulation start date in MM/DD/YYYY format (yc-bench convention).",
+    )
+
+
+# =============================================================================
+# Scoring helpers
+# =============================================================================
+
+def _read_final_score(db_path: str) -> Dict[str, Any]:
+    """
+    Read final game state from a YC-Bench SQLite database.
+
+    Returns dict with final_funds_cents (int), survived (bool),
+    terminal_reason (str).
+
+    Note: yc-bench table names are plural -- 'companies' not 'company',
+    'sim_events' not 'simulation_log'.
+    """
+    if not os.path.exists(db_path):
+        logger.warning("DB not found at %s", db_path)
+        return {
+            "final_funds_cents": 0,
+            "survived": False,
+            "terminal_reason": "db_missing",
+        }
+
+    conn = None
+    try:
+        conn = sqlite3.connect(db_path)
+        cur = conn.cursor()
+
+        # Read final funds from the 'companies' table
+        cur.execute("SELECT funds_cents FROM companies LIMIT 1")
+        row = cur.fetchone()
+        funds = row[0] if row else 0
+
+        # Determine terminal reason from 'sim_events' table
+        terminal_reason = "unknown"
+        try:
+            cur.execute(
+                "SELECT event_type FROM sim_events "
+                "WHERE event_type IN ('bankruptcy', 'horizon_end') "
+                "ORDER BY scheduled_at DESC LIMIT 1"
+            )
+            event_row = cur.fetchone()
+            if event_row:
+                terminal_reason = event_row[0]
+        except sqlite3.OperationalError:
+            # Table may not exist if simulation didn't progress
+            pass
+
+        survived = funds >= 0 and terminal_reason != "bankruptcy"
+        return {
+            "final_funds_cents": funds,
+            "survived": survived,
+            "terminal_reason": terminal_reason,
+        }
+
+    except Exception as e:
+        logger.error("Failed to read DB %s: %s", db_path, e)
+        return {
+            "final_funds_cents": 0,
+            "survived": False,
+            "terminal_reason": f"db_error: {e}",
+        }
+    finally:
+        if conn:
+            conn.close()
+
+
+def _compute_composite_score(
+    final_funds_cents: int,
+    survived: bool,
+    survival_weight: float = 0.5,
+    funds_weight: float = 0.5,
+    initial_funds_cents: int = INITIAL_FUNDS_CENTS,
+) -> float:
+    """
+    Compute composite score from survival and final funds.
+
+    Score = survival_weight * survival_score
+          + funds_weight * normalised_funds_score
+
+    Normalised funds uses log-scale relative to initial capital:
+    - funds <= 0:          0.0
+    - funds == initial:   ~0.15
+    - funds == 10x:       ~0.52
+    - funds == 100x:       1.0
+    """
+    survival_score = 1.0 if survived else 0.0
+
+    if final_funds_cents <= 0:
+        funds_score = 0.0
+    else:
+        max_ratio = 100.0
+        ratio = final_funds_cents / max(initial_funds_cents, 1)
+        funds_score = min(math.log1p(ratio) / math.log1p(max_ratio), 1.0)
+
+    return survival_weight * survival_score + funds_weight * funds_score
+
+
+# =============================================================================
+# Main Environment
+# =============================================================================
+
+class YCBenchEvalEnv(HermesAgentBaseEnv):
+    """
+    YC-Bench long-horizon agent benchmark environment (eval-only).
+
+    Each eval item is a (preset, seed) pair. The environment initialises the
+    simulation via ``yc-bench sim init`` (NOT ``yc-bench run`` which would start
+    a competing built-in agent loop). The HermesAgentLoop then drives the
+    interaction by calling individual yc-bench CLI commands via the terminal tool.
+
+    After the agent loop ends, the SQLite DB is read to extract the final score.
+
+    Scoring:
+      composite = 0.5 * survival + 0.5 * normalised_funds
+    """
+
+    name = "yc-bench"
+    env_config_cls = YCBenchEvalConfig
+
+    @classmethod
+    def config_init(cls) -> Tuple[YCBenchEvalConfig, List[APIServerConfig]]:
+        env_config = YCBenchEvalConfig(
+            enabled_toolsets=["terminal"],
+            disabled_toolsets=None,
+            distribution=None,
+            max_agent_turns=200,
+            max_token_length=32000,
+            agent_temperature=0.0,
+            system_prompt=YC_BENCH_SYSTEM_PROMPT,
+            terminal_backend="local",
+            terminal_timeout=60,
+            presets=["fast_test", "medium", "hard"],
+            seeds=[1, 2, 3],
+            run_timeout=3600,
+            survival_weight=0.5,
+            funds_weight=0.5,
+            db_dir="/tmp/yc_bench_dbs",
+            eval_handling=EvalHandlingEnum.STOP_TRAIN,
+            group_size=1,
+            steps_per_eval=1,
+            total_steps=1,
+            tokenizer_name="NousResearch/Hermes-3-Llama-3.1-8B",
+            use_wandb=True,
+            wandb_name="yc-bench",
+            ensure_scores_are_not_same=False,
+        )
+
+        server_configs = [
+            APIServerConfig(
+                base_url="https://openrouter.ai/api/v1",
+                model_name="anthropic/claude-sonnet-4.6",
+                server_type="openai",
+                api_key=os.getenv("OPENROUTER_API_KEY", ""),
+                health_check=False,
+            )
+        ]
+
+        return env_config, server_configs
+
+    # =========================================================================
+    # Setup
+    # =========================================================================
+
+    async def setup(self):
+        """Verify yc-bench is installed and build the eval matrix."""
+        # Verify yc-bench CLI is available
+        try:
+            result = subprocess.run(
+                ["yc-bench", "--help"], capture_output=True, text=True, timeout=10
+            )
+            if result.returncode != 0:
+                raise FileNotFoundError
+        except (FileNotFoundError, subprocess.TimeoutExpired):
+            raise RuntimeError(
+                "yc-bench CLI not found. Install with:\n"
+                '  pip install "hermes-agent[yc-bench]"\n'
+                "Or: git clone https://github.com/collinear-ai/yc-bench "
+                "&& cd yc-bench && pip install -e ."
+            )
+        print("yc-bench CLI verified.")
+
+        # Build eval matrix: preset x seed
+        self.all_eval_items = [
+            {"preset": preset, "seed": seed}
+            for preset in self.config.presets
+            for seed in self.config.seeds
+        ]
+        self.iter = 0
+
+        os.makedirs(self.config.db_dir, exist_ok=True)
+        self.eval_metrics: List[Tuple[str, float]] = []
+
+        # Streaming JSONL log for crash-safe result persistence
+        log_dir = os.path.join(os.path.dirname(__file__), "logs")
+        os.makedirs(log_dir, exist_ok=True)
+        run_ts = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
+        self._streaming_path = os.path.join(log_dir, f"samples_{run_ts}.jsonl")
+        self._streaming_file = open(self._streaming_path, "w")
+        self._streaming_lock = threading.Lock()
+
+        print(f"\nYC-Bench eval matrix: {len(self.all_eval_items)} runs")
+        for item in self.all_eval_items:
+            print(f"  preset={item['preset']!r}  seed={item['seed']}")
+        print(f"Streaming results to: {self._streaming_path}\n")
+
+    def _save_result(self, result: Dict[str, Any]):
+        """Write a single run result to the streaming JSONL file immediately."""
+        if not hasattr(self, "_streaming_file") or self._streaming_file.closed:
+            return
+        with self._streaming_lock:
+            self._streaming_file.write(
+                json.dumps(result, ensure_ascii=False, default=str) + "\n"
+            )
+            self._streaming_file.flush()
+
+    # =========================================================================
+    # Training pipeline stubs (eval-only -- not used)
+    # =========================================================================
+
+    async def get_next_item(self):
+        item = self.all_eval_items[self.iter % len(self.all_eval_items)]
+        self.iter += 1
+        return item
+
+    def format_prompt(self, item: Dict[str, Any]) -> str:
+        preset = item["preset"]
+        seed = item["seed"]
+        return (
+            f"A new YC-Bench simulation has been initialized "
+            f"(preset='{preset}', seed={seed}).\n"
+            f"Your company '{self.config.company_name}' is ready.\n\n"
+            "Begin by calling:\n"
+            "1. `yc-bench company status` -- see your starting funds and prestige\n"
+            "2. `yc-bench employee list` -- see your team and their skills\n"
+            "3. `yc-bench market browse --required-prestige-lte 1` -- find tasks "
+            "you can take\n\n"
+            "Then accept 2-3 tasks, assign employees, dispatch them, and call "
+            "`yc-bench sim resume` to advance time. Repeat this loop until the "
+            "simulation ends (horizon reached or bankruptcy)."
+        )
+
+    async def compute_reward(self, item, result, ctx) -> float:
+        return 0.0
+
+    async def collect_trajectories(self, item):
+        return None, []
+
+    async def score(self, rollout_group_data):
+        return None
+
+    # =========================================================================
+    # Per-run evaluation
+    # =========================================================================
+
+    async def rollout_and_score_eval(self, eval_item: Dict[str, Any]) -> Dict:
+        """
+        Evaluate a single (preset, seed) run.
+
+        1. Sets DATABASE_URL and YC_BENCH_EXPERIMENT env vars
+        2. Initialises the simulation via ``yc-bench sim init`` (NOT ``run``)
+        3. Runs HermesAgentLoop with terminal tool
+        4. Reads SQLite DB to compute final score
+        5. Returns result dict with survival, funds, and composite score
+        """
+        preset = eval_item["preset"]
+        seed = eval_item["seed"]
+        run_id = str(uuid.uuid4())[:8]
+        run_key = f"{preset}_seed{seed}_{run_id}"
+
+        from tqdm import tqdm
+        tqdm.write(f"  [START] preset={preset!r} seed={seed} (run_id={run_id})")
+        run_start = time.time()
+
+        # Isolated DB per run -- prevents cross-run state leakage
+        db_path = os.path.join(self.config.db_dir, f"yc_bench_{run_key}.db")
+        os.environ["DATABASE_URL"] = f"sqlite:///{db_path}"
+        os.environ["YC_BENCH_EXPERIMENT"] = preset
+
+        # Determine horizon: explicit config override > preset lookup > default 1
+        horizon = self.config.horizon_years or _PRESET_HORIZONS.get(preset, 1)
+
+        try:
+            # ----------------------------------------------------------
+            # Step 1: Initialise the simulation via CLI
+            # IMPORTANT: We use `sim init`, NOT `yc-bench run`.
+            # `yc-bench run` starts yc-bench's own LLM agent loop (via
+            # LiteLLM), which would compete with our HermesAgentLoop.
+            # `sim init` just sets up the world and returns.
+            # ----------------------------------------------------------
+            init_cmd = [
+                "yc-bench", "sim", "init",
+                "--seed", str(seed),
+                "--start-date", self.config.start_date,
+                "--company-name", self.config.company_name,
+                "--horizon-years", str(horizon),
+            ]
+            init_result = subprocess.run(
+                init_cmd, capture_output=True, text=True, timeout=30,
+            )
+            if init_result.returncode != 0:
+                error_msg = (init_result.stderr or init_result.stdout).strip()
+                raise RuntimeError(f"yc-bench sim init failed: {error_msg}")
+
+            tqdm.write(f"    Simulation initialized (horizon={horizon}yr)")
+
+            # ----------------------------------------------------------
+            # Step 2: Run the HermesAgentLoop
+            # ----------------------------------------------------------
+            tools, valid_names = self._resolve_tools_for_group()
+
+            messages: List[Dict[str, Any]] = [
+                {"role": "system", "content": YC_BENCH_SYSTEM_PROMPT},
+                {"role": "user", "content": self.format_prompt(eval_item)},
+            ]
+
+            agent = HermesAgentLoop(
+                server=self.server,
+                tool_schemas=tools,
+                valid_tool_names=valid_names,
+                max_turns=self.config.max_agent_turns,
+                task_id=run_id,
+                temperature=self.config.agent_temperature,
+                max_tokens=self.config.max_token_length,
+                extra_body=self.config.extra_body,
+            )
+            result = await agent.run(messages)
+
+            # ----------------------------------------------------------
+            # Step 3: Read final score from the simulation DB
+            # ----------------------------------------------------------
+            score_data = _read_final_score(db_path)
+            final_funds = score_data["final_funds_cents"]
+            survived = score_data["survived"]
+            terminal_reason = score_data["terminal_reason"]
+
+            composite = _compute_composite_score(
+                final_funds_cents=final_funds,
+                survived=survived,
+                survival_weight=self.config.survival_weight,
+                funds_weight=self.config.funds_weight,
+            )
+
+            elapsed = time.time() - run_start
+            status = "SURVIVED" if survived else "BANKRUPT"
+            if final_funds >= 0:
+                funds_str = f"${final_funds / 100:,.0f}"
+            else:
+                funds_str = f"-${abs(final_funds) / 100:,.0f}"
+
+            tqdm.write(
+                f"  [{status}] preset={preset!r} seed={seed} "
+                f"funds={funds_str} score={composite:.3f} "
+                f"turns={result.turns_used} ({elapsed:.0f}s)"
+            )
+
+            out = {
+                "preset": preset,
+                "seed": seed,
+                "survived": survived,
+                "final_funds_cents": final_funds,
+                "final_funds_usd": final_funds / 100,
+                "terminal_reason": terminal_reason,
+                "composite_score": composite,
+                "turns_used": result.turns_used,
+                "finished_naturally": result.finished_naturally,
+                "elapsed_seconds": elapsed,
+                "db_path": db_path,
+                "messages": result.messages,
+            }
+            self._save_result(out)
+            return out
+
+        except Exception as e:
+            elapsed = time.time() - run_start
+            logger.error("Run %s failed: %s", run_key, e, exc_info=True)
+            tqdm.write(
+                f"  [ERROR] preset={preset!r} seed={seed}: {e} ({elapsed:.0f}s)"
+            )
+            out = {
+                "preset": preset,
+                "seed": seed,
+                "survived": False,
+                "final_funds_cents": 0,
+                "final_funds_usd": 0.0,
+                "terminal_reason": f"error: {e}",
+                "composite_score": 0.0,
+                "turns_used": 0,
+                "error": str(e),
+                "elapsed_seconds": elapsed,
+            }
+            self._save_result(out)
+            return out
+
+    # =========================================================================
+    # Evaluate
+    # =========================================================================
+
+    async def _run_with_timeout(self, item: Dict[str, Any]) -> Dict:
+        """Wrap a single rollout with a wall-clock timeout."""
+        preset = item["preset"]
+        seed = item["seed"]
+        try:
+            return await asyncio.wait_for(
+                self.rollout_and_score_eval(item),
+                timeout=self.config.run_timeout,
+            )
+        except asyncio.TimeoutError:
+            from tqdm import tqdm
+            tqdm.write(
+                f"  [TIMEOUT] preset={preset!r} seed={seed} "
+                f"(exceeded {self.config.run_timeout}s)"
+            )
+            out = {
+                "preset": preset,
+                "seed": seed,
+                "survived": False,
+                "final_funds_cents": 0,
+                "final_funds_usd": 0.0,
+                "terminal_reason": f"timeout ({self.config.run_timeout}s)",
+                "composite_score": 0.0,
+                "turns_used": 0,
+                "error": "timeout",
+            }
+            self._save_result(out)
+            return out
+
+    async def evaluate(self, *args, **kwargs) -> None:
+        """
+        Run YC-Bench evaluation over all (preset, seed) combinations.
+
+        Runs sequentially -- each run is 100-500 turns, parallelising would
+        be prohibitively expensive and cause env var conflicts.
+        """
+        start_time = time.time()
+        from tqdm import tqdm
+
+        # --- tqdm-compatible logging handler (TB2 pattern) ---
+        class _TqdmHandler(logging.Handler):
+            def emit(self, record):
+                try:
+                    tqdm.write(self.format(record))
+                except Exception:
+                    self.handleError(record)
+
+        root = logging.getLogger()
+        handler = _TqdmHandler()
+        handler.setFormatter(
+            logging.Formatter("%(levelname)s %(name)s: %(message)s")
+        )
+        root.handlers = [handler]
+        for noisy in ("httpx", "openai"):
+            logging.getLogger(noisy).setLevel(logging.WARNING)
+
+        # --- Print config summary ---
+        print(f"\n{'='*60}")
+        print("Starting YC-Bench Evaluation")
+        print(f"{'='*60}")
+        print(f"  Presets: {self.config.presets}")
+        print(f"  Seeds: {self.config.seeds}")
+        print(f"  Total runs: {len(self.all_eval_items)}")
+        print(f"  Max turns/run: {self.config.max_agent_turns}")
+        print(f"  Run timeout: {self.config.run_timeout}s")
+        print(f"{'='*60}\n")
+
+        results = []
+        pbar = tqdm(
+            total=len(self.all_eval_items), desc="YC-Bench", dynamic_ncols=True
+        )
+
+        try:
+            for item in self.all_eval_items:
+                result = await self._run_with_timeout(item)
+                results.append(result)
+                survived_count = sum(1 for r in results if r.get("survived"))
+                pbar.set_postfix_str(
+                    f"survived={survived_count}/{len(results)}"
+                )
+                pbar.update(1)
+
+        except (KeyboardInterrupt, asyncio.CancelledError):
+            tqdm.write("\n[INTERRUPTED] Stopping evaluation...")
+            pbar.close()
+            try:
+                from tools.terminal_tool import cleanup_all_environments
+                cleanup_all_environments()
+            except Exception:
+                pass
+            if hasattr(self, "_streaming_file") and not self._streaming_file.closed:
+                self._streaming_file.close()
+            return
+
+        pbar.close()
+        end_time = time.time()
+
+        # --- Compute metrics ---
+        valid = [r for r in results if r is not None]
+        if not valid:
+            print("Warning: No valid results.")
+            return
+
+        total = len(valid)
+        survived_total = sum(1 for r in valid if r.get("survived"))
+        survival_rate = survived_total / total if total else 0.0
+        avg_score = (
+            sum(r.get("composite_score", 0) for r in valid) / total
+            if total
+            else 0.0
+        )
+
+        preset_results: Dict[str, List[Dict]] = defaultdict(list)
+        for r in valid:
+            preset_results[r["preset"]].append(r)
+
+        eval_metrics = {
+            "eval/survival_rate": survival_rate,
+            "eval/avg_composite_score": avg_score,
+            "eval/total_runs": total,
+            "eval/survived_runs": survived_total,
+            "eval/evaluation_time_seconds": end_time - start_time,
+        }
+
+        for preset, items in sorted(preset_results.items()):
+            ps = sum(1 for r in items if r.get("survived"))
+            pt = len(items)
+            pa = (
+                sum(r.get("composite_score", 0) for r in items) / pt
+                if pt
+                else 0
+            )
+            key = preset.replace("-", "_")
+            eval_metrics[f"eval/survival_rate_{key}"] = ps / pt if pt else 0
+            eval_metrics[f"eval/avg_score_{key}"] = pa
+
+        self.eval_metrics = [(k, v) for k, v in eval_metrics.items()]
+
+        # --- Print summary ---
+        print(f"\n{'='*60}")
+        print("YC-Bench Evaluation Results")
+        print(f"{'='*60}")
+        print(
+            f"Overall survival rate: {survival_rate:.1%} "
+            f"({survived_total}/{total})"
+        )
+        print(f"Average composite score: {avg_score:.4f}")
+        print(f"Evaluation time: {end_time - start_time:.1f}s")
+
+        print("\nPer-preset breakdown:")
+        for preset, items in sorted(preset_results.items()):
+            ps = sum(1 for r in items if r.get("survived"))
+            pt = len(items)
+            pa = (
+                sum(r.get("composite_score", 0) for r in items) / pt
+                if pt
+                else 0
+            )
+            print(f"  {preset}: {ps}/{pt} survived  avg_score={pa:.4f}")
+            for r in items:
+                status = "SURVIVED" if r.get("survived") else "BANKRUPT"
+                funds = r.get("final_funds_usd", 0)
+                print(
+                    f"    seed={r['seed']}  [{status}]  "
+                    f"${funds:,.0f}  "
+                    f"score={r.get('composite_score', 0):.3f}"
+                )
+
+        print(f"{'='*60}\n")
+
+        # --- Log results ---
+        samples = [
+            {k: v for k, v in r.items() if k != "messages"} for r in valid
+        ]
+
+        try:
+            await self.evaluate_log(
+                metrics=eval_metrics,
+                samples=samples,
+                start_time=start_time,
+                end_time=end_time,
+                generation_parameters={
+                    "temperature": self.config.agent_temperature,
+                    "max_tokens": self.config.max_token_length,
+                    "max_agent_turns": self.config.max_agent_turns,
+                },
+            )
+        except Exception as e:
+            print(f"Error logging results: {e}")
+
+        # --- Cleanup (TB2 pattern) ---
+        if hasattr(self, "_streaming_file") and not self._streaming_file.closed:
+            self._streaming_file.close()
+            print(f"Results saved to: {self._streaming_path}")
+
+        try:
+            from tools.terminal_tool import cleanup_all_environments
+            cleanup_all_environments()
+        except Exception:
+            pass
+
+        try:
+            from environments.agent_loop import _tool_executor
+            _tool_executor.shutdown(wait=False, cancel_futures=True)
+        except Exception:
+            pass
+
+    # =========================================================================
+    # Wandb logging
+    # =========================================================================
+
+    async def wandb_log(self, wandb_metrics: Optional[Dict] = None):
+        """Log YC-Bench-specific metrics to wandb."""
+        if wandb_metrics is None:
+            wandb_metrics = {}
+        for k, v in self.eval_metrics:
+            wandb_metrics[k] = v
+        self.eval_metrics = []
+        await super().wandb_log(wandb_metrics)
+
+
+if __name__ == "__main__":
+    YCBenchEvalEnv.cli()
--- a/hermes_code/environments/hermes_base_env.py
+++ b/hermes_code/environments/hermes_base_env.py
@ -0,0 +1,670 @@
+"""
+HermesAgentBaseEnv -- Abstract Base Environment for Hermes-Agent + Atropos
+
+Provides the Atropos integration plumbing that all hermes-agent environments share:
+- Two-mode operation (OpenAI server for Phase 1, VLLM ManagedServer for Phase 2)
+- Per-group toolset/distribution resolution
+- Agent loop orchestration via HermesAgentLoop
+- ToolContext creation for reward functions
+- ScoredDataGroup construction from ManagedServer state
+
+Subclasses only need to implement:
+    setup()           -- Load dataset, initialize state
+    get_next_item()   -- Return the next item from the dataset
+    format_prompt()   -- Convert a dataset item into the user message
+    compute_reward()  -- Score the rollout (has full ToolContext access)
+    evaluate()        -- Periodic evaluation
+"""
+
+import asyncio
+import json
+import logging
+import os
+import sys
+import uuid
+from abc import abstractmethod
+from pathlib import Path
+from typing import Any, Dict, List, Optional, Set, Tuple, Union
+
+# Ensure the hermes-agent repo root is on sys.path so that imports like
+# `from model_tools import ...` and `from environments.X import ...` work
+# regardless of where the script is invoked from.
+_repo_root = Path(__file__).resolve().parent.parent
+if str(_repo_root) not in sys.path:
+    sys.path.insert(0, str(_repo_root))
+
+from dotenv import load_dotenv
+from pydantic import Field
+
+# Load API keys from hermes-agent/.env so all environments can access them
+_env_path = _repo_root / ".env"
+if _env_path.exists():
+    load_dotenv(dotenv_path=_env_path)
+
+# Apply monkey patches for async-safe tool operation inside Atropos's event loop.
+# This patches SwerexModalEnvironment to use a background thread instead of
+# asyncio.run(), which would deadlock inside Atropos. Safe for normal CLI too.
+from environments.patches import apply_patches
+apply_patches()
+
+from atroposlib.envs.base import (
+    BaseEnv,
+    BaseEnvConfig,
+    ScoredDataGroup,
+    ScoredDataItem,
+)
+from atroposlib.envs.server_handling.server_manager import (
+    APIServerConfig,
+    ServerBaseline,
+    ServerManager,
+)
+from atroposlib.type_definitions import Item
+
+from environments.agent_loop import AgentResult, HermesAgentLoop
+from environments.tool_context import ToolContext
+
+# Import hermes-agent toolset infrastructure
+from model_tools import get_tool_definitions
+from toolset_distributions import sample_toolsets_from_distribution
+
+logger = logging.getLogger(__name__)
+
+
+class HermesAgentEnvConfig(BaseEnvConfig):
+    """
+    Configuration for hermes-agent Atropos environments.
+
+    Extends BaseEnvConfig with agent-specific settings for toolsets,
+    terminal backend, dataset loading, and tool call parsing.
+    """
+
+    # --- Toolset configuration ---
+    # Mutually exclusive: use either enabled_toolsets OR distribution
+    enabled_toolsets: Optional[List[str]] = Field(
+        default=None,
+        description="Explicit list of hermes toolsets to enable (e.g., ['terminal', 'file', 'web']). "
+        "If None and distribution is also None, all available toolsets are enabled.",
+    )
+    disabled_toolsets: Optional[List[str]] = Field(
+        default=None,
+        description="Toolsets to disable. Applied as a filter on top of enabled_toolsets or distribution.",
+    )
+    distribution: Optional[str] = Field(
+        default=None,
+        description="Name of a toolset distribution from toolset_distributions.py "
+        "(e.g., 'development', 'terminal_tasks'). Sampled once per group. "
+        "Mutually exclusive with enabled_toolsets.",
+    )
+
+    # --- Agent loop configuration ---
+    max_agent_turns: int = Field(
+        default=30,
+        description="Maximum number of LLM calls (tool-calling iterations) per rollout.",
+    )
+    system_prompt: Optional[str] = Field(
+        default=None,
+        description="System prompt for the agent. Tools are handled via the tools= parameter, "
+        "not embedded in the prompt text.",
+    )
+    agent_temperature: float = Field(
+        default=1.0,
+        description="Sampling temperature for agent generation during rollouts.",
+    )
+
+    # --- Terminal backend ---
+    terminal_backend: str = Field(
+        default="local",
+        description="Terminal backend: 'local', 'docker', 'modal', 'daytona', 'ssh', 'singularity'. "
+        "Modal or Daytona recommended for production RL (cloud isolation per rollout).",
+    )
+    terminal_timeout: int = Field(
+        default=120,
+        description="Per-command timeout in seconds for terminal tool calls. "
+        "Commands exceeding this are killed. Increase for tasks with long-running "
+        "commands (compilation, pip install, etc.).",
+    )
+    terminal_lifetime: int = Field(
+        default=3600,
+        description="Sandbox inactivity lifetime in seconds. The cleanup thread kills "
+        "sandboxes that have been idle longer than this. Must be longer than "
+        "the longest gap between tool calls (e.g., waiting for LLM response).",
+    )
+
+    # --- Dataset ---
+    dataset_name: Optional[str] = Field(
+        default=None,
+        description="HuggingFace dataset name. Optional if tasks are defined inline.",
+    )
+    dataset_split: str = Field(
+        default="train",
+        description="Dataset split to use.",
+    )
+    prompt_field: str = Field(
+        default="prompt",
+        description="Which field in the dataset contains the prompt.",
+    )
+
+    # --- Thread pool ---
+    tool_pool_size: int = Field(
+        default=128,
+        description="Thread pool size for tool execution. Each concurrent task needs a "
+        "thread for tool calls. Must be large enough for parallel evaluation. "
+        "Too small = thread pool starvation.",
+    )
+
+    # --- Phase 2: Tool call parsing ---
+    tool_call_parser: str = Field(
+        default="hermes",
+        description="Tool call parser name for Phase 2 (VLLM server type). "
+        "Ignored in Phase 1 (OpenAI server type where VLLM parses natively). "
+        "Options: hermes, mistral, llama3_json, qwen, deepseek_v3, etc.",
+    )
+
+    # --- Provider-specific parameters ---
+    # Passed as extra_body to the OpenAI client's chat.completions.create() call.
+    # Useful for OpenRouter provider preferences, transforms, route settings, etc.
+    # Example YAML:
+    #   extra_body:
+    #     provider:
+    #       ignore: ["DeepInfra", "Fireworks"]
+    #       order: ["Together"]
+    #     transforms: ["middle-out"]
+    extra_body: Optional[Dict[str, Any]] = Field(
+        default=None,
+        description="Extra body parameters passed to the OpenAI client's "
+        "chat.completions.create(). Used for OpenRouter provider preferences, "
+        "transforms, and other provider-specific settings.",
+    )
+
+
+class HermesAgentBaseEnv(BaseEnv):
+    """
+    Abstract base environment for hermes-agent Atropos integration.
+
+    Handles two modes of operation:
+    - Phase 1 (OpenAI server type): Uses server.chat_completion() directly.
+      The server (VLLM, SGLang, OpenRouter, OpenAI) handles tool call parsing
+      and reasoning extraction natively. DummyManagedServer provides placeholder
+      tokens. Good for SFT data gen, verifier testing, evaluation.
+
+    - Phase 2 (VLLM server type): Uses ManagedServer for exact token IDs + logprobs
+      via /generate. Client-side tool call parser reconstructs structured tool_calls
+      from raw output. Full RL training capability.
+
+    Subclasses must implement:
+        setup()           -- Load dataset, initialize state
+        get_next_item()   -- Return the next item to roll out
+        format_prompt()   -- Convert a dataset item into the user message string
+        compute_reward()  -- Score the rollout using ToolContext
+        evaluate()        -- Periodic evaluation
+    """
+
+    name: Optional[str] = "hermes-agent"
+    env_config_cls = HermesAgentEnvConfig
+
+    def __init__(
+        self,
+        config: HermesAgentEnvConfig,
+        server_configs: Union[ServerBaseline, List[APIServerConfig]],
+        slurm=False,
+        testing=False,
+    ):
+        super().__init__(config, server_configs, slurm, testing)
+
+        # Set terminal environment variables so hermes tools pick them up.
+        # These can all be overridden per-environment via config fields instead
+        # of requiring users to set shell env vars.
+        if config.terminal_backend:
+            os.environ["TERMINAL_ENV"] = config.terminal_backend
+        os.environ["TERMINAL_TIMEOUT"] = str(config.terminal_timeout)
+        os.environ["TERMINAL_LIFETIME_SECONDS"] = str(config.terminal_lifetime)
+        print(
+            f"🖥️  Terminal: backend={config.terminal_backend}, "
+            f"timeout={config.terminal_timeout}s, lifetime={config.terminal_lifetime}s"
+        )
+
+        # Resize the agent loop's thread pool for tool execution.
+        # This must be large enough for the number of concurrent tasks
+        # (e.g., 89 parallel TB2 eval tasks each need a thread for tool calls).
+        from environments.agent_loop import resize_tool_pool
+        resize_tool_pool(config.tool_pool_size)
+
+        # Set tool_parser on the ServerManager so ManagedServer uses it
+        # for bidirectional tool call translation (raw text ↔ OpenAI tool_calls).
+        if hasattr(self.server, 'tool_parser'):
+            self.server.tool_parser = config.tool_call_parser
+            print(f"🔧 Tool parser: {config.tool_call_parser}")
+
+        # Current group's resolved tools (set in collect_trajectories)
+        self._current_group_tools: Optional[Tuple[List[Dict], Set[str]]] = None
+
+        # Tool error tracking for wandb logging
+        self._tool_error_buffer: List[Dict[str, Any]] = []
+
+    # =========================================================================
+    # Toolset resolution (per-group)
+    # =========================================================================
+
+    def _resolve_tools_for_group(self) -> Tuple[List[Dict[str, Any]], Set[str]]:
+        """
+        Resolve toolsets for a group. Called once in collect_trajectories(),
+        then shared by all collect_trajectory() calls in the group.
+
+        If distribution is set, samples probabilistically.
+        If enabled_toolsets is set, uses that explicit list.
+        disabled_toolsets is applied as a filter on top.
+
+        Returns:
+            (tool_schemas, valid_tool_names) tuple
+        """
+        config = self.config
+
+        if config.distribution:
+            group_toolsets = sample_toolsets_from_distribution(config.distribution)
+            logger.info("Sampled toolsets from '%s': %s", config.distribution, group_toolsets)
+        else:
+            group_toolsets = config.enabled_toolsets  # None means "all available"
+            if group_toolsets is None:
+                logger.warning(
+                    "enabled_toolsets is None -- loading ALL tools including messaging. "
+                    "Set explicit enabled_toolsets for RL training."
+                )
+
+        tools = get_tool_definitions(
+            enabled_toolsets=group_toolsets,
+            disabled_toolsets=config.disabled_toolsets,
+            quiet_mode=True,
+        )
+
+        valid_names = {t["function"]["name"] for t in tools} if tools else set()
+        logger.info("Resolved %d tools for group: %s", len(valid_names), sorted(valid_names))
+        return tools, valid_names
+
+    # =========================================================================
+    # Server mode detection
+    # =========================================================================
+
+    def _use_managed_server(self) -> bool:
+        """
+        Determine if we should use ManagedServer (Phase 2) or direct server (Phase 1).
+
+        Phase 2 (ManagedServer) is used when the server type is 'vllm' or 'sglang',
+        which go through the /generate endpoint for exact token tracking.
+
+        Phase 1 (direct server) is used for 'openai' server type, which uses
+        /v1/chat/completions with native tool call parsing.
+        """
+        if not self.server.servers:
+            return False
+
+        server = self.server.servers[0]
+        # If the server is an OpenAI server (not VLLM/SGLang), use direct mode
+        from atroposlib.envs.server_handling.openai_server import OpenAIServer
+        return not isinstance(server, OpenAIServer)
+
+    # =========================================================================
+    # Core Atropos integration
+    # =========================================================================
+
+    async def collect_trajectories(
+        self, item: Item
+    ) -> Tuple[
+        Union[Optional[ScoredDataGroup], List[Optional[ScoredDataGroup]]],
+        List[Item],
+    ]:
+        """
+        Override collect_trajectories to resolve toolsets once per group,
+        then delegate to the standard group-level collection.
+
+        The default BaseEnv.collect_trajectories() calls collect_trajectory()
+        group_size times in parallel. We resolve tools once here and store
+        them for all those calls to use.
+        """
+        # Resolve toolsets for this group (shared by all rollouts in the group)
+        self._current_group_tools = self._resolve_tools_for_group()
+
+        # Delegate to the default implementation which calls collect_trajectory()
+        # group_size times via asyncio.gather
+        return await super().collect_trajectories(item)
+
+    # =========================================================================
+    # Wandb rollout display -- format trajectories nicely
+    # =========================================================================
+
+    @staticmethod
+    def _format_trajectory_for_display(messages: List[Dict[str, Any]]) -> str:
+        """
+        Format a conversation's messages into a readable trajectory string
+        for wandb rollout tables. Shows tool calls, tool results, and reasoning
+        in a structured way instead of raw token decoding.
+        """
+        parts = []
+        for msg in messages:
+            role = msg.get("role", "unknown")
+            content = msg.get("content", "")
+
+            if role == "system":
+                parts.append(f"[SYSTEM]\n{content}")
+
+            elif role == "user":
+                parts.append(f"[USER]\n{content}")
+
+            elif role == "assistant":
+                # Show reasoning if present
+                reasoning = msg.get("reasoning_content", "")
+                if reasoning:
+                    # Truncate long reasoning for display
+                    if len(reasoning) > 300:
+                        reasoning = reasoning[:300] + "..."
+                    parts.append(f"[ASSISTANT thinking]\n{reasoning}")
+
+                # Show content
+                if content:
+                    parts.append(f"[ASSISTANT]\n{content}")
+
+                # Show tool calls
+                tool_calls = msg.get("tool_calls", [])
+                for tc in tool_calls:
+                    func = tc.get("function", {})
+                    name = func.get("name", "?")
+                    args = func.get("arguments", "{}")
+                    # Truncate long arguments for display
+                    if len(args) > 200:
+                        args = args[:200] + "..."
+                    parts.append(f"[TOOL CALL] {name}({args})")
+
+            elif role == "tool":
+                tool_id = msg.get("tool_call_id", "")
+                result = content
+                # Truncate long tool results for display
+                if len(result) > 500:
+                    result = result[:500] + "..."
+                parts.append(f"[TOOL RESULT] {result}")
+
+        return "\n\n".join(parts)
+
+    async def add_rollouts_for_wandb(
+        self,
+        scored_data,
+        item=None,
+    ):
+        """
+        Override to show formatted trajectories with tool calls visible,
+        instead of raw token decoding which loses all structure.
+        """
+        num_keep = self.config.num_rollouts_per_group_for_logging
+        if num_keep == -1:
+            num_keep = self.config.group_size
+
+        group = []
+        for i in range(min(num_keep, len(scored_data.get("scores", [])))):
+            score = scored_data["scores"][i]
+
+            # Use messages if available for rich display
+            messages = None
+            if scored_data.get("messages") and i < len(scored_data["messages"]):
+                messages = scored_data["messages"][i]
+
+            if messages:
+                text = self._format_trajectory_for_display(messages)
+            elif scored_data.get("tokens") and i < len(scored_data["tokens"]):
+                text = self.tokenizer.decode(scored_data["tokens"][i])
+            else:
+                text = "(no data)"
+
+            group.append((text, score))
+
+        self.rollouts_for_wandb.append(group)
+        if len(self.rollouts_for_wandb) > self.config.num_rollouts_to_keep:
+            self.rollouts_for_wandb.pop(0)
+
+    async def wandb_log(self, wandb_metrics: Optional[Dict] = None):
+        """Log base metrics including tool errors to wandb."""
+        if wandb_metrics is None:
+            wandb_metrics = {}
+
+        # Log tool error stats
+        if self._tool_error_buffer:
+            wandb_metrics["train/tool_errors_count"] = len(self._tool_error_buffer)
+
+            # Log error details as a summary string (tables can crash wandb on tmp cleanup)
+            error_summaries = []
+            for err in self._tool_error_buffer:
+                error_summaries.append(
+                    f"[turn {err['turn']}] {err['tool']}({err['args'][:80]}) -> {err['error'][:150]}"
+                )
+            wandb_metrics["train/tool_error_details"] = "\n".join(error_summaries)
+
+            # Also print to stdout for immediate visibility
+            for summary in error_summaries:
+                print(f"  Tool Error: {summary}")
+
+            self._tool_error_buffer = []
+        else:
+            wandb_metrics["train/tool_errors_count"] = 0
+
+        await super().wandb_log(wandb_metrics)
+
+    async def collect_trajectory(
+        self, item: Item
+    ) -> Tuple[Optional[Union[ScoredDataItem, Any]], List[Item]]:
+        """
+        Run a single rollout: agent loop + reward computation.
+
+        This is called group_size times in parallel by collect_trajectories().
+        Each call gets its own task_id for terminal/browser session isolation.
+        """
+        task_id = str(uuid.uuid4())
+
+        # Get group-level tools (resolved once in collect_trajectories)
+        if self._current_group_tools is None:
+            # Fallback: resolve per-trajectory if called outside collect_trajectories
+            tools, valid_names = self._resolve_tools_for_group()
+        else:
+            tools, valid_names = self._current_group_tools
+
+        # Build initial messages
+        messages: List[Dict[str, Any]] = []
+        if self.config.system_prompt:
+            messages.append({"role": "system", "content": self.config.system_prompt})
+        messages.append({"role": "user", "content": self.format_prompt(item)})
+
+        # Run the agent loop
+        result: AgentResult
+        if self._use_managed_server():
+            # Phase 2: ManagedServer with ToolCallTranslator -- exact tokens + logprobs
+            # tool_parser is set on ServerManager in __init__ and passed through
+            # to ManagedServer, which uses ToolCallTranslator for bidirectional
+            # translation between raw text and OpenAI tool_calls.
+            try:
+                async with self.server.managed_server(
+                    tokenizer=self.tokenizer,
+                    preserve_think_blocks=bool(self.config.thinking_mode),
+                ) as managed:
+                    agent = HermesAgentLoop(
+                        server=managed,
+                        tool_schemas=tools,
+                        valid_tool_names=valid_names,
+                        max_turns=self.config.max_agent_turns,
+                        task_id=task_id,
+                        temperature=self.config.agent_temperature,
+                        max_tokens=self.config.max_token_length,
+                        extra_body=self.config.extra_body,
+                    )
+                    result = await agent.run(messages)
+            except NotImplementedError:
+                # DummyManagedServer not allowed -- fall back to Phase 1
+                logger.warning(
+                    "ManagedServer not available (OpenAI server?). "
+                    "Falling back to direct server mode."
+                )
+                agent = HermesAgentLoop(
+                    server=self.server,
+                    tool_schemas=tools,
+                    valid_tool_names=valid_names,
+                    max_turns=self.config.max_agent_turns,
+                    task_id=task_id,
+                    temperature=self.config.agent_temperature,
+                    max_tokens=self.config.max_token_length,
+                    extra_body=self.config.extra_body,
+                )
+                result = await agent.run(messages)
+        else:
+            # Phase 1: OpenAI server -- native tool_calls, placeholder tokens
+            agent = HermesAgentLoop(
+                server=self.server,
+                tool_schemas=tools,
+                valid_tool_names=valid_names,
+                max_turns=self.config.max_agent_turns,
+                task_id=task_id,
+                temperature=self.config.agent_temperature,
+                max_tokens=self.config.max_token_length,
+                extra_body=self.config.extra_body,
+            )
+            result = await agent.run(messages)
+
+        # Skip reward computation if the agent loop produced no meaningful work
+        # (e.g., API call failed on turn 1). No point spinning up a Modal sandbox
+        # just to verify files that were never created.
+        only_system_and_user = all(
+            msg.get("role") in ("system", "user") for msg in result.messages
+        )
+        if result.turns_used == 0 or only_system_and_user:
+            logger.warning(
+                "Agent loop produced no output (turns=%d, msgs=%d). Skipping reward.",
+                result.turns_used, len(result.messages),
+            )
+            reward = 0.0
+        else:
+            # Compute reward using ToolContext (gives verifier full tool access)
+            ctx = ToolContext(task_id)
+            try:
+                reward = await self.compute_reward(item, result, ctx)
+            except Exception as e:
+                logger.error("compute_reward failed: %s", e)
+                reward = 0.0
+            finally:
+                ctx.cleanup()
+
+        # Track tool errors for wandb logging
+        if result.tool_errors:
+            for err in result.tool_errors:
+                self._tool_error_buffer.append({
+                    "turn": err.turn,
+                    "tool": err.tool_name,
+                    "args": err.arguments[:150],
+                    "error": err.error[:300],
+                    "result": err.tool_result[:300],
+                })
+
+        # Build ScoredDataItem from ManagedServer state
+        # Phase 2: real tokens/masks/logprobs from SequenceNodes
+        # Phase 1: placeholder tokens (still need a valid ScoredDataItem for the pipeline)
+        nodes = (result.managed_state or {}).get("nodes", [])
+
+        if nodes:
+            # Phase 2 (or DummyManagedServer): use actual node data
+            node = nodes[-1]  # Final sequence node = full trajectory
+            scored_item: Dict[str, Any] = {
+                "tokens": node.tokens,
+                "masks": node.masked_tokens,
+                "scores": reward,
+            }
+
+            # Include logprobs if available (Phase 2)
+            if hasattr(node, "logprobs") and node.logprobs:
+                scored_item["advantages"] = None  # Computed by trainer
+                scored_item["ref_logprobs"] = None
+        else:
+            # Phase 1 with no managed state: create placeholder tokens
+            # so the data pipeline doesn't break. These are NOT suitable
+            # for training but allow process mode (SFT data gen) to work.
+            # Tokenize the full conversation to get approximate tokens.
+            full_text = "\n".join(
+                msg.get("content", "") for msg in result.messages if msg.get("content")
+            )
+            if self.tokenizer:
+                tokens = self.tokenizer.encode(full_text, add_special_tokens=True)
+            else:
+                tokens = list(range(min(len(full_text) // 4, 128)))
+
+            scored_item = {
+                "tokens": tokens,
+                "masks": [-100] + tokens[1:],  # Mask first token as prompt
+                "scores": reward,
+            }
+
+        # Always include messages for wandb rollout display and data logging
+        scored_item["messages"] = result.messages
+
+        return scored_item, []
+
+    # =========================================================================
+    # Abstract methods -- subclasses must implement
+    # =========================================================================
+
+    @abstractmethod
+    async def setup(self):
+        """
+        Load dataset, initialize state.
+
+        Called once when the environment starts. Typical implementation:
+            self.dataset = load_dataset(self.config.dataset_name, split=self.config.dataset_split)
+            self.iter = 0
+        """
+        raise NotImplementedError
+
+    @abstractmethod
+    async def get_next_item(self) -> Item:
+        """
+        Return the next item from the dataset for rollout.
+
+        Called by the base env's main loop to get items for workers.
+        Should cycle through the dataset.
+        """
+        raise NotImplementedError
+
+    @abstractmethod
+    def format_prompt(self, item: Item) -> str:
+        """
+        Convert a dataset item into the user message for the agent.
+
+        Args:
+            item: Dataset item (dict, tuple, etc.)
+
+        Returns:
+            The prompt string to send to the agent
+        """
+        raise NotImplementedError
+
+    @abstractmethod
+    async def compute_reward(
+        self, item: Item, result: AgentResult, ctx: ToolContext
+    ) -> float:
+        """
+        Score the rollout. Has full access to:
+        - item: the original dataset item (ground truth, test commands, etc.)
+        - result: AgentResult with full messages, turn count, reasoning, etc.
+        - ctx: ToolContext -- call ANY hermes-agent tool (terminal, file, web,
+               browser, vision...) scoped to this rollout's sandbox. Nothing
+               is off-limits.
+
+        Args:
+            item: The dataset item that was rolled out
+            result: The agent's rollout result
+            ctx: ToolContext with full tool access for verification
+
+        Returns:
+            Reward float (typically 0.0 to 1.0, but any float is valid)
+        """
+        raise NotImplementedError
+
+    @abstractmethod
+    async def evaluate(self, *args, **kwargs):
+        """
+        Periodic evaluation. Called every steps_per_eval steps.
+
+        Typical implementation runs the agent on a held-out eval set
+        and logs metrics via wandb/evaluate_log.
+        """
+        raise NotImplementedError
--- a/hermes_code/environments/hermes_swe_env/init.py
+++ b/hermes_code/environments/hermes_swe_env/init.py
--- a/hermes_code/environments/hermes_swe_env/default.yaml
+++ b/hermes_code/environments/hermes_swe_env/default.yaml
@ -0,0 +1,34 @@
+# SWE Environment -- Default Configuration
+#
+# SWE-bench style tasks with Modal sandboxes for cloud isolation.
+# Uses terminal + file + web toolsets.
+#
+# Usage:
+#   python environments/hermes_swe_env/hermes_swe_env.py serve \
+#       --config environments/hermes_swe_env/default.yaml
+
+env:
+  enabled_toolsets: ["terminal", "file", "web"]
+  max_agent_turns: 30
+  max_token_length: 4096
+  group_size: 4
+  terminal_backend: "modal"
+  tool_call_parser: "hermes"
+  tokenizer_name: "NousResearch/DeepHermes-3-Llama-3-3B-Preview"
+  dataset_name: "bigcode/humanevalpack"
+  dataset_split: "test"
+  prompt_field: "prompt"
+  steps_per_eval: 50
+  total_steps: 500
+  use_wandb: true
+  wandb_name: "hermes-swe"
+  system_prompt: >
+    You are a skilled software engineer. You have access to a terminal,
+    file tools, and web search. Use these tools to complete the coding task.
+    Write clean, working code and verify it runs correctly before finishing.
+
+openai:
+  base_url: "http://localhost:8000/v1"
+  model_name: "NousResearch/DeepHermes-3-Llama-3-3B-Preview"
+  server_type: "openai"
+  api_key: ""
--- a/hermes_code/environments/hermes_swe_env/hermes_swe_env.py
+++ b/hermes_code/environments/hermes_swe_env/hermes_swe_env.py
@ -0,0 +1,229 @@
+"""
+HermesSweEnv -- SWE-Bench Style Environment with Modal Sandboxes
+
+A concrete environment for software engineering tasks where the model writes code
+and the reward function runs tests to verify correctness. Uses Modal terminal
+backend for cloud-isolated sandboxes per rollout.
+
+The reward function uses ToolContext.terminal() to run test commands in the same
+Modal sandbox the model used during its agentic loop. All filesystem state from
+the model's tool calls is preserved for verification.
+
+Usage:
+    # Phase 1: OpenAI server type
+    vllm serve YourModel --tool-parser hermes
+    run-api
+    python environments/hermes_swe_env.py serve \\
+        --openai.base_url http://localhost:8000/v1 \\
+        --openai.model_name YourModel \\
+        --openai.server_type openai \\
+        --env.dataset_name bigcode/humanevalpack \\
+        --env.terminal_backend modal
+
+    # Phase 2: VLLM server type (full RL training)
+    python environments/hermes_swe_env.py serve \\
+        --openai.base_url http://localhost:8000/v1 \\
+        --openai.model_name YourModel \\
+        --openai.server_type vllm \\
+        --env.tool_call_parser hermes \\
+        --env.terminal_backend modal
+"""
+
+import logging
+import sys
+import time
+from pathlib import Path
+from typing import Any, Dict, List, Optional, Tuple, Union
+
+# Ensure repo root is on sys.path for imports
+_repo_root = Path(__file__).resolve().parent.parent.parent
+if str(_repo_root) not in sys.path:
+    sys.path.insert(0, str(_repo_root))
+
+from datasets import load_dataset
+
+from atroposlib.envs.base import ScoredDataGroup
+from atroposlib.envs.server_handling.server_manager import APIServerConfig
+from atroposlib.type_definitions import Item
+
+from environments.agent_loop import AgentResult
+from environments.hermes_base_env import HermesAgentBaseEnv, HermesAgentEnvConfig
+from environments.tool_context import ToolContext
+
+logger = logging.getLogger(__name__)
+
+
+class HermesSweEnvConfig(HermesAgentEnvConfig):
+    """Config with defaults for SWE-bench style tasks."""
+
+    pass  # Inherits all fields, overrides defaults in config_init
+
+
+class HermesSweEnv(HermesAgentBaseEnv):
+    """
+    SWE-bench style environment using Modal terminal backend.
+
+    The model gets a coding task, uses terminal + file + web tools to solve it,
+    and the reward function runs tests in the same Modal sandbox to verify.
+
+    Subclass this for specific SWE datasets (HumanEval, SWE-bench, etc.)
+    and customize format_prompt() and compute_reward() as needed.
+    """
+
+    name = "hermes-swe"
+    env_config_cls = HermesSweEnvConfig
+
+    @classmethod
+    def config_init(cls) -> Tuple[HermesSweEnvConfig, List[APIServerConfig]]:
+        """
+        Default configuration for the SWE environment.
+
+        Uses Modal terminal backend for cloud isolation and terminal + file + web toolsets.
+        """
+        env_config = HermesSweEnvConfig(
+            # Toolsets: terminal for running code, file for reading/writing, web for docs
+            enabled_toolsets=["terminal", "file", "web"],
+            disabled_toolsets=None,
+            distribution=None,
+            # Agent settings -- SWE tasks need more turns
+            max_agent_turns=30,
+            max_token_length=4096,
+            agent_temperature=1.0,
+            system_prompt=(
+                "You are a skilled software engineer. You have access to a terminal, "
+                "file tools, and web search. Use these tools to complete the coding task. "
+                "Write clean, working code and verify it runs correctly before finishing."
+            ),
+            # Modal backend for cloud-isolated sandboxes
+            terminal_backend="modal",
+            # Dataset -- override via CLI for your specific SWE dataset
+            dataset_name="bigcode/humanevalpack",
+            dataset_split="test",
+            prompt_field="prompt",
+            # Atropos settings
+            group_size=4,
+            tokenizer_name="NousResearch/DeepHermes-3-Llama-3-3B-Preview",
+            tool_call_parser="hermes",
+            steps_per_eval=50,
+            total_steps=500,
+            use_wandb=True,
+            wandb_name="hermes-swe",
+        )
+
+        server_configs = [
+            APIServerConfig(
+                base_url="http://localhost:8000/v1",
+                model_name="NousResearch/DeepHermes-3-Llama-3-3B-Preview",
+                server_type="openai",  # Phase 1; switch to "vllm" for Phase 2
+                api_key="",
+            )
+        ]
+
+        return env_config, server_configs
+
+    async def setup(self):
+        """Load the SWE dataset."""
+        if self.config.dataset_name:
+            self.dataset = load_dataset(
+                self.config.dataset_name, split=self.config.dataset_split
+            )
+        else:
+            # Placeholder if no dataset specified
+            self.dataset = []
+        self.iter = 0
+        self.reward_buffer: List[float] = []
+
+    async def get_next_item(self) -> Dict[str, Any]:
+        """Cycle through the SWE dataset."""
+        if not self.dataset:
+            raise ValueError("No dataset loaded. Set dataset_name in config.")
+        item = self.dataset[self.iter % len(self.dataset)]
+        self.iter += 1
+        return item
+
+    def format_prompt(self, item: Dict[str, Any]) -> str:
+        """
+        Format the SWE task prompt.
+
+        Override this in subclasses for different dataset formats.
+        Default assumes the dataset has a 'prompt' field and optionally a 'test' field.
+        """
+        prompt = item.get(self.config.prompt_field, "")
+
+        # If the dataset has test information, include it in the prompt
+        test_info = item.get("test", item.get("test_code", item.get("tests", "")))
+        if test_info:
+            prompt += f"\n\nTests to pass:\n{test_info}"
+
+        return prompt
+
+    async def compute_reward(
+        self, item: Dict[str, Any], result: AgentResult, ctx: ToolContext
+    ) -> float:
+        """
+        Score by running tests in the model's Modal sandbox.
+
+        Default implementation:
+        - If the dataset item has a 'test' or 'test_code' field, run it
+        - Check exit code: 0 = pass, non-zero = fail
+        - Partial credit for file creation
+
+        Override this in subclasses for more sophisticated reward logic.
+        """
+        # Find the test command from the dataset item
+        test_code = item.get("test", item.get("test_code", item.get("tests", "")))
+
+        if test_code:
+            # Run the test in the model's sandbox
+            test_result = ctx.terminal(
+                f'cd /workspace && python3 -c "{test_code}"', timeout=60
+            )
+
+            if test_result["exit_code"] == 0:
+                self.reward_buffer.append(1.0)
+                return 1.0
+
+        # Partial credit: check if the model created any Python files
+        file_check = ctx.terminal("find /workspace -name '*.py' -newer /tmp/.start_marker 2>/dev/null | head -5")
+        if file_check["exit_code"] == 0 and file_check.get("output", "").strip():
+            self.reward_buffer.append(0.1)
+            return 0.1
+
+        self.reward_buffer.append(0.0)
+        return 0.0
+
+    async def evaluate(self, *args, **kwargs):
+        """
+        Run evaluation on a held-out set.
+
+        Override for dataset-specific evaluation logic.
+        """
+        start_time = time.time()
+        end_time = time.time()
+
+        eval_metrics = {"eval/placeholder": 0.0}
+        await self.evaluate_log(
+            metrics=eval_metrics,
+            start_time=start_time,
+            end_time=end_time,
+        )
+
+    async def wandb_log(self, wandb_metrics: Optional[Dict] = None):
+        """Log SWE-specific metrics."""
+        if wandb_metrics is None:
+            wandb_metrics = {}
+
+        if self.reward_buffer:
+            wandb_metrics["train/avg_reward"] = sum(self.reward_buffer) / len(
+                self.reward_buffer
+            )
+            wandb_metrics["train/pass_rate"] = sum(
+                1 for r in self.reward_buffer if r == 1.0
+            ) / len(self.reward_buffer)
+            self.reward_buffer = []
+
+        await super().wandb_log(wandb_metrics)
+
+
+if __name__ == "__main__":
+    HermesSweEnv.cli()
--- a/hermes_code/environments/patches.py
+++ b/hermes_code/environments/patches.py
@ -0,0 +1,42 @@
+"""
+Monkey patches for making hermes-agent tools work inside async frameworks (Atropos).
+
+Problem:
+    Some tools use asyncio.run() internally (e.g., Modal backend via SWE-ReX,
+    web_extract). This crashes when called from inside Atropos's event loop because
+    asyncio.run() can't be nested.
+
+Solution:
+    The Modal environment (tools/environments/modal.py) now uses a dedicated
+    _AsyncWorker thread internally, making it safe for both CLI and Atropos use.
+    No monkey-patching is required.
+
+    This module is kept for backward compatibility — apply_patches() is now a no-op.
+
+Usage:
+    Call apply_patches() once at import time (done automatically by hermes_base_env.py).
+    This is idempotent — calling it multiple times is safe.
+"""
+
+import logging
+
+logger = logging.getLogger(__name__)
+
+_patches_applied = False
+
+
+def apply_patches():
+    """Apply all monkey patches needed for Atropos compatibility.
+
+    Now a no-op — Modal async safety is built directly into ModalEnvironment.
+    Safe to call multiple times.
+    """
+    global _patches_applied
+    if _patches_applied:
+        return
+
+    # Modal async-safety is now built into tools/environments/modal.py
+    # via the _AsyncWorker class. No monkey-patching needed.
+    logger.debug("apply_patches() called — no patches needed (async safety is built-in)")
+
+    _patches_applied = True
--- a/hermes_code/environments/terminal_test_env/init.py
+++ b/hermes_code/environments/terminal_test_env/init.py
--- a/hermes_code/environments/terminal_test_env/default.yaml
+++ b/hermes_code/environments/terminal_test_env/default.yaml
@ -0,0 +1,34 @@
+# Terminal Test Environment -- Default Configuration
+#
+# Simple file-creation tasks for validating the full Atropos + hermes-agent stack.
+# Uses Modal terminal backend and OpenRouter (Claude) for inference.
+# API keys loaded from ~/hermes-agent/.env
+#
+# Usage:
+#   run-api
+#   python environments/terminal_test_env/terminal_test_env.py serve \
+#       --config environments/terminal_test_env/default.yaml
+
+env:
+  enabled_toolsets: ["terminal", "file"]
+  max_agent_turns: 10
+  max_token_length: 2048
+  group_size: 3
+  total_steps: 3
+  steps_per_eval: 3
+  terminal_backend: "modal"
+  tool_call_parser: "hermes"
+  tokenizer_name: "NousResearch/DeepHermes-3-Llama-3-3B-Preview"
+  ensure_scores_are_not_same: false
+  use_wandb: false
+  system_prompt: >
+    You are a helpful assistant with access to a terminal and file tools.
+    Complete the user's request by using the available tools.
+    Be precise and follow instructions exactly.
+
+openai:
+  base_url: "https://openrouter.ai/api/v1"
+  model_name: "anthropic/claude-opus-4.6"
+  server_type: "openai"
+  health_check: false
+  # api_key loaded from OPENROUTER_API_KEY in .env
--- a/hermes_code/environments/terminal_test_env/terminal_test_env.py
+++ b/hermes_code/environments/terminal_test_env/terminal_test_env.py
@ -0,0 +1,292 @@
+"""
+TerminalTestEnv -- Simple Test Environment for Validating the Stack
+
+A self-contained environment with inline tasks (no external dataset needed).
+Each task asks the model to create a file at a known path with specific content.
+The reward verifier cats the file and checks if the content matches.
+
+Enables only terminal + file toolsets. Uses Modal terminal backend with
+OpenRouter (Claude) by default.
+
+Training tasks (3):
+    1. Create ~/greeting.txt with "Hello from Hermes Agent"
+    2. Create ~/count.txt with numbers 1-5, one per line
+    3. Create ~/answer.txt with the result of 123 + 456
+
+Eval task (1):
+    1. Create ~/result.txt with the result of 6 * 7
+
+Usage:
+    # Start Atropos API server
+    run-api
+
+    # Run environment (uses OpenRouter + Modal by default)
+    python environments/terminal_test_env.py serve
+
+    # Process mode (no run-api needed, saves to JSONL)
+    python environments/terminal_test_env.py process \\
+        --env.data_path_to_save_groups terminal_test_output.jsonl
+"""
+
+import logging
+import os
+import sys
+import time
+from pathlib import Path
+from typing import Any, Dict, List, Optional, Tuple, Union
+
+# Ensure repo root is on sys.path for imports
+_repo_root = Path(__file__).resolve().parent.parent.parent
+if str(_repo_root) not in sys.path:
+    sys.path.insert(0, str(_repo_root))
+
+from atroposlib.envs.base import ScoredDataGroup
+from atroposlib.envs.server_handling.server_manager import APIServerConfig
+from atroposlib.type_definitions import Item
+
+from environments.agent_loop import AgentResult
+from environments.hermes_base_env import HermesAgentBaseEnv, HermesAgentEnvConfig
+from environments.tool_context import ToolContext
+
+logger = logging.getLogger(__name__)
+
+
+# =============================================================================
+# Inline task definitions -- no external dataset needed
+# =============================================================================
+
+TRAIN_TASKS = [
+    {
+        "prompt": "Create a file at ~/greeting.txt containing exactly the text: Hello from Hermes Agent",
+        "verify_path": "~/greeting.txt",
+        "expected_content": "Hello from Hermes Agent",
+    },
+    {
+        "prompt": "Create a file at ~/count.txt containing the numbers 1 through 5, one per line",
+        "verify_path": "~/count.txt",
+        "expected_content": "1\n2\n3\n4\n5",
+    },
+    {
+        "prompt": "Create a file at ~/answer.txt containing the result of 123 + 456",
+        "verify_path": "~/answer.txt",
+        "expected_content": "579",
+    },
+]
+
+EVAL_TASKS = [
+    {
+        "prompt": "Create a file at ~/result.txt containing the result of 6 * 7",
+        "verify_path": "~/result.txt",
+        "expected_content": "42",
+    },
+]
+
+
+class TerminalTestEnvConfig(HermesAgentEnvConfig):
+    """Config with defaults suitable for terminal testing."""
+
+    pass  # Inherits all fields, overrides defaults in config_init
+
+
+class TerminalTestEnv(HermesAgentBaseEnv):
+    """
+    Simple test environment with inline file-creation tasks.
+
+    All tasks follow the same pattern: "create a file at ~/X.txt with content Y".
+    The verifier runs `cat ~/X.txt` in the rollout's terminal and checks the output
+    against the expected string. Same verifier logic for all tasks.
+
+    This environment is designed to validate the full stack end-to-end:
+    - Agent loop executes tool calls (terminal/file)
+    - ToolContext provides terminal access to the reward function
+    - Reward function verifies file content via cat
+    - Scored data flows through the Atropos pipeline
+    """
+
+    name = "terminal-test"
+    env_config_cls = TerminalTestEnvConfig
+
+    @classmethod
+    def config_init(cls) -> Tuple[TerminalTestEnvConfig, List[APIServerConfig]]:
+        """
+        Default configuration for the terminal test environment.
+
+        Uses Modal terminal backend for cloud isolation and OpenRouter with
+        Claude for inference. API keys loaded from ~/hermes-agent/.env.
+        """
+        env_config = TerminalTestEnvConfig(
+            # Terminal + file tools only
+            enabled_toolsets=["terminal", "file"],
+            disabled_toolsets=None,
+            distribution=None,
+            # Agent settings
+            max_agent_turns=10,  # Simple tasks, don't need many turns
+            max_token_length=16000,
+            agent_temperature=1.0,
+            system_prompt=(
+                "You are a helpful assistant with access to a terminal and file tools. "
+                "Complete the user's request by using the available tools. "
+                "Be precise and follow instructions exactly."
+            ),
+            # Modal terminal backend for cloud-isolated sandboxes per rollout
+            terminal_backend="modal",
+            # Atropos settings
+            group_size=3,              # 3 rollouts per group
+            tokenizer_name="NousResearch/q-30b-t-h45-e1",
+            tool_call_parser="hermes",
+            steps_per_eval=3,          # Eval after all 3 steps
+            total_steps=3,             # 3 groups total (1 group per step)
+            use_wandb=True,
+            wandb_name="terminal-test",
+            ensure_scores_are_not_same=False,  # Allow all-same scores for simple tasks
+            # No external dataset
+            dataset_name=None,
+        )
+
+        # OpenRouter with Claude -- API key loaded from .env (OPENROUTER_API_KEY)
+        server_configs = [
+            APIServerConfig(
+                base_url="https://openrouter.ai/api/v1",
+                model_name="anthropic/claude-opus-4.6",
+                server_type="openai",
+                api_key=os.getenv("OPENROUTER_API_KEY", ""),
+                health_check=False,  # OpenRouter doesn't have a /health endpoint
+            )
+        ]
+
+        return env_config, server_configs
+
+    async def setup(self):
+        """Initialize inline task lists."""
+        self.train_tasks = list(TRAIN_TASKS)
+        self.eval_tasks = list(EVAL_TASKS)
+        self.iter = 0
+        # Track reward stats for wandb logging
+        self.reward_buffer: List[float] = []
+
+    async def get_next_item(self) -> Dict[str, str]:
+        """Cycle through training tasks."""
+        item = self.train_tasks[self.iter % len(self.train_tasks)]
+        self.iter += 1
+        return item
+
+    def format_prompt(self, item: Dict[str, str]) -> str:
+        """The prompt is directly in the task item."""
+        return item["prompt"]
+
+    async def compute_reward(
+        self, item: Dict[str, str], result: AgentResult, ctx: ToolContext
+    ) -> float:
+        """
+        Verify by cat-ing the expected file path and checking content matches.
+        Same verifier for all tasks -- they all write a file at a known path.
+
+        Scoring:
+            1.0 = exact match
+            0.5 = expected content is present but has extra stuff
+            0.0 = file doesn't exist or content doesn't match
+        """
+        verify_result = ctx.terminal(f"cat {item['verify_path']}")
+
+        # File doesn't exist or can't be read
+        if verify_result["exit_code"] != 0:
+            self.reward_buffer.append(0.0)
+            return 0.0
+
+        actual = verify_result.get("output", "").strip()
+        expected = item["expected_content"].strip()
+
+        # Exact match
+        if actual == expected:
+            self.reward_buffer.append(1.0)
+            return 1.0
+
+        # Partial credit: expected content is present but has extra stuff
+        if expected in actual:
+            self.reward_buffer.append(0.5)
+            return 0.5
+
+        self.reward_buffer.append(0.0)
+        return 0.0
+
+    async def evaluate(self, *args, **kwargs):
+        """
+        Run eval tasks using the agent loop and verify results.
+        Logs accuracy metrics.
+        """
+        start_time = time.time()
+        correct = 0
+        total = len(self.eval_tasks)
+        samples = []
+
+        for eval_item in self.eval_tasks:
+            try:
+                # For eval, we do a simple single-turn completion (not full agent loop)
+                # to keep eval fast. The agent loop is tested via training.
+                completion = await self.server.chat_completion(
+                    messages=[
+                        {"role": "system", "content": self.config.system_prompt or ""},
+                        {"role": "user", "content": eval_item["prompt"]},
+                    ],
+                    n=1,
+                    max_tokens=self.config.max_token_length,
+                    temperature=0.0,
+                    split="eval",
+                )
+
+                response_content = (
+                    completion.choices[0].message.content if completion.choices else ""
+                )
+
+                samples.append(
+                    {
+                        "prompt": eval_item["prompt"],
+                        "response": response_content,
+                        "expected": eval_item["expected_content"],
+                    }
+                )
+
+            except Exception as e:
+                logger.error("Eval failed for item: %s", e)
+                samples.append(
+                    {
+                        "prompt": eval_item["prompt"],
+                        "response": f"ERROR: {e}",
+                        "expected": eval_item["expected_content"],
+                    }
+                )
+
+        end_time = time.time()
+
+        eval_metrics = {
+            "eval/num_samples": total,
+        }
+
+        await self.evaluate_log(
+            metrics=eval_metrics,
+            samples=samples,
+            start_time=start_time,
+            end_time=end_time,
+        )
+
+    async def wandb_log(self, wandb_metrics: Optional[Dict] = None):
+        """Log training metrics including reward stats and accuracy."""
+        if wandb_metrics is None:
+            wandb_metrics = {}
+
+        if self.reward_buffer:
+            total = len(self.reward_buffer)
+            correct = sum(1 for r in self.reward_buffer if r == 1.0)
+            partial = sum(1 for r in self.reward_buffer if r == 0.5)
+
+            wandb_metrics["train/avg_reward"] = sum(self.reward_buffer) / total
+            wandb_metrics["train/accuracy"] = correct / total
+            wandb_metrics["train/partial_match_rate"] = partial / total
+            wandb_metrics["train/total_rollouts"] = total
+            self.reward_buffer = []
+
+        await super().wandb_log(wandb_metrics)
+
+
+if __name__ == "__main__":
+    TerminalTestEnv.cli()
--- a/hermes_code/environments/tool_call_parsers/init.py
+++ b/hermes_code/environments/tool_call_parsers/init.py
@ -0,0 +1,120 @@
+"""
+Tool Call Parser Registry
+
+Client-side parsers that extract structured tool_calls from raw model output text.
+Used in Phase 2 (VLLM server type) where ManagedServer's /generate endpoint returns
+raw text without tool call parsing.
+
+Each parser is a standalone reimplementation of the corresponding VLLM parser's
+non-streaming extract_tool_calls() logic. No VLLM dependency -- only standard library
+(re, json, uuid) and openai types.
+
+Usage:
+    from environments.tool_call_parsers import get_parser
+
+    parser = get_parser("hermes")
+    content, tool_calls = parser.parse(raw_model_output)
+    # content = text with tool call markup stripped
+    # tool_calls = list of ChatCompletionMessageToolCall objects, or None
+"""
+
+import logging
+from abc import ABC, abstractmethod
+from typing import Dict, List, Optional, Tuple, Type
+
+from openai.types.chat.chat_completion_message_tool_call import (
+    ChatCompletionMessageToolCall,
+)
+
+logger = logging.getLogger(__name__)
+
+# Type alias for parser return value
+ParseResult = Tuple[Optional[str], Optional[List[ChatCompletionMessageToolCall]]]
+
+
+class ToolCallParser(ABC):
+    """
+    Base class for tool call parsers.
+
+    Each parser knows how to extract structured tool_calls from a specific
+    model family's raw output text format.
+    """
+
+    @abstractmethod
+    def parse(self, text: str) -> ParseResult:
+        """
+        Parse raw model output text for tool calls.
+
+        Args:
+            text: Raw decoded text from the model's completion
+
+        Returns:
+            Tuple of (content, tool_calls) where:
+            - content: text with tool call markup stripped (the message 'content' field),
+                       or None if the entire output was tool calls
+            - tool_calls: list of ChatCompletionMessageToolCall objects,
+                          or None if no tool calls were found
+        """
+        raise NotImplementedError
+
+
+# Global parser registry: name -> parser class
+PARSER_REGISTRY: Dict[str, Type[ToolCallParser]] = {}
+
+
+def register_parser(name: str):
+    """
+    Decorator to register a parser class under a given name.
+
+    Usage:
+        @register_parser("hermes")
+        class HermesToolCallParser(ToolCallParser):
+            ...
+    """
+
+    def decorator(cls: Type[ToolCallParser]) -> Type[ToolCallParser]:
+        PARSER_REGISTRY[name] = cls
+        return cls
+
+    return decorator
+
+
+def get_parser(name: str) -> ToolCallParser:
+    """
+    Get a parser instance by name.
+
+    Args:
+        name: Parser name (e.g., "hermes", "mistral", "llama3_json")
+
+    Returns:
+        Instantiated parser
+
+    Raises:
+        KeyError: If parser name is not found in registry
+    """
+    if name not in PARSER_REGISTRY:
+        available = sorted(PARSER_REGISTRY.keys())
+        raise KeyError(
+            f"Tool call parser '{name}' not found. Available parsers: {available}"
+        )
+    return PARSER_REGISTRY[name]()
+
+
+def list_parsers() -> List[str]:
+    """Return sorted list of registered parser names."""
+    return sorted(PARSER_REGISTRY.keys())
+
+
+# Import all parser modules to trigger registration via @register_parser decorators
+# Each module registers itself when imported
+from environments.tool_call_parsers.hermes_parser import HermesToolCallParser  # noqa: E402, F401
+from environments.tool_call_parsers.longcat_parser import LongcatToolCallParser  # noqa: E402, F401
+from environments.tool_call_parsers.mistral_parser import MistralToolCallParser  # noqa: E402, F401
+from environments.tool_call_parsers.llama_parser import LlamaToolCallParser  # noqa: E402, F401
+from environments.tool_call_parsers.qwen_parser import QwenToolCallParser  # noqa: E402, F401
+from environments.tool_call_parsers.deepseek_v3_parser import DeepSeekV3ToolCallParser  # noqa: E402, F401
+from environments.tool_call_parsers.deepseek_v3_1_parser import DeepSeekV31ToolCallParser  # noqa: E402, F401
+from environments.tool_call_parsers.kimi_k2_parser import KimiK2ToolCallParser  # noqa: E402, F401
+from environments.tool_call_parsers.glm45_parser import Glm45ToolCallParser  # noqa: E402, F401
+from environments.tool_call_parsers.glm47_parser import Glm47ToolCallParser  # noqa: E402, F401
+from environments.tool_call_parsers.qwen3_coder_parser import Qwen3CoderToolCallParser  # noqa: E402, F401
--- a/hermes_code/environments/tool_call_parsers/deepseek_v3_1_parser.py
+++ b/hermes_code/environments/tool_call_parsers/deepseek_v3_1_parser.py
@ -0,0 +1,72 @@
+"""
+DeepSeek V3.1 tool call parser.
+
+Similar to V3 but with a slightly different format:
+    <｜tool▁call▁begin｜>function_name<｜tool▁sep｜>arguments<｜tool▁call▁end｜>
+
+Note: V3 has type+name before the separator, V3.1 has name before and args after.
+
+Based on VLLM's DeepSeekV31ToolParser.extract_tool_calls()
+"""
+
+import re
+import uuid
+from typing import List, Optional
+
+from openai.types.chat.chat_completion_message_tool_call import (
+    ChatCompletionMessageToolCall,
+    Function,
+)
+
+from environments.tool_call_parsers import ParseResult, ToolCallParser, register_parser
+
+
+@register_parser("deepseek_v3_1")
+@register_parser("deepseek_v31")
+class DeepSeekV31ToolCallParser(ToolCallParser):
+    """
+    Parser for DeepSeek V3.1 tool calls.
+
+    Slightly different regex than V3: function_name comes before the separator,
+    arguments come after (no type field, no json code block wrapper).
+    """
+
+    START_TOKEN = "<｜tool▁calls▁begin｜>"
+
+    # Regex captures: function_name, function_arguments
+    PATTERN = re.compile(
+        r"<｜tool▁call▁begin｜>(?P<function_name>.*?)<｜tool▁sep｜>(?P<function_arguments>.*?)<｜tool▁call▁end｜>",
+        re.DOTALL,
+    )
+
+    def parse(self, text: str) -> ParseResult:
+        if self.START_TOKEN not in text:
+            return text, None
+
+        try:
+            matches = self.PATTERN.findall(text)
+            if not matches:
+                return text, None
+
+            tool_calls: List[ChatCompletionMessageToolCall] = []
+            for match in matches:
+                func_name, func_args = match
+                tool_calls.append(
+                    ChatCompletionMessageToolCall(
+                        id=f"call_{uuid.uuid4().hex[:8]}",
+                        type="function",
+                        function=Function(
+                            name=func_name.strip(),
+                            arguments=func_args.strip(),
+                        ),
+                    )
+                )
+
+            if not tool_calls:
+                return text, None
+
+            content = text[: text.find(self.START_TOKEN)].strip()
+            return content if content else None, tool_calls
+
+        except Exception:
+            return text, None
--- a/hermes_code/environments/tool_call_parsers/deepseek_v3_parser.py
+++ b/hermes_code/environments/tool_call_parsers/deepseek_v3_parser.py
@ -0,0 +1,89 @@
+"""
+DeepSeek V3 tool call parser.
+
+Format uses special unicode tokens:
+    <｜tool▁calls▁begin｜>
+    <｜tool▁call▁begin｜>type<｜tool▁sep｜>function_name
+    ```json
+    {"arg": "value"}
+    ```
+    <｜tool▁call▁end｜>
+    <｜tool▁calls▁end｜>
+
+Fixes Issue #989: Support for multiple simultaneous tool calls.
+"""
+
+import re
+import uuid
+import logging
+from typing import List, Optional, Tuple
+
+from openai.types.chat.chat_completion_message_tool_call import (
+    ChatCompletionMessageToolCall,
+    Function,
+)
+
+from environments.tool_call_parsers import ParseResult, ToolCallParser, register_parser
+
+logger = logging.getLogger(__name__)
+
+@register_parser("deepseek_v3")
+class DeepSeekV3ToolCallParser(ToolCallParser):
+    """
+    Parser for DeepSeek V3 tool calls.
+
+    Uses special unicode tokens with fullwidth angle brackets and block elements.
+    Extracts type, function name, and JSON arguments from the structured format.
+    Ensures all tool calls are captured when the model executes multiple actions.
+    """
+
+    START_TOKEN = "<｜tool▁calls▁begin｜>"
+
+    # Updated PATTERN: Using \s* instead of literal \n for increased robustness
+    # against variations in model formatting (Issue #989).
+    PATTERN = re.compile(
+        r"<｜tool▁call▁begin｜>(?P<type>.*?)<｜tool▁sep｜>(?P<function_name>.*?)\s*```json\s*(?P<function_arguments>.*?)\s*```\s*<｜tool▁call▁end｜>",
+        re.DOTALL,
+    )
+
+    def parse(self, text: str) -> ParseResult:
+        """
+        Parses the input text and extracts all available tool calls.
+        """
+        if self.START_TOKEN not in text:
+            return text, None
+
+        try:
+            # Using finditer to capture ALL tool calls in the sequence
+            matches = list(self.PATTERN.finditer(text))
+            if not matches:
+                return text, None
+
+            tool_calls: List[ChatCompletionMessageToolCall] = []
+            
+            for match in matches:
+                func_name = match.group("function_name").strip()
+                func_args = match.group("function_arguments").strip()
+                
+                tool_calls.append(
+                    ChatCompletionMessageToolCall(
+                        id=f"call_{uuid.uuid4().hex[:8]}",
+                        type="function",
+                        function=Function(
+                            name=func_name,
+                            arguments=func_args,
+                        ),
+                    )
+                )
+
+            if tool_calls:
+                # Content is text before the first tool call block
+                content_index = text.find(self.START_TOKEN)
+                content = text[:content_index].strip()
+                return content if content else None, tool_calls
+
+            return text, None
+
+        except Exception as e:
+            logger.error(f"Error parsing DeepSeek V3 tool calls: {e}")
+            return text, None
--- a/hermes_code/environments/tool_call_parsers/glm45_parser.py
+++ b/hermes_code/environments/tool_call_parsers/glm45_parser.py
@ -0,0 +1,109 @@
+"""
+GLM 4.5 (GLM-4-MoE) tool call parser.
+
+Format uses custom arg_key/arg_value tags rather than standard JSON:
+    <tool_call>function_name
+    <arg_key>param1</arg_key><arg_value>value1</arg_value>
+    <arg_key>param2</arg_key><arg_value>value2</arg_value>
+    </tool_call>
+
+Values are deserialized using json.loads -> ast.literal_eval -> raw string fallback.
+
+Based on VLLM's Glm4MoeModelToolParser.extract_tool_calls()
+"""
+
+import ast
+import json
+import re
+import uuid
+from typing import Any, Dict, List, Optional
+
+from openai.types.chat.chat_completion_message_tool_call import (
+    ChatCompletionMessageToolCall,
+    Function,
+)
+
+from environments.tool_call_parsers import ParseResult, ToolCallParser, register_parser
+
+
+def _deserialize_value(value: str) -> Any:
+    """
+    Try to deserialize a string value to its native Python type.
+    Attempts json.loads, then ast.literal_eval, then returns raw string.
+    """
+    try:
+        return json.loads(value)
+    except (json.JSONDecodeError, TypeError):
+        pass
+
+    try:
+        return ast.literal_eval(value)
+    except (ValueError, SyntaxError, TypeError):
+        pass
+
+    return value
+
+
+@register_parser("glm45")
+class Glm45ToolCallParser(ToolCallParser):
+    """
+    Parser for GLM 4.5 (GLM-4-MoE) tool calls.
+
+    Uses <tool_call>...</tool_call> tags with <arg_key>/<arg_value> pairs
+    instead of standard JSON arguments.
+    """
+
+    FUNC_CALL_REGEX = re.compile(r"<tool_call>.*?</tool_call>", re.DOTALL)
+    FUNC_DETAIL_REGEX = re.compile(r"<tool_call>([^\n]*)\n(.*)</tool_call>", re.DOTALL)
+    FUNC_ARG_REGEX = re.compile(
+        r"<arg_key>(.*?)</arg_key>\s*<arg_value>(.*?)</arg_value>", re.DOTALL
+    )
+
+    START_TOKEN = "<tool_call>"
+
+    def parse(self, text: str) -> ParseResult:
+        if self.START_TOKEN not in text:
+            return text, None
+
+        try:
+            matched_calls = self.FUNC_CALL_REGEX.findall(text)
+            if not matched_calls:
+                return text, None
+
+            tool_calls: List[ChatCompletionMessageToolCall] = []
+
+            for match in matched_calls:
+                detail = self.FUNC_DETAIL_REGEX.search(match)
+                if not detail:
+                    continue
+
+                func_name = detail.group(1).strip()
+                func_args_raw = detail.group(2)
+
+                # Parse arg_key/arg_value pairs
+                pairs = self.FUNC_ARG_REGEX.findall(func_args_raw) if func_args_raw else []
+                arg_dict: Dict[str, Any] = {}
+                for key, value in pairs:
+                    arg_key = key.strip()
+                    arg_val = _deserialize_value(value.strip())
+                    arg_dict[arg_key] = arg_val
+
+                tool_calls.append(
+                    ChatCompletionMessageToolCall(
+                        id=f"call_{uuid.uuid4().hex[:8]}",
+                        type="function",
+                        function=Function(
+                            name=func_name,
+                            arguments=json.dumps(arg_dict, ensure_ascii=False),
+                        ),
+                    )
+                )
+
+            if not tool_calls:
+                return text, None
+
+            content = text[: text.find(self.START_TOKEN)].strip()
+            return content if content else None, tool_calls
+
+        except Exception:
+            return text, None
--- a/hermes_code/environments/tool_call_parsers/glm47_parser.py
+++ b/hermes_code/environments/tool_call_parsers/glm47_parser.py
@ -0,0 +1,35 @@
+"""
+GLM 4.7 tool call parser.
+
+Same as GLM 4.5 but with slightly different regex patterns.
+The tool_call tags may wrap differently and arg parsing handles
+newlines between key/value pairs.
+
+Based on VLLM's Glm47MoeModelToolParser (extends Glm4MoeModelToolParser).
+"""
+
+import re
+
+from environments.tool_call_parsers import ParseResult, register_parser
+from environments.tool_call_parsers.glm45_parser import Glm45ToolCallParser
+
+
+@register_parser("glm47")
+class Glm47ToolCallParser(Glm45ToolCallParser):
+    """
+    Parser for GLM 4.7 tool calls.
+    Extends GLM 4.5 with updated regex patterns.
+    """
+
+    def __init__(self):
+        super().__init__()
+        # GLM 4.7 uses a slightly different detail regex that includes
+        # the <tool_call> wrapper and optional arg_key content
+        self.FUNC_DETAIL_REGEX = re.compile(
+            r"<tool_call>(.*?)(<arg_key>.*?)?</tool_call>", re.DOTALL
+        )
+        # GLM 4.7 handles newlines between arg_key and arg_value tags
+        self.FUNC_ARG_REGEX = re.compile(
+            r"<arg_key>(.*?)</arg_key>(?:\\n|\s)*<arg_value>(.*?)</arg_value>",
+            re.DOTALL,
+        )
--- a/hermes_code/environments/tool_call_parsers/hermes_parser.py
+++ b/hermes_code/environments/tool_call_parsers/hermes_parser.py
@ -0,0 +1,73 @@
+"""
+Hermes tool call parser.
+
+Format: <tool_call>{"name": "func", "arguments": {...}}</tool_call>
+Based on VLLM's Hermes2ProToolParser.extract_tool_calls()
+"""
+
+import json
+import re
+import uuid
+from typing import List, Optional, Tuple
+
+from openai.types.chat.chat_completion_message_tool_call import (
+    ChatCompletionMessageToolCall,
+    Function,
+)
+
+from environments.tool_call_parsers import ParseResult, ToolCallParser, register_parser
+
+
+@register_parser("hermes")
+class HermesToolCallParser(ToolCallParser):
+    """
+    Parser for Hermes-format tool calls.
+
+    Matches <tool_call>...</tool_call> tags containing JSON with "name" and "arguments".
+    Also handles unclosed <tool_call> at end-of-string (truncated generation).
+    """
+
+    # Matches both closed and unclosed tool_call tags
+    PATTERN = re.compile(
+        r"<tool_call>\s*(.*?)\s*</tool_call>|<tool_call>\s*(.*)", re.DOTALL
+    )
+
+    def parse(self, text: str) -> ParseResult:
+        if "<tool_call>" not in text:
+            return text, None
+
+        try:
+            matches = self.PATTERN.findall(text)
+            if not matches:
+                return text, None
+
+            tool_calls: List[ChatCompletionMessageToolCall] = []
+            for match in matches:
+                # match is a tuple: (closed_content, unclosed_content)
+                raw_json = match[0] if match[0] else match[1]
+                if not raw_json.strip():
+                    continue
+
+                tc_data = json.loads(raw_json)
+                tool_calls.append(
+                    ChatCompletionMessageToolCall(
+                        id=f"call_{uuid.uuid4().hex[:8]}",
+                        type="function",
+                        function=Function(
+                            name=tc_data["name"],
+                            arguments=json.dumps(
+                                tc_data.get("arguments", {}), ensure_ascii=False
+                            ),
+                        ),
+                    )
+                )
+
+            if not tool_calls:
+                return text, None
+
+            # Content is everything before the first <tool_call> tag
+            content = text[: text.find("<tool_call>")].strip()
+            return content if content else None, tool_calls
+
+        except Exception:
+            return text, None
--- a/hermes_code/environments/tool_call_parsers/kimi_k2_parser.py
+++ b/hermes_code/environments/tool_call_parsers/kimi_k2_parser.py
@ -0,0 +1,93 @@
+"""
+Kimi K2 tool call parser.
+
+Format:
+    <|tool_calls_section_begin|>
+    <|tool_call_begin|>function_id:0<|tool_call_argument_begin|>{"arg": "val"}<|tool_call_end|>
+    <|tool_calls_section_end|>
+
+The function_id format is typically "functions.func_name:index" or "func_name:index".
+
+Based on VLLM's KimiK2ToolParser.extract_tool_calls()
+"""
+
+import re
+import uuid
+from typing import List, Optional
+
+from openai.types.chat.chat_completion_message_tool_call import (
+    ChatCompletionMessageToolCall,
+    Function,
+)
+
+from environments.tool_call_parsers import ParseResult, ToolCallParser, register_parser
+
+
+@register_parser("kimi_k2")
+class KimiK2ToolCallParser(ToolCallParser):
+    """
+    Parser for Kimi K2 tool calls.
+
+    Uses section begin/end tokens wrapping individual tool call begin/end tokens.
+    The tool_call_id contains the function name (after last dot, before colon).
+    """
+
+    # Support both singular and plural variants
+    START_TOKENS = [
+        "<|tool_calls_section_begin|>",
+        "<|tool_call_section_begin|>",
+    ]
+
+    # Regex captures: tool_call_id (e.g., "functions.get_weather:0"), function_arguments
+    PATTERN = re.compile(
+        r"<\|tool_call_begin\|>\s*(?P<tool_call_id>[^<]+:\d+)\s*"
+        r"<\|tool_call_argument_begin\|>\s*"
+        r"(?P<function_arguments>(?:(?!<\|tool_call_begin\|>).)*?)\s*"
+        r"<\|tool_call_end\|>",
+        re.DOTALL,
+    )
+
+    def parse(self, text: str) -> ParseResult:
+        # Check for any variant of the start token
+        has_start = any(token in text for token in self.START_TOKENS)
+        if not has_start:
+            return text, None
+
+        try:
+            matches = self.PATTERN.findall(text)
+            if not matches:
+                return text, None
+
+            tool_calls: List[ChatCompletionMessageToolCall] = []
+            for match in matches:
+                function_id, function_args = match
+
+                # Extract function name from ID format: "functions.get_weather:0" -> "get_weather"
+                function_name = function_id.split(":")[0].split(".")[-1]
+
+                tool_calls.append(
+                    ChatCompletionMessageToolCall(
+                        id=function_id,  # Preserve the original ID format
+                        type="function",
+                        function=Function(
+                            name=function_name,
+                            arguments=function_args.strip(),
+                        ),
+                    )
+                )
+
+            if not tool_calls:
+                return text, None
+
+            # Content is everything before the tool calls section
+            earliest_start = len(text)
+            for token in self.START_TOKENS:
+                idx = text.find(token)
+                if idx >= 0 and idx < earliest_start:
+                    earliest_start = idx
+
+            content = text[:earliest_start].strip()
+            return content if content else None, tool_calls
+
+        except Exception:
+            return text, None
--- a/hermes_code/environments/tool_call_parsers/llama_parser.py
+++ b/hermes_code/environments/tool_call_parsers/llama_parser.py
@ -0,0 +1,96 @@
+"""
+Llama 3.x / 4 tool call parser.
+
+Format: The model outputs JSON objects with "name" and "arguments" (or "parameters") keys.
+May be preceded by <|python_tag|> token. Supports multiple JSON objects separated
+by content or semicolons.
+
+Based on VLLM's Llama3JsonToolParser.extract_tool_calls()
+"""
+
+import json
+import re
+import uuid
+from typing import List, Optional
+
+from openai.types.chat.chat_completion_message_tool_call import (
+    ChatCompletionMessageToolCall,
+    Function,
+)
+
+from environments.tool_call_parsers import ParseResult, ToolCallParser, register_parser
+
+
+@register_parser("llama3_json")
+@register_parser("llama4_json")
+class LlamaToolCallParser(ToolCallParser):
+    """
+    Parser for Llama 3.x and 4 JSON-format tool calls.
+
+    Finds JSON objects containing "name" + ("arguments" or "parameters") keys.
+    Uses Python's json.JSONDecoder.raw_decode for robust extraction of
+    JSON objects from mixed text.
+    """
+
+    BOT_TOKEN = "<|python_tag|>"
+
+    # Regex to find the start of potential JSON objects
+    JSON_START = re.compile(r"\{")
+
+    def parse(self, text: str) -> ParseResult:
+        # Quick check: need either the bot token or a JSON brace
+        if self.BOT_TOKEN not in text and "{" not in text:
+            return text, None
+
+        try:
+            decoder = json.JSONDecoder()
+            tool_calls: List[ChatCompletionMessageToolCall] = []
+            end_index = -1  # Track where the last parsed JSON ended
+
+            for match in self.JSON_START.finditer(text):
+                start = match.start()
+                # Skip if this brace is inside a previously parsed JSON object
+                if start <= end_index:
+                    continue
+
+                try:
+                    obj, json_end = decoder.raw_decode(text[start:])
+                    end_index = start + json_end
+
+                    # Must have "name" and either "arguments" or "parameters"
+                    name = obj.get("name")
+                    args = obj.get("arguments", obj.get("parameters"))
+
+                    if not name or args is None:
+                        continue
+
+                    # Normalize arguments to JSON string
+                    if isinstance(args, dict):
+                        args = json.dumps(args, ensure_ascii=False)
+                    elif not isinstance(args, str):
+                        args = json.dumps(args, ensure_ascii=False)
+
+                    tool_calls.append(
+                        ChatCompletionMessageToolCall(
+                            id=f"call_{uuid.uuid4().hex[:8]}",
+                            type="function",
+                            function=Function(name=name, arguments=args),
+                        )
+                    )
+                except (json.JSONDecodeError, KeyError, ValueError):
+                    continue
+
+            if not tool_calls:
+                return text, None
+
+            # Content is everything before the first tool call JSON
+            # Find where the first tool call starts in the text
+            first_tc_start = text.find("{")
+            if self.BOT_TOKEN in text:
+                first_tc_start = text.find(self.BOT_TOKEN)
+            content = text[:first_tc_start].strip() if first_tc_start > 0 else None
+
+            return content, tool_calls
+
+        except Exception:
+            return text, None
--- a/hermes_code/environments/tool_call_parsers/longcat_parser.py
+++ b/hermes_code/environments/tool_call_parsers/longcat_parser.py
@ -0,0 +1,69 @@
+"""
+Longcat Flash Chat tool call parser.
+
+Same as Hermes but uses <longcat_tool_call> tags instead of <tool_call>.
+Based on VLLM's LongcatFlashToolParser (extends Hermes2ProToolParser).
+"""
+
+import json
+import re
+import uuid
+from typing import List, Optional
+
+from openai.types.chat.chat_completion_message_tool_call import (
+    ChatCompletionMessageToolCall,
+    Function,
+)
+
+from environments.tool_call_parsers import ParseResult, ToolCallParser, register_parser
+
+
+@register_parser("longcat")
+class LongcatToolCallParser(ToolCallParser):
+    """
+    Parser for Longcat Flash Chat tool calls.
+    Identical logic to Hermes, just different tag names.
+    """
+
+    PATTERN = re.compile(
+        r"<longcat_tool_call>\s*(.*?)\s*</longcat_tool_call>|<longcat_tool_call>\s*(.*)",
+        re.DOTALL,
+    )
+
+    def parse(self, text: str) -> ParseResult:
+        if "<longcat_tool_call>" not in text:
+            return text, None
+
+        try:
+            matches = self.PATTERN.findall(text)
+            if not matches:
+                return text, None
+
+            tool_calls: List[ChatCompletionMessageToolCall] = []
+            for match in matches:
+                raw_json = match[0] if match[0] else match[1]
+                if not raw_json.strip():
+                    continue
+
+                tc_data = json.loads(raw_json)
+                tool_calls.append(
+                    ChatCompletionMessageToolCall(
+                        id=f"call_{uuid.uuid4().hex[:8]}",
+                        type="function",
+                        function=Function(
+                            name=tc_data["name"],
+                            arguments=json.dumps(
+                                tc_data.get("arguments", {}), ensure_ascii=False
+                            ),
+                        ),
+                    )
+                )
+
+            if not tool_calls:
+                return text, None
+
+            content = text[: text.find("<longcat_tool_call>")].strip()
+            return content if content else None, tool_calls
+
+        except Exception:
+            return text, None
--- a/hermes_code/environments/tool_call_parsers/mistral_parser.py
+++ b/hermes_code/environments/tool_call_parsers/mistral_parser.py
@ -0,0 +1,135 @@
+"""
+Mistral tool call parser.
+
+Supports two formats depending on tokenizer version:
+- Pre-v11: content[TOOL_CALLS] [{"name": ..., "arguments": {...}}, ...]
+- v11+:    content[TOOL_CALLS]tool_name1{"arg": "val"}[TOOL_CALLS]tool_name2{"arg": "val"}
+
+Based on VLLM's MistralToolParser.extract_tool_calls()
+The [TOOL_CALLS] token is the bot_token used by Mistral models.
+"""
+
+import json
+import uuid
+from typing import List, Optional
+
+from openai.types.chat.chat_completion_message_tool_call import (
+    ChatCompletionMessageToolCall,
+    Function,
+)
+
+from environments.tool_call_parsers import ParseResult, ToolCallParser, register_parser
+
+
+def _generate_mistral_id() -> str:
+    """Mistral tool call IDs are 9-char alphanumeric strings."""
+    import random
+    import string
+
+    return "".join(random.choices(string.ascii_letters + string.digits, k=9))
+
+
+@register_parser("mistral")
+class MistralToolCallParser(ToolCallParser):
+    """
+    Parser for Mistral-format tool calls.
+
+    Detects format by checking if the content after [TOOL_CALLS] starts with '['
+    (pre-v11 JSON array) or with a tool name (v11+ format).
+    """
+
+    # The [TOOL_CALLS] token -- may appear as different strings depending on tokenizer
+    BOT_TOKEN = "[TOOL_CALLS]"
+
+    def parse(self, text: str) -> ParseResult:
+        if self.BOT_TOKEN not in text:
+            return text, None
+
+        try:
+            parts = text.split(self.BOT_TOKEN)
+            content = parts[0].strip()
+            raw_tool_calls = parts[1:]
+
+            # Detect format: if the first raw part starts with '[', it's pre-v11
+            first_raw = raw_tool_calls[0].strip() if raw_tool_calls else ""
+            is_pre_v11 = first_raw.startswith("[") or first_raw.startswith("{")
+
+            tool_calls: List[ChatCompletionMessageToolCall] = []
+
+            if not is_pre_v11:
+                # v11+ format: [TOOL_CALLS]tool_name{args}[TOOL_CALLS]tool_name2{args2}
+                for raw in raw_tool_calls:
+                    raw = raw.strip()
+                    if not raw or "{" not in raw:
+                        continue
+
+                    brace_idx = raw.find("{")
+                    tool_name = raw[:brace_idx].strip()
+                    args_str = raw[brace_idx:]
+
+                    # Validate and clean the JSON arguments
+                    try:
+                        parsed_args = json.loads(args_str)
+                        args_str = json.dumps(parsed_args, ensure_ascii=False)
+                    except json.JSONDecodeError:
+                        pass  # Keep raw if parsing fails
+
+                    tool_calls.append(
+                        ChatCompletionMessageToolCall(
+                            id=_generate_mistral_id(),
+                            type="function",
+                            function=Function(name=tool_name, arguments=args_str),
+                        )
+                    )
+            else:
+                # Pre-v11 format: [TOOL_CALLS] [{"name": ..., "arguments": {...}}]
+                try:
+                    parsed = json.loads(first_raw)
+                    if isinstance(parsed, dict):
+                        parsed = [parsed]
+
+                    for tc in parsed:
+                        args = tc.get("arguments", {})
+                        if isinstance(args, dict):
+                            args = json.dumps(args, ensure_ascii=False)
+
+                        tool_calls.append(
+                            ChatCompletionMessageToolCall(
+                                id=_generate_mistral_id(),
+                                type="function",
+                                function=Function(
+                                    name=tc["name"], arguments=args
+                                ),
+                            )
+                        )
+                except json.JSONDecodeError:
+                    # Fallback: extract JSON objects using raw_decode
+                    decoder = json.JSONDecoder()
+                    idx = 0
+                    while idx < len(first_raw):
+                        try:
+                            obj, end_idx = decoder.raw_decode(first_raw, idx)
+                            if isinstance(obj, dict) and "name" in obj:
+                                args = obj.get("arguments", {})
+                                if isinstance(args, dict):
+                                    args = json.dumps(args, ensure_ascii=False)
+                                tool_calls.append(
+                                    ChatCompletionMessageToolCall(
+                                        id=_generate_mistral_id(),
+                                        type="function",
+                                        function=Function(
+                                            name=obj["name"], arguments=args
+                                        ),
+                                    )
+                                )
+                            idx = end_idx
+                        except json.JSONDecodeError:
+                            idx += 1
+
+            if not tool_calls:
+                return text, None
+
+            return content if content else None, tool_calls
+
+        except Exception:
+            return text, None
--- a/hermes_code/environments/tool_call_parsers/qwen3_coder_parser.py
+++ b/hermes_code/environments/tool_call_parsers/qwen3_coder_parser.py
@ -0,0 +1,163 @@
+"""
+Qwen3-Coder tool call parser.
+
+Format uses XML-style nested tags:
+    <tool_call>
+    <function=function_name>
+    <parameter=param_name>value</parameter>
+    <parameter=param_name2>value2</parameter>
+    </function>
+    </tool_call>
+
+Parameters are extracted from <parameter=name>value</parameter> tags and
+type-converted using the schema if available, otherwise treated as strings.
+
+Based on VLLM's Qwen3CoderToolParser.extract_tool_calls()
+"""
+
+import ast
+import json
+import re
+import uuid
+from typing import Any, Dict, List, Optional
+
+from openai.types.chat.chat_completion_message_tool_call import (
+    ChatCompletionMessageToolCall,
+    Function,
+)
+
+from environments.tool_call_parsers import ParseResult, ToolCallParser, register_parser
+
+
+def _try_convert_value(value: str) -> Any:
+    """
+    Try to convert a parameter value string to a native Python type.
+    Handles null, numbers, booleans, JSON objects/arrays, and falls back to string.
+    """
+    stripped = value.strip()
+
+    # Handle null
+    if stripped.lower() == "null":
+        return None
+
+    # Try JSON first (handles objects, arrays, strings, numbers, booleans)
+    try:
+        return json.loads(stripped)
+    except (json.JSONDecodeError, TypeError):
+        pass
+
+    # Try Python literal eval (handles tuples, etc.)
+    try:
+        return ast.literal_eval(stripped)
+    except (ValueError, SyntaxError, TypeError):
+        pass
+
+    # Return as string
+    return stripped
+
+
+@register_parser("qwen3_coder")
+class Qwen3CoderToolCallParser(ToolCallParser):
+    """
+    Parser for Qwen3-Coder XML-format tool calls.
+
+    Uses nested XML tags: <tool_call><function=name><parameter=key>val</parameter></function></tool_call>
+    """
+
+    START_TOKEN = "<tool_call>"
+    FUNCTION_PREFIX = "<function="
+
+    # Find complete tool_call blocks (or unclosed at end)
+    TOOL_CALL_REGEX = re.compile(
+        r"<tool_call>(.*?)</tool_call>|<tool_call>(.*?)$", re.DOTALL
+    )
+
+    # Find function blocks within a tool_call
+    FUNCTION_REGEX = re.compile(
+        r"<function=(.*?)</function>|<function=(.*)$", re.DOTALL
+    )
+
+    # Find parameter blocks within a function
+    PARAMETER_REGEX = re.compile(
+        r"<parameter=(.*?)(?:</parameter>|(?=<parameter=)|(?=</function>)|$)",
+        re.DOTALL,
+    )
+
+    def _parse_function_call(self, function_str: str) -> Optional[ChatCompletionMessageToolCall]:
+        """Parse a single <function=name>...</function> block into a ToolCall."""
+        try:
+            # Extract function name: everything before the first '>'
+            gt_idx = function_str.index(">")
+            func_name = function_str[:gt_idx].strip()
+            params_str = function_str[gt_idx + 1:]
+
+            # Extract parameters
+            param_dict: Dict[str, Any] = {}
+            for match_text in self.PARAMETER_REGEX.findall(params_str):
+                if ">" not in match_text:
+                    continue
+                eq_idx = match_text.index(">")
+                param_name = match_text[:eq_idx].strip()
+                param_value = match_text[eq_idx + 1:]
+
+                # Clean up whitespace
+                if param_value.startswith("\n"):
+                    param_value = param_value[1:]
+                if param_value.endswith("\n"):
+                    param_value = param_value[:-1]
+
+                param_dict[param_name] = _try_convert_value(param_value)
+
+            return ChatCompletionMessageToolCall(
+                id=f"call_{uuid.uuid4().hex[:24]}",
+                type="function",
+                function=Function(
+                    name=func_name,
+                    arguments=json.dumps(param_dict, ensure_ascii=False),
+                ),
+            )
+        except (ValueError, IndexError):
+            return None
+
+    def parse(self, text: str) -> ParseResult:
+        if self.FUNCTION_PREFIX not in text:
+            return text, None
+
+        try:
+            # Find all tool_call blocks
+            tc_matches = self.TOOL_CALL_REGEX.findall(text)
+            raw_blocks = [m[0] if m[0] else m[1] for m in tc_matches]
+
+            # Fallback: if no tool_call tags, try the whole text
+            if not raw_blocks:
+                raw_blocks = [text]
+
+            # Find function blocks within each tool_call
+            function_strs: List[str] = []
+            for block in raw_blocks:
+                func_matches = self.FUNCTION_REGEX.findall(block)
+                function_strs.extend(m[0] if m[0] else m[1] for m in func_matches)
+
+            if not function_strs:
+                return text, None
+
+            # Parse each function call
+            tool_calls: List[ChatCompletionMessageToolCall] = []
+            for func_str in function_strs:
+                tc = self._parse_function_call(func_str)
+                if tc is not None:
+                    tool_calls.append(tc)
+
+            if not tool_calls:
+                return text, None
+
+            # Content before tool calls
+            first_tc = text.find(self.START_TOKEN)
+            if first_tc < 0:
+                first_tc = text.find(self.FUNCTION_PREFIX)
+            content = text[:first_tc].strip() if first_tc > 0 else None
+
+            return content, tool_calls
+
+        except Exception:
+            return text, None
--- a/hermes_code/environments/tool_call_parsers/qwen_parser.py
+++ b/hermes_code/environments/tool_call_parsers/qwen_parser.py
@ -0,0 +1,19 @@
+"""
+Qwen 2.5 tool call parser.
+
+Uses the same <tool_call> format as Hermes.
+Registered as a separate parser name for clarity when using --tool-parser=qwen.
+"""
+
+from environments.tool_call_parsers import register_parser
+from environments.tool_call_parsers.hermes_parser import HermesToolCallParser
+
+
+@register_parser("qwen")
+class QwenToolCallParser(HermesToolCallParser):
+    """
+    Parser for Qwen 2.5 tool calls.
+    Same <tool_call>{"name": ..., "arguments": ...}</tool_call> format as Hermes.
+    """
+
+    pass  # Identical format -- inherits everything from Hermes
--- a/hermes_code/environments/tool_context.py
+++ b/hermes_code/environments/tool_context.py
@ -0,0 +1,474 @@
+"""
+ToolContext -- Unrestricted Tool Access for Reward Functions
+
+A per-rollout handle that gives reward/verification functions direct access to
+ALL hermes-agent tools, scoped to the rollout's task_id. The same task_id means
+the terminal/browser session is the SAME one the model used during its rollout --
+all state (files, processes, browser tabs) is preserved.
+
+The verifier author decides which tools to use. Nothing is hardcoded or gated.
+
+Example usage in a compute_reward():
+    async def compute_reward(self, item, result, ctx):
+        # Run tests in the model's terminal sandbox
+        test = ctx.terminal("pytest -v")
+        if test["exit_code"] == 0:
+            return 1.0
+
+        # Check if a file was created
+        content = ctx.read_file("/workspace/solution.py")
+        if content.get("content"):
+            return 0.5
+
+        return 0.0
+"""
+
+import json
+import logging
+import os
+from typing import Any, Dict, List, Optional
+
+import asyncio
+import concurrent.futures
+
+from model_tools import handle_function_call
+from tools.terminal_tool import cleanup_vm
+from tools.browser_tool import cleanup_browser
+
+logger = logging.getLogger(__name__)
+
+# Thread pool for running sync tool calls that internally use asyncio.run()
+_tool_executor = concurrent.futures.ThreadPoolExecutor(max_workers=4)
+
+
+def _run_tool_in_thread(tool_name: str, arguments: Dict[str, Any], task_id: str) -> str:
+    """
+    Run a tool call in a thread pool executor so backends that use asyncio.run()
+    internally (modal, docker, daytona) get a clean event loop.
+
+    If we're already in an async context, executes handle_function_call() in a
+    disposable worker thread and blocks for the result.
+    If not (e.g., called from sync code), runs directly.
+    """
+    try:
+        loop = asyncio.get_running_loop()
+        # We're in an async context -- need to run in thread
+        import concurrent.futures
+        with concurrent.futures.ThreadPoolExecutor(max_workers=1) as pool:
+            future = pool.submit(
+                handle_function_call, tool_name, arguments, task_id
+            )
+            return future.result(timeout=300)
+    except RuntimeError:
+        # No running event loop -- safe to call directly
+        return handle_function_call(tool_name, arguments, task_id)
+
+
+class ToolContext:
+    """
+    Open-ended access to all hermes-agent tools for a specific rollout.
+
+    Passed to compute_reward() so verifiers can use any tool they need:
+    terminal commands, file reads/writes, web searches, browser automation, etc.
+    All calls share the rollout's task_id for session isolation.
+    """
+
+    def __init__(self, task_id: str):
+        self.task_id = task_id
+
+    # -------------------------------------------------------------------------
+    # Terminal tools
+    # -------------------------------------------------------------------------
+
+    def terminal(self, command: str, timeout: int = 180) -> Dict[str, Any]:
+        """
+        Run a command in the rollout's terminal session.
+
+        Args:
+            command: Shell command to execute
+            timeout: Command timeout in seconds
+
+        Returns:
+            Dict with 'exit_code' (int) and 'output' (str)
+        """
+        import os
+        backend = os.getenv("TERMINAL_ENV", "local")
+        logger.debug("ToolContext.terminal [%s backend] task=%s: %s", backend, self.task_id[:8], command[:100])
+
+        # Run via thread helper so modal/docker/daytona backends' asyncio.run() doesn't deadlock
+        result = _run_tool_in_thread(
+            "terminal",
+            {"command": command, "timeout": timeout},
+            self.task_id,
+        )
+        try:
+            return json.loads(result)
+        except json.JSONDecodeError:
+            return {"exit_code": -1, "output": result}
+
+    # -------------------------------------------------------------------------
+    # File tools
+    # -------------------------------------------------------------------------
+
+    def read_file(self, path: str) -> Dict[str, Any]:
+        """
+        Read a file from the rollout's filesystem.
+
+        Args:
+            path: File path to read
+
+        Returns:
+            Dict with file content or error
+        """
+        result = handle_function_call(
+            "read_file", {"path": path}, task_id=self.task_id
+        )
+        try:
+            return json.loads(result)
+        except json.JSONDecodeError:
+            return {"error": result}
+
+    def write_file(self, path: str, content: str) -> Dict[str, Any]:
+        """
+        Write a TEXT file in the rollout's filesystem.
+
+        Uses a shell heredoc under the hood, so this is only safe for text content.
+        For binary files (images, compiled artifacts, etc.), use upload_file() instead.
+
+        Args:
+            path: File path to write
+            content: Text content to write
+
+        Returns:
+            Dict with success status or error
+        """
+        result = handle_function_call(
+            "write_file", {"path": path, "content": content}, task_id=self.task_id
+        )
+        try:
+            return json.loads(result)
+        except json.JSONDecodeError:
+            return {"error": result}
+
+    def upload_file(self, local_path: str, remote_path: str) -> Dict[str, Any]:
+        """
+        Upload a local file to the rollout's sandbox (binary-safe).
+
+        Unlike write_file() which passes content through a shell heredoc (text-only),
+        this method base64-encodes the file and decodes it inside the sandbox.
+        Safe for any file type: binaries, images, archives, etc.
+
+        For large files (>1MB), the content is split into chunks to avoid
+        hitting shell command-length limits.
+
+        Args:
+            local_path: Path to a local file on the host
+            remote_path: Destination path inside the sandbox
+
+        Returns:
+            Dict with 'exit_code' and 'output'
+        """
+        import base64
+        from pathlib import Path as _Path
+
+        local = _Path(local_path)
+        if not local.exists():
+            return {"exit_code": -1, "output": f"Local file not found: {local_path}"}
+
+        raw = local.read_bytes()
+        b64 = base64.b64encode(raw).decode("ascii")
+
+        # Ensure parent directory exists in the sandbox
+        parent = str(_Path(remote_path).parent)
+        if parent not in (".", "/"):
+            self.terminal(f"mkdir -p {parent}", timeout=10)
+
+        # For small files, single command is fine
+        chunk_size = 60_000  # ~60KB per chunk (well within shell limits)
+        if len(b64) <= chunk_size:
+            result = self.terminal(
+                f"printf '%s' '{b64}' | base64 -d > {remote_path}",
+                timeout=30,
+            )
+        else:
+            # For larger files, write base64 in chunks then decode
+            tmp_b64 = "/tmp/_hermes_upload.b64"
+            self.terminal(f": > {tmp_b64}", timeout=5)  # truncate
+            for i in range(0, len(b64), chunk_size):
+                chunk = b64[i : i + chunk_size]
+                self.terminal(f"printf '%s' '{chunk}' >> {tmp_b64}", timeout=15)
+            result = self.terminal(
+                f"base64 -d {tmp_b64} > {remote_path} && rm -f {tmp_b64}",
+                timeout=30,
+            )
+
+        return result
+
+    def upload_dir(self, local_dir: str, remote_dir: str) -> List[Dict[str, Any]]:
+        """
+        Upload an entire local directory to the rollout's sandbox (binary-safe).
+
+        Recursively uploads all files, preserving directory structure.
+
+        Args:
+            local_dir: Path to a local directory on the host
+            remote_dir: Destination directory inside the sandbox
+
+        Returns:
+            List of results, one per file uploaded
+        """
+        from pathlib import Path as _Path
+
+        local = _Path(local_dir)
+        if not local.exists() or not local.is_dir():
+            return [{"exit_code": -1, "output": f"Local directory not found: {local_dir}"}]
+
+        results = []
+        for file_path in sorted(local.rglob("*")):
+            if file_path.is_file():
+                relative = file_path.relative_to(local)
+                target = f"{remote_dir}/{relative}"
+                results.append(self.upload_file(str(file_path), target))
+        return results
+
+    def download_file(self, remote_path: str, local_path: str) -> Dict[str, Any]:
+        """
+        Download a file from the rollout's sandbox to the host (binary-safe).
+
+        The inverse of upload_file(). Base64-encodes the file inside the sandbox,
+        reads the encoded data through the terminal, and decodes it locally.
+        Safe for any file type.
+
+        Args:
+            remote_path: Path to the file inside the sandbox
+            local_path: Destination path on the host
+
+        Returns:
+            Dict with 'success' (bool) and 'bytes' (int) or 'error' (str)
+        """
+        import base64
+        from pathlib import Path as _Path
+
+        # Base64-encode the file inside the sandbox and capture output
+        result = self.terminal(
+            f"base64 {remote_path} 2>/dev/null",
+            timeout=30,
+        )
+
+        if result.get("exit_code", -1) != 0:
+            return {
+                "success": False,
+                "error": f"Failed to read remote file: {result.get('output', '')}",
+            }
+
+        b64_data = result.get("output", "").strip()
+        if not b64_data:
+            return {"success": False, "error": f"Remote file is empty or missing: {remote_path}"}
+
+        try:
+            raw = base64.b64decode(b64_data)
+        except Exception as e:
+            return {"success": False, "error": f"Base64 decode failed: {e}"}
+
+        # Write to local host filesystem
+        local = _Path(local_path)
+        local.parent.mkdir(parents=True, exist_ok=True)
+        local.write_bytes(raw)
+
+        return {"success": True, "bytes": len(raw)}
+
+    def download_dir(self, remote_dir: str, local_dir: str) -> List[Dict[str, Any]]:
+        """
+        Download a directory from the rollout's sandbox to the host (binary-safe).
+
+        Lists all files in the remote directory, then downloads each one.
+        Preserves directory structure.
+
+        Args:
+            remote_dir: Path to the directory inside the sandbox
+            local_dir: Destination directory on the host
+
+        Returns:
+            List of results, one per file downloaded
+        """
+        from pathlib import Path as _Path
+
+        # List files in the remote directory
+        ls_result = self.terminal(
+            f"find {remote_dir} -type f 2>/dev/null",
+            timeout=15,
+        )
+
+        if ls_result.get("exit_code", -1) != 0:
+            return [{"success": False, "error": f"Failed to list remote dir: {remote_dir}"}]
+
+        file_list = ls_result.get("output", "").strip()
+        if not file_list:
+            return [{"success": False, "error": f"Remote directory is empty or missing: {remote_dir}"}]
+
+        results = []
+        for remote_file in file_list.splitlines():
+            remote_file = remote_file.strip()
+            if not remote_file:
+                continue
+            # Compute the relative path to preserve directory structure
+            if remote_file.startswith(remote_dir):
+                relative = remote_file[len(remote_dir):].lstrip("/")
+            else:
+                relative = _Path(remote_file).name
+            local_file = str(_Path(local_dir) / relative)
+            results.append(self.download_file(remote_file, local_file))
+
+        return results
+
+    def search(self, query: str, path: str = ".") -> Dict[str, Any]:
+        """
+        Search for text in the rollout's filesystem.
+
+        Args:
+            query: Search query
+            path: Directory to search in
+
+        Returns:
+            Dict with search results
+        """
+        result = handle_function_call(
+            "search_files", {"pattern": query, "path": path}, task_id=self.task_id
+        )
+        try:
+            return json.loads(result)
+        except json.JSONDecodeError:
+            return {"error": result}
+
+    # -------------------------------------------------------------------------
+    # Web tools
+    # -------------------------------------------------------------------------
+
+    def web_search(self, query: str) -> Dict[str, Any]:
+        """
+        Search the web.
+
+        Args:
+            query: Search query
+
+        Returns:
+            Dict with search results
+        """
+        result = handle_function_call("web_search", {"query": query})
+        try:
+            return json.loads(result)
+        except json.JSONDecodeError:
+            return {"error": result}
+
+    def web_extract(self, urls: List[str]) -> Dict[str, Any]:
+        """
+        Extract content from URLs.
+
+        Args:
+            urls: List of URLs to extract content from
+
+        Returns:
+            Dict with extracted content
+        """
+        result = handle_function_call("web_extract", {"urls": urls})
+        try:
+            return json.loads(result)
+        except json.JSONDecodeError:
+            return {"error": result}
+
+    # -------------------------------------------------------------------------
+    # Browser tools
+    # -------------------------------------------------------------------------
+
+    def browser_navigate(self, url: str) -> Dict[str, Any]:
+        """
+        Navigate the rollout's browser session to a URL.
+
+        Args:
+            url: URL to navigate to
+
+        Returns:
+            Dict with page snapshot or error
+        """
+        result = handle_function_call(
+            "browser_navigate", {"url": url}, task_id=self.task_id
+        )
+        try:
+            return json.loads(result)
+        except json.JSONDecodeError:
+            return {"error": result}
+
+    def browser_snapshot(self) -> Dict[str, Any]:
+        """
+        Take a snapshot of the current browser page.
+
+        Returns:
+            Dict with page content/accessibility snapshot
+        """
+        result = handle_function_call(
+            "browser_snapshot", {}, task_id=self.task_id
+        )
+        try:
+            return json.loads(result)
+        except json.JSONDecodeError:
+            return {"error": result}
+
+    # -------------------------------------------------------------------------
+    # Generic tool access
+    # -------------------------------------------------------------------------
+
+    def call_tool(self, tool_name: str, arguments: Dict[str, Any]) -> str:
+        """
+        Call any hermes-agent tool by name.
+
+        This is the generic escape hatch -- if a tool doesn't have a convenience
+        wrapper above, you can call it directly here.
+
+        Args:
+            tool_name: Name of the tool (e.g., "vision_analyze", "skills_list")
+            arguments: Dict of arguments for the tool
+
+        Returns:
+            Raw JSON string result from the tool
+        """
+        return _run_tool_in_thread(tool_name, arguments, self.task_id)
+
+    # -------------------------------------------------------------------------
+    # Cleanup
+    # -------------------------------------------------------------------------
+
+    def cleanup(self):
+        """
+        Release all resources (terminal VMs, browser sessions, background processes)
+        for this rollout.
+
+        Called automatically by the base environment via try/finally after
+        compute_reward() completes. You generally don't need to call this yourself.
+        """
+        # Kill any background processes from this rollout (safety net)
+        try:
+            from tools.process_registry import process_registry
+            killed = process_registry.kill_all(task_id=self.task_id)
+            if killed:
+                logger.debug("Process cleanup for task %s: killed %d process(es)", self.task_id, killed)
+        except Exception as e:
+            logger.debug("Process cleanup for task %s: %s", self.task_id, e)
+
+        try:
+            cleanup_vm(self.task_id)
+        except Exception as e:
+            logger.debug("VM cleanup for task %s: %s", self.task_id, e)
+
+        # Suppress browser_tool's noisy debug prints during cleanup.
+        # The cleanup still runs (safe), it just doesn't spam the console.
+        _prev_quiet = os.environ.get("HERMES_QUIET")
+        os.environ["HERMES_QUIET"] = "1"
+        try:
+            cleanup_browser(self.task_id)
+        except Exception as e:
+            logger.debug("Browser cleanup for task %s: %s", self.task_id, e)
+        finally:
+            if _prev_quiet is None:
+                os.environ.pop("HERMES_QUIET", None)
+            else:
+                os.environ["HERMES_QUIET"] = _prev_quiet
--- a/hermes_code/environments/web_research_env.py
+++ b/hermes_code/environments/web_research_env.py
@ -0,0 +1,718 @@
+"""
+WebResearchEnv — RL Environment for Multi-Step Web Research
+============================================================
+
+Trains models to do accurate, efficient, multi-source web research.
+
+Reward signals:
+  - Answer correctness  (LLM judge, 0.0–1.0)
+  - Source diversity    (used ≥2 distinct domains)
+  - Efficiency          (penalizes excessive tool calls)
+  - Tool usage          (bonus for actually using web tools)
+
+Dataset: FRAMES benchmark (Google, 2024) — multi-hop factual questions
+  HuggingFace: google/frames-benchmark
+  Fallback:    built-in sample questions (no HF token needed)
+
+Usage:
+    # Phase 1 (OpenAI-compatible server)
+    python environments/web_research_env.py serve \\
+        --openai.base_url http://localhost:8000/v1 \\
+        --openai.model_name YourModel \\
+        --openai.server_type openai
+
+    # Process mode (offline data generation)
+    python environments/web_research_env.py process \\
+        --env.data_path_to_save_groups data/web_research.jsonl
+
+    # Standalone eval
+    python environments/web_research_env.py evaluate \\
+        --openai.base_url http://localhost:8000/v1 \\
+        --openai.model_name YourModel
+
+Built by: github.com/jackx707
+Inspired by: GroceryMind — production Hermes agent doing live web research
+             across German grocery stores (firecrawl + hermes-agent)
+"""
+
+from __future__ import annotations
+
+import asyncio
+import json
+import logging
+import os
+import random
+import re
+import sys
+from pathlib import Path
+from typing import Any, Dict, List, Optional, Tuple
+from urllib.parse import urlparse
+
+from pydantic import Field
+
+# Ensure hermes-agent root is on path
+_repo_root = Path(__file__).resolve().parent.parent
+if str(_repo_root) not in sys.path:
+    sys.path.insert(0, str(_repo_root))
+
+# ---------------------------------------------------------------------------
+# Optional HuggingFace datasets import
+# ---------------------------------------------------------------------------
+try:
+    from datasets import load_dataset
+    HF_AVAILABLE = True
+except ImportError:
+    HF_AVAILABLE = False
+
+from atroposlib.envs.base import ScoredDataGroup
+from atroposlib.envs.server_handling.server_manager import APIServerConfig
+from atroposlib.type_definitions import Item
+
+from environments.hermes_base_env import HermesAgentBaseEnv, HermesAgentEnvConfig
+from environments.agent_loop import AgentResult
+from environments.tool_context import ToolContext
+
+logger = logging.getLogger(__name__)
+
+# ---------------------------------------------------------------------------
+# Fallback sample dataset (used when HuggingFace is unavailable)
+# Multi-hop questions requiring real web search to answer.
+# ---------------------------------------------------------------------------
+SAMPLE_QUESTIONS = [
+    {
+        "question": "What is the current population of the capital city of the country that won the 2022 FIFA World Cup?",
+        "answer": "Buenos Aires has approximately 3 million people in the city proper, or around 15 million in the greater metro area.",
+        "difficulty": "medium",
+        "hops": 2,
+    },
+    {
+        "question": "Who is the CEO of the company that makes the most widely used open-source container orchestration platform?",
+        "answer": "The Linux Foundation oversees Kubernetes. CNCF (Cloud Native Computing Foundation) is the specific body — it does not have a traditional CEO but has an executive director.",
+        "difficulty": "medium",
+        "hops": 2,
+    },
+    {
+        "question": "What programming language was used to write the original version of the web framework used by Instagram?",
+        "answer": "Django, which Instagram was built on, is written in Python.",
+        "difficulty": "easy",
+        "hops": 2,
+    },
+    {
+        "question": "In what year was the university founded where the inventor of the World Wide Web currently holds a professorship?",
+        "answer": "Tim Berners-Lee holds a professorship at MIT (founded 1861) and the University of Southampton (founded 1952).",
+        "difficulty": "hard",
+        "hops": 3,
+    },
+    {
+        "question": "What is the latest stable version of the programming language that ranks #1 on the TIOBE index as of this year?",
+        "answer": "Python is currently #1 on TIOBE. The latest stable version should be verified via the official python.org site.",
+        "difficulty": "medium",
+        "hops": 2,
+    },
+    {
+        "question": "How many employees does the parent company of Instagram have?",
+        "answer": "Meta Platforms (parent of Instagram) employs approximately 70,000+ people as of recent reports.",
+        "difficulty": "medium",
+        "hops": 2,
+    },
+    {
+        "question": "What is the current interest rate set by the central bank of the country where the Eiffel Tower is located?",
+        "answer": "The European Central Bank sets rates for France/eurozone. The current rate should be verified — it has changed frequently in 2023-2025.",
+        "difficulty": "hard",
+        "hops": 2,
+    },
+    {
+        "question": "Which company acquired the startup founded by the creator of Oculus VR?",
+        "answer": "Palmer Luckey founded Oculus VR, which was acquired by Facebook (now Meta). He later founded Anduril Industries.",
+        "difficulty": "medium",
+        "hops": 2,
+    },
+    {
+        "question": "What is the market cap of the company that owns the most popular search engine in Russia?",
+        "answer": "Yandex (now split into separate entities after 2024 restructuring). Current market cap should be verified via financial sources.",
+        "difficulty": "hard",
+        "hops": 2,
+    },
+    {
+        "question": "What was the GDP growth rate of the country that hosted the most recent Summer Olympics?",
+        "answer": "Paris, France hosted the 2024 Summer Olympics. France's recent GDP growth should be verified via World Bank or IMF data.",
+        "difficulty": "hard",
+        "hops": 2,
+    },
+]
+
+
+# ---------------------------------------------------------------------------
+# Configuration
+# ---------------------------------------------------------------------------
+
+class WebResearchEnvConfig(HermesAgentEnvConfig):
+    """Configuration for the web research RL environment."""
+
+    # Reward weights
+    correctness_weight: float = Field(
+        default=0.6,
+        description="Weight for answer correctness in reward (LLM judge score).",
+    )
+    tool_usage_weight: float = Field(
+        default=0.2,
+        description="Weight for tool usage signal (did the model actually use web tools?).",
+    )
+    efficiency_weight: float = Field(
+        default=0.2,
+        description="Weight for efficiency signal (penalizes excessive tool calls).",
+    )
+    diversity_bonus: float = Field(
+        default=0.1,
+        description="Bonus reward for citing ≥2 distinct domains.",
+    )
+
+    # Efficiency thresholds
+    efficient_max_calls: int = Field(
+        default=5,
+        description="Maximum tool calls before efficiency penalty begins.",
+    )
+    heavy_penalty_calls: int = Field(
+        default=10,
+        description="Tool call count where efficiency penalty steepens.",
+    )
+
+    # Eval
+    eval_size: int = Field(
+        default=20,
+        description="Number of held-out items for evaluation.",
+    )
+    eval_split_ratio: float = Field(
+        default=0.1,
+        description="Fraction of dataset to hold out for evaluation (0.0–1.0).",
+    )
+
+    # Dataset
+    dataset_name: str = Field(
+        default="google/frames-benchmark",
+        description="HuggingFace dataset name for research questions.",
+    )
+
+
+# ---------------------------------------------------------------------------
+# Environment
+# ---------------------------------------------------------------------------
+
+class WebResearchEnv(HermesAgentBaseEnv):
+    """
+    RL environment for training multi-step web research skills.
+
+    The model is given a factual question requiring 2-3 hops of web research
+    and must use web_search / web_extract tools to find and synthesize the answer.
+
+    Reward is multi-signal:
+      60% — answer correctness (LLM judge)
+      20% — tool usage (did the model actually search the web?)
+      20% — efficiency (penalizes >5 tool calls)
+
+    Bonus +0.1 for source diversity (≥2 distinct domains cited).
+    """
+
+    name = "web-research"
+    env_config_cls = WebResearchEnvConfig
+
+    # Default toolsets for this environment — web + file for saving notes
+    default_toolsets = ["web", "file"]
+
+    @classmethod
+    def config_init(cls) -> Tuple[WebResearchEnvConfig, List[APIServerConfig]]:
+        """Default configuration for the web research environment."""
+        env_config = WebResearchEnvConfig(
+            enabled_toolsets=["web", "file"],
+            max_agent_turns=15,
+            agent_temperature=1.0,
+            system_prompt=(
+                "You are a highly capable research agent. When asked a factual question, "
+                "always use web_search to find current, accurate information before answering. "
+                "Cite at least 2 sources. Be concise and accurate."
+            ),
+            group_size=4,
+            total_steps=1000,
+            steps_per_eval=100,
+            use_wandb=True,
+            wandb_name="web-research",
+        )
+
+        server_configs = [
+            APIServerConfig(
+                base_url="https://openrouter.ai/api/v1",
+                model_name="anthropic/claude-sonnet-4.5",
+                server_type="openai",
+                api_key=os.getenv("OPENROUTER_API_KEY", ""),
+                health_check=False,
+            )
+        ]
+
+        return env_config, server_configs
+
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        self._items: list[dict] = []
+        self._eval_items: list[dict] = []
+        self._index: int = 0
+
+        # Metrics tracking for wandb
+        self._reward_buffer: list[float] = []
+        self._correctness_buffer: list[float] = []
+        self._tool_usage_buffer: list[float] = []
+        self._efficiency_buffer: list[float] = []
+        self._diversity_buffer: list[float] = []
+
+    # ------------------------------------------------------------------
+    # 1. Setup — load dataset
+    # ------------------------------------------------------------------
+
+    async def setup(self) -> None:
+        """Load the FRAMES benchmark or fall back to built-in samples."""
+        if HF_AVAILABLE:
+            try:
+                logger.info("Loading FRAMES benchmark from HuggingFace...")
+                ds = load_dataset(self.config.dataset_name, split="test")
+                self._items = [
+                    {
+                        "question": row["Prompt"],
+                        "answer": row["Answer"],
+                        "difficulty": row.get("reasoning_types", "unknown"),
+                        "hops": 2,
+                    }
+                    for row in ds
+                ]
+                # Hold out for eval
+                eval_size = max(
+                    self.config.eval_size,
+                    int(len(self._items) * self.config.eval_split_ratio),
+                )
+                random.shuffle(self._items)
+                self._eval_items = self._items[:eval_size]
+                self._items = self._items[eval_size:]
+                logger.info(
+                    f"Loaded {len(self._items)} train / {len(self._eval_items)} eval items "
+                    f"from FRAMES benchmark."
+                )
+                return
+            except Exception as e:
+                logger.warning(f"Could not load FRAMES from HuggingFace: {e}. Using built-in samples.")
+
+        # Fallback
+        random.shuffle(SAMPLE_QUESTIONS)
+        split = max(1, len(SAMPLE_QUESTIONS) * 8 // 10)
+        self._items = SAMPLE_QUESTIONS[:split]
+        self._eval_items = SAMPLE_QUESTIONS[split:]
+        logger.info(
+            f"Using built-in sample dataset: {len(self._items)} train / "
+            f"{len(self._eval_items)} eval items."
+        )
+
+    # ------------------------------------------------------------------
+    # 2. get_next_item — return the next question
+    # ------------------------------------------------------------------
+
+    async def get_next_item(self) -> dict:
+        """Return the next item, cycling through the dataset."""
+        if not self._items:
+            raise RuntimeError("Dataset is empty. Did you call setup()?")
+        item = self._items[self._index % len(self._items)]
+        self._index += 1
+        return item
+
+    # ------------------------------------------------------------------
+    # 3. format_prompt — build the user-facing prompt
+    # ------------------------------------------------------------------
+
+    def format_prompt(self, item: dict) -> str:
+        """Format the research question as a task prompt."""
+        return (
+            f"Research the following question thoroughly using web search. "
+            f"You MUST search the web to find current, accurate information — "
+            f"do not rely solely on your training data.\n\n"
+            f"Question: {item['question']}\n\n"
+            f"Requirements:\n"
+            f"- Use web_search and/or web_extract tools to find information\n"
+            f"- Search at least 2 different sources\n"
+            f"- Provide a concise, accurate answer (2-4 sentences)\n"
+            f"- Cite the sources you used"
+        )
+
+    # ------------------------------------------------------------------
+    # 4. compute_reward — multi-signal scoring
+    # ------------------------------------------------------------------
+
+    async def compute_reward(
+        self,
+        item: dict,
+        result: AgentResult,
+        ctx: ToolContext,
+    ) -> float:
+        """
+        Multi-signal reward function:
+
+          correctness_weight * correctness  — LLM judge comparing answer to ground truth
+          tool_usage_weight  * tool_used    — binary: did the model use web tools?
+          efficiency_weight  * efficiency   — penalizes wasteful tool usage
+          + diversity_bonus                 — source diversity (≥2 distinct domains)
+        """
+        # Extract final response from messages (last assistant message with content)
+        final_response = ""
+        tools_used: list[str] = []
+        for msg in reversed(result.messages):
+            if msg.get("role") == "assistant" and msg.get("content") and not final_response:
+                final_response = msg["content"]
+            # Collect tool names from tool call messages
+            if msg.get("role") == "assistant" and msg.get("tool_calls"):
+                for tc in msg["tool_calls"]:
+                    fn = tc.get("function", {}) if isinstance(tc, dict) else {}
+                    name = fn.get("name", "")
+                    if name:
+                        tools_used.append(name)
+        tool_call_count: int = result.turns_used or len(tools_used)
+
+        cfg = self.config
+
+        # ---- Signal 1: Answer correctness (LLM judge) ----------------
+        correctness = await self._llm_judge(
+            question=item["question"],
+            expected=item["answer"],
+            model_answer=final_response,
+        )
+
+        # ---- Signal 2: Web tool usage --------------------------------
+        web_tools = {"web_search", "web_extract", "search", "firecrawl"}
+        tool_used = 1.0 if any(t in web_tools for t in tools_used) else 0.0
+
+        # ---- Signal 3: Efficiency ------------------------------------
+        if tool_call_count <= cfg.efficient_max_calls:
+            efficiency = 1.0
+        elif tool_call_count <= cfg.heavy_penalty_calls:
+            efficiency = 1.0 - (tool_call_count - cfg.efficient_max_calls) * 0.08
+        else:
+            efficiency = max(0.0, 1.0 - (tool_call_count - cfg.efficient_max_calls) * 0.12)
+
+        # ---- Bonus: Source diversity ---------------------------------
+        domains = self._extract_domains(final_response)
+        diversity = cfg.diversity_bonus if len(domains) >= 2 else 0.0
+
+        # ---- Combine ------------------------------------------------
+        reward = (
+            cfg.correctness_weight * correctness
+            + cfg.tool_usage_weight * tool_used
+            + cfg.efficiency_weight * efficiency
+            + diversity
+        )
+        reward = min(1.0, max(0.0, reward))  # clamp to [0, 1]
+
+        # Track for wandb
+        self._reward_buffer.append(reward)
+        self._correctness_buffer.append(correctness)
+        self._tool_usage_buffer.append(tool_used)
+        self._efficiency_buffer.append(efficiency)
+        self._diversity_buffer.append(diversity)
+
+        logger.debug(
+            f"Reward breakdown — correctness={correctness:.2f}, "
+            f"tool_used={tool_used:.1f}, efficiency={efficiency:.2f}, "
+            f"diversity={diversity:.1f} → total={reward:.3f}"
+        )
+
+        return reward
+
+    # ------------------------------------------------------------------
+    # 5. evaluate — run on held-out eval split
+    # ------------------------------------------------------------------
+
+    async def evaluate(self, *args, **kwargs) -> None:
+        """Run evaluation on the held-out split using the full agent loop with tools.
+
+        Each eval item runs through the same agent loop as training —
+        the model can use web_search, web_extract, etc. to research answers.
+        This measures actual agentic research capability, not just knowledge.
+        """
+        import time
+        import uuid
+        from environments.agent_loop import HermesAgentLoop
+        from environments.tool_context import ToolContext
+
+        items = self._eval_items
+        if not items:
+            logger.warning("No eval items available.")
+            return
+
+        eval_size = min(self.config.eval_size, len(items))
+        eval_items = items[:eval_size]
+
+        logger.info(f"Running eval on {len(eval_items)} questions (with agent loop + tools)...")
+        start_time = time.time()
+        samples = []
+
+        # Resolve tools once for all eval items
+        tools, valid_names = self._resolve_tools_for_group()
+
+        for i, item in enumerate(eval_items):
+            task_id = str(uuid.uuid4())
+            logger.info(f"Eval [{i+1}/{len(eval_items)}]: {item['question'][:80]}...")
+
+            try:
+                # Build messages
+                messages: List[Dict[str, Any]] = []
+                if self.config.system_prompt:
+                    messages.append({"role": "system", "content": self.config.system_prompt})
+                messages.append({"role": "user", "content": self.format_prompt(item)})
+
+                # Run the full agent loop with tools
+                agent = HermesAgentLoop(
+                    server=self.server,
+                    tool_schemas=tools,
+                    valid_tool_names=valid_names,
+                    max_turns=self.config.max_agent_turns,
+                    task_id=task_id,
+                    temperature=0.0,  # Deterministic for eval
+                    max_tokens=self.config.max_token_length,
+                    extra_body=self.config.extra_body,
+                )
+                result = await agent.run(messages)
+
+                # Extract final response and tool usage from messages
+                final_response = ""
+                tool_call_count = 0
+                for msg in reversed(result.messages):
+                    if msg.get("role") == "assistant" and msg.get("content") and not final_response:
+                        final_response = msg["content"]
+                    if msg.get("role") == "assistant" and msg.get("tool_calls"):
+                        tool_call_count += len(msg["tool_calls"])
+
+                # Compute reward (includes LLM judge for correctness)
+                # Temporarily save buffer lengths so we can extract the
+                # correctness score without calling judge twice, and avoid
+                # polluting training metric buffers with eval data.
+                buf_len = len(self._correctness_buffer)
+                ctx = ToolContext(task_id)
+                try:
+                    reward = await self.compute_reward(item, result, ctx)
+                finally:
+                    ctx.cleanup()
+
+                # Extract correctness from the buffer (compute_reward appended it)
+                # then remove eval entries from training buffers
+                correctness = (
+                    self._correctness_buffer[buf_len]
+                    if len(self._correctness_buffer) > buf_len
+                    else 0.0
+                )
+                # Roll back buffers to avoid polluting training metrics
+                for buf in (
+                    self._reward_buffer, self._correctness_buffer,
+                    self._tool_usage_buffer, self._efficiency_buffer,
+                    self._diversity_buffer,
+                ):
+                    if len(buf) > buf_len:
+                        buf.pop()
+
+                samples.append({
+                    "prompt": item["question"],
+                    "response": final_response[:500],
+                    "expected": item["answer"],
+                    "correctness": correctness,
+                    "reward": reward,
+                    "tool_calls": tool_call_count,
+                    "turns": result.turns_used,
+                })
+
+                logger.info(
+                    f"  → correctness={correctness:.2f}, reward={reward:.3f}, "
+                    f"tools={tool_call_count}, turns={result.turns_used}"
+                )
+
+            except Exception as e:
+                logger.error(f"Eval error on item: {e}")
+                samples.append({
+                    "prompt": item["question"],
+                    "response": f"ERROR: {e}",
+                    "expected": item["answer"],
+                    "correctness": 0.0,
+                    "reward": 0.0,
+                    "tool_calls": 0,
+                    "turns": 0,
+                })
+
+        end_time = time.time()
+
+        # Compute aggregate metrics
+        correctness_scores = [s["correctness"] for s in samples]
+        rewards = [s["reward"] for s in samples]
+        tool_counts = [s["tool_calls"] for s in samples]
+        n = len(samples)
+
+        eval_metrics = {
+            "eval/mean_correctness": sum(correctness_scores) / n if n else 0.0,
+            "eval/mean_reward": sum(rewards) / n if n else 0.0,
+            "eval/mean_tool_calls": sum(tool_counts) / n if n else 0.0,
+            "eval/tool_usage_rate": sum(1 for t in tool_counts if t > 0) / n if n else 0.0,
+            "eval/n_items": n,
+        }
+
+        logger.info(
+            f"Eval complete — correctness={eval_metrics['eval/mean_correctness']:.3f}, "
+            f"reward={eval_metrics['eval/mean_reward']:.3f}, "
+            f"tool_usage={eval_metrics['eval/tool_usage_rate']:.0%}"
+        )
+
+        await self.evaluate_log(
+            metrics=eval_metrics,
+            samples=samples,
+            start_time=start_time,
+            end_time=end_time,
+        )
+
+    # ------------------------------------------------------------------
+    # 6. wandb_log — custom metrics
+    # ------------------------------------------------------------------
+
+    async def wandb_log(self, wandb_metrics: Optional[Dict] = None) -> None:
+        """Log reward breakdown metrics to wandb."""
+        if wandb_metrics is None:
+            wandb_metrics = {}
+
+        if self._reward_buffer:
+            n = len(self._reward_buffer)
+            wandb_metrics["train/mean_reward"] = sum(self._reward_buffer) / n
+            wandb_metrics["train/mean_correctness"] = sum(self._correctness_buffer) / n
+            wandb_metrics["train/mean_tool_usage"] = sum(self._tool_usage_buffer) / n
+            wandb_metrics["train/mean_efficiency"] = sum(self._efficiency_buffer) / n
+            wandb_metrics["train/mean_diversity"] = sum(self._diversity_buffer) / n
+            wandb_metrics["train/total_rollouts"] = n
+
+            # Accuracy buckets
+            wandb_metrics["train/correct_rate"] = (
+                sum(1 for c in self._correctness_buffer if c >= 0.7) / n
+            )
+            wandb_metrics["train/tool_usage_rate"] = (
+                sum(1 for t in self._tool_usage_buffer if t > 0) / n
+            )
+
+            # Clear buffers
+            self._reward_buffer.clear()
+            self._correctness_buffer.clear()
+            self._tool_usage_buffer.clear()
+            self._efficiency_buffer.clear()
+            self._diversity_buffer.clear()
+
+        await super().wandb_log(wandb_metrics)
+
+    # ------------------------------------------------------------------
+    # Private helpers
+    # ------------------------------------------------------------------
+
+    async def _llm_judge(
+        self,
+        question: str,
+        expected: str,
+        model_answer: str,
+    ) -> float:
+        """
+        Use the server's LLM to judge answer correctness.
+        Falls back to keyword heuristic if LLM call fails.
+        """
+        if not model_answer or not model_answer.strip():
+            return 0.0
+
+        judge_prompt = (
+            "You are an impartial judge evaluating the quality of an AI research answer.\n\n"
+            f"Question: {question}\n\n"
+            f"Reference answer: {expected}\n\n"
+            f"Model answer: {model_answer}\n\n"
+            "Score the model answer on a scale from 0.0 to 1.0 where:\n"
+            "  1.0 = fully correct and complete\n"
+            "  0.7 = mostly correct with minor gaps\n"
+            "  0.4 = partially correct\n"
+            "  0.1 = mentions relevant topic but wrong or very incomplete\n"
+            "  0.0 = completely wrong or no answer\n\n"
+            "Consider: factual accuracy, completeness, and relevance.\n"
+            'Respond with ONLY a JSON object: {"score": <float>, "reason": "<one sentence>"}'
+        )
+
+        try:
+            response = await self.server.chat_completion(
+                messages=[{"role": "user", "content": judge_prompt}],
+                n=1,
+                max_tokens=150,
+                temperature=0.0,
+                split="eval",
+            )
+            text = response.choices[0].message.content if response.choices else ""
+            parsed = self._parse_judge_json(text)
+            if parsed is not None:
+                return float(parsed)
+        except Exception as e:
+            logger.debug(f"LLM judge failed: {e}. Using heuristic.")
+
+        return self._heuristic_score(expected, model_answer)
+
+    @staticmethod
+    def _parse_judge_json(text: str) -> Optional[float]:
+        """Extract the score float from LLM judge JSON response."""
+        try:
+            clean = re.sub(r"```(?:json)?|```", "", text).strip()
+            data = json.loads(clean)
+            score = float(data.get("score", -1))
+            if 0.0 <= score <= 1.0:
+                return score
+        except Exception:
+            match = re.search(r'"score"\s*:\s*([0-9.]+)', text)
+            if match:
+                score = float(match.group(1))
+                if 0.0 <= score <= 1.0:
+                    return score
+        return None
+
+    @staticmethod
+    def _heuristic_score(expected: str, model_answer: str) -> float:
+        """Lightweight keyword overlap score as fallback."""
+        stopwords = {
+            "the", "a", "an", "is", "are", "was", "were", "of", "in", "on",
+            "at", "to", "for", "with", "and", "or", "but", "it", "its",
+            "this", "that", "as", "by", "from", "be", "has", "have", "had",
+        }
+
+        def tokenize(text: str) -> set:
+            tokens = re.findall(r'\b\w+\b', text.lower())
+            return {t for t in tokens if t not in stopwords and len(t) > 2}
+
+        expected_tokens = tokenize(expected)
+        answer_tokens = tokenize(model_answer)
+
+        if not expected_tokens:
+            return 0.5
+
+        overlap = len(expected_tokens & answer_tokens)
+        union = len(expected_tokens | answer_tokens)
+
+        jaccard = overlap / union if union > 0 else 0.0
+        recall = overlap / len(expected_tokens)
+        return min(1.0, 0.4 * jaccard + 0.6 * recall)
+
+    @staticmethod
+    def _extract_domains(text: str) -> set:
+        """Extract unique domains from URLs cited in the response."""
+        urls = re.findall(r'https?://[^\s\)>\]"\']+', text)
+        domains = set()
+        for url in urls:
+            try:
+                parsed = urlparse(url)
+                domain = parsed.netloc.lower().lstrip("www.")
+                if domain:
+                    domains.add(domain)
+            except Exception:
+                pass
+        return domains
+
+
+# ---------------------------------------------------------------------------
+# Entry point
+# ---------------------------------------------------------------------------
+
+if __name__ == "__main__":
+    WebResearchEnv.cli()