feat: overhaul context length detection with models.dev and provider-aware resolution (#2158)

Replace the fragile hardcoded context length system with a multi-source resolution chain that correctly identifies context windows per provider. Key changes: - New agent/models_dev.py: Fetches and caches the models.dev registry (3800+ models across 100+ providers with per-provider context windows). In-memory cache (1hr TTL) + disk cache for cold starts. - Rewritten get_model_context_length() resolution chain: 0. Config override (model.context_length) 1. Custom providers per-model context_length 2. Persistent disk cache 3. Endpoint /models (local servers) 4. Anthropic /v1/models API (max_input_tokens, API-key only) 5. OpenRouter live API (existing, unchanged) 6. Nous suffix-match via OpenRouter (dot/dash normalization) 7. models.dev registry lookup (provider-aware) 8. Thin hardcoded defaults (broad family patterns) 9. 128K fallback (was 2M) - Provider-aware context: same model now correctly resolves to different context windows per provider (e.g. claude-opus-4.6: 1M on Anthropic, 128K on GitHub Copilot). Provider name flows through ContextCompressor. - DEFAULT_CONTEXT_LENGTHS shrunk from 80+ entries to ~16 broad patterns. models.dev replaces the per-model hardcoding. - CONTEXT_PROBE_TIERS changed from [2M, 1M, 512K, 200K, 128K, 64K, 32K] to [128K, 64K, 32K, 16K, 8K]. Unknown models no longer start at 2M. - hermes model: prompts for context_length when configuring custom endpoints. Supports shorthand (32k, 128K). Saved to custom_providers per-model config. - custom_providers schema extended with optional models dict for per-model context_length (backward compatible). - Nous Portal: suffix-matches bare IDs (claude-opus-4-6) against OpenRouter's prefixed IDs (anthropic/claude-opus-4.6) with dot/dash normalization. Handles all 15 current Nous models. - Anthropic direct: queries /v1/models for max_input_tokens. Only works with regular API keys (sk-ant-api*), not OAuth tokens. Falls through to models.dev for OAuth users. Tests: 5574 passed (18 new tests for models_dev + updated probe tiers) Docs: Updated configuration.md context length section, AGENTS.md Co-authored-by: Test <test@test.com>
2026-03-20 06:04:33 -07:00 · 2026-03-20 06:04:33 -07:00 · 88643a1ba9
commit 88643a1ba9
parent b7b585656b
13 changed files with 662 additions and 246 deletions
--- a/AGENTS.md
+++ b/AGENTS.md
@ -23,6 +23,7 @@ hermes-agent/
 │   ├── prompt_caching.py     # Anthropic prompt caching
 │   ├── auxiliary_client.py   # Auxiliary LLM client (vision, summarization)
 │   ├── model_metadata.py     # Model context lengths, token estimation
 │   ├── models_dev.py         # models.dev registry integration (provider-aware context)
 │   ├── display.py            # KawaiiSpinner, tool preview formatting
 │   ├── skill_commands.py     # Skill slash commands (shared CLI/gateway)
 │   └── trajectory.py         # Trajectory saving helpers
--- a/agent/context_compressor.py
+++ b/agent/context_compressor.py
@ -47,10 +47,12 @@ class ContextCompressor:
        base_url: str = "",
        api_key: str = "",
        config_context_length: int | None = None,
        provider: str = "",
    ):
        self.model = model
        self.base_url = base_url
        self.api_key = api_key
        self.provider = provider
        self.threshold_percent = threshold_percent
        self.protect_first_n = protect_first_n
        self.protect_last_n = protect_last_n
@ -60,6 +62,7 @@ class ContextCompressor:
        self.context_length = get_model_context_length(
            model, base_url=base_url, api_key=api_key,
            config_context_length=config_context_length,
            provider=provider,
        )
        self.threshold_tokens = int(self.context_length * threshold_percent)
        self.compression_count = 0
--- a/agent/model_metadata.py
+++ b/agent/model_metadata.py
@ -55,104 +55,52 @@ _endpoint_model_metadata_cache_time: Dict[str, float] = {}
 _ENDPOINT_MODEL_CACHE_TTL = 300
 # Descending tiers for context length probing when the model is unknown.
-# We start high and step down on context-length errors until one works.
+# We start at 128K (a safe default for most modern models) and step down
 # on context-length errors until one works.
 CONTEXT_PROBE_TIERS = [
    2_000_000,
    1_000_000,
    512_000,
    200_000,
    128_000,
    64_000,
    32_000,
    16_000,
    8_000,
 ]
 # Default context length when no detection method succeeds.
 DEFAULT_FALLBACK_CONTEXT = CONTEXT_PROBE_TIERS[0]
 # Thin fallback defaults — only broad model family patterns.
 # These fire only when provider is unknown AND models.dev/OpenRouter/Anthropic
 # all miss. Replaced the previous 80+ entry dict.
 # For provider-specific context lengths, models.dev is the primary source.
 DEFAULT_CONTEXT_LENGTHS = {
-    "anthropic/claude-opus-4": 200000,
+    # Anthropic Claude 4.6 (1M context) — bare IDs only to avoid
-    "anthropic/claude-opus-4.5": 200000,
+    # fuzzy-match collisions (e.g. "anthropic/claude-sonnet-4" is a
-    "anthropic/claude-opus-4.6": 1000000,
+    # substring of "anthropic/claude-sonnet-4.6").
-    "anthropic/claude-sonnet-4": 200000,
+    # OpenRouter-prefixed models resolve via OpenRouter live API or models.dev.
    "anthropic/claude-sonnet-4-20250514": 200000,
    "anthropic/claude-sonnet-4.5": 200000,
    "anthropic/claude-sonnet-4.6": 1000000,
    "anthropic/claude-haiku-4.5": 200000,
    # Bare Anthropic model IDs (for native API provider)
    "claude-opus-4-6": 1000000,
    "claude-sonnet-4-6": 1000000,
-    "claude-opus-4-5-20251101": 200000,
+    "claude-opus-4.6": 1000000,
-    "claude-sonnet-4-5-20250929": 200000,
+    "claude-sonnet-4.6": 1000000,
-    "claude-opus-4-1-20250805": 200000,
+    # Catch-all for older Claude models (must sort after specific entries)
-    "claude-opus-4-20250514": 200000,
+    "claude": 200000,
-    "claude-sonnet-4-20250514": 200000,
+    # OpenAI
-    "claude-haiku-4-5-20251001": 200000,
+    "gpt-4.1": 1047576,
    "openai/gpt-5": 128000,
    "openai/gpt-4.1": 1047576,
    "openai/gpt-4.1-mini": 1047576,
    "openai/gpt-4o": 128000,
    "openai/gpt-4-turbo": 128000,
    "openai/gpt-4o-mini": 128000,
    "google/gemini-3-pro-preview": 1048576,
    "google/gemini-3-flash": 1048576,
    "google/gemini-2.5-flash": 1048576,
    "google/gemini-2.0-flash": 1048576,
    "google/gemini-2.5-pro": 1048576,
    "deepseek/deepseek-v3.2": 65536,
    "meta-llama/llama-3.3-70b-instruct": 131072,
    "deepseek/deepseek-chat-v3": 65536,
    "qwen/qwen-2.5-72b-instruct": 32768,
    "glm-4.7": 202752,
    "glm-5": 202752,
    "glm-4.5": 131072,
    "glm-4.5-flash": 131072,
    "kimi-for-coding": 262144,
    "kimi-k2.5": 262144,
    "kimi-k2-thinking": 262144,
    "kimi-k2-thinking-turbo": 262144,
    "kimi-k2-turbo-preview": 262144,
    "kimi-k2-0905-preview": 131072,
    "MiniMax-M2.7": 204800,
    "MiniMax-M2.7-highspeed": 204800,
    "MiniMax-M2.5": 204800,
    "MiniMax-M2.5-highspeed": 204800,
    "MiniMax-M2.1": 204800,
    # OpenCode Zen models
    "gpt-5.4-pro": 128000,
    "gpt-5.4": 128000,
    "gpt-5.3-codex": 128000,
    "gpt-5.3-codex-spark": 128000,
    "gpt-5.2": 128000,
    "gpt-5.2-codex": 128000,
    "gpt-5.1": 128000,
    "gpt-5.1-codex": 128000,
    "gpt-5.1-codex-max": 128000,
    "gpt-5.1-codex-mini": 128000,
    "gpt-5": 128000,
-    "gpt-5-codex": 128000,
+    "gpt-4": 128000,
-    "gpt-5-nano": 128000,
+    # Google
-    # Bare model IDs without provider prefix (avoid duplicates with entries above)
+    "gemini": 1048576,
-    "claude-opus-4-5": 200000,
+    # DeepSeek
-    "claude-opus-4-1": 200000,
+    "deepseek": 128000,
-    "claude-sonnet-4-5": 200000,
+    # Meta
-    "claude-sonnet-4": 200000,
+    "llama": 131072,
-    "claude-haiku-4-5": 200000,
+    # Qwen
-    "claude-3-5-haiku": 200000,
+    "qwen": 131072,
-    "gemini-3.1-pro": 1048576,
+    # MiniMax
-    "gemini-3-pro": 1048576,
+    "minimax": 204800,
-    "gemini-3-flash": 1048576,
+    # GLM
-    "minimax-m2.5": 204800,
+    "glm": 202752,
-    "minimax-m2.5-free": 204800,
+    # Kimi
-    "minimax-m2.1": 204800,
+    "kimi": 262144,
    "glm-4.6": 202752,
    "kimi-k2": 262144,
    "qwen3-coder": 32768,
    "big-pickle": 128000,
    # Alibaba Cloud / DashScope Qwen models
    "qwen3.5-plus": 131072,
    "qwen3-max": 131072,
    "qwen3-coder-plus": 131072,
    "qwen3-coder-next": 131072,
    "qwen-plus-latest": 131072,
    "qwen3.5-flash": 131072,
    "qwen-vl-max": 32768,
 }
 _CONTEXT_LENGTH_KEYS = (
@ -693,22 +641,100 @@ def _query_local_context_length(model: str, base_url: str) -> Optional[int]:
    return None
 def _normalize_model_version(model: str) -> str:
    """Normalize version separators for matching.
    Nous uses dashes: claude-opus-4-6, claude-sonnet-4-5
    OpenRouter uses dots: claude-opus-4.6, claude-sonnet-4.5
    Normalize both to dashes for comparison.
    """
    return model.replace(".", "-")
 def _query_anthropic_context_length(model: str, base_url: str, api_key: str) -> Optional[int]:
    """Query Anthropic's /v1/models endpoint for context length.
    Only works with regular ANTHROPIC_API_KEY (sk-ant-api*).
    OAuth tokens (sk-ant-oat*) from Claude Code return 401.
    """
    if not api_key or api_key.startswith("sk-ant-oat"):
        return None  # OAuth tokens can't access /v1/models
    try:
        base = base_url.rstrip("/")
        if base.endswith("/v1"):
            base = base[:-3]
        url = f"{base}/v1/models?limit=1000"
        headers = {
            "x-api-key": api_key,
            "anthropic-version": "2023-06-01",
        }
        resp = requests.get(url, headers=headers, timeout=10)
        if resp.status_code != 200:
            return None
        data = resp.json()
        for m in data.get("data", []):
            if m.get("id") == model:
                ctx = m.get("max_input_tokens")
                if isinstance(ctx, int) and ctx > 0:
                    return ctx
    except Exception as e:
        logger.debug("Anthropic /v1/models query failed: %s", e)
    return None
 def _resolve_nous_context_length(model: str) -> Optional[int]:
    """Resolve Nous Portal model context length via OpenRouter metadata.
    Nous model IDs are bare (e.g. 'claude-opus-4-6') while OpenRouter uses
    prefixed IDs (e.g. 'anthropic/claude-opus-4.6'). Try suffix matching
    with version normalization (dot↔dash).
    """
    metadata = fetch_model_metadata()  # OpenRouter cache
    # Exact match first
    if model in metadata:
        return metadata[model].get("context_length")
    normalized = _normalize_model_version(model).lower()
    for or_id, entry in metadata.items():
        bare = or_id.split("/", 1)[1] if "/" in or_id else or_id
        if bare.lower() == model.lower() or _normalize_model_version(bare).lower() == normalized:
            return entry.get("context_length")
    # Partial prefix match for cases like gemini-3-flash → gemini-3-flash-preview
    # Require match to be at a word boundary (followed by -, :, or end of string)
    model_lower = model.lower()
    for or_id, entry in metadata.items():
        bare = or_id.split("/", 1)[1] if "/" in or_id else or_id
        for candidate, query in [(bare.lower(), model_lower), (_normalize_model_version(bare).lower(), normalized)]:
            if candidate.startswith(query) and (
                len(candidate) == len(query) or candidate[len(query)] in "-:."
            ):
                return entry.get("context_length")
    return None
 def get_model_context_length(
    model: str,
    base_url: str = "",
    api_key: str = "",
    config_context_length: int | None = None,
    provider: str = "",
 ) -> int:
    """Get the context length for a model.
    Resolution order:
-    0. Explicit config override (model.context_length in config.yaml)
+    0. Explicit config override (model.context_length or custom_providers per-model)
    1. Persistent cache (previously discovered via probing)
    2. Active endpoint metadata (/models for explicit custom endpoints)
-    3. Local server query (for local endpoints when model not in /models list)
+    3. Local server query (for local endpoints)
-    4. OpenRouter API metadata
+    4. Anthropic /v1/models API (API-key users only, not OAuth)
-    5. Hardcoded DEFAULT_CONTEXT_LENGTHS (fuzzy match for hosted routes only)
+    5. OpenRouter live API metadata
-    6. First probe tier (2M) — will be narrowed on first context error
+    6. Nous suffix-match via OpenRouter cache
    7. models.dev registry lookup (provider-aware)
    8. Thin hardcoded defaults (broad family patterns)
    9. Default fallback (128K)
    """
    # 0. Explicit config override — user knows best
    if config_context_length is not None and isinstance(config_context_length, int) and config_context_length > 0:
@ -744,9 +770,7 @@ def get_model_context_length(
            if isinstance(context_length, int):
                return context_length
        if not _is_known_provider_base_url(base_url):
-            # Explicit third-party endpoints should not borrow fuzzy global
+            # 3. Try querying local server directly
            # defaults from unrelated providers with similarly named models.
            # But first try querying the local server directly.
            if is_local_endpoint(base_url):
                local_ctx = _query_local_context_length(model, base_url)
                if local_ctx and local_ctx > 0:
@ -756,31 +780,53 @@ def get_model_context_length(
                "Could not detect context length for model %r at %s — "
                "defaulting to %s tokens (probe-down). Set model.context_length "
                "in config.yaml to override.",
-                model, base_url, f"{CONTEXT_PROBE_TIERS[0]:,}",
+                model, base_url, f"{DEFAULT_FALLBACK_CONTEXT:,}",
            )
-            return CONTEXT_PROBE_TIERS[0]
+            return DEFAULT_FALLBACK_CONTEXT
-    # 3. OpenRouter API metadata
+    # 4. Anthropic /v1/models API (only for regular API keys, not OAuth)
    if provider == "anthropic" or (
        base_url and "api.anthropic.com" in base_url
    ):
        ctx = _query_anthropic_context_length(model, base_url or "https://api.anthropic.com", api_key)
        if ctx:
            return ctx
    # 5. Provider-aware lookups (before generic OpenRouter cache)
    # These are provider-specific and take priority over the generic OR cache,
    # since the same model can have different context limits per provider
    # (e.g. claude-opus-4.6 is 1M on Anthropic but 128K on GitHub Copilot).
    if provider == "nous":
        ctx = _resolve_nous_context_length(model)
        if ctx:
            return ctx
    elif provider:
        from agent.models_dev import lookup_models_dev_context
        ctx = lookup_models_dev_context(provider, model)
        if ctx:
            return ctx
    # 6. OpenRouter live API metadata (provider-unaware fallback)
    metadata = fetch_model_metadata()
    if model in metadata:
        return metadata[model].get("context_length", 128000)
-    # 4. Hardcoded defaults (fuzzy match — longest key first for specificity)
+    # 8. Hardcoded defaults (fuzzy match — longest key first for specificity)
    for default_model, length in sorted(
        DEFAULT_CONTEXT_LENGTHS.items(), key=lambda x: len(x[0]), reverse=True
    ):
        if default_model in model or model in default_model:
            return length
-    # 5. Query local server for unknown models before defaulting to 2M
+    # 9. Query local server as last resort
    if base_url and is_local_endpoint(base_url):
        local_ctx = _query_local_context_length(model, base_url)
        if local_ctx and local_ctx > 0:
            save_context_length(model, base_url, local_ctx)
            return local_ctx
-    # 6. Unknown model — start at highest probe tier
+    # 10. Default fallback — 128K
-    return CONTEXT_PROBE_TIERS[0]
+    return DEFAULT_FALLBACK_CONTEXT
 def estimate_tokens_rough(text: str) -> int:
--- a/agent/models_dev.py
+++ b/agent/models_dev.py
@ -0,0 +1,170 @@
 """Models.dev registry integration for provider-aware context length detection.
 Fetches model metadata from https://models.dev/api.json — a community-maintained
 database of 3800+ models across 100+ providers, including per-provider context
 windows, pricing, and capabilities.
 Data is cached in memory (1hr TTL) and on disk (~/.hermes/models_dev_cache.json)
 to avoid cold-start network latency.
 """
 import json
 import logging
 import os
 import time
 from pathlib import Path
 from typing import Any, Dict, Optional
 import requests
 logger = logging.getLogger(__name__)
 MODELS_DEV_URL = "https://models.dev/api.json"
 _MODELS_DEV_CACHE_TTL = 3600  # 1 hour in-memory
 # In-memory cache
 _models_dev_cache: Dict[str, Any] = {}
 _models_dev_cache_time: float = 0
 # Provider ID mapping: Hermes provider names → models.dev provider IDs
 PROVIDER_TO_MODELS_DEV: Dict[str, str] = {
    "openrouter": "openrouter",
    "anthropic": "anthropic",
    "zai": "zai",
    "kimi-coding": "kimi-for-coding",
    "minimax": "minimax",
    "minimax-cn": "minimax-cn",
    "deepseek": "deepseek",
    "alibaba": "alibaba",
    "copilot": "github-copilot",
    "ai-gateway": "vercel",
    "opencode-zen": "opencode",
    "opencode-go": "opencode-go",
    "kilocode": "kilo",
 }
 def _get_cache_path() -> Path:
    """Return path to disk cache file."""
    env_val = os.environ.get("HERMES_HOME", "")
    hermes_home = Path(env_val) if env_val else Path.home() / ".hermes"
    return hermes_home / "models_dev_cache.json"
 def _load_disk_cache() -> Dict[str, Any]:
    """Load models.dev data from disk cache."""
    try:
        cache_path = _get_cache_path()
        if cache_path.exists():
            with open(cache_path, encoding="utf-8") as f:
                return json.load(f)
    except Exception as e:
        logger.debug("Failed to load models.dev disk cache: %s", e)
    return {}
 def _save_disk_cache(data: Dict[str, Any]) -> None:
    """Save models.dev data to disk cache."""
    try:
        cache_path = _get_cache_path()
        cache_path.parent.mkdir(parents=True, exist_ok=True)
        with open(cache_path, "w", encoding="utf-8") as f:
            json.dump(data, f, separators=(",", ":"))
    except Exception as e:
        logger.debug("Failed to save models.dev disk cache: %s", e)
 def fetch_models_dev(force_refresh: bool = False) -> Dict[str, Any]:
    """Fetch models.dev registry. In-memory cache (1hr) + disk fallback.
    Returns the full registry dict keyed by provider ID, or empty dict on failure.
    """
    global _models_dev_cache, _models_dev_cache_time
    # Check in-memory cache
    if (
        not force_refresh
        and _models_dev_cache
        and (time.time() - _models_dev_cache_time) < _MODELS_DEV_CACHE_TTL
    ):
        return _models_dev_cache
    # Try network fetch
    try:
        response = requests.get(MODELS_DEV_URL, timeout=15)
        response.raise_for_status()
        data = response.json()
        if isinstance(data, dict) and len(data) > 0:
            _models_dev_cache = data
            _models_dev_cache_time = time.time()
            _save_disk_cache(data)
            logger.debug(
                "Fetched models.dev registry: %d providers, %d total models",
                len(data),
                sum(len(p.get("models", {})) for p in data.values() if isinstance(p, dict)),
            )
            return data
    except Exception as e:
        logger.debug("Failed to fetch models.dev: %s", e)
    # Fall back to disk cache
    if not _models_dev_cache:
        _models_dev_cache = _load_disk_cache()
        if _models_dev_cache:
            _models_dev_cache_time = time.time()
            logger.debug("Loaded models.dev from disk cache (%d providers)", len(_models_dev_cache))
    return _models_dev_cache
 def lookup_models_dev_context(provider: str, model: str) -> Optional[int]:
    """Look up context_length for a provider+model combo in models.dev.
    Returns the context window in tokens, or None if not found.
    Handles case-insensitive matching and filters out context=0 entries.
    """
    mdev_provider_id = PROVIDER_TO_MODELS_DEV.get(provider)
    if not mdev_provider_id:
        return None
    data = fetch_models_dev()
    provider_data = data.get(mdev_provider_id)
    if not isinstance(provider_data, dict):
        return None
    models = provider_data.get("models", {})
    if not isinstance(models, dict):
        return None
    # Exact match
    entry = models.get(model)
    if entry:
        ctx = _extract_context(entry)
        if ctx:
            return ctx
    # Case-insensitive match
    model_lower = model.lower()
    for mid, mdata in models.items():
        if mid.lower() == model_lower:
            ctx = _extract_context(mdata)
            if ctx:
                return ctx
    return None
 def _extract_context(entry: Dict[str, Any]) -> Optional[int]:
    """Extract context_length from a models.dev model entry.
    Returns None for invalid/zero values (some audio/image models have context=0).
    """
    if not isinstance(entry, dict):
        return None
    limit = entry.get("limit")
    if not isinstance(limit, dict):
        return None
    ctx = limit.get("context")
    if isinstance(ctx, (int, float)) and ctx > 0:
        return int(ctx)
    return None
--- a/hermes_cli/main.py
+++ b/hermes_cli/main.py
@ -1137,10 +1137,21 @@ def _model_flow_custom(config):
        base_url = input(f"API base URL [{current_url or 'e.g. https://api.example.com/v1'}]: ").strip()
        api_key = input(f"API key [{current_key[:8] + '...' if current_key else 'optional'}]: ").strip()
        model_name = input("Model name (e.g. gpt-4, llama-3-70b): ").strip()
        context_length_str = input("Context length in tokens [leave blank for auto-detect]: ").strip()
    except (KeyboardInterrupt, EOFError):
        print("\nCancelled.")
        return
    context_length = None
    if context_length_str:
        try:
            context_length = int(context_length_str.replace(",", "").replace("k", "000").replace("K", "000"))
            if context_length <= 0:
                context_length = None
        except ValueError:
            print(f"Invalid context length: {context_length_str} — will auto-detect.")
            context_length = None
    if not base_url and not current_url:
        print("No URL provided. Cancelled.")
        return
@ -1203,14 +1214,14 @@ def _model_flow_custom(config):
        print("Endpoint saved. Use `/model` in chat or `hermes model` to set a model.")
    # Auto-save to custom_providers so it appears in the menu next time
-    _save_custom_provider(effective_url, effective_key, model_name or "")
+    _save_custom_provider(effective_url, effective_key, model_name or "", context_length=context_length)
-def _save_custom_provider(base_url, api_key="", model=""):
+def _save_custom_provider(base_url, api_key="", model="", context_length=None):
    """Save a custom endpoint to custom_providers in config.yaml.
    Deduplicates by base_url — if the URL already exists, updates the
-    model name but doesn't add a duplicate entry.
+    model name and context_length but doesn't add a duplicate entry.
    Auto-generates a display name from the URL hostname.
    """
    from hermes_cli.config import load_config, save_config
@ -1220,14 +1231,24 @@ def _save_custom_provider(base_url, api_key="", model=""):
    if not isinstance(providers, list):
        providers = []
-    # Check if this URL is already saved — update model if so
+    # Check if this URL is already saved — update model/context_length if so
    for entry in providers:
        if isinstance(entry, dict) and entry.get("base_url", "").rstrip("/") == base_url.rstrip("/"):
            changed = False
            if model and entry.get("model") != model:
                entry["model"] = model
                changed = True
            if model and context_length:
                models_cfg = entry.get("models", {})
                if not isinstance(models_cfg, dict):
                    models_cfg = {}
                models_cfg[model] = {"context_length": context_length}
                entry["models"] = models_cfg
                changed = True
            if changed:
                cfg["custom_providers"] = providers
                save_config(cfg)
-            return  # already saved, updated model if needed
+            return  # already saved, updated if needed
    # Auto-generate a name from the URL
    import re
@ -1249,6 +1270,8 @@ def _save_custom_provider(base_url, api_key="", model=""):
        entry["api_key"] = api_key
    if model:
        entry["model"] = model
    if model and context_length:
        entry["models"] = {model: {"context_length": context_length}}
    providers.append(entry)
    cfg["custom_providers"] = providers
--- a/hermes_cli/setup.py
+++ b/hermes_cli/setup.py
@ -1045,93 +1045,17 @@ def setup_model_provider(config: dict):
        print()
        print_header("Custom OpenAI-Compatible Endpoint")
        print_info("Works with any API that follows OpenAI's chat completions spec")
        print()
-        current_url = get_env_value("OPENAI_BASE_URL") or ""
+        # Reuse the shared custom endpoint flow from `hermes model`.
-        current_key = get_env_value("OPENAI_API_KEY")
+        # This handles: URL/key/model/context-length prompts, endpoint probing,
-        _raw_model = config.get("model", "")
+        # env saving, config.yaml updates, and custom_providers persistence.
-        current_model = (
+        from hermes_cli.main import _model_flow_custom
-            _raw_model.get("default", "")
+        _model_flow_custom(config)
-            if isinstance(_raw_model, dict)
+        # _model_flow_custom handles model selection, config, env vars,
-            else (_raw_model or "")
+        # and custom_providers. Keep selected_provider = "custom" so
-        )
+        # the model selection step below is skipped (line 1631 check)
-
+        # but vision and TTS setup still run.
        if current_url:
            print_info(f"  Current URL: {current_url}")
        if current_key:
            print_info(f"  Current key: {current_key[:8]}... (configured)")
        base_url = prompt(
            "  API base URL (e.g., https://api.example.com/v1)", current_url
        ).strip()
        api_key = prompt("  API key", password=True)
        model_name = prompt("  Model name (e.g., gpt-4, claude-3-opus)", current_model)
        if base_url:
            from hermes_cli.models import probe_api_models
            probe = probe_api_models(api_key, base_url)
            if probe.get("used_fallback") and probe.get("resolved_base_url"):
                print_warning(
                    f"Endpoint verification worked at {probe['resolved_base_url']}/models, "
                    f"not the exact URL you entered. Saving the working base URL instead."
                )
                base_url = probe["resolved_base_url"]
            elif probe.get("models") is not None:
                print_success(
                    f"Verified endpoint via {probe.get('probed_url')} "
                    f"({len(probe.get('models') or [])} model(s) visible)"
                )
            else:
                print_warning(
                    f"Could not verify this endpoint via {probe.get('probed_url')}. "
                    f"Hermes will still save it."
                )
                if probe.get("suggested_base_url"):
                    print_info(
                        f"  If this server expects /v1, try base URL: {probe['suggested_base_url']}"
                    )
            save_env_value("OPENAI_BASE_URL", base_url)
        if api_key:
            save_env_value("OPENAI_API_KEY", api_key)
        if model_name:
            _set_default_model(config, model_name)
        try:
            from hermes_cli.auth import deactivate_provider
            deactivate_provider()
        except Exception:
            pass
        # Save provider and base_url to config.yaml so the gateway and CLI
        # both resolve the correct provider without relying on env-var heuristics.
        if base_url:
            import yaml
            config_path = (
                Path(os.environ.get("HERMES_HOME", Path.home() / ".hermes"))
                / "config.yaml"
            )
            try:
                disk_cfg = {}
                if config_path.exists():
                    disk_cfg = yaml.safe_load(config_path.read_text()) or {}
                model_section = disk_cfg.get("model", {})
                if isinstance(model_section, str):
                    model_section = {"default": model_section}
                model_section["provider"] = "custom"
                model_section["base_url"] = base_url.rstrip("/")
                if model_name:
                    model_section["default"] = model_name
                disk_cfg["model"] = model_section
                config_path.write_text(yaml.safe_dump(disk_cfg, sort_keys=False))
            except Exception as e:
                logger.debug("Could not save provider to config.yaml: %s", e)
            _set_model_provider(config, "custom", base_url)
        print_success("Custom endpoint configured")
    elif provider_idx == 4:  # Z.AI / GLM
        selected_provider = "zai"
--- a/run_agent.py
+++ b/run_agent.py
@ -992,6 +992,27 @@ class AIAgent:
            except (TypeError, ValueError):
                _config_context_length = None
        # Check custom_providers per-model context_length
        if _config_context_length is None:
            _custom_providers = _agent_cfg.get("custom_providers")
            if isinstance(_custom_providers, list):
                for _cp_entry in _custom_providers:
                    if not isinstance(_cp_entry, dict):
                        continue
                    _cp_url = (_cp_entry.get("base_url") or "").rstrip("/")
                    if _cp_url and _cp_url == self.base_url.rstrip("/"):
                        _cp_models = _cp_entry.get("models", {})
                        if isinstance(_cp_models, dict):
                            _cp_model_cfg = _cp_models.get(self.model, {})
                            if isinstance(_cp_model_cfg, dict):
                                _cp_ctx = _cp_model_cfg.get("context_length")
                                if _cp_ctx is not None:
                                    try:
                                        _config_context_length = int(_cp_ctx)
                                    except (TypeError, ValueError):
                                        pass
                        break
        self.context_compressor = ContextCompressor(
            model=self.model,
            threshold_percent=compression_threshold,
@ -1003,6 +1024,7 @@ class AIAgent:
            base_url=self.base_url,
            api_key=getattr(self, "api_key", ""),
            config_context_length=_config_context_length,
            provider=self.provider,
        )
        self.compression_enabled = compression_enabled
        self._user_turn_count = 0
--- a/tests/agent/test_model_metadata.py
+++ b/tests/agent/test_model_metadata.py
@ -472,35 +472,35 @@ class TestContextProbeTiers:
        for i in range(len(CONTEXT_PROBE_TIERS) - 1):
            assert CONTEXT_PROBE_TIERS[i] > CONTEXT_PROBE_TIERS[i + 1]
-    def test_first_tier_is_2m(self):
+    def test_first_tier_is_128k(self):
-        assert CONTEXT_PROBE_TIERS[0] == 2_000_000
+        assert CONTEXT_PROBE_TIERS[0] == 128_000
-    def test_last_tier_is_32k(self):
+    def test_last_tier_is_8k(self):
-        assert CONTEXT_PROBE_TIERS[-1] == 32_000
+        assert CONTEXT_PROBE_TIERS[-1] == 8_000
 class TestGetNextProbeTier:
    def test_from_2m(self):
        assert get_next_probe_tier(2_000_000) == 1_000_000
    def test_from_1m(self):
        assert get_next_probe_tier(1_000_000) == 512_000
    def test_from_128k(self):
        assert get_next_probe_tier(128_000) == 64_000
-    def test_from_32k_returns_none(self):
+    def test_from_64k(self):
-        assert get_next_probe_tier(32_000) is None
+        assert get_next_probe_tier(64_000) == 32_000
    def test_from_32k(self):
        assert get_next_probe_tier(32_000) == 16_000
    def test_from_8k_returns_none(self):
        assert get_next_probe_tier(8_000) is None
    def test_from_below_min_returns_none(self):
-        assert get_next_probe_tier(16_000) is None
+        assert get_next_probe_tier(4_000) is None
    def test_from_arbitrary_value(self):
-        assert get_next_probe_tier(300_000) == 200_000
+        assert get_next_probe_tier(100_000) == 64_000
    def test_above_max_tier(self):
-        """Value above 2M should return 2M."""
+        """Value above 128K should return 128K."""
-        assert get_next_probe_tier(5_000_000) == 2_000_000
+        assert get_next_probe_tier(500_000) == 128_000
    def test_zero_returns_none(self):
        assert get_next_probe_tier(0) is None
--- a/tests/agent/test_models_dev.py
+++ b/tests/agent/test_models_dev.py
@ -0,0 +1,197 @@
 """Tests for agent.models_dev — models.dev registry integration."""
 import json
 from unittest.mock import patch, MagicMock
 import pytest
 from agent.models_dev import (
    PROVIDER_TO_MODELS_DEV,
    _extract_context,
    fetch_models_dev,
    lookup_models_dev_context,
 )
 SAMPLE_REGISTRY = {
    "anthropic": {
        "id": "anthropic",
        "name": "Anthropic",
        "models": {
            "claude-opus-4-6": {
                "id": "claude-opus-4-6",
                "limit": {"context": 1000000, "output": 128000},
            },
            "claude-sonnet-4-6": {
                "id": "claude-sonnet-4-6",
                "limit": {"context": 1000000, "output": 64000},
            },
            "claude-sonnet-4-0": {
                "id": "claude-sonnet-4-0",
                "limit": {"context": 200000, "output": 64000},
            },
        },
    },
    "github-copilot": {
        "id": "github-copilot",
        "name": "GitHub Copilot",
        "models": {
            "claude-opus-4.6": {
                "id": "claude-opus-4.6",
                "limit": {"context": 128000, "output": 32000},
            },
        },
    },
    "kilo": {
        "id": "kilo",
        "name": "Kilo Gateway",
        "models": {
            "anthropic/claude-sonnet-4.6": {
                "id": "anthropic/claude-sonnet-4.6",
                "limit": {"context": 1000000, "output": 128000},
            },
        },
    },
    "deepseek": {
        "id": "deepseek",
        "name": "DeepSeek",
        "models": {
            "deepseek-chat": {
                "id": "deepseek-chat",
                "limit": {"context": 128000, "output": 8192},
            },
        },
    },
    "audio-only": {
        "id": "audio-only",
        "models": {
            "tts-model": {
                "id": "tts-model",
                "limit": {"context": 0, "output": 0},
            },
        },
    },
 }
 class TestProviderMapping:
    def test_all_mapped_providers_are_strings(self):
        for hermes_id, mdev_id in PROVIDER_TO_MODELS_DEV.items():
            assert isinstance(hermes_id, str)
            assert isinstance(mdev_id, str)
    def test_known_providers_mapped(self):
        assert PROVIDER_TO_MODELS_DEV["anthropic"] == "anthropic"
        assert PROVIDER_TO_MODELS_DEV["copilot"] == "github-copilot"
        assert PROVIDER_TO_MODELS_DEV["kilocode"] == "kilo"
        assert PROVIDER_TO_MODELS_DEV["ai-gateway"] == "vercel"
    def test_unmapped_provider_not_in_dict(self):
        assert "nous" not in PROVIDER_TO_MODELS_DEV
        assert "openai-codex" not in PROVIDER_TO_MODELS_DEV
 class TestExtractContext:
    def test_valid_entry(self):
        assert _extract_context({"limit": {"context": 128000}}) == 128000
    def test_zero_context_returns_none(self):
        assert _extract_context({"limit": {"context": 0}}) is None
    def test_missing_limit_returns_none(self):
        assert _extract_context({"id": "test"}) is None
    def test_missing_context_returns_none(self):
        assert _extract_context({"limit": {"output": 8192}}) is None
    def test_non_dict_returns_none(self):
        assert _extract_context("not a dict") is None
    def test_float_context_coerced_to_int(self):
        assert _extract_context({"limit": {"context": 131072.0}}) == 131072
 class TestLookupModelsDevContext:
    @patch("agent.models_dev.fetch_models_dev")
    def test_exact_match(self, mock_fetch):
        mock_fetch.return_value = SAMPLE_REGISTRY
        assert lookup_models_dev_context("anthropic", "claude-opus-4-6") == 1000000
    @patch("agent.models_dev.fetch_models_dev")
    def test_case_insensitive_match(self, mock_fetch):
        mock_fetch.return_value = SAMPLE_REGISTRY
        assert lookup_models_dev_context("anthropic", "Claude-Opus-4-6") == 1000000
    @patch("agent.models_dev.fetch_models_dev")
    def test_provider_not_mapped(self, mock_fetch):
        mock_fetch.return_value = SAMPLE_REGISTRY
        assert lookup_models_dev_context("nous", "some-model") is None
    @patch("agent.models_dev.fetch_models_dev")
    def test_model_not_found(self, mock_fetch):
        mock_fetch.return_value = SAMPLE_REGISTRY
        assert lookup_models_dev_context("anthropic", "nonexistent-model") is None
    @patch("agent.models_dev.fetch_models_dev")
    def test_provider_aware_context(self, mock_fetch):
        """Same model, different context per provider."""
        mock_fetch.return_value = SAMPLE_REGISTRY
        # Anthropic direct: 1M
        assert lookup_models_dev_context("anthropic", "claude-opus-4-6") == 1000000
        # GitHub Copilot: only 128K for same model
        assert lookup_models_dev_context("copilot", "claude-opus-4.6") == 128000
    @patch("agent.models_dev.fetch_models_dev")
    def test_zero_context_filtered(self, mock_fetch):
        mock_fetch.return_value = SAMPLE_REGISTRY
        # audio-only is not a mapped provider, but test the filtering directly
        data = SAMPLE_REGISTRY["audio-only"]["models"]["tts-model"]
        assert _extract_context(data) is None
    @patch("agent.models_dev.fetch_models_dev")
    def test_empty_registry(self, mock_fetch):
        mock_fetch.return_value = {}
        assert lookup_models_dev_context("anthropic", "claude-opus-4-6") is None
 class TestFetchModelsDev:
    @patch("agent.models_dev.requests.get")
    def test_fetch_success(self, mock_get):
        mock_resp = MagicMock()
        mock_resp.status_code = 200
        mock_resp.json.return_value = SAMPLE_REGISTRY
        mock_resp.raise_for_status = MagicMock()
        mock_get.return_value = mock_resp
        # Clear caches
        import agent.models_dev as md
        md._models_dev_cache = {}
        md._models_dev_cache_time = 0
        with patch.object(md, "_save_disk_cache"):
            result = fetch_models_dev(force_refresh=True)
        assert "anthropic" in result
        assert len(result) == len(SAMPLE_REGISTRY)
    @patch("agent.models_dev.requests.get")
    def test_fetch_failure_returns_stale_cache(self, mock_get):
        mock_get.side_effect = Exception("network error")
        import agent.models_dev as md
        md._models_dev_cache = SAMPLE_REGISTRY
        md._models_dev_cache_time = 0  # expired
        with patch.object(md, "_load_disk_cache", return_value=SAMPLE_REGISTRY):
            result = fetch_models_dev(force_refresh=True)
        assert "anthropic" in result
    @patch("agent.models_dev.requests.get")
    def test_in_memory_cache_used(self, mock_get):
        import agent.models_dev as md
        import time
        md._models_dev_cache = SAMPLE_REGISTRY
        md._models_dev_cache_time = time.time()  # fresh
        result = fetch_models_dev()
        mock_get.assert_not_called()
        assert result == SAMPLE_REGISTRY
--- a/tests/hermes_cli/test_setup.py
+++ b/tests/hermes_cli/test_setup.py
@ -97,30 +97,32 @@ def test_custom_setup_clears_active_oauth_provider(tmp_path, monkeypatch):
    monkeypatch.setattr("hermes_cli.setup.prompt_choice", fake_prompt_choice)
-    prompt_values = iter(
+    # _model_flow_custom uses builtins.input (URL, key, model, context_length)
-        [
+    input_values = iter([
        "https://custom.example/v1",
        "custom-api-key",
        "custom/model",
-        ]
+        "",  # context_length (blank = auto-detect)
-    )
+    ])
-    monkeypatch.setattr(
+    monkeypatch.setattr("builtins.input", lambda _prompt="": next(input_values))
        "hermes_cli.setup.prompt",
        lambda *args, **kwargs: next(prompt_values),
    )
    monkeypatch.setattr("hermes_cli.setup.prompt_yes_no", lambda *args, **kwargs: False)
    monkeypatch.setattr("hermes_cli.auth.detect_external_credentials", lambda: [])
    monkeypatch.setattr("hermes_cli.main._save_custom_provider", lambda *args, **kwargs: None)
    monkeypatch.setattr(
        "hermes_cli.models.probe_api_models",
        lambda api_key, base_url: {"models": ["m"], "probed_url": base_url + "/models"},
    )
    setup_model_provider(config)
    save_config(config)
    reloaded = load_config()
    # Core assertion: switching to custom endpoint clears OAuth provider
    assert get_active_provider() is None
-    assert isinstance(reloaded["model"], dict)
+
-    assert reloaded["model"]["provider"] == "custom"
+    # _model_flow_custom writes config via its own load/save cycle
-    assert reloaded["model"]["base_url"] == "https://custom.example/v1"
+    reloaded = load_config()
-    assert reloaded["model"]["default"] == "custom/model"
+    if isinstance(reloaded.get("model"), dict):
        assert reloaded["model"].get("provider") == "custom"
        assert reloaded["model"].get("default") == "custom/model"
 def test_codex_setup_uses_runtime_access_token_for_live_model_list(tmp_path, monkeypatch):
--- a/tests/hermes_cli/test_setup_model_provider.py
+++ b/tests/hermes_cli/test_setup_model_provider.py
@ -99,21 +99,21 @@ def test_setup_custom_endpoint_saves_working_v1_base_url(tmp_path, monkeypatch):
            return tts_idx
        raise AssertionError(f"Unexpected prompt_choice call: {question}")
-    def fake_prompt(message, current=None, **kwargs):
+    # _model_flow_custom uses builtins.input (URL, key, model, context_length)
-        if "API base URL" in message:
+    input_values = iter([
-            return "http://localhost:8000"
+        "http://localhost:8000",
-        if "API key" in message:
+        "local-key",
-            return "local-key"
+        "llm",
-        if "Model name" in message:
+        "",  # context_length (blank = auto-detect)
-            return "llm"
+    ])
-        return ""
+    monkeypatch.setattr("builtins.input", lambda _prompt="": next(input_values))
    monkeypatch.setattr("hermes_cli.setup.prompt_choice", fake_prompt_choice)
    monkeypatch.setattr("hermes_cli.setup.prompt", fake_prompt)
    monkeypatch.setattr("hermes_cli.setup.prompt_yes_no", lambda *args, **kwargs: False)
    monkeypatch.setattr("hermes_cli.auth.get_active_provider", lambda: None)
    monkeypatch.setattr("hermes_cli.auth.detect_external_credentials", lambda: [])
    monkeypatch.setattr("agent.auxiliary_client.get_available_vision_backends", lambda: [])
    monkeypatch.setattr("hermes_cli.main._save_custom_provider", lambda *args, **kwargs: None)
    monkeypatch.setattr(
        "hermes_cli.models.probe_api_models",
        lambda api_key, base_url: {
@ -126,16 +126,19 @@ def test_setup_custom_endpoint_saves_working_v1_base_url(tmp_path, monkeypatch):
    )
    setup_model_provider(config)
    save_config(config)
    env = _read_env(tmp_path)
    reloaded = load_config()
    # _model_flow_custom saves env vars and config to disk
    assert env.get("OPENAI_BASE_URL") == "http://localhost:8000/v1"
    assert env.get("OPENAI_API_KEY") == "local-key"
-    assert reloaded["model"]["provider"] == "custom"
+
-    assert reloaded["model"]["base_url"] == "http://localhost:8000/v1"
+    # The model config is saved as a dict by _model_flow_custom
-    assert reloaded["model"]["default"] == "llm"
+    reloaded = load_config()
    model_cfg = reloaded.get("model", {})
    if isinstance(model_cfg, dict):
        assert model_cfg.get("provider") == "custom"
        assert model_cfg.get("default") == "llm"
 def test_setup_keep_current_config_provider_uses_provider_specific_model_menu(tmp_path, monkeypatch):
--- a/tests/test_cli_provider_resolution.py
+++ b/tests/test_cli_provider_resolution.py
@ -459,7 +459,7 @@ def test_model_flow_custom_saves_verified_v1_base_url(monkeypatch, capsys):
    )
    monkeypatch.setattr("hermes_cli.config.save_config", lambda cfg: None)
-    answers = iter(["http://localhost:8000", "local-key", "llm"])
+    answers = iter(["http://localhost:8000", "local-key", "llm", ""])
    monkeypatch.setattr("builtins.input", lambda _prompt="": next(answers))
    hermes_main._model_flow_custom({})
--- a/website/docs/user-guide/configuration.md
+++ b/website/docs/user-guide/configuration.md
@ -416,7 +416,19 @@ LLM_MODEL=meta-llama/Llama-3.1-70B-Instruct-Turbo
 ### Context Length Detection
-Hermes automatically detects your model's context length by querying the endpoint's `/v1/models` response. For most setups this works out of the box. If detection fails (the model name doesn't match, the endpoint doesn't expose `/v1/models`, etc.), Hermes falls back to a high default and probes downward on context-length errors.
+Hermes uses a multi-source resolution chain to detect the correct context window for your model and provider:
 1. **Config override** — `model.context_length` in config.yaml (highest priority)
 2. **Custom provider per-model** — `custom_providers[].models.<id>.context_length`
 3. **Persistent cache** — previously discovered values (survives restarts)
 4. **Endpoint `/models`** — queries your server's API (local/custom endpoints)
 5. **Anthropic `/v1/models`** — queries Anthropic's API for `max_input_tokens` (API-key users only)
 6. **OpenRouter API** — live model metadata from OpenRouter
 7. **Nous Portal** — suffix-matches Nous model IDs against OpenRouter metadata
 8. **[models.dev](https://models.dev)** — community-maintained registry with provider-specific context lengths for 3800+ models across 100+ providers
 9. **Fallback defaults** — broad model family patterns (128K default)
 For most setups this works out of the box. The system is provider-aware — the same model can have different context limits depending on who serves it (e.g., `claude-opus-4.6` is 1M on Anthropic direct but 128K on GitHub Copilot).
 To set the context length explicitly, add `context_length` to your model config:
@ -427,10 +439,23 @@ model:
  context_length: 131072  # tokens
 ```
-This takes highest priority — it overrides auto-detection, cached values, and hardcoded defaults.
+For custom endpoints, you can also set context length per model:
 ```yaml
 custom_providers:
  - name: "My Local LLM"
    base_url: "http://localhost:11434/v1"
    models:
      qwen3.5:27b:
        context_length: 32768
      deepseek-r1:70b:
        context_length: 65536
 ```
 `hermes model` will prompt for context length when configuring a custom endpoint. Leave it blank for auto-detection.
 :::tip When to set this manually
- Your model shows "2M context" in the status bar (detection failed)
+- You're using Ollama with a custom `num_ctx` that's lower than the model's maximum
 - You want to limit context below the model's maximum (e.g., 8k on a 128k model to save VRAM)
 - You're running behind a proxy that doesn't expose `/v1/models`
 :::