fix: detect context length for custom model endpoints via fuzzy matching + config override (#2051)

* fix: detect context length for custom model endpoints via fuzzy matching + config override

Custom model endpoints (non-OpenRouter, non-known-provider) were silently
falling back to 2M tokens when the model name didn't exactly match what the
endpoint's /v1/models reported. This happened because:

1. Endpoint metadata lookup used exact match only — model name mismatches
   (e.g. 'qwen3.5:9b' vs 'Qwen3.5-9B-Q4_K_M.gguf') caused a miss
2. Single-model servers (common for local inference) required exact name
   match even though only one model was loaded
3. No user escape hatch to manually set context length

Changes:
- Add fuzzy matching for endpoint model metadata: single-model servers
  use the only available model regardless of name; multi-model servers
  try substring matching in both directions
- Add model.context_length config override (highest priority) so users
  can explicitly set their model's context length in config.yaml
- Log an informative message when falling back to 2M probe, telling
  users about the config override option
- Thread config_context_length through ContextCompressor and AIAgent init

Tests: 6 new tests covering fuzzy match, single-model fallback, config
override (including zero/None edge cases).

* fix: auto-detect local model name and context length for local servers

Cherry-picked from PR #2043 by sudoingX.

- Auto-detect model name from local server's /v1/models when only one
  model is loaded (no manual model name config needed)
- Add n_ctx_train and n_ctx to context length detection keys for llama.cpp
- Query llama.cpp /props endpoint for actual allocated context (not just
  training context from GGUF metadata)
- Strip .gguf suffix from display in banner and status bar
- _auto_detect_local_model() in runtime_provider.py for CLI init

Co-authored-by: sudo <sudoingx@users.noreply.github.com>

* fix: revert accidental summary_target_tokens change + add docs for context_length config

- Revert summary_target_tokens from 2500 back to 500 (accidental change
  during patching)
- Add 'Context Length Detection' section to Custom & Self-Hosted docs
  explaining model.context_length config override

---------

Co-authored-by: Test <test@test.com>
Co-authored-by: sudo <sudoingx@users.noreply.github.com>
This commit is contained in:
Teknium 2026-03-19 06:01:16 -07:00 committed by GitHub
parent 7b6d14e62a
commit d76fa7fc37
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
8 changed files with 208 additions and 5 deletions

View file

@ -218,6 +218,79 @@ class TestGetModelContextLength:
assert result == CONTEXT_PROBE_TIERS[0]
@patch("agent.model_metadata.fetch_model_metadata")
@patch("agent.model_metadata.fetch_endpoint_model_metadata")
def test_custom_endpoint_single_model_fallback(self, mock_endpoint_fetch, mock_fetch):
"""Single-model servers: use the only model even if name doesn't match."""
mock_fetch.return_value = {}
mock_endpoint_fetch.return_value = {
"Qwen3.5-9B-Q4_K_M.gguf": {"context_length": 131072}
}
result = get_model_context_length(
"qwen3.5:9b",
base_url="http://myserver.example.com:8080/v1",
api_key="test-key",
)
assert result == 131072
@patch("agent.model_metadata.fetch_model_metadata")
@patch("agent.model_metadata.fetch_endpoint_model_metadata")
def test_custom_endpoint_fuzzy_substring_match(self, mock_endpoint_fetch, mock_fetch):
"""Fuzzy match: configured model name is substring of endpoint model."""
mock_fetch.return_value = {}
mock_endpoint_fetch.return_value = {
"org/llama-3.3-70b-instruct-fp8": {"context_length": 131072},
"org/qwen-2.5-72b": {"context_length": 32768},
}
result = get_model_context_length(
"llama-3.3-70b-instruct",
base_url="http://myserver.example.com:8080/v1",
api_key="test-key",
)
assert result == 131072
@patch("agent.model_metadata.fetch_model_metadata")
def test_config_context_length_overrides_all(self, mock_fetch):
"""Explicit config_context_length takes priority over everything."""
mock_fetch.return_value = {
"test/model": {"context_length": 200000}
}
result = get_model_context_length(
"test/model",
config_context_length=65536,
)
assert result == 65536
@patch("agent.model_metadata.fetch_model_metadata")
def test_config_context_length_zero_is_ignored(self, mock_fetch):
"""config_context_length=0 should be treated as unset."""
mock_fetch.return_value = {}
result = get_model_context_length(
"anthropic/claude-sonnet-4",
config_context_length=0,
)
assert result == 200000
@patch("agent.model_metadata.fetch_model_metadata")
def test_config_context_length_none_is_ignored(self, mock_fetch):
"""config_context_length=None should be treated as unset."""
mock_fetch.return_value = {}
result = get_model_context_length(
"anthropic/claude-sonnet-4",
config_context_length=None,
)
assert result == 200000
# =========================================================================
# fetch_model_metadata — caching, TTL, slugs, failures