Custom endpoints (LM Studio, Ollama, vLLM, llama.cpp) silently fall
back to 2M tokens when /v1/models doesn't include context_length.
Adds _query_local_context_length() which queries server-specific APIs:
- LM Studio: /api/v1/models (max_context_length + loaded instances)
- Ollama: /api/show (model_info + num_ctx parameters)
- llama.cpp: /props (n_ctx from default_generation_settings)
- vLLM: /v1/models/{model} (max_model_len)
Prefers loaded instance context over max (e.g., 122K loaded vs 1M max).
Results are cached via save_context_length() to avoid repeated queries.
Also fixes detect_local_server_type() misidentifying LM Studio as
Ollama (LM Studio returns 200 for /api/tags with an error body).
|
||
|---|---|---|
| .. | ||
| __init__.py | ||
| anthropic_adapter.py | ||
| auxiliary_client.py | ||
| context_compressor.py | ||
| copilot_acp_client.py | ||
| display.py | ||
| insights.py | ||
| model_metadata.py | ||
| prompt_builder.py | ||
| prompt_caching.py | ||
| redact.py | ||
| skill_commands.py | ||
| smart_model_routing.py | ||
| title_generator.py | ||
| trajectory.py | ||
| usage_pricing.py | ||