refactor(gateway): remove broken 1.4x hygiene multiplier entirely
The previous commit capped the 1.4x at 95% of context, but the multiplier itself is unnecessary and confusing: 85% threshold × 1.4 = 119% of context → never fires 95% warn × 1.4 = 133% of context → never warns The 85% hygiene threshold already provides ample headroom over the agent's own 50% compressor. Even if rough estimates overestimate by 50%, hygiene would fire at ~57% actual usage — safe and harmless. Remove the multiplier entirely. Both actual and estimated token paths now use the same 85% / 95% thresholds. Update tests and comments.
This commit is contained in:
parent
b2b4a9ee7d
commit
b799bca7a3
2 changed files with 55 additions and 76 deletions
|
|
@ -1757,9 +1757,9 @@ class GatewayRunner:
|
|||
# Token source priority:
|
||||
# 1. Actual API-reported prompt_tokens from the last turn
|
||||
# (stored in session_entry.last_prompt_tokens)
|
||||
# 2. Rough char-based estimate (str(msg)//4) with a 1.4x
|
||||
# safety factor to account for overestimation on tool-heavy
|
||||
# conversations (code/JSON tokenizes at 5-7+ chars/token).
|
||||
# 2. Rough char-based estimate (str(msg)//4). Overestimates
|
||||
# by 30-50% on code/JSON-heavy sessions, but that just
|
||||
# means hygiene fires a bit early — safe and harmless.
|
||||
# -----------------------------------------------------------------
|
||||
if history and len(history) >= 4:
|
||||
from agent.model_metadata import (
|
||||
|
|
@ -1845,29 +1845,20 @@ class GatewayRunner:
|
|||
|
||||
# Prefer actual API-reported tokens from the last turn
|
||||
# (stored in session entry) over the rough char-based estimate.
|
||||
# The rough estimate (str(msg)//4) overestimates by 30-50% on
|
||||
# tool-heavy/code-heavy conversations, causing premature compression.
|
||||
_stored_tokens = session_entry.last_prompt_tokens
|
||||
if _stored_tokens > 0:
|
||||
_approx_tokens = _stored_tokens
|
||||
_token_source = "actual"
|
||||
else:
|
||||
_approx_tokens = estimate_messages_tokens_rough(history)
|
||||
# Apply safety factor only for rough estimates.
|
||||
# Cap the adjusted threshold at 95% of context length
|
||||
# so it never exceeds what the model can actually handle
|
||||
# (the 1.4x factor previously pushed the threshold above
|
||||
# the model's context limit for ~200K models like GLM-5).
|
||||
_max_safe_threshold = int(_hyg_context_length * 0.95)
|
||||
_compress_token_threshold = min(
|
||||
int(_compress_token_threshold * 1.4),
|
||||
_max_safe_threshold,
|
||||
)
|
||||
_warn_token_threshold = min(
|
||||
int(_warn_token_threshold * 1.4),
|
||||
_hyg_context_length,
|
||||
)
|
||||
_token_source = "estimated"
|
||||
# Note: rough estimates overestimate by 30-50% for code/JSON-heavy
|
||||
# sessions, but that just means hygiene fires a bit early — which
|
||||
# is safe and harmless. The 85% threshold already provides ample
|
||||
# headroom (agent's own compressor runs at 50%). A previous 1.4x
|
||||
# multiplier tried to compensate by inflating the threshold, but
|
||||
# 85% * 1.4 = 119% of context — which exceeds the model's limit
|
||||
# and prevented hygiene from ever firing for ~200K models (GLM-5).
|
||||
|
||||
_needs_compress = _approx_tokens >= _compress_token_threshold
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue