Add context compression feature for long conversations

- Implemented automatic context compression to manage long conversations that approach the model's context limit. - Configured the feature to summarize middle turns while protecting the first three and last four turns, ensuring important context is retained. - Added configuration options in `cli-config.yaml` and environment variables for enabling/disabling compression and setting thresholds. - Updated documentation in `README.md`, `cli.md`, and `.env.example` to explain the context compression functionality and its configuration. - Enhanced the `cli.py` to load compression settings into environment variables, ensuring seamless integration with the CLI. - Completed the implementation of context compression as outlined in the TODO list, marking it as a significant enhancement to conversation management.
2026-02-01 18:01:31 -08:00 · 2026-02-01 18:01:31 -08:00 · 9b4d9452ba
commit 9b4d9452ba
parent bbeed5b5d1
7 changed files with 614 additions and 12 deletions
--- a/TODO.md
+++ b/TODO.md
@ -47,7 +47,24 @@ These items need to be addressed ASAP:
  - Structured JSON format for easy parsing and replay
  - Automatic on CLI runs (configurable)

-### 4. Stream Thinking Summaries in Real-Time 💭 ⏸️ DEFERRED
+### 4. Automatic Context Compression 🗜️ ✅ COMPLETE
+- [x] **Problem:** Long conversations exceed model context limits, causing errors
+- [x] **Solution:** Auto-compress middle turns when approaching limit
+- [x] **Implementation:**
+  - Fetches model context lengths from OpenRouter `/api/v1/models` API (cached 1hr)
+  - Tracks actual token usage from API responses (`usage.prompt_tokens`)
+  - Triggers at 85% of model's context limit (configurable)
+  - Protects first 3 turns (system, initial request, first response)
+  - Protects last 4 turns (recent context most relevant)
+  - Summarizes middle turns using fast model (Gemini Flash)
+  - Inserts summary as user message, conversation continues seamlessly
+  - If context error occurs, attempts compression before failing
+- [x] **Configuration (cli-config.yaml / env vars):**
+  - `CONTEXT_COMPRESSION_ENABLED` (default: true)
+  - `CONTEXT_COMPRESSION_THRESHOLD` (default: 0.85 = 85%)
+  - `CONTEXT_COMPRESSION_MODEL` (default: google/gemini-2.0-flash-001)
+
+### 5. Stream Thinking Summaries in Real-Time 💭 ⏸️ DEFERRED
 - [ ] **Problem:** Thinking/reasoning summaries not shown while streaming
 - [ ] **Complexity:** This is a significant refactor - leaving for later