Add context compression feature for long conversations
- Implemented automatic context compression to manage long conversations that approach the model's context limit. - Configured the feature to summarize middle turns while protecting the first three and last four turns, ensuring important context is retained. - Added configuration options in `cli-config.yaml` and environment variables for enabling/disabling compression and setting thresholds. - Updated documentation in `README.md`, `cli.md`, and `.env.example` to explain the context compression functionality and its configuration. - Enhanced the `cli.py` to load compression settings into environment variables, ensuring seamless integration with the CLI. - Completed the implementation of context compression as outlined in the TODO list, marking it as a significant enhancement to conversation management.
This commit is contained in:
parent
bbeed5b5d1
commit
9b4d9452ba
7 changed files with 614 additions and 12 deletions
19
TODO.md
19
TODO.md
|
|
@ -47,7 +47,24 @@ These items need to be addressed ASAP:
|
|||
- Structured JSON format for easy parsing and replay
|
||||
- Automatic on CLI runs (configurable)
|
||||
|
||||
### 4. Stream Thinking Summaries in Real-Time 💭 ⏸️ DEFERRED
|
||||
### 4. Automatic Context Compression 🗜️ ✅ COMPLETE
|
||||
- [x] **Problem:** Long conversations exceed model context limits, causing errors
|
||||
- [x] **Solution:** Auto-compress middle turns when approaching limit
|
||||
- [x] **Implementation:**
|
||||
- Fetches model context lengths from OpenRouter `/api/v1/models` API (cached 1hr)
|
||||
- Tracks actual token usage from API responses (`usage.prompt_tokens`)
|
||||
- Triggers at 85% of model's context limit (configurable)
|
||||
- Protects first 3 turns (system, initial request, first response)
|
||||
- Protects last 4 turns (recent context most relevant)
|
||||
- Summarizes middle turns using fast model (Gemini Flash)
|
||||
- Inserts summary as user message, conversation continues seamlessly
|
||||
- If context error occurs, attempts compression before failing
|
||||
- [x] **Configuration (cli-config.yaml / env vars):**
|
||||
- `CONTEXT_COMPRESSION_ENABLED` (default: true)
|
||||
- `CONTEXT_COMPRESSION_THRESHOLD` (default: 0.85 = 85%)
|
||||
- `CONTEXT_COMPRESSION_MODEL` (default: google/gemini-2.0-flash-001)
|
||||
|
||||
### 5. Stream Thinking Summaries in Real-Time 💭 ⏸️ DEFERRED
|
||||
- [ ] **Problem:** Thinking/reasoning summaries not shown while streaming
|
||||
- [ ] **Complexity:** This is a significant refactor - leaving for later
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue