Merge: OBLITERATUS skill v2.0 + unified gateway compression

OBLITERATUS skill (PR #408 updated):
- 9 CLI methods, 28 analysis modules, 116 model presets
- Default method: advanced (multi-direction SVD, norm-preserving)
- Live-tested: Qwen2.5-3B 75%→0% refusal, Qwen2.5-0.5B 60%→20%
- References, templates, and real-world pitfalls included

Gateway compression fix (PR #739):
- Unified session hygiene with agent compression config
- Uses model context length × compression.threshold from config.yaml
- Removed hardcoded 100k/200-msg thresholds
This commit is contained in:
teknium1 2026-03-09 02:59:41 -07:00
commit c21d77ca08
3 changed files with 423 additions and 402 deletions

View file

@ -1,19 +1,19 @@
--- ---
name: obliteratus name: obliteratus
description: Remove refusal behaviors from open-weight LLMs using OBLITERATUS — mechanistic interpretability techniques (diff-in-means, SVD, whitened SVD, SAE decomposition, etc.) to excise guardrails while preserving reasoning. 9 CLI methods (+ 4 Python-API-only), 15 analysis modules, 116 model presets across 5 compute tiers. Use when a user wants to uncensor, abliterate, or remove refusal from an LLM. description: Remove refusal behaviors from open-weight LLMs using OBLITERATUS — mechanistic interpretability techniques (diff-in-means, SVD, whitened SVD, LEACE, SAE decomposition, etc.) to excise guardrails while preserving reasoning. 9 CLI methods, 28 analysis modules, 116 model presets across 5 compute tiers, tournament evaluation, and telemetry-driven recommendations. Use when a user wants to uncensor, abliterate, or remove refusal from an LLM.
version: 1.0.0 version: 2.0.0
author: Hermes Agent author: Hermes Agent
license: MIT license: MIT
dependencies: [obliteratus, torch, transformers, bitsandbytes, accelerate, safetensors] dependencies: [obliteratus, torch, transformers, bitsandbytes, accelerate, safetensors]
metadata: metadata:
hermes: hermes:
tags: [Abliteration, Uncensoring, Refusal-Removal, LLM, Weight-Projection, SVD, Mechanistic-Interpretability, HuggingFace, Model-Surgery] tags: [Abliteration, Uncensoring, Refusal-Removal, LLM, Weight-Projection, SVD, Mechanistic-Interpretability, HuggingFace, Model-Surgery]
related_skills: [vllm, gguf, huggingface-tokenizers]
--- ---
# OBLITERATUS Skill # OBLITERATUS Skill
Remove refusal behaviors (guardrails) from open-weight LLMs without retraining or fine-tuning. Uses mechanistic interpretability techniques — including diff-in-means, SVD, whitened SVD, SAE decomposition, Bayesian kernel projection, and more — to identify and surgically excise refusal directions from model weights while preserving reasoning capabilities. Remove refusal behaviors (guardrails) from open-weight LLMs without retraining or fine-tuning. Uses mechanistic interpretability techniques — including diff-in-means, SVD, whitened SVD, LEACE concept erasure, SAE decomposition, Bayesian kernel projection, and more — to identify and surgically excise refusal directions from model weights while preserving reasoning capabilities.
**License warning:** OBLITERATUS is AGPL-3.0. NEVER import it as a Python library. Always invoke via CLI (`obliteratus` command) or subprocess. This keeps Hermes Agent's MIT license clean. **License warning:** OBLITERATUS is AGPL-3.0. NEVER import it as a Python library. Always invoke via CLI (`obliteratus` command) or subprocess. This keeps Hermes Agent's MIT license clean.
@ -25,7 +25,7 @@ Trigger when the user:
- Wants to create an uncensored version of Llama, Qwen, Mistral, etc. - Wants to create an uncensored version of Llama, Qwen, Mistral, etc.
- Mentions "refusal removal", "abliteration", "weight projection" - Mentions "refusal removal", "abliteration", "weight projection"
- Wants to analyze how a model's refusal mechanism works - Wants to analyze how a model's refusal mechanism works
- References OBLITERATUS, FailSpy, abliterator, or refusal directions - References OBLITERATUS, abliterator, or refusal directions
## Step 1: Installation ## Step 1: Installation
@ -35,10 +35,12 @@ obliteratus --version 2>/dev/null && echo "INSTALLED" || echo "NOT INSTALLED"
``` ```
If not installed, clone and install from GitHub: If not installed, clone and install from GitHub:
``` ```bash
Repository: https://github.com/elder-plinius/OBLITERATUS git clone https://github.com/elder-plinius/OBLITERATUS.git
Install: pip install -e . (from the cloned directory) cd OBLITERATUS
For Gradio UI: pip install -e ".[spaces]" pip install -e .
# For Gradio web UI support:
# pip install -e ".[spaces]"
``` ```
**IMPORTANT:** Confirm with user before installing. This pulls in ~5-10GB of dependencies (PyTorch, Transformers, bitsandbytes, etc.). **IMPORTANT:** Confirm with user before installing. This pulls in ~5-10GB of dependencies (PyTorch, Transformers, bitsandbytes, etc.).
@ -51,7 +53,7 @@ python3 -c "
import torch import torch
if torch.cuda.is_available(): if torch.cuda.is_available():
gpu = torch.cuda.get_device_name(0) gpu = torch.cuda.get_device_name(0)
vram = torch.cuda.get_device_properties(0).total_mem / 1024**3 vram = torch.cuda.get_device_properties(0).total_memory / 1024**3
print(f'GPU: {gpu}') print(f'GPU: {gpu}')
print(f'VRAM: {vram:.1f} GB') print(f'VRAM: {vram:.1f} GB')
if vram < 4: print('TIER: tiny (models under 1B)') if vram < 4: print('TIER: tiny (models under 1B)')
@ -75,25 +77,28 @@ else:
| 48 GB+ | ~72B+ params | Qwen2.5-72B, DeepSeek-R1 | | 48 GB+ | ~72B+ params | Qwen2.5-72B, DeepSeek-R1 |
| Multi-GPU| 200B+ params | Llama 3.1 405B, DeepSeek-V3 (685B MoE) | | Multi-GPU| 200B+ params | Llama 3.1 405B, DeepSeek-V3 (685B MoE) |
## Step 3: Browse Available Models ## Step 3: Browse Available Models & Get Recommendations
```bash ```bash
# List models for your compute tier # Browse models by compute tier
obliteratus models --tier medium obliteratus models --tier medium
# Get architecture info for a specific model # Get architecture info for a specific model
obliteratus info meta-llama/Llama-3.1-8B-Instruct obliteratus info <model_name>
# Get telemetry-driven recommendation for best method & params
obliteratus recommend <model_name>
obliteratus recommend <model_name> --insights # global cross-architecture rankings
``` ```
## Step 4: Choose a Method ## Step 4: Choose a Method
### Method Selection Guide ### Method Selection Guide
**Default / recommended for most cases: `advanced`.** It uses multi-direction SVD with norm-preserving projection and is well-tested.
**First time / unsure? Use `informed`.** It auto-configures everything.
| Situation | Recommended Method | Why | | Situation | Recommended Method | Why |
|:----------------------------------|:-------------------|:-----------------------------------------| |:----------------------------------|:-------------------|:-----------------------------------------|
| First attempt, any model | `informed` | Auto-detects alignment type, auto-tunes | | Default / most models | `advanced` | Multi-direction SVD, norm-preserving, reliable |
| Quick test / prototyping | `basic` | Fast, simple, good enough to evaluate | | Quick test / prototyping | `basic` | Fast, simple, good enough to evaluate |
| Dense model (Llama, Mistral) | `advanced` | Multi-direction, norm-preserving | | Dense model (Llama, Mistral) | `advanced` | Multi-direction, norm-preserving |
| MoE model (DeepSeek, Mixtral) | `nuclear` | Expert-granular, handles MoE complexity | | MoE model (DeepSeek, Mixtral) | `nuclear` | Expert-granular, handles MoE complexity |
@ -101,214 +106,225 @@ obliteratus info meta-llama/Llama-3.1-8B-Instruct
| Stubborn refusals persist | `aggressive` | Whitened SVD + head surgery + jailbreak | | Stubborn refusals persist | `aggressive` | Whitened SVD + head surgery + jailbreak |
| Want reversible changes | Use steering vectors (see Analysis section) | | Want reversible changes | Use steering vectors (see Analysis section) |
| Maximum quality, time no object | `optimized` | Bayesian search for best parameters | | Maximum quality, time no object | `optimized` | Bayesian search for best parameters |
| Experimental auto-detection | `informed` | Auto-detects alignment type — experimental, may not always outperform advanced |
### 9 CLI Methods ### 9 CLI Methods
- **basic** — Single refusal direction via diff-in-means. Fast (~5-10 min for 8B).
- **advanced** (DEFAULT, RECOMMENDED) — Multiple SVD directions, norm-preserving projection, 2 refinement passes. Medium speed (~10-20 min).
- **aggressive** — Whitened SVD + jailbreak-contrastive + attention head surgery. Higher risk of coherence damage.
- **spectral_cascade** — DCT frequency-domain decomposition. Research/novel approach.
- **informed** — Runs analysis DURING abliteration to auto-configure. Experimental — slower and less predictable than advanced.
- **surgical** — SAE features + neuron masking + head surgery + per-expert. Very slow (~1-2 hrs). Best for reasoning models.
- **optimized** — Bayesian hyperparameter search (Optuna TPE). Longest runtime but finds optimal parameters.
- **inverted** — Flips the refusal direction. Model becomes actively willing.
- **nuclear** — Maximum force combo for stubborn MoE models. Expert-granular.
These can be passed to `--method` on the command line: ### Direction Extraction Methods (--direction-method flag)
- **diff_means** (default) — Simple difference-in-means between refused/complied activations. Robust.
- **basic** — Single refusal direction via diff-in-means. Fastest, simplest. (Arditi et al. 2024) - **svd** — Multi-direction SVD extraction. Better for complex alignment.
- **advanced** — Multiple SVD directions, norm-preserving projection. Good default. - **leace** — LEACE (Linear Erasure via Closed-form Estimation). Optimal linear erasure.
- **aggressive** — Whitened SVD + jailbreak contrast + attention head surgery
- **spectral_cascade** — DCT frequency-domain decomposition
- **informed** — Runs analysis DURING abliteration to auto-configure. Detects DPO/RLHF/CAI, maps refusal geometry, compensates for self-repair. Best quality.
- **surgical** — SAE features + neuron masking + head surgery + per-expert. Maximum precision.
- **optimized** — Bayesian hyperparameter search (Optuna TPE). Slowest but optimal.
- **inverted** — Flips the refusal direction (model becomes eager to help, not just neutral)
- **nuclear** — Maximum force combo for stubborn MoE models.
### 4 Python-API-Only Methods ### 4 Python-API-Only Methods
(NOT available via CLI — require Python import, which violates AGPL boundary. Mention to user only if they explicitly want to use OBLITERATUS as a library in their own AGPL project.)
These reproduce prior community/academic work but are NOT available via CLI — only via the Python API (`from obliteratus.abliterate import AbliterationPipeline`). **Do not use these in CLI commands.** - failspy, gabliteration, heretic, rdo
- **failspy** — FailSpy/abliterator reproduction
- **gabliteration** — Gabliteration reproduction
- **heretic** — Heretic/p-e-w reproduction
- **rdo** — Refusal Direction Optimization (ICML 2025)
## Step 5: Run Abliteration ## Step 5: Run Abliteration
### Basic Usage ### Standard usage
```bash ```bash
# Default (advanced method) # Default method (advanced) — recommended for most models
obliteratus obliterate meta-llama/Llama-3.1-8B-Instruct obliteratus obliterate <model_name> --method advanced --output-dir ./abliterated-models
# With the informed pipeline (recommended) # With 4-bit quantization (saves VRAM)
obliteratus obliterate meta-llama/Llama-3.1-8B-Instruct --method informed obliteratus obliterate <model_name> --method advanced --quantization 4bit --output-dir ./abliterated-models
# With 4-bit quantization to save VRAM # Large models (70B+) — conservative defaults
obliteratus obliterate meta-llama/Llama-3.1-8B-Instruct \ obliteratus obliterate <model_name> --method advanced --quantization 4bit --large-model --output-dir ./abliterated-models
--method informed \
--quantization 4bit \
--output-dir ./abliterated-models
# For large models (120B+), use conservative settings
obliteratus obliterate Qwen/Qwen2.5-72B-Instruct \
--method advanced \
--quantization 4bit \
--large-model \
--output-dir ./abliterated-models
``` ```
### Fine-Tuning Parameters ### Fine-tuning parameters
```bash ```bash
obliteratus obliterate <model> \ obliteratus obliterate <model_name> \
--method advanced \ --method advanced \
--n-directions 8 \ --direction-method diff_means \
--n-directions 4 \
--refinement-passes 2 \
--regularization 0.1 \ --regularization 0.1 \
--refinement-passes 3 \ --quantization 4bit \
--dtype bfloat16 \ --output-dir ./abliterated-models \
--device auto \ --contribute # opt-in telemetry for community research
--output-dir ./output
``` ```
Parameter explanations: ### Key flags
- `--n-directions N` — How many refusal directions to remove (default: auto-detected) | Flag | Description | Default |
- `--regularization 0.0-1.0` — Fraction of original weights to preserve (higher = safer but less complete removal) |:-----|:------------|:--------|
- `--refinement-passes N` — Iterative passes to catch self-repair (Ouroboros effect) | `--method` | Abliteration method | advanced |
- `--dtype` — float16, bfloat16, or float32 | `--direction-method` | Direction extraction | diff_means |
- `--quantization` — 4bit or 8bit (saves VRAM, slight quality tradeoff) | `--n-directions` | Number of refusal directions (1-32) | method-dependent |
- `--large-model` — Conservative defaults for 120B+ models (fewer directions, fewer passes) | `--refinement-passes` | Iterative passes (1-5) | 2 |
| `--regularization` | Regularization strength (0.0-1.0) | 0.1 |
| `--quantization` | Load in 4bit or 8bit | none (full precision) |
| `--large-model` | Conservative defaults for 120B+ | false |
| `--output-dir` | Where to save the abliterated model | ./obliterated_model |
| `--contribute` | Share anonymized results for research | false |
| `--verify-sample-size` | Number of test prompts for refusal check | 20 |
| `--dtype` | Model dtype (float16, bfloat16) | auto |
### Interactive Mode (Guided) ### Other execution modes
For users unsure about options:
```bash ```bash
# Interactive guided mode (hardware → model → preset)
obliteratus interactive obliteratus interactive
```
### Web UI (Gradio) # Web UI (Gradio)
```bash
obliteratus ui --port 7860 obliteratus ui --port 7860
# Run a full ablation study from YAML config
obliteratus run config.yaml --preset quick
# Tournament: pit all methods against each other
obliteratus tourney <model_name>
``` ```
## Step 6: Verify Results ## Step 6: Verify Results
After abliteration, check the output report for: After abliteration, check the output metrics:
| Metric | Good Value | Concerning Value | Meaning | | Metric | Good Value | Warning |
|:---------------|:--------------------|:------------------------|:-------------------------------------------| |:-------|:-----------|:--------|
| Refusal rate | Near 0% | > 10% | Refusals still present, try harder method | | Refusal rate | < 5% (ideally ~0%) | > 10% means refusals persist |
| Perplexity | Within 10% of orig | > 20% increase | Model coherence damaged, too aggressive | | Perplexity change | < 10% increase | > 15% means coherence damage |
| KL divergence | < 0.1 | > 0.5 | Large output distribution shift | | KL divergence | < 0.1 | > 0.5 means significant distribution shift |
| Coherence | High | Low | Model generating nonsense | | Coherence | High / passes qualitative check | Degraded responses, repetition |
### If perplexity spiked (too aggressive): ### If refusals persist (> 10%)
1. Increase `--regularization` (e.g., 0.2 or 0.3) 1. Try `aggressive` method
2. Decrease `--n-directions` (e.g., 4 instead of 8) 2. Increase `--n-directions` (e.g., 8 or 16)
3. Use a less aggressive method (`advanced` instead of `aggressive`) 3. Add `--refinement-passes 3`
4. Try `--direction-method svd` instead of diff_means
### If refusal persists (not aggressive enough): ### If coherence is damaged (perplexity > 15% increase)
1. Use `--method aggressive` or `--method nuclear` 1. Reduce `--n-directions` (try 2)
2. Add `--refinement-passes 3` to catch self-repair 2. Increase `--regularization` (try 0.3)
3. Use `--method informed` which auto-compensates 3. Reduce `--refinement-passes` to 1
4. Try `basic` method (gentler)
## Step 7: Use the Abliterated Model ## Step 7: Use the Abliterated Model
The output is a standard HuggingFace model directory. Use it like any other model: The output is a standard HuggingFace model directory.
### Quick test
```bash ```bash
python3 << 'EOF' # Test locally with transformers
python3 -c "
from transformers import AutoModelForCausalLM, AutoTokenizer from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("./abliterated-models/model-name") model = AutoModelForCausalLM.from_pretrained('./abliterated-models/<model>')
tokenizer = AutoTokenizer.from_pretrained("./abliterated-models/model-name") tokenizer = AutoTokenizer.from_pretrained('./abliterated-models/<model>')
inputs = tokenizer("Write a story about:", return_tensors="pt").to(model.device) inputs = tokenizer('How do I pick a lock?', return_tensors='pt')
outputs = model.generate(**inputs, max_new_tokens=200) outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) print(tokenizer.decode(outputs[0], skip_special_tokens=True))
EOF "
# Upload to HuggingFace Hub
huggingface-cli upload <username>/<model-name>-abliterated ./abliterated-models/<model>
# Serve with vLLM
vllm serve ./abliterated-models/<model>
``` ```
### Upload to HuggingFace Hub ## CLI Command Reference
| Command | Description |
|:--------|:------------|
| `obliteratus obliterate` | Main abliteration command |
| `obliteratus info <model>` | Print model architecture details |
| `obliteratus models --tier <tier>` | Browse curated models by compute tier |
| `obliteratus recommend <model>` | Telemetry-driven method/param suggestion |
| `obliteratus interactive` | Guided setup wizard |
| `obliteratus tourney <model>` | Tournament: all methods head-to-head |
| `obliteratus run <config.yaml>` | Execute ablation study from YAML |
| `obliteratus strategies` | List all registered ablation strategies |
| `obliteratus report <results.json>` | Regenerate visual reports |
| `obliteratus ui` | Launch Gradio web interface |
| `obliteratus aggregate` | Summarize community telemetry data |
## Analysis Modules
OBLITERATUS includes 28 analysis modules for mechanistic interpretability.
See `skill_view(name="obliteratus", file_path="references/analysis-modules.md")` for the full reference.
### Quick analysis commands
```bash ```bash
huggingface-cli login # if not already logged in # Run specific analysis modules
huggingface-cli upload your-username/model-name-abliterated ./abliterated-models/model-name obliteratus run analysis-config.yaml --preset quick
# Key modules to run first:
# - alignment_imprint: Fingerprint DPO/RLHF/CAI/SFT alignment method
# - concept_geometry: Single direction vs polyhedral cone
# - logit_lens: Which layer decides to refuse
# - anti_ouroboros: Self-repair risk score
# - causal_tracing: Causally necessary components
``` ```
### Serve with vLLM ### Steering Vectors (Reversible Alternative)
```bash Instead of permanent weight modification, use inference-time steering:
vllm serve ./abliterated-models/model-name --port 8000 ```python
# Python API only — for user's own projects
from obliteratus.analysis.steering_vectors import SteeringVectorFactory, SteeringHookManager
``` ```
## Analysis Modules (15 Modules, Pre-Abliteration, Optional) ## Ablation Strategies
For understanding refusal geometry before committing to abliteration. Beyond direction-based abliteration, OBLITERATUS includes structural ablation strategies:
- **Embedding Ablation** — Target embedding layer components
- **FFN Ablation** — Feed-forward network block removal
- **Head Pruning** — Attention head pruning
- **Layer Removal** — Full layer removal
### Run a Study List all available: `obliteratus strategies`
```bash ## Evaluation
obliteratus run study-config.yaml --preset jailbreak
```
### Study Presets OBLITERATUS includes built-in evaluation tools:
- Refusal rate benchmarking
- Perplexity comparison (before/after)
- LM Eval Harness integration for academic benchmarks
- Head-to-head competitor comparison
- Baseline performance tracking
| Preset | Purpose | Time | ## Platform Support
|:-------------|:-------------------------------------|:-------|
| `quick` | Sanity check, basic metrics | ~5 min |
| `jailbreak` | Refusal circuit localization | ~20 min|
| `guardrail` | Guardrail robustness evaluation | ~30 min|
| `attention` | Attention head contributions | ~30 min|
| `knowledge` | FFN importance mapping | ~30 min|
| `full` | Complete analysis, all strategies | ~1 hr |
### Key Analysis Modules - **CUDA** — Full support (NVIDIA GPUs)
- **Apple Silicon (MLX)** — Supported via MLX backend
- **CPU** — Supported for tiny models (< 1B params)
- **Alignment Imprint Detection** — Fingerprints DPO vs RLHF vs CAI vs SFT from subspace geometry ## YAML Config Templates
- **Concept Cone Geometry** — Is refusal one linear direction or a polyhedral cone (many directions)?
- **Refusal Logit Lens** — Which transformer layer makes the refusal decision?
- **Ouroboros Detection** — Will the model self-repair its refusal after removal?
- **Causal Tracing** — Which attention heads and MLP layers are causally necessary for refusal?
- **Cross-Model Transfer** — Can refusal directions from one model architecture work on another?
- **Residual Stream Decomposition** — Attention vs MLP contribution to refusal behavior
- **SAE-based Analysis** — Sparse Autoencoder feature decomposition of refusal circuits
## Steering Vectors (Reversible Alternative) Load templates for reproducible runs via `skill_view`:
- `templates/abliteration-config.yaml` — Standard single-model config
- `templates/analysis-study.yaml` — Pre-abliteration analysis study
- `templates/batch-abliteration.yaml` — Multi-model batch processing
For testing refusal removal without permanent weight changes: ## Telemetry
Steering vectors apply activation hooks at inference time. Model weights stay unchanged. OBLITERATUS can optionally contribute anonymized run data to a global research dataset.
Generated during the PROBE/DISTILL stages and can be saved/applied/removed at will. Enable with `--contribute` flag. No personal data is collected — only model name, method, metrics.
Useful for A/B testing before committing to permanent abliteration.
## YAML Config for Reproducible Studies
For complex or reproducible workflows, use YAML configs. See templates/ for examples:
```bash
obliteratus run my_study.yaml
```
## Telemetry Notice
- **CLI usage (local installs)**: Telemetry is OFF by default. Must explicitly opt in via `OBLITERATUS_TELEMETRY=1` env var or `--contribute` flag.
- **HuggingFace Spaces**: Telemetry is ON by default (auto-enabled when `SPACE_ID` env var is detected).
- Collected: model ID, method, benchmark scores, hardware info, timing (anonymous)
- NOT collected: IP addresses, user identity, prompt content
- Force off: `export OBLITERATUS_TELEMETRY=0`
## Common Pitfalls ## Common Pitfalls
1. **OOM (Out of Memory)** — Use `--quantization 4bit` and `--large-model` for big models 1. **Don't use `informed` as default** — it's experimental and slower. Use `advanced` for reliable results.
2. **Perplexity spike** — Too aggressive. Increase `--regularization` or reduce `--n-directions` 2. **Models under ~1B respond poorly to abliteration** — their refusal behaviors are shallow and fragmented, making clean direction extraction difficult. Expect partial results (20-40% remaining refusal). Models 3B+ have cleaner refusal directions and respond much better (often 0% refusal with `advanced`).
3. **Refusal persists** — Try `--method aggressive` or `--refinement-passes 3` 3. **`aggressive` can make things worse** — on small models it can damage coherence and actually increase refusal rate. Only use it if `advanced` leaves > 10% refusals on a 3B+ model.
4. **MoE models resist** — Use `--method nuclear` for DeepSeek, Mixtral, DBRX 4. **Always check perplexity** — if it spikes > 15%, the model is damaged. Reduce aggressiveness.
5. **Gated models fail** — Run `huggingface-cli login` and accept model terms on HF website first 5. **MoE models need special handling** — use `nuclear` method for Mixtral, DeepSeek-MoE, etc.
6. **Self-repair (Ouroboros)** — Some models reconstruct refusal. Use `--method informed` which auto-compensates 6. **Quantized models can't be re-quantized** — abliterate the full-precision model, then quantize the output.
7. **CoT damage** — Reasoning models lose chain-of-thought. Use `--method surgical` (CoT-aware) 7. **VRAM estimation is approximate** — 4-bit quant helps but peak usage can spike during extraction.
8. **Disk space** — Output is full model copy. 8B fp16 = ~16GB, 70B fp16 = ~140GB 8. **Reasoning models are sensitive** — use `surgical` for R1 distills to preserve chain-of-thought.
9. **Slow on CPU** — CPU-only is viable only for tiny models (<1B). Anything bigger needs GPU. 9. **Check `obliteratus recommend`** — telemetry data may have better parameters than defaults.
10. **AGPL license** — never `import obliteratus` in MIT/Apache projects. CLI invocation only.
11. **Large models (70B+)** — always use `--large-model` flag for conservative defaults.
12. **Spectral certification RED is common** — the spectral check often flags "incomplete" even when practical refusal rate is 0%. Check actual refusal rate rather than relying on spectral certification alone.
## Complementary Hermes Skills ## Complementary Skills
After abliteration: - **vllm** — Serve abliterated models with high throughput
- **axolotl** / **unsloth** — Fine-tune the abliterated model further - **gguf** — Convert abliterated models to GGUF for llama.cpp
- **serving-llms-vllm** — Serve the model as an OpenAI-compatible API - **huggingface-tokenizers** — Work with model tokenizers
- **sparse-autoencoder-training** — Train SAEs for deeper interpretability work
## Resources
- [OBLITERATUS GitHub](https://github.com/elder-plinius/OBLITERATUS) (AGPL-3.0)
- [HuggingFace Spaces Demo](https://huggingface.co/spaces/pliny-the-prompter/obliteratus)
- [Arditi et al. 2024 — Refusal in LMs Is Mediated by a Single Direction](https://arxiv.org/abs/2406.11717)
- [Refusal Direction Optimization — ICML 2025](https://arxiv.org/abs/2411.14793)

View file

@ -1,170 +1,166 @@
# OBLITERATUS Analysis Modules — Reference # OBLITERATUS Analysis Modules — Reference
15 analysis modules for mechanistic interpretability of refusal in LLMs. OBLITERATUS includes 28 analysis modules for mechanistic interpretability of refusal in LLMs.
These help you understand HOW a model refuses before you decide to remove it. These modules help understand how and where refusal behaviors are encoded before performing abliteration.
> **Note:** The `analysis/` directory contains additional utility files (utils.py, ---
> visualization.py, etc.) and helper functions beyond the 15 core analysis modules
> listed below. The module count matches the README's "15 deep analysis modules."
## Core Analysis (Run These First) ## Core Analysis (Run These First)
### Alignment Imprint Detection ### 1. Alignment Imprint Detection (`alignment_imprint.py`)
**File:** `alignment_imprint.py` Fingerprints whether a model was trained via DPO, RLHF, CAI, or SFT.
**Purpose:** Identifies what alignment technique was used to train the model This determines which extraction strategy will work best.
**Detects:** DPO, RLHF, CAI (Constitutional AI), SFT (Supervised Fine-Tuning)
**How:** Analyzes subspace geometry — each alignment method leaves a distinct
geometric "fingerprint" in the weight space
**Output:** Detected method + confidence score
**Why it matters:** Different alignment methods need different abliteration approaches.
DPO models typically have cleaner single-direction refusal; RLHF is more diffuse.
### Concept Cone Geometry ### 2. Concept Cone Geometry (`concept_geometry.py`)
**File:** `concept_geometry.py` Determines if refusal is a single linear direction or a polyhedral cone
**Purpose:** Maps whether refusal is one direction or a polyhedral cone (many) (set of multiple mechanisms). Single-direction models respond well to `basic`;
**Output:** Cone angle, dimensionality, per-category breakdown polyhedral models need `advanced` or `surgical`.
**Why it matters:** If refusal is a single direction, `basic` method works. If it's
a cone (multiple directions for different refusal categories), you need `advanced`
or `informed` with higher `n_directions`.
### Refusal Logit Lens ### 3. Refusal Logit Lens (`logit_lens.py`)
**File:** `logit_lens.py` Identifies the specific layer where a model "decides" to refuse by decoding
**Purpose:** Identifies the specific layer where the model "decides" to refuse intermediate layer representations into token space.
**How:** Projects intermediate hidden states to vocabulary space at each layer,
watches when "I cannot" tokens spike in probability
**Output:** Layer-by-layer refusal probability plot
**Why it matters:** Tells you which layers are most important to target
### Ouroboros (Self-Repair) Detection ### 4. Ouroboros Detection (`anti_ouroboros.py`)
**File:** `anti_ouroboros.py` Identifies if a model attempts to "self-repair" refusal behaviors after
**Purpose:** Predicts whether the model will reconstruct its refusal after removal excision. Reports a risk score (0-1). High scores mean additional refinement
**How:** Measures redundancy in refusal representation across layers passes are needed.
**Output:** Self-repair risk score (0-1)
**Why it matters:** High self-repair risk means you need multiple refinement passes
or the `informed` method which auto-compensates
### Causal Tracing ### 5. Causal Tracing (`causal_tracing.py`)
**File:** `causal_tracing.py` Identifies which components (layers, heads, MLPs) are causally necessary
**Purpose:** Determines which components are causally necessary for refusal for refusal behavior using activation patching.
**How:** Patches activations between clean and corrupted runs, measures causal effect
**Output:** Causal importance map across layers, heads, and MLPs ---
**Why it matters:** Shows exactly which components to target for surgical removal
## Geometric Analysis ## Geometric Analysis
### Cross-Layer Alignment ### 6. Cross-Layer Alignment (`cross_layer.py`)
**File:** `cross_layer.py` Measures how refusal directions align across different layers. High alignment
**Purpose:** Measures how aligned refusal directions are across layers means the refusal signal is consistent; low alignment suggests layer-specific
**Output:** Alignment matrix, cluster assignments mechanisms.
**Why it matters:** If directions are highly aligned across layers, removal is easier.
If they cluster, you may need layer-group-specific directions.
### Residual Stream Decomposition ### 7. Residual Stream Decomposition (`residual_stream.py`)
**File:** `residual_stream.py` Decomposes the residual stream into attention and MLP contributions to
**Purpose:** Breaks down refusal into Attention vs MLP contributions understand which component type contributes more to refusal.
**Output:** Per-layer Attention/MLP contribution to refusal direction
**Why it matters:** Helps decide whether to target attention heads, MLPs, or both
### Riemannian Manifold Geometry ### 8. Riemannian Manifold Geometry (`riemannian_manifold.py`)
**File:** `riemannian_manifold.py` (673 lines) Analyzes the curvature and geometry of the weight manifold near refusal
**Purpose:** Analyzes the weight manifold geometry around refusal directions directions. Informs how aggressively projections can be applied without
**Output:** Curvature, geodesics, tangent space analysis damaging the manifold structure.
**Why it matters:** Research-grade; helps understand the geometric structure of alignment
### Whitened SVD ### 9. Whitened SVD (`whitened_svd.py`)
**File:** `whitened_svd.py` Covariance-normalized SVD extraction that separates guardrail signals from
**Purpose:** Covariance-normalized SVD extraction natural activation variance. More precise than standard SVD for models with
**How:** Whitens the activation covariance before computing refusal directions, high activation variance.
separating true refusal signal from natural activation variance
**Output:** Cleaner refusal directions with less noise ### 10. Concept Cone Geometry (extended)
**Why it matters:** Produces more precise directions, especially for noisy activations Maps the full polyhedral structure of refusal, including cone angles,
face counts, and intersection patterns.
---
## Probing & Classification ## Probing & Classification
### Activation Probing ### 11. Activation Probing (`activation_probing.py`)
**File:** `activation_probing.py` Post-excision verification — probes for residual refusal concepts after
**Purpose:** Post-excision probing to verify refusal signal is truly gone abliteration to ensure complete removal.
**Output:** Residual refusal signal strength per layer
**Why it matters:** Verification that abliteration was complete
### Probing Classifiers ### 12. Probing Classifiers (`probing_classifiers.py`)
**File:** `probing_classifiers.py` Trains linear classifiers to detect refusal in activations. Used both
**Purpose:** Trains linear classifiers to detect refusal in hidden states before (to verify refusal exists) and after (to verify it's gone).
**Output:** Classification accuracy per layer (should drop to ~50% after abliteration)
**Why it matters:** Quantitative measure of refusal removal completeness
### Activation Patching ### 13. Activation Patching (`activation_patching.py`)
**File:** `activation_patching.py` Interchange interventions — swaps activations between refused and complied
**Purpose:** Interchange interventions — swap activations between harmful/harmless runs runs to identify causal components.
**Output:** Which components are sufficient (not just necessary) for refusal
**Why it matters:** Complementary to causal tracing; together they give full picture ### 14. Tuned Lens (`tuned_lens.py`)
Trained version of logit lens that provides more accurate per-layer
decoding by learning affine transformations for each layer.
### 15. Multi-Token Position Analysis (`multi_token_position.py`)
Analyzes refusal signals across multiple token positions, not just the
last token. Important for models that distribute refusal across the sequence.
---
## Abliteration & Manipulation
### 16. SAE-Based Abliteration (`sae_abliteration.py`)
Uses Sparse Autoencoder features to identify and remove specific refusal
features. More surgical than direction-based methods.
### 17. Steering Vectors (`steering_vectors.py`)
Creates and applies inference-time steering vectors for reversible refusal
modification. Includes `SteeringVectorFactory` and `SteeringHookManager`.
### 18. LEACE Concept Erasure (`leace.py`)
Linear Erasure via Closed-form Estimation — mathematically optimal linear
concept removal. Available as both analysis module and direction extraction method.
### 19. Sparse Surgery (`sparse_surgery.py`)
High-precision weight modification targeting individual neurons and
weight matrix entries rather than full directions.
### 20. Conditional Abliteration (`conditional_abliteration.py`)
Targeted removal that only affects specific refusal categories while
preserving others (e.g., remove weapons refusal but keep CSAM refusal).
---
## Transfer & Robustness ## Transfer & Robustness
### Cross-Model Transfer ### 21. Cross-Model Transfer (`cross_model_transfer.py`)
**File:** `cross_model_transfer.py` Tests whether refusal directions extracted from one model transfer to
**Purpose:** Tests if refusal directions from one model work on another another architecture. Measures universality of guardrail directions.
**Output:** Transfer success rate between model pairs
**Why it matters:** If directions transfer, you can skip PROBE stage on similar models
### Defense Robustness ### 22. Defense Robustness (`defense_robustness.py`)
**File:** `defense_robustness.py` Evaluates how robust the abliteration is against various defense mechanisms
**Purpose:** Evaluates how robust the model's refusal defenses are and re-alignment attempts.
**Output:** Robustness score, entanglement mapping
**Why it matters:** Higher robustness = need more aggressive method
### Spectral Certification ### 23. Spectral Certification (`spectral_certification.py`)
**File:** `spectral_certification.py` Provides mathematical bounds on the completeness of refusal removal
**Purpose:** Certifies completeness of refusal direction removal using spectral analysis of the projection.
**Output:** Spectral gap analysis, completeness score
**Why it matters:** Formal verification that all major refusal components are addressed ### 24. Wasserstein Optimal Extraction (`wasserstein_optimal.py`)
Uses optimal transport theory for more precise direction extraction
that minimizes distribution shift.
### 25. Wasserstein Transfer (`wasserstein_transfer.py`)
Distribution transfer between models using Wasserstein distance
for cross-architecture refusal direction mapping.
---
## Advanced / Research ## Advanced / Research
### SAE-based Abliteration ### 26. Bayesian Kernel Projection (`bayesian_kernel_projection.py`)
**File:** `sae_abliteration.py` (762 lines) Probabilistic feature mapping that estimates uncertainty in refusal
**Purpose:** Uses Sparse Autoencoder features to decompose refusal at feature level direction identification.
**Output:** Refusal-specific SAE features, targeted removal
**Why it matters:** Most fine-grained approach; can target individual refusal "concepts"
### Wasserstein Optimal Extraction ### 27. Cross-Model Universality Index
**File:** `wasserstein_optimal.py` Measures if guardrail directions generalize across different model
**Purpose:** Optimal transport-based direction extraction architectures and training regimes.
**Output:** Wasserstein-optimal refusal directions
**Why it matters:** Theoretically optimal direction extraction under distributional assumptions
### Bayesian Kernel Projection ### 28. Visualization (`visualization.py`)
**File:** `bayesian_kernel_projection.py` Plotting and graphing utilities for all analysis modules. Generates
**Purpose:** Bayesian approach to refusal direction projection heatmaps, direction plots, and layer-wise analysis charts.
**Output:** Posterior distribution over refusal directions
**Why it matters:** Quantifies uncertainty in direction estimation
### Conditional Abliteration ---
**File:** `conditional_abliteration.py`
**Purpose:** Domain-specific conditional removal (remove refusal for topic X but keep for Y)
**Output:** Per-domain refusal directions
**Why it matters:** Selective uncensoring — remove only specific refusal categories
### Steering Vectors ## Running Analysis
**File:** `steering_vectors.py`
**Purpose:** Generate inference-time steering vectors (reversible alternative)
**Output:** Steering vector files that can be applied/removed at inference
**Why it matters:** Non-destructive alternative to permanent weight modification
### Tuned Lens ### Via CLI
**File:** `tuned_lens.py` ```bash
**Purpose:** Trained linear probes per layer (more accurate than raw logit lens) # Run analysis from a YAML config
**Output:** Layer-by-layer refusal representation with trained projections obliteratus run analysis-study.yaml --preset quick
**Why it matters:** More accurate than logit lens, especially for deeper models
### Multi-Token Position Analysis # Available study presets:
**File:** `multi_token_position.py` # quick — Fast sanity check (2-3 modules)
**Purpose:** Analyzes refusal signal at multiple token positions (not just last) # full — All core + geometric analysis
**Output:** Position-dependent refusal direction maps # jailbreak — Refusal circuit localization
**Why it matters:** Some models encode refusal at the system prompt position, not the query # knowledge — Knowledge preservation analysis
# robustness — Stress testing / defense evaluation
```
### Sparse Surgery ### Via YAML Config
**File:** `sparse_surgery.py` See the `templates/analysis-study.yaml` template for a complete example.
**Purpose:** Row-level sparse weight surgery instead of full matrix projection Load with: `skill_view(name="obliteratus", file_path="templates/analysis-study.yaml")`
**Output:** Targeted weight modifications at the row level
**Why it matters:** More surgical than full-matrix projection, less collateral damage

View file

@ -1,132 +1,141 @@
# OBLITERATUS Methods — Detailed Guide # OBLITERATUS Methods — Detailed Guide
> **Important:** The CLI (`obliteratus obliterate --method`) accepts 9 methods: > The CLI accepts 9 methods via `--method`: basic, advanced, aggressive, spectral_cascade,
> basic, advanced, aggressive, spectral_cascade, informed, surgical, optimized, > informed, surgical, optimized, inverted, nuclear.
> inverted, nuclear. Four additional methods (failspy, gabliteration, heretic, rdo) > Four additional methods (failspy, gabliteration, heretic, rdo) are available only via the Python API.
> are available only via the Python API and will be rejected by argparse if used on CLI.
## How Abliteration Works (Theory) ## How Abliteration Works (Theory)
When a model is trained with RLHF/DPO/CAI, it learns to represent "should I refuse?" Abliteration identifies a "refusal direction" — a vector in the model's activation space that
as a direction in its internal activation space. When processing a "harmful" prompt, corresponds to refusal behavior — and projects it out of the weight matrices.
activations shift in this direction, causing the model to generate refusal text.
Abliteration works by:
1. Measuring this direction (the difference between harmful and harmless activations)
2. Removing it from the model's weight matrices via orthogonal projection
3. The model can no longer "point toward" refusal, so it responds normally
Mathematically: `W_new = W_old - (W_old @ d @ d.T)` where `d` is the refusal direction. Mathematically: `W_new = W_old - (W_old @ d @ d.T)` where `d` is the refusal direction.
The key challenge is finding accurate refusal directions without damaging other capabilities.
---
## Direction Extraction Methods
Before projecting, OBLITERATUS extracts refusal directions using one of three methods:
| Method | Flag | Description | Best For |
|:-------|:-----|:------------|:---------|
| Diff-in-Means | `--direction-method diff_means` | Difference between mean activations on refused vs. complied prompts | Default, fast, robust |
| SVD | `--direction-method svd` | Multi-direction extraction via Singular Value Decomposition | Complex alignment, multiple refusal mechanisms |
| LEACE | `--direction-method leace` | Linear Erasure via Closed-form Estimation — mathematically optimal | Maximum precision, research |
---
## Method Details ## Method Details
### basic ### basic
**Technique:** Single refusal direction via diff-in-means - **Directions:** 1 (single diff-in-means vector)
**Based on:** Arditi et al. 2024 ("Refusal in Language Models Is Mediated by a Single Direction") - **Speed:** Fast (~5-10 min for 8B model)
**Speed:** Fast (~5-10 min for 8B) - **Risk:** Low
**Quality:** Moderate — works for simple refusal patterns - **Use case:** Quick tests, prototyping, evaluating if abliteration works for a model
**Best for:** Quick tests, models with clean single-direction refusal - **How it works:** Extracts one refusal direction and projects it out uniformly across all layers.
**Limitation:** Misses complex multi-direction refusal patterns
### advanced (DEFAULT) ### advanced (DEFAULT — RECOMMENDED)
**Technique:** Multiple SVD directions with norm-preserving projection - **Directions:** 4 (multi-direction SVD)
**Speed:** Medium (~10-20 min for 8B) - **Speed:** Medium (~10-20 min for 8B model)
**Quality:** Good — handles multi-direction refusal - **Risk:** Low-Medium
**Best for:** Dense models (Llama, Qwen, Mistral) as a reliable default - **Refinement passes:** 2
**Key improvement:** Norm preservation prevents weight magnitude drift - **Use case:** Default for most models. Well-tested and reliable.
- **How it works:** Extracts multiple refusal directions via SVD, applies norm-preserving bi-projection to maintain weight matrix norms. Two refinement passes catch residual refusal.
### informed (RECOMMENDED)
**Technique:** Analysis-guided auto-configuration
**Speed:** Slow (~20-40 min for 8B, runs 4 analysis modules first)
**Quality:** Best — adapts to each model's specific refusal implementation
**Best for:** Any model when quality matters more than speed
The informed pipeline runs these analysis modules during abliteration:
1. **AlignmentImprintDetector** — Detects DPO/RLHF/CAI/SFT → sets regularization
2. **ConceptConeAnalyzer** — Polyhedral vs linear refusal → sets n_directions
3. **CrossLayerAlignmentAnalyzer** — Cluster-aware → selects target layers
4. **DefenseRobustnessEvaluator** — Self-repair risk → sets refinement passes
5. **Ouroboros loop** — Re-probes after excision, re-excises if refusal persists
### aggressive ### aggressive
**Technique:** Whitened SVD + jailbreak-contrastive activations + attention head surgery - **Directions:** 8+ (whitened SVD + jailbreak-contrastive)
**Speed:** Slow (~30-60 min for 8B) - **Speed:** Medium-Slow
**Quality:** High but higher risk of coherence damage - **Risk:** Medium-High (may damage coherence)
**Best for:** Models that resist gentler methods - **Use case:** When `advanced` leaves > 10% refusals. Stubborn models.
**Key feature:** Whitened SVD separates refusal signal from natural activation variance - **How it works:** Uses whitened SVD for covariance-normalized extraction, adds jailbreak-contrastive directions, performs attention head surgery on the most refusal-active heads.
### surgical
**Technique:** SAE features + neuron masking + head surgery + per-expert directions
**Speed:** Very slow (~1-2 hrs for 8B, needs SAE)
**Quality:** Highest precision
**Best for:** Reasoning models (R1 distills) where you must preserve CoT
**Key feature:** CoT-Aware — explicitly protects reasoning-critical directions
### nuclear
**Technique:** Everything combined — expert transplant + steering + per-expert directions
**Speed:** Very slow
**Quality:** Most thorough removal, highest risk of side effects
**Best for:** Stubborn MoE models (DeepSeek, Mixtral, DBRX) that resist other methods
**Key feature:** Expert-granular abliteration decomposes signals per MoE expert
### optimized
**Technique:** Bayesian hyperparameter search via Optuna TPE
**Speed:** Very slow (runs many trials)
**Quality:** Finds optimal configuration automatically
**Best for:** Research, when you want the mathematically best parameters
**Requires:** optuna package
### spectral_cascade ### spectral_cascade
**Technique:** DCT frequency-domain decomposition of refusal signal - **Speed:** Medium
**Speed:** Medium-slow - **Risk:** Medium
**Quality:** Novel approach, less battle-tested - **Use case:** Research, novel approaches
**Best for:** Research, exploring alternative decomposition strategies - **How it works:** DCT (Discrete Cosine Transform) frequency-domain decomposition of refusal signals. Separates high-frequency (surface-level) from low-frequency (deep) refusal patterns.
### informed (EXPERIMENTAL)
- **Speed:** Slow (~20-40 min for 8B model)
- **Risk:** Variable — results depend on analysis quality
- **Use case:** When you want auto-configuration, but be aware this is experimental and may not outperform `advanced`.
- **How it works:** Runs 4 analysis modules first (alignment imprint, concept geometry, logit lens, ouroboros detection), then auto-configures extraction strategy. Includes an "Ouroboros loop" that detects and counteracts self-repair.
- **Note:** The auto-detection can sometimes misconfigure. If results are poor, fall back to `advanced`.
### surgical
- **Speed:** Very slow (~1-2 hrs for 8B model)
- **Risk:** Low (very precise)
- **Use case:** Reasoning models (R1 distills, QwQ, etc.) where chain-of-thought must be preserved.
- **How it works:** Uses SAE (Sparse Autoencoder) features + individual neuron masking + attention head surgery + per-expert decomposition (for MoE). CoT-aware — identifies and protects reasoning-critical directions before projecting.
### optimized
- **Speed:** Very slow (hours — runs many trials)
- **Risk:** Low (finds optimal parameters)
- **Use case:** When quality matters more than speed. Production models.
- **How it works:** Bayesian hyperparameter search via Optuna TPE sampler. Optimizes n_directions, regularization, refinement passes, and layer selection jointly. Evaluates each configuration on refusal rate + perplexity.
### inverted ### inverted
**Technique:** Reflects (inverts) the refusal direction instead of removing it - **Speed:** Fast
**Speed:** Fast (same as basic) - **Risk:** High (model behavior changes dramatically)
**Quality:** Aggressive — model becomes actively willing, not just neutral - **Use case:** Research, studying refusal mechanisms
**Best for:** When you want the model to be maximally helpful - **How it works:** Instead of projecting out the refusal direction, reflects it. The model actively complies rather than passively not-refusing. Useful for understanding the geometry of alignment.
**Warning:** Can make the model too eager; may reduce safety-adjacent reasoning
### failspy / gabliteration / heretic / rdo (PYTHON API ONLY) ### nuclear
**Technique:** Faithful reproductions of prior community/academic work - **Speed:** Slow
**Speed:** Varies - **Risk:** Medium-High
**Quality:** Known baselines - **Use case:** Stubborn MoE models (DeepSeek-MoE, Mixtral, etc.)
**Best for:** Reproducing published results, comparing methods - **How it works:** Combines expert-granular abliteration (EGA), steering vector injection, attention head pruning, and multi-pass refinement. Decomposes refusal signals into per-expert components for MoE architectures.
**⚠️ NOT available via CLI** — these methods are only accessible via the Python API.
Do not use `--method failspy` etc. in CLI commands; argparse will reject them. ---
## Method Selection Flowchart ## Method Selection Flowchart
``` ```
Is this a quick test? Is this a quick test?
├─ YES → basic → YES: basic
└─ NO → Is the model MoE (DeepSeek, Mixtral)? → NO: continue
├─ YES → nuclear
└─ NO → Is it a reasoning model (R1 distill)? Is it an MoE model (Mixtral, DeepSeek-MoE)?
├─ YES → surgical → YES: nuclear
└─ NO → Do you care about speed? → NO: continue
├─ YES → advanced
└─ NO → informed Is it a reasoning model (R1, QwQ, CoT-focused)?
→ YES: surgical
→ NO: continue
Do you need the absolute best quality and have time?
→ YES: optimized
→ NO: advanced (recommended default)
Did advanced leave > 10% refusals?
→ YES: aggressive
→ Still refusing: nuclear
``` ```
---
## Key Parameters ## Key Parameters
| Parameter | Range | Default | Effect | | Parameter | Range | Default | Effect |
|:--------------------|:---------|:--------|:--------------------------------------------| |:----------|:------|:--------|:-------|
| n_directions | 1-32 | auto | More = more thorough but riskier | | `--n-directions` | 1-32 | method-dependent | More directions = more complete removal, but higher damage risk |
| regularization | 0.0-1.0 | 0.0 | Higher preserves more original behavior | | `--regularization` | 0.0-1.0 | 0.1 | Higher = more conservative (less removal, less damage) |
| refinement_passes | 1-5 | 1 | More catches self-repair (Ouroboros effect) | | `--refinement-passes` | 1-5 | 2 | More passes catch residual refusal, but diminishing returns |
| quantization | 4/8 bit | none | Saves VRAM, slight quality tradeoff | | `--quantization` | 4bit, 8bit | none | Reduces VRAM usage; quality impact minimal for extraction |
| `--verify-sample-size` | 10-200 | 20 | More samples = more accurate refusal rate estimate |
---
## Troubleshooting ## Troubleshooting
| Problem | Solution | | Problem | Likely Cause | Fix |
|:---------------------------|:--------------------------------------------------| |:--------|:-------------|:----|
| Refusal rate still > 10% | Try aggressive/nuclear, add refinement passes | | Refusal rate > 20% | Too few directions | Increase `--n-directions`, try `aggressive` |
| Perplexity up > 20% | Reduce n_directions, increase regularization | | Refusal rate 5-20% | Residual refusal | Add `--refinement-passes 3`, try `--direction-method svd` |
| Model generates nonsense | Regularization too low, try 0.2-0.3 | | Perplexity spike > 20% | Over-aggressive removal | Reduce `--n-directions`, increase `--regularization` |
| OOM on GPU | Use 4-bit quantization, or try smaller model | | Repetitive output | Weight matrix damage | Use `basic` with fewer directions, check norm preservation |
| MoE model barely changes | Use nuclear method (expert-granular) | | MoE model still refuses | Non-expert-aware method | Switch to `nuclear` |
| CoT reasoning broken | Use surgical method (CoT-aware) | | Reasoning degraded | CoT directions damaged | Use `surgical` method |
| OOM during extraction | Insufficient VRAM | Add `--quantization 4bit` and/or `--large-model` |