feat: update OBLITERATUS skill to v2.0 — match current repo state

Major updates to reflect the current OBLITERATUS codebase:

- Change default recommendation from 'informed' (experimental) to
  'advanced' (reliable, well-tested multi-direction SVD)
- Add new CLI commands: tourney, recommend, strategies, report,
  aggregate, abliterate (alias)
- Add --direction-method flag (diff_means, svd, leace)
- Add strategies module (embedding/FFN ablation, head pruning,
  layer removal)
- Add evaluation module with LM Eval Harness integration
- Expand analysis modules from 15 to 28
- Add Apple Silicon (MLX) support
- Add study presets (quick, jailbreak, knowledge, etc.)
- Add --contribute, --verify-sample-size, --preset flags
- Add complete CLI command reference table
- Fix torch property name: total_mem -> total_memory (caught
  during live testing)

Tested: Successfully abliterated Qwen2.5-0.5B-Instruct using
'advanced' method — refusal rate 0.4%, coherence 1.0, model
responds without refusal to test prompts.
This commit is contained in:
teknium1 2026-03-09 02:39:03 -07:00
parent 763c6d104d
commit a6d3becd6a
3 changed files with 420 additions and 402 deletions

View file

@ -1,170 +1,166 @@
# OBLITERATUS Analysis Modules — Reference
15 analysis modules for mechanistic interpretability of refusal in LLMs.
These help you understand HOW a model refuses before you decide to remove it.
OBLITERATUS includes 28 analysis modules for mechanistic interpretability of refusal in LLMs.
These modules help understand how and where refusal behaviors are encoded before performing abliteration.
> **Note:** The `analysis/` directory contains additional utility files (utils.py,
> visualization.py, etc.) and helper functions beyond the 15 core analysis modules
> listed below. The module count matches the README's "15 deep analysis modules."
---
## Core Analysis (Run These First)
### Alignment Imprint Detection
**File:** `alignment_imprint.py`
**Purpose:** Identifies what alignment technique was used to train the model
**Detects:** DPO, RLHF, CAI (Constitutional AI), SFT (Supervised Fine-Tuning)
**How:** Analyzes subspace geometry — each alignment method leaves a distinct
geometric "fingerprint" in the weight space
**Output:** Detected method + confidence score
**Why it matters:** Different alignment methods need different abliteration approaches.
DPO models typically have cleaner single-direction refusal; RLHF is more diffuse.
### 1. Alignment Imprint Detection (`alignment_imprint.py`)
Fingerprints whether a model was trained via DPO, RLHF, CAI, or SFT.
This determines which extraction strategy will work best.
### Concept Cone Geometry
**File:** `concept_geometry.py`
**Purpose:** Maps whether refusal is one direction or a polyhedral cone (many)
**Output:** Cone angle, dimensionality, per-category breakdown
**Why it matters:** If refusal is a single direction, `basic` method works. If it's
a cone (multiple directions for different refusal categories), you need `advanced`
or `informed` with higher `n_directions`.
### 2. Concept Cone Geometry (`concept_geometry.py`)
Determines if refusal is a single linear direction or a polyhedral cone
(set of multiple mechanisms). Single-direction models respond well to `basic`;
polyhedral models need `advanced` or `surgical`.
### Refusal Logit Lens
**File:** `logit_lens.py`
**Purpose:** Identifies the specific layer where the model "decides" to refuse
**How:** Projects intermediate hidden states to vocabulary space at each layer,
watches when "I cannot" tokens spike in probability
**Output:** Layer-by-layer refusal probability plot
**Why it matters:** Tells you which layers are most important to target
### 3. Refusal Logit Lens (`logit_lens.py`)
Identifies the specific layer where a model "decides" to refuse by decoding
intermediate layer representations into token space.
### Ouroboros (Self-Repair) Detection
**File:** `anti_ouroboros.py`
**Purpose:** Predicts whether the model will reconstruct its refusal after removal
**How:** Measures redundancy in refusal representation across layers
**Output:** Self-repair risk score (0-1)
**Why it matters:** High self-repair risk means you need multiple refinement passes
or the `informed` method which auto-compensates
### 4. Ouroboros Detection (`anti_ouroboros.py`)
Identifies if a model attempts to "self-repair" refusal behaviors after
excision. Reports a risk score (0-1). High scores mean additional refinement
passes are needed.
### Causal Tracing
**File:** `causal_tracing.py`
**Purpose:** Determines which components are causally necessary for refusal
**How:** Patches activations between clean and corrupted runs, measures causal effect
**Output:** Causal importance map across layers, heads, and MLPs
**Why it matters:** Shows exactly which components to target for surgical removal
### 5. Causal Tracing (`causal_tracing.py`)
Identifies which components (layers, heads, MLPs) are causally necessary
for refusal behavior using activation patching.
---
## Geometric Analysis
### Cross-Layer Alignment
**File:** `cross_layer.py`
**Purpose:** Measures how aligned refusal directions are across layers
**Output:** Alignment matrix, cluster assignments
**Why it matters:** If directions are highly aligned across layers, removal is easier.
If they cluster, you may need layer-group-specific directions.
### 6. Cross-Layer Alignment (`cross_layer.py`)
Measures how refusal directions align across different layers. High alignment
means the refusal signal is consistent; low alignment suggests layer-specific
mechanisms.
### Residual Stream Decomposition
**File:** `residual_stream.py`
**Purpose:** Breaks down refusal into Attention vs MLP contributions
**Output:** Per-layer Attention/MLP contribution to refusal direction
**Why it matters:** Helps decide whether to target attention heads, MLPs, or both
### 7. Residual Stream Decomposition (`residual_stream.py`)
Decomposes the residual stream into attention and MLP contributions to
understand which component type contributes more to refusal.
### Riemannian Manifold Geometry
**File:** `riemannian_manifold.py` (673 lines)
**Purpose:** Analyzes the weight manifold geometry around refusal directions
**Output:** Curvature, geodesics, tangent space analysis
**Why it matters:** Research-grade; helps understand the geometric structure of alignment
### 8. Riemannian Manifold Geometry (`riemannian_manifold.py`)
Analyzes the curvature and geometry of the weight manifold near refusal
directions. Informs how aggressively projections can be applied without
damaging the manifold structure.
### Whitened SVD
**File:** `whitened_svd.py`
**Purpose:** Covariance-normalized SVD extraction
**How:** Whitens the activation covariance before computing refusal directions,
separating true refusal signal from natural activation variance
**Output:** Cleaner refusal directions with less noise
**Why it matters:** Produces more precise directions, especially for noisy activations
### 9. Whitened SVD (`whitened_svd.py`)
Covariance-normalized SVD extraction that separates guardrail signals from
natural activation variance. More precise than standard SVD for models with
high activation variance.
### 10. Concept Cone Geometry (extended)
Maps the full polyhedral structure of refusal, including cone angles,
face counts, and intersection patterns.
---
## Probing & Classification
### Activation Probing
**File:** `activation_probing.py`
**Purpose:** Post-excision probing to verify refusal signal is truly gone
**Output:** Residual refusal signal strength per layer
**Why it matters:** Verification that abliteration was complete
### 11. Activation Probing (`activation_probing.py`)
Post-excision verification — probes for residual refusal concepts after
abliteration to ensure complete removal.
### Probing Classifiers
**File:** `probing_classifiers.py`
**Purpose:** Trains linear classifiers to detect refusal in hidden states
**Output:** Classification accuracy per layer (should drop to ~50% after abliteration)
**Why it matters:** Quantitative measure of refusal removal completeness
### 12. Probing Classifiers (`probing_classifiers.py`)
Trains linear classifiers to detect refusal in activations. Used both
before (to verify refusal exists) and after (to verify it's gone).
### Activation Patching
**File:** `activation_patching.py`
**Purpose:** Interchange interventions — swap activations between harmful/harmless runs
**Output:** Which components are sufficient (not just necessary) for refusal
**Why it matters:** Complementary to causal tracing; together they give full picture
### 13. Activation Patching (`activation_patching.py`)
Interchange interventions — swaps activations between refused and complied
runs to identify causal components.
### 14. Tuned Lens (`tuned_lens.py`)
Trained version of logit lens that provides more accurate per-layer
decoding by learning affine transformations for each layer.
### 15. Multi-Token Position Analysis (`multi_token_position.py`)
Analyzes refusal signals across multiple token positions, not just the
last token. Important for models that distribute refusal across the sequence.
---
## Abliteration & Manipulation
### 16. SAE-Based Abliteration (`sae_abliteration.py`)
Uses Sparse Autoencoder features to identify and remove specific refusal
features. More surgical than direction-based methods.
### 17. Steering Vectors (`steering_vectors.py`)
Creates and applies inference-time steering vectors for reversible refusal
modification. Includes `SteeringVectorFactory` and `SteeringHookManager`.
### 18. LEACE Concept Erasure (`leace.py`)
Linear Erasure via Closed-form Estimation — mathematically optimal linear
concept removal. Available as both analysis module and direction extraction method.
### 19. Sparse Surgery (`sparse_surgery.py`)
High-precision weight modification targeting individual neurons and
weight matrix entries rather than full directions.
### 20. Conditional Abliteration (`conditional_abliteration.py`)
Targeted removal that only affects specific refusal categories while
preserving others (e.g., remove weapons refusal but keep CSAM refusal).
---
## Transfer & Robustness
### Cross-Model Transfer
**File:** `cross_model_transfer.py`
**Purpose:** Tests if refusal directions from one model work on another
**Output:** Transfer success rate between model pairs
**Why it matters:** If directions transfer, you can skip PROBE stage on similar models
### 21. Cross-Model Transfer (`cross_model_transfer.py`)
Tests whether refusal directions extracted from one model transfer to
another architecture. Measures universality of guardrail directions.
### Defense Robustness
**File:** `defense_robustness.py`
**Purpose:** Evaluates how robust the model's refusal defenses are
**Output:** Robustness score, entanglement mapping
**Why it matters:** Higher robustness = need more aggressive method
### 22. Defense Robustness (`defense_robustness.py`)
Evaluates how robust the abliteration is against various defense mechanisms
and re-alignment attempts.
### Spectral Certification
**File:** `spectral_certification.py`
**Purpose:** Certifies completeness of refusal direction removal
**Output:** Spectral gap analysis, completeness score
**Why it matters:** Formal verification that all major refusal components are addressed
### 23. Spectral Certification (`spectral_certification.py`)
Provides mathematical bounds on the completeness of refusal removal
using spectral analysis of the projection.
### 24. Wasserstein Optimal Extraction (`wasserstein_optimal.py`)
Uses optimal transport theory for more precise direction extraction
that minimizes distribution shift.
### 25. Wasserstein Transfer (`wasserstein_transfer.py`)
Distribution transfer between models using Wasserstein distance
for cross-architecture refusal direction mapping.
---
## Advanced / Research
### SAE-based Abliteration
**File:** `sae_abliteration.py` (762 lines)
**Purpose:** Uses Sparse Autoencoder features to decompose refusal at feature level
**Output:** Refusal-specific SAE features, targeted removal
**Why it matters:** Most fine-grained approach; can target individual refusal "concepts"
### 26. Bayesian Kernel Projection (`bayesian_kernel_projection.py`)
Probabilistic feature mapping that estimates uncertainty in refusal
direction identification.
### Wasserstein Optimal Extraction
**File:** `wasserstein_optimal.py`
**Purpose:** Optimal transport-based direction extraction
**Output:** Wasserstein-optimal refusal directions
**Why it matters:** Theoretically optimal direction extraction under distributional assumptions
### 27. Cross-Model Universality Index
Measures if guardrail directions generalize across different model
architectures and training regimes.
### Bayesian Kernel Projection
**File:** `bayesian_kernel_projection.py`
**Purpose:** Bayesian approach to refusal direction projection
**Output:** Posterior distribution over refusal directions
**Why it matters:** Quantifies uncertainty in direction estimation
### 28. Visualization (`visualization.py`)
Plotting and graphing utilities for all analysis modules. Generates
heatmaps, direction plots, and layer-wise analysis charts.
### Conditional Abliteration
**File:** `conditional_abliteration.py`
**Purpose:** Domain-specific conditional removal (remove refusal for topic X but keep for Y)
**Output:** Per-domain refusal directions
**Why it matters:** Selective uncensoring — remove only specific refusal categories
---
### Steering Vectors
**File:** `steering_vectors.py`
**Purpose:** Generate inference-time steering vectors (reversible alternative)
**Output:** Steering vector files that can be applied/removed at inference
**Why it matters:** Non-destructive alternative to permanent weight modification
## Running Analysis
### Tuned Lens
**File:** `tuned_lens.py`
**Purpose:** Trained linear probes per layer (more accurate than raw logit lens)
**Output:** Layer-by-layer refusal representation with trained projections
**Why it matters:** More accurate than logit lens, especially for deeper models
### Via CLI
```bash
# Run analysis from a YAML config
obliteratus run analysis-study.yaml --preset quick
### Multi-Token Position Analysis
**File:** `multi_token_position.py`
**Purpose:** Analyzes refusal signal at multiple token positions (not just last)
**Output:** Position-dependent refusal direction maps
**Why it matters:** Some models encode refusal at the system prompt position, not the query
# Available study presets:
# quick — Fast sanity check (2-3 modules)
# full — All core + geometric analysis
# jailbreak — Refusal circuit localization
# knowledge — Knowledge preservation analysis
# robustness — Stress testing / defense evaluation
```
### Sparse Surgery
**File:** `sparse_surgery.py`
**Purpose:** Row-level sparse weight surgery instead of full matrix projection
**Output:** Targeted weight modifications at the row level
**Why it matters:** More surgical than full-matrix projection, less collateral damage
### Via YAML Config
See the `templates/analysis-study.yaml` template for a complete example.
Load with: `skill_view(name="obliteratus", file_path="templates/analysis-study.yaml")`

View file

@ -1,132 +1,141 @@
# OBLITERATUS Methods — Detailed Guide
> **Important:** The CLI (`obliteratus obliterate --method`) accepts 9 methods:
> basic, advanced, aggressive, spectral_cascade, informed, surgical, optimized,
> inverted, nuclear. Four additional methods (failspy, gabliteration, heretic, rdo)
> are available only via the Python API and will be rejected by argparse if used on CLI.
> The CLI accepts 9 methods via `--method`: basic, advanced, aggressive, spectral_cascade,
> informed, surgical, optimized, inverted, nuclear.
> Four additional methods (failspy, gabliteration, heretic, rdo) are available only via the Python API.
## How Abliteration Works (Theory)
When a model is trained with RLHF/DPO/CAI, it learns to represent "should I refuse?"
as a direction in its internal activation space. When processing a "harmful" prompt,
activations shift in this direction, causing the model to generate refusal text.
Abliteration works by:
1. Measuring this direction (the difference between harmful and harmless activations)
2. Removing it from the model's weight matrices via orthogonal projection
3. The model can no longer "point toward" refusal, so it responds normally
Abliteration identifies a "refusal direction" — a vector in the model's activation space that
corresponds to refusal behavior — and projects it out of the weight matrices.
Mathematically: `W_new = W_old - (W_old @ d @ d.T)` where `d` is the refusal direction.
The key challenge is finding accurate refusal directions without damaging other capabilities.
---
## Direction Extraction Methods
Before projecting, OBLITERATUS extracts refusal directions using one of three methods:
| Method | Flag | Description | Best For |
|:-------|:-----|:------------|:---------|
| Diff-in-Means | `--direction-method diff_means` | Difference between mean activations on refused vs. complied prompts | Default, fast, robust |
| SVD | `--direction-method svd` | Multi-direction extraction via Singular Value Decomposition | Complex alignment, multiple refusal mechanisms |
| LEACE | `--direction-method leace` | Linear Erasure via Closed-form Estimation — mathematically optimal | Maximum precision, research |
---
## Method Details
### basic
**Technique:** Single refusal direction via diff-in-means
**Based on:** Arditi et al. 2024 ("Refusal in Language Models Is Mediated by a Single Direction")
**Speed:** Fast (~5-10 min for 8B)
**Quality:** Moderate — works for simple refusal patterns
**Best for:** Quick tests, models with clean single-direction refusal
**Limitation:** Misses complex multi-direction refusal patterns
- **Directions:** 1 (single diff-in-means vector)
- **Speed:** Fast (~5-10 min for 8B model)
- **Risk:** Low
- **Use case:** Quick tests, prototyping, evaluating if abliteration works for a model
- **How it works:** Extracts one refusal direction and projects it out uniformly across all layers.
### advanced (DEFAULT)
**Technique:** Multiple SVD directions with norm-preserving projection
**Speed:** Medium (~10-20 min for 8B)
**Quality:** Good — handles multi-direction refusal
**Best for:** Dense models (Llama, Qwen, Mistral) as a reliable default
**Key improvement:** Norm preservation prevents weight magnitude drift
### informed (RECOMMENDED)
**Technique:** Analysis-guided auto-configuration
**Speed:** Slow (~20-40 min for 8B, runs 4 analysis modules first)
**Quality:** Best — adapts to each model's specific refusal implementation
**Best for:** Any model when quality matters more than speed
The informed pipeline runs these analysis modules during abliteration:
1. **AlignmentImprintDetector** — Detects DPO/RLHF/CAI/SFT → sets regularization
2. **ConceptConeAnalyzer** — Polyhedral vs linear refusal → sets n_directions
3. **CrossLayerAlignmentAnalyzer** — Cluster-aware → selects target layers
4. **DefenseRobustnessEvaluator** — Self-repair risk → sets refinement passes
5. **Ouroboros loop** — Re-probes after excision, re-excises if refusal persists
### advanced (DEFAULT — RECOMMENDED)
- **Directions:** 4 (multi-direction SVD)
- **Speed:** Medium (~10-20 min for 8B model)
- **Risk:** Low-Medium
- **Refinement passes:** 2
- **Use case:** Default for most models. Well-tested and reliable.
- **How it works:** Extracts multiple refusal directions via SVD, applies norm-preserving bi-projection to maintain weight matrix norms. Two refinement passes catch residual refusal.
### aggressive
**Technique:** Whitened SVD + jailbreak-contrastive activations + attention head surgery
**Speed:** Slow (~30-60 min for 8B)
**Quality:** High but higher risk of coherence damage
**Best for:** Models that resist gentler methods
**Key feature:** Whitened SVD separates refusal signal from natural activation variance
### surgical
**Technique:** SAE features + neuron masking + head surgery + per-expert directions
**Speed:** Very slow (~1-2 hrs for 8B, needs SAE)
**Quality:** Highest precision
**Best for:** Reasoning models (R1 distills) where you must preserve CoT
**Key feature:** CoT-Aware — explicitly protects reasoning-critical directions
### nuclear
**Technique:** Everything combined — expert transplant + steering + per-expert directions
**Speed:** Very slow
**Quality:** Most thorough removal, highest risk of side effects
**Best for:** Stubborn MoE models (DeepSeek, Mixtral, DBRX) that resist other methods
**Key feature:** Expert-granular abliteration decomposes signals per MoE expert
### optimized
**Technique:** Bayesian hyperparameter search via Optuna TPE
**Speed:** Very slow (runs many trials)
**Quality:** Finds optimal configuration automatically
**Best for:** Research, when you want the mathematically best parameters
**Requires:** optuna package
- **Directions:** 8+ (whitened SVD + jailbreak-contrastive)
- **Speed:** Medium-Slow
- **Risk:** Medium-High (may damage coherence)
- **Use case:** When `advanced` leaves > 10% refusals. Stubborn models.
- **How it works:** Uses whitened SVD for covariance-normalized extraction, adds jailbreak-contrastive directions, performs attention head surgery on the most refusal-active heads.
### spectral_cascade
**Technique:** DCT frequency-domain decomposition of refusal signal
**Speed:** Medium-slow
**Quality:** Novel approach, less battle-tested
**Best for:** Research, exploring alternative decomposition strategies
- **Speed:** Medium
- **Risk:** Medium
- **Use case:** Research, novel approaches
- **How it works:** DCT (Discrete Cosine Transform) frequency-domain decomposition of refusal signals. Separates high-frequency (surface-level) from low-frequency (deep) refusal patterns.
### informed (EXPERIMENTAL)
- **Speed:** Slow (~20-40 min for 8B model)
- **Risk:** Variable — results depend on analysis quality
- **Use case:** When you want auto-configuration, but be aware this is experimental and may not outperform `advanced`.
- **How it works:** Runs 4 analysis modules first (alignment imprint, concept geometry, logit lens, ouroboros detection), then auto-configures extraction strategy. Includes an "Ouroboros loop" that detects and counteracts self-repair.
- **Note:** The auto-detection can sometimes misconfigure. If results are poor, fall back to `advanced`.
### surgical
- **Speed:** Very slow (~1-2 hrs for 8B model)
- **Risk:** Low (very precise)
- **Use case:** Reasoning models (R1 distills, QwQ, etc.) where chain-of-thought must be preserved.
- **How it works:** Uses SAE (Sparse Autoencoder) features + individual neuron masking + attention head surgery + per-expert decomposition (for MoE). CoT-aware — identifies and protects reasoning-critical directions before projecting.
### optimized
- **Speed:** Very slow (hours — runs many trials)
- **Risk:** Low (finds optimal parameters)
- **Use case:** When quality matters more than speed. Production models.
- **How it works:** Bayesian hyperparameter search via Optuna TPE sampler. Optimizes n_directions, regularization, refinement passes, and layer selection jointly. Evaluates each configuration on refusal rate + perplexity.
### inverted
**Technique:** Reflects (inverts) the refusal direction instead of removing it
**Speed:** Fast (same as basic)
**Quality:** Aggressive — model becomes actively willing, not just neutral
**Best for:** When you want the model to be maximally helpful
**Warning:** Can make the model too eager; may reduce safety-adjacent reasoning
- **Speed:** Fast
- **Risk:** High (model behavior changes dramatically)
- **Use case:** Research, studying refusal mechanisms
- **How it works:** Instead of projecting out the refusal direction, reflects it. The model actively complies rather than passively not-refusing. Useful for understanding the geometry of alignment.
### failspy / gabliteration / heretic / rdo (PYTHON API ONLY)
**Technique:** Faithful reproductions of prior community/academic work
**Speed:** Varies
**Quality:** Known baselines
**Best for:** Reproducing published results, comparing methods
**⚠️ NOT available via CLI** — these methods are only accessible via the Python API.
Do not use `--method failspy` etc. in CLI commands; argparse will reject them.
### nuclear
- **Speed:** Slow
- **Risk:** Medium-High
- **Use case:** Stubborn MoE models (DeepSeek-MoE, Mixtral, etc.)
- **How it works:** Combines expert-granular abliteration (EGA), steering vector injection, attention head pruning, and multi-pass refinement. Decomposes refusal signals into per-expert components for MoE architectures.
---
## Method Selection Flowchart
```
Is this a quick test?
├─ YES → basic
└─ NO → Is the model MoE (DeepSeek, Mixtral)?
├─ YES → nuclear
└─ NO → Is it a reasoning model (R1 distill)?
├─ YES → surgical
└─ NO → Do you care about speed?
├─ YES → advanced
└─ NO → informed
→ YES: basic
→ NO: continue
Is it an MoE model (Mixtral, DeepSeek-MoE)?
→ YES: nuclear
→ NO: continue
Is it a reasoning model (R1, QwQ, CoT-focused)?
→ YES: surgical
→ NO: continue
Do you need the absolute best quality and have time?
→ YES: optimized
→ NO: advanced (recommended default)
Did advanced leave > 10% refusals?
→ YES: aggressive
→ Still refusing: nuclear
```
---
## Key Parameters
| Parameter | Range | Default | Effect |
|:--------------------|:---------|:--------|:--------------------------------------------|
| n_directions | 1-32 | auto | More = more thorough but riskier |
| regularization | 0.0-1.0 | 0.0 | Higher preserves more original behavior |
| refinement_passes | 1-5 | 1 | More catches self-repair (Ouroboros effect) |
| quantization | 4/8 bit | none | Saves VRAM, slight quality tradeoff |
| Parameter | Range | Default | Effect |
|:----------|:------|:--------|:-------|
| `--n-directions` | 1-32 | method-dependent | More directions = more complete removal, but higher damage risk |
| `--regularization` | 0.0-1.0 | 0.1 | Higher = more conservative (less removal, less damage) |
| `--refinement-passes` | 1-5 | 2 | More passes catch residual refusal, but diminishing returns |
| `--quantization` | 4bit, 8bit | none | Reduces VRAM usage; quality impact minimal for extraction |
| `--verify-sample-size` | 10-200 | 20 | More samples = more accurate refusal rate estimate |
---
## Troubleshooting
| Problem | Solution |
|:---------------------------|:--------------------------------------------------|
| Refusal rate still > 10% | Try aggressive/nuclear, add refinement passes |
| Perplexity up > 20% | Reduce n_directions, increase regularization |
| Model generates nonsense | Regularization too low, try 0.2-0.3 |
| OOM on GPU | Use 4-bit quantization, or try smaller model |
| MoE model barely changes | Use nuclear method (expert-granular) |
| CoT reasoning broken | Use surgical method (CoT-aware) |
| Problem | Likely Cause | Fix |
|:--------|:-------------|:----|
| Refusal rate > 20% | Too few directions | Increase `--n-directions`, try `aggressive` |
| Refusal rate 5-20% | Residual refusal | Add `--refinement-passes 3`, try `--direction-method svd` |
| Perplexity spike > 20% | Over-aggressive removal | Reduce `--n-directions`, increase `--regularization` |
| Repetitive output | Weight matrix damage | Use `basic` with fewer directions, check norm preservation |
| MoE model still refuses | Non-expert-aware method | Switch to `nuclear` |
| Reasoning degraded | CoT directions damaged | Use `surgical` method |
| OOM during extraction | Insufficient VRAM | Add `--quantization 4bit` and/or `--large-model` |