fix: restore all removed bundled skills + fix skills sync system

- Restored 21 skills removed in commits 757d012 and 740dd92: accelerate, audiocraft, code-review, faiss, flash-attention, gguf, grpo-rl-training, guidance, llava, nemo-curator, obliteratus, peft, pytorch-fsdp, pytorch-lightning, simpo, slime, stable-diffusion, tensorrt-llm, torchtitan, trl-fine-tuning, whisper - Rewrote sync_skills() with proper update semantics: * New skills (not in manifest): copied to user dir * Existing skills (in manifest + on disk): updated via hash comparison * User-deleted skills (in manifest, not on disk): respected, not re-added * Stale manifest entries (removed from bundled): cleaned from manifest - Added sync_skills() to CLI startup (cmd_chat) and gateway startup (start_gateway) — previously only ran during 'hermes update' - Updated cmd_update output to show new/updated/cleaned counts - Rewrote tests: 20 tests covering manifest CRUD, dir hashing, fresh install, user deletion respect, update detection, stale cleanup, and name collision handling 75 bundled skills total. 2002 tests pass.
2026-03-06 15:57:12 -08:00 · 2026-03-06 15:57:12 -08:00 · ab0f4126cf
commit ab0f4126cf
parent 68fbae5692
74 changed files with 27881 additions and 44 deletions
--- a/skills/mlops/slime/SKILL.md
+++ b/skills/mlops/slime/SKILL.md
@ -0,0 +1,467 @@
+---
+name: slime-rl-training
+description: Provides guidance for LLM post-training with RL using slime, a Megatron+SGLang framework. Use when training GLM models, implementing custom data generation workflows, or needing tight Megatron-LM integration for RL scaling.
+version: 1.0.0
+author: Orchestra Research
+license: MIT
+dependencies: [sglang-router>=0.2.3, ray, torch>=2.0.0, transformers>=4.40.0]
+metadata:
+  hermes:
+    tags: [Reinforcement Learning, Megatron-LM, SGLang, GRPO, Post-Training, GLM]
+
+---
+
+# slime: LLM Post-Training Framework for RL Scaling
+
+slime is an LLM post-training framework from Tsinghua's THUDM team, powering GLM-4.5, GLM-4.6, and GLM-4.7. It connects Megatron-LM for training with SGLang for high-throughput rollout generation.
+
+## When to Use slime
+
+**Choose slime when you need:**
+- Megatron-LM native training with SGLang inference
+- Custom data generation workflows with flexible data buffers
+- Training GLM, Qwen3, DeepSeek V3, or Llama 3 models
+- Research-grade framework with production backing (Z.ai)
+
+**Consider alternatives when:**
+- You need enterprise-grade stability features → use **miles**
+- You want flexible backend swapping → use **verl**
+- You need PyTorch-native abstractions → use **torchforge**
+
+## Key Features
+
+- **Training**: Megatron-LM with full parallelism support (TP, PP, DP, SP)
+- **Rollout**: SGLang-based high-throughput generation with router
+- **Data Buffer**: Flexible prompt management and sample storage
+- **Models**: GLM-4.x, Qwen3, DeepSeek V3/R1, Llama 3
+
+## Architecture Overview
+
+```
+┌─────────────────────────────────────────────────────────┐
+│                    Data Buffer                          │
+│ - Prompt initialization and management                  │
+│ - Custom data generation and filtering                  │
+│ - Rollout sample storage                                │
+└─────────────┬───────────────────────────┬───────────────┘
+              │                           │
+┌─────────────▼───────────┐ ┌─────────────▼───────────────┐
+│ Training (Megatron-LM)  │ │ Rollout (SGLang + Router)   │
+│ - Actor model training  │ │ - Response generation       │
+│ - Critic (optional)     │ │ - Reward/verifier output    │
+│ - Weight sync to rollout│ │ - Multi-turn support        │
+└─────────────────────────┘ └─────────────────────────────┘
+```
+
+## Installation
+
+```bash
+# Recommended: Docker
+docker pull slimerl/slime:latest
+docker run --rm --gpus all --ipc=host --shm-size=16g \
+  -it slimerl/slime:latest /bin/bash
+
+# Inside container
+cd /root/slime && pip install -e . --no-deps
+```
+
+### From Source
+
+```bash
+git clone https://github.com/THUDM/slime.git
+cd slime
+pip install -r requirements.txt
+pip install -e .
+```
+
+## Quick Start: GRPO Training
+
+```bash
+# Source model configuration
+source scripts/models/qwen3-4B.sh
+
+# Launch training
+python train.py \
+    --actor-num-nodes 1 \
+    --actor-num-gpus-per-node 4 \
+    --rollout-num-gpus 4 \
+    --advantage-estimator grpo \
+    --use-kl-loss --kl-loss-coef 0.001 \
+    --rollout-batch-size 32 \
+    --n-samples-per-prompt 8 \
+    --global-batch-size 256 \
+    --num-rollout 3000 \
+    --prompt-data /path/to/data.jsonl \
+    ${MODEL_ARGS[@]} ${CKPT_ARGS[@]}
+```
+
+---
+
+## Workflow 1: Standard GRPO Training
+
+Use this workflow for training reasoning models with group-relative advantages.
+
+### Prerequisites Checklist
+- [ ] Docker environment or Megatron-LM + SGLang installed
+- [ ] Model checkpoint (HuggingFace or Megatron format)
+- [ ] Training data in JSONL format
+
+### Step 1: Prepare Data
+
+```python
+# data.jsonl format
+{"prompt": "What is 2 + 2?", "label": "4"}
+{"prompt": "Solve: 3x = 12", "label": "x = 4"}
+```
+
+Or with chat format:
+```python
+{
+    "prompt": [
+        {"role": "system", "content": "You are a math tutor."},
+        {"role": "user", "content": "What is 15 + 27?"}
+    ],
+    "label": "42"
+}
+```
+
+### Step 2: Configure Model
+
+Choose a pre-configured model script:
+
+```bash
+# List available models
+ls scripts/models/
+# glm4-9B.sh, qwen3-4B.sh, qwen3-30B-A3B.sh, deepseek-v3.sh, llama3-8B.sh, ...
+
+# Source your model
+source scripts/models/qwen3-4B.sh
+```
+
+### Step 3: Launch Training
+
+```bash
+python train.py \
+    --actor-num-nodes 1 \
+    --actor-num-gpus-per-node 8 \
+    --rollout-num-gpus 8 \
+    --advantage-estimator grpo \
+    --use-kl-loss \
+    --kl-loss-coef 0.001 \
+    --prompt-data /path/to/train.jsonl \
+    --input-key prompt \
+    --label-key label \
+    --apply-chat-template \
+    --rollout-batch-size 32 \
+    --n-samples-per-prompt 8 \
+    --global-batch-size 256 \
+    --num-rollout 3000 \
+    --save-interval 100 \
+    --eval-interval 50 \
+    ${MODEL_ARGS[@]}
+```
+
+### Step 4: Monitor Training
+- [ ] Check TensorBoard: `tensorboard --logdir outputs/`
+- [ ] Verify reward curves are increasing
+- [ ] Monitor GPU utilization across nodes
+
+---
+
+## Workflow 2: Asynchronous Training
+
+Use async mode for higher throughput by overlapping rollout and training.
+
+### When to Use Async
+- Large models with long generation times
+- High GPU idle time in synchronous mode
+- Sufficient memory for buffering
+
+### Launch Async Training
+
+```bash
+python train_async.py \
+    --actor-num-nodes 1 \
+    --actor-num-gpus-per-node 8 \
+    --rollout-num-gpus 8 \
+    --advantage-estimator grpo \
+    --async-buffer-size 4 \
+    --prompt-data /path/to/train.jsonl \
+    ${MODEL_ARGS[@]}
+```
+
+### Async-Specific Parameters
+
+```bash
+--async-buffer-size 4        # Number of rollouts to buffer
+--update-weights-interval 2  # Sync weights every N rollouts
+```
+
+---
+
+## Workflow 3: Multi-Turn Agentic Training
+
+Use this workflow for training agents with tool use or multi-step reasoning.
+
+### Prerequisites
+- [ ] Custom generate function for multi-turn logic
+- [ ] Tool/environment interface
+
+### Step 1: Define Custom Generate Function
+
+```python
+# custom_generate.py
+async def custom_generate(args, samples, evaluation=False):
+    """Multi-turn generation with tool calling."""
+    for sample in samples:
+        conversation = sample.prompt
+
+        for turn in range(args.max_turns):
+            # Generate response
+            response = await generate_single(conversation)
+
+            # Check for tool call
+            tool_call = extract_tool_call(response)
+            if tool_call:
+                tool_result = execute_tool(tool_call)
+                conversation.append({"role": "assistant", "content": response})
+                conversation.append({"role": "tool", "content": tool_result})
+            else:
+                break
+
+        sample.response = response
+        sample.reward = compute_reward(sample)
+
+    return samples
+```
+
+### Step 2: Launch with Custom Function
+
+```bash
+python train.py \
+    --custom-generate-function-path custom_generate.py \
+    --max-turns 5 \
+    --prompt-data /path/to/agent_data.jsonl \
+    ${MODEL_ARGS[@]}
+```
+
+See `examples/search-r1/` for a complete multi-turn search example.
+
+---
+
+## Configuration Reference
+
+### Three Argument Categories
+
+slime uses three types of arguments:
+
+**1. Megatron Arguments** (passed directly):
+```bash
+--tensor-model-parallel-size 2
+--pipeline-model-parallel-size 1
+--num-layers 32
+--hidden-size 4096
+```
+
+**2. SGLang Arguments** (prefixed with `--sglang-`):
+```bash
+--sglang-mem-fraction-static 0.8
+--sglang-context-length 8192
+--sglang-log-level INFO
+```
+
+**3. slime Arguments**:
+```bash
+# Resource allocation
+--actor-num-nodes 1
+--actor-num-gpus-per-node 8
+--rollout-num-gpus 8
+--colocate  # Share GPUs between training/inference
+
+# Data
+--prompt-data /path/to/data.jsonl
+--input-key prompt
+--label-key label
+
+# Training loop
+--num-rollout 3000
+--rollout-batch-size 32
+--n-samples-per-prompt 8
+--global-batch-size 256
+
+# Algorithm
+--advantage-estimator grpo  # or: gspo, ppo, reinforce_plus_plus
+--use-kl-loss
+--kl-loss-coef 0.001
+```
+
+### Key Constraints
+
+```
+rollout_batch_size × n_samples_per_prompt = global_batch_size × num_steps_per_rollout
+```
+
+Example: 32 × 8 = 256 × 1
+
+---
+
+## Data Buffer System
+
+slime's data buffer enables flexible data management:
+
+### Basic Data Source
+
+```python
+class RolloutDataSource:
+    def get_samples(self, num_samples):
+        """Fetch prompts from dataset."""
+        return self.dataset.sample(num_samples)
+
+    def add_samples(self, samples):
+        """Called after generation (no-op by default)."""
+        pass
+```
+
+### Buffered Data Source (Off-Policy)
+
+```python
+class RolloutDataSourceWithBuffer(RolloutDataSource):
+    def __init__(self):
+        self.buffer = []
+
+    def add_samples(self, samples):
+        """Store generated samples for reuse."""
+        self.buffer.extend(samples)
+
+    def buffer_filter(self, args, buffer, num_samples):
+        """Custom selection logic (prioritized, stratified, etc.)."""
+        return select_best(buffer, num_samples)
+```
+
+---
+
+## Common Issues and Solutions
+
+### Issue: SGLang Engine Crash
+
+**Symptoms**: Inference engine dies mid-training
+
+**Solutions**:
+```bash
+# Enable fault tolerance
+--use-fault-tolerance
+
+# Increase memory allocation
+--sglang-mem-fraction-static 0.85
+
+# Reduce batch size
+--rollout-batch-size 16
+```
+
+### Issue: Weight Sync Timeout
+
+**Symptoms**: Training hangs after rollout
+
+**Solutions**:
+```bash
+# Increase sync interval
+--update-weights-interval 5
+
+# Use colocated mode (no network transfer)
+--colocate
+```
+
+### Issue: OOM During Training
+
+**Symptoms**: CUDA OOM in backward pass
+
+**Solutions**:
+```bash
+# Enable gradient checkpointing
+--recompute-activations
+
+# Reduce micro-batch size
+--micro-batch-size 1
+
+# Enable sequence parallelism
+--sequence-parallel
+```
+
+### Issue: Slow Data Loading
+
+**Symptoms**: GPU idle during data fetch
+
+**Solutions**:
+```bash
+# Increase data workers
+--num-data-workers 4
+
+# Use streaming dataset
+--streaming-data
+```
+
+---
+
+## Supported Models
+
+| Model Family | Configurations |
+|--------------|----------------|
+| GLM | GLM-4.5, GLM-4.6, GLM-4.7, GLM-Z1-9B |
+| Qwen | Qwen3 (4B, 8B, 30B-A3B), Qwen3-MoE, Qwen2.5 |
+| DeepSeek | V3, V3.1, R1 |
+| Llama | Llama 3 (8B, 70B) |
+| Others | Kimi K2, Moonlight-16B |
+
+Each model has pre-configured scripts in `scripts/models/`.
+
+---
+
+## Advanced Topics
+
+### Co-location Mode
+
+Share GPUs between training and inference to reduce memory:
+
+```bash
+python train.py \
+    --colocate \
+    --actor-num-gpus-per-node 8 \
+    --sglang-mem-fraction-static 0.4 \
+    ${MODEL_ARGS[@]}
+```
+
+### Custom Reward Model
+
+```python
+# custom_rm.py
+class CustomRewardModel:
+    def __init__(self, model_path):
+        self.model = load_model(model_path)
+
+    def compute_reward(self, prompts, responses):
+        inputs = self.tokenize(prompts, responses)
+        scores = self.model(inputs)
+        return scores.tolist()
+```
+
+```bash
+--custom-rm-path custom_rm.py
+```
+
+### Evaluation Multi-Task
+
+```bash
+--eval-prompt-data aime /path/to/aime.jsonl \
+--eval-prompt-data gsm8k /path/to/gsm8k.jsonl \
+--n-samples-per-eval-prompt 16
+```
+
+---
+
+## Resources
+
+- **Documentation**: https://thudm.github.io/slime/
+- **GitHub**: https://github.com/THUDM/slime
+- **Blog**: https://lmsys.org/blog/2025-07-09-slime/
+- **Examples**: See `examples/` directory for 14+ worked examples
+
--- a/skills/mlops/slime/references/api-reference.md
+++ b/skills/mlops/slime/references/api-reference.md
@ -0,0 +1,392 @@
+# slime API Reference
+
+## Architecture Overview
+
+slime operates with a three-module architecture orchestrated by Ray:
+
+```
+┌─────────────────────────────────────────────────────────┐
+│                    Data Buffer                          │
+│ - Prompt initialization and management                  │
+│ - Custom data generation and filtering                  │
+│ - Rollout sample storage                                │
+└─────────────┬───────────────────────────┬───────────────┘
+              │                           │
+┌─────────────▼───────────┐ ┌─────────────▼───────────────┐
+│ Training (Megatron-LM)  │ │ Rollout (SGLang + Router)   │
+│ - Actor model training  │ │ - Response generation       │
+│ - Critic (optional)     │ │ - Reward/verifier output    │
+│ - Weight sync to rollout│ │ - Multi-turn support        │
+└─────────────────────────┘ └─────────────────────────────┘
+```
+
+## Core Data Structures
+
+### Sample Object
+
+The `Sample` object is the core data structure defined in `slime/utils/types.py`:
+
+```python
+from slime.utils.types import Sample
+
+@dataclass
+class Sample:
+    # Core fields
+    group_index: Optional[int]              # Group index for batching
+    index: Optional[int]                    # Sample index
+    prompt: str | list[dict] = ""           # Input prompt or chat history
+    tokens: list[int] = field(default_factory=list)  # Token IDs
+    response: str = ""                      # Generated response
+    response_length: int = 0                # Response length in tokens
+    label: Optional[str] = None             # Ground truth label
+    reward: Optional[float | dict] = None   # RL reward signal
+    loss_mask: Optional[list[int]] = None   # 1=compute loss, 0=mask
+    status: Status = Status.PENDING         # Sample status
+    metadata: dict = field(default_factory=dict)  # Custom data
+
+    # Multimodal support
+    multimodal_inputs: Optional[Any] = None       # Raw multimodal data (images, videos)
+    multimodal_train_inputs: Optional[Any] = None # Processed multimodal data (pixel_values)
+
+    # Rollout tracking
+    weight_versions: list[str] = field(default_factory=list)
+    rollout_log_probs: Optional[list[float]] = None    # Log probs from SGLang
+    rollout_routed_experts: Optional[list[list[int]]] = None  # Expert routing (MoE)
+
+    # Control fields
+    remove_sample: bool = False
+    generate_function_path: Optional[str] = None
+    train_metadata: Optional[dict] = None
+    non_generation_time: float = 0.0
+
+    # Speculative decoding info (nested dataclass)
+    @dataclass
+    class SpecInfo:
+        spec_accept_token_num: int = 0
+        spec_draft_token_num: int = 0
+        spec_verify_ct: int = 0
+        completion_token_num: int = 0
+```
+
+### Status Enum
+
+```python
+class Status(Enum):
+    PENDING = "pending"           # Not yet processed
+    COMPLETED = "completed"       # Successfully generated
+    TRUNCATED = "truncated"       # Hit max length
+    ABORTED = "aborted"           # Failed generation
+    FAILED = "failed"             # Generation failed
+```
+
+## Configuration System
+
+slime uses three categories of command-line arguments:
+
+### 1. Megatron Arguments
+
+All Megatron-LM arguments are supported directly:
+
+```bash
+--tensor-model-parallel-size 2
+--pipeline-model-parallel-size 1
+--num-layers 32
+--hidden-size 4096
+--num-attention-heads 32
+--seq-length 4096
+--micro-batch-size 1
+--global-batch-size 256
+```
+
+### 2. SGLang Arguments
+
+SGLang arguments are prefixed with `--sglang-`:
+
+```bash
+--sglang-mem-fraction-static 0.8   # GPU memory for KV cache
+--sglang-context-length 8192       # Maximum context length
+--sglang-log-level INFO            # Logging verbosity
+--sglang-tp-size 2                 # Tensor parallelism
+--sglang-disable-cuda-graph        # Disable CUDA graphs
+```
+
+### 3. slime-Specific Arguments
+
+Defined in `slime/utils/arguments.py`:
+
+```bash
+# Resource Allocation
+--actor-num-nodes 1                # Training nodes
+--actor-num-gpus-per-node 8        # GPUs per training node
+--rollout-num-gpus 8               # Total rollout GPUs
+--rollout-num-gpus-per-engine 2    # GPUs per SGLang engine
+--colocate                         # Share GPUs for train/inference
+
+# Data Configuration
+--prompt-data /path/to/data.jsonl  # Training data path
+--input-key prompt                 # Key for prompts in JSON
+--label-key label                  # Key for labels in JSON
+--apply-chat-template              # Apply chat formatting
+
+# Training Loop
+--num-rollout 3000                 # Total rollout iterations
+--rollout-batch-size 32            # Prompts per rollout
+--n-samples-per-prompt 8           # Responses per prompt
+--global-batch-size 256            # Training batch size
+--num-steps-per-rollout 1          # Training steps per rollout
+
+# RL Algorithm
+--advantage-estimator grpo         # grpo, gspo, ppo, reinforce_plus_plus
+--use-kl-loss                      # Enable KL loss
+--kl-loss-coef 0.001               # KL coefficient
+--calculate-per-token-loss         # Token-level loss
+
+# Off-Policy Options
+--use-tis                          # Truncated Importance Sampling
+--tis-threshold 0.9                # TIS threshold
+--true-on-policy-mode              # Force on-policy training
+```
+
+## Data Buffer System
+
+### RolloutDataSource (Base Class)
+
+```python
+from slime.data import RolloutDataSource
+
+class RolloutDataSource:
+    def __init__(self, dataset, args):
+        self.dataset = dataset
+        self.args = args
+
+    def get_samples(self, num_samples: int) -> list[Sample]:
+        """Fetch prompts from dataset."""
+        return [Sample(prompt=p) for p in self.dataset.sample(num_samples)]
+
+    def add_samples(self, samples: list[Sample]) -> None:
+        """Called after generation (no-op by default)."""
+        pass
+```
+
+### Buffered Data Source (Off-Policy)
+
+```python
+from slime.data import RolloutDataSourceWithBuffer
+
+class RolloutDataSourceWithBuffer(RolloutDataSource):
+    def __init__(self, dataset, args):
+        super().__init__(dataset, args)
+        self.buffer = []
+
+    def add_samples(self, samples: list[Sample]) -> None:
+        """Store generated samples for reuse."""
+        self.buffer.extend(samples)
+
+    def buffer_filter(self, args, buffer, num_samples) -> list[Sample]:
+        """Custom selection logic."""
+        # Example: prioritized sampling based on reward
+        sorted_buffer = sorted(buffer, key=lambda s: s.reward, reverse=True)
+        return sorted_buffer[:num_samples]
+```
+
+## Custom Functions
+
+### Custom Generate Function
+
+For multi-turn or tool-calling scenarios:
+
+```python
+# custom_generate.py
+from slime.data import Sample
+
+async def custom_generate(args, samples: list[Sample], evaluation: bool = False) -> list[Sample]:
+    """
+    Custom generation function for multi-turn interactions.
+
+    Args:
+        args: Training arguments
+        samples: List of Sample objects with prompts
+        evaluation: Whether this is an evaluation run
+
+    Returns:
+        List of Sample objects with responses and rewards
+    """
+    for sample in samples:
+        conversation = sample.prompt if isinstance(sample.prompt, list) else [
+            {"role": "user", "content": sample.prompt}
+        ]
+
+        for turn in range(args.max_turns):
+            # Generate response
+            response = await generate_single(conversation)
+
+            # Check for tool call
+            tool_call = extract_tool_call(response)
+            if tool_call:
+                # Execute tool
+                tool_result = await execute_tool(tool_call)
+                conversation.append({"role": "assistant", "content": response})
+                conversation.append({"role": "tool", "content": tool_result})
+            else:
+                # Final response
+                sample.response = response
+                break
+
+        # Compute reward
+        sample.reward = compute_reward(sample)
+
+        # Set loss mask (1 for model tokens, 0 for tool responses)
+        sample.loss_mask = build_loss_mask(sample)
+
+    return samples
+```
+
+Usage:
+```bash
+python train.py \
+    --custom-generate-function-path custom_generate.py \
+    --max-turns 5
+```
+
+### Custom Reward Function
+
+```python
+# custom_rm.py
+from slime.data import Sample
+
+async def reward_func(args, sample: Sample, **kwargs) -> float:
+    """
+    Compute reward for a single sample.
+
+    Args:
+        args: Training arguments
+        sample: Sample object with response
+
+    Returns:
+        Reward score (float)
+    """
+    response = sample.response
+    ground_truth = sample.label or sample.metadata.get("answer", "")
+
+    # Example: exact match reward
+    if response.strip() == ground_truth.strip():
+        return 1.0
+    return 0.0
+
+# For batched processing (more efficient)
+async def batched_custom_rm(args, samples: list[Sample]) -> list[float]:
+    """Batch reward computation."""
+    rewards = []
+    for sample in samples:
+        reward = await reward_func(args, sample)
+        rewards.append(reward)
+    return rewards
+```
+
+Usage:
+```bash
+python train.py \
+    --custom-rm-path custom_rm.py \
+    --group-rm  # Enable batched processing
+```
+
+## Model Configuration
+
+### Pre-configured Model Scripts
+
+Located in `scripts/models/`:
+
+```bash
+# List available models
+ls scripts/models/
+# glm4-9B.sh, qwen3-4B.sh, qwen3-30B-A3B.sh, deepseek-v3.sh, llama3-8B.sh
+
+# Source model configuration
+source scripts/models/qwen3-4B.sh
+# This sets MODEL_ARGS and CKPT_ARGS arrays
+```
+
+### Example Model Script
+
+```bash
+# scripts/models/qwen3-4B.sh
+export MODEL_ARGS=(
+    --num-layers 36
+    --hidden-size 2560
+    --num-attention-heads 20
+    --num-query-groups 4
+    --ffn-hidden-size 6912
+    --max-position-embeddings 32768
+    --rotary-percent 1.0
+    --rotary-base 1000000
+    --swiglu
+    --untie-embeddings-and-output-weights
+    --no-position-embedding
+    --normalization RMSNorm
+    --tokenizer-type HuggingFaceTokenizer
+    --bf16
+)
+
+export CKPT_ARGS=(
+    --hf-checkpoint /path/to/qwen3-4b-hf
+    --initial-megatron-checkpoint /path/to/megatron/ckpt
+)
+```
+
+## Async Training
+
+### Enabling Async Mode
+
+```bash
+python train_async.py \
+    --actor-num-gpus-per-node 8 \
+    --rollout-num-gpus 8 \
+    --async-buffer-size 4 \
+    --update-weights-interval 2 \
+    ${MODEL_ARGS[@]}
+```
+
+### Async-Specific Parameters
+
+```bash
+--async-buffer-size 4            # Number of rollouts to buffer
+--update-weights-interval 2      # Sync weights every N rollouts
+```
+
+**Note**: Colocated mode (`--colocate`) is NOT supported with async training.
+
+## Evaluation
+
+### Multi-Task Evaluation
+
+```bash
+--eval-prompt-data aime /path/to/aime.jsonl \
+--eval-prompt-data gsm8k /path/to/gsm8k.jsonl \
+--n-samples-per-eval-prompt 16 \
+--eval-interval 50
+```
+
+### Evaluation Configuration
+
+```bash
+--eval-interval 50               # Evaluate every N rollouts
+--n-samples-per-eval-prompt 16   # Samples for evaluation
+--eval-temperature 0.0           # Greedy decoding for eval
+```
+
+## Supported Models
+
+| Model Family | Configurations |
+|--------------|----------------|
+| GLM | GLM-4.5, GLM-4.6, GLM-4.7, GLM-Z1-9B |
+| Qwen | Qwen3 (4B, 8B, 30B-A3B), Qwen3-MoE, Qwen2.5 |
+| DeepSeek | V3, V3.1, R1 |
+| Llama | Llama 3 (8B, 70B) |
+| Others | Kimi K2, Moonlight-16B |
+
+## Resources
+
+- Documentation: https://thudm.github.io/slime/
+- GitHub: https://github.com/THUDM/slime
+- Blog: https://lmsys.org/blog/2025-07-09-slime/
+- Examples: `examples/` directory (14+ worked examples)
--- a/skills/mlops/slime/references/troubleshooting.md
+++ b/skills/mlops/slime/references/troubleshooting.md
@ -0,0 +1,386 @@
+# slime Troubleshooting Guide
+
+## Common Issues and Solutions
+
+### SGLang Issues
+
+#### Issue: SGLang Engine Crash
+
+**Symptoms**: Inference engine dies mid-training, connection errors
+
+**Solutions**:
+
+1. **Enable fault tolerance**:
+```bash
+--use-fault-tolerance
+```
+
+2. **Increase memory allocation**:
+```bash
+--sglang-mem-fraction-static 0.85  # Increase from 0.8
+```
+
+3. **Reduce batch size**:
+```bash
+--rollout-batch-size 16  # Reduce from 32
+```
+
+4. **Disable CUDA graphs** (for debugging):
+```bash
+--sglang-disable-cuda-graph
+```
+
+#### Issue: SGLang Router Load Imbalance
+
+**Symptoms**: Some SGLang engines overloaded while others idle
+
+**Solutions**:
+
+1. **Adjust routing strategy**:
+```bash
+--sglang-router-strategy round_robin
+```
+
+2. **Increase number of engines**:
+```bash
+--rollout-num-gpus-per-engine 1  # More engines, less GPUs each
+```
+
+### Weight Synchronization Issues
+
+#### Issue: Weight Sync Timeout
+
+**Symptoms**: Training hangs after rollout, timeout errors
+
+**Solutions**:
+
+1. **Increase sync interval** (async mode):
+```bash
+--update-weights-interval 5  # Increase from 2
+```
+
+2. **Use colocated mode** (eliminates network transfer):
+```bash
+--colocate
+```
+
+3. **Check network bandwidth**:
+```bash
+# Verify InfiniBand is enabled
+ibstat
+```
+
+#### Issue: Weight Sync Failures in Multi-Node
+
+**Symptoms**: Nodes fail to receive updated weights
+
+**Solutions**:
+
+1. **Set NCCL environment**:
+```bash
+export NCCL_DEBUG=INFO
+export NCCL_SOCKET_IFNAME=eth0
+export NCCL_IB_DISABLE=0
+```
+
+2. **Increase timeout**:
+```bash
+export NCCL_TIMEOUT=1800
+```
+
+### Memory Issues
+
+#### Issue: OOM During Training
+
+**Symptoms**: CUDA OOM in backward pass
+
+**Solutions**:
+
+1. **Enable gradient checkpointing**:
+```bash
+--recompute-activations
+```
+
+2. **Reduce micro-batch size**:
+```bash
+--micro-batch-size 1
+```
+
+3. **Enable sequence parallelism**:
+```bash
+--sequence-parallel
+```
+
+4. **Reduce global batch size**:
+```bash
+--global-batch-size 128  # Reduce from 256
+```
+
+#### Issue: OOM in Colocated Mode
+
+**Symptoms**: OOM when both training and inference run on same GPUs
+
+**Solutions**:
+
+1. **Reduce SGLang memory**:
+```bash
+--sglang-mem-fraction-static 0.4  # Reduce from 0.8
+```
+
+2. **Enable offloading**:
+```bash
+--offload-optimizer-states
+```
+
+3. **Use smaller sequence length**:
+```bash
+--seq-length 2048  # Reduce from 4096
+```
+
+### Data Loading Issues
+
+#### Issue: Slow Data Loading
+
+**Symptoms**: GPU idle during data fetch, low GPU utilization
+
+**Solutions**:
+
+1. **Increase data workers**:
+```bash
+--num-data-workers 4
+```
+
+2. **Use streaming dataset**:
+```bash
+--streaming-data
+```
+
+3. **Pre-tokenize data**:
+```python
+# Pre-process data offline
+from transformers import AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained("model_path")
+# Save tokenized data
+```
+
+#### Issue: Data Format Errors
+
+**Symptoms**: KeyError, missing fields, parsing failures
+
+**Solutions**:
+
+1. **Verify data format**:
+```python
+import json
+with open("data.jsonl") as f:
+    for line in f:
+        data = json.loads(line)
+        assert "prompt" in data, "Missing prompt field"
+        assert "label" in data, "Missing label field"
+```
+
+2. **Check key names**:
+```bash
+--input-key prompt  # Must match your data
+--label-key label   # Must match your data
+```
+
+### Training Stability Issues
+
+#### Issue: Loss Explosion / NaN
+
+**Symptoms**: Loss becomes NaN or explodes
+
+**Solutions**:
+
+1. **Reduce learning rate**:
+```bash
+--lr 1e-6  # Reduce from 5e-6
+```
+
+2. **Enable gradient clipping**:
+```bash
+--clip-grad 1.0
+```
+
+3. **Check for data issues**:
+```python
+# Verify no empty prompts or responses
+for sample in dataset:
+    assert len(sample["prompt"]) > 0
+```
+
+4. **Use BF16 instead of FP16**:
+```bash
+--bf16  # More numerically stable
+```
+
+#### Issue: Reward Collapse
+
+**Symptoms**: Reward drops to zero, model outputs garbage
+
+**Solutions**:
+
+1. **Increase KL penalty**:
+```bash
+--kl-loss-coef 0.01  # Increase from 0.001
+```
+
+2. **Reduce number of samples**:
+```bash
+--n-samples-per-prompt 4  # Reduce from 8
+```
+
+3. **Verify reward function**:
+```python
+# Test reward function independently
+from custom_rm import reward_func
+sample = Sample(prompt="test", response="test response")
+reward = reward_func(args, sample)
+print(f"Reward: {reward}")  # Should be reasonable
+```
+
+### Async Training Issues
+
+#### Issue: Async Training Not Supported with Colocate
+
+**Symptoms**: Error when using `--colocate` with `train_async.py`
+
+**Solution**: Colocated mode is NOT supported for async training. Use separate GPUs:
+```bash
+# Remove --colocate flag
+python train_async.py \
+    --actor-num-gpus-per-node 4 \
+    --rollout-num-gpus 4 \
+    # No --colocate
+```
+
+#### Issue: Stale Weights in Async Mode
+
+**Symptoms**: Policy divergence, inconsistent behavior
+
+**Solutions**:
+
+1. **Reduce async buffer size**:
+```bash
+--async-buffer-size 2  # Reduce from 4
+```
+
+2. **Increase weight update frequency**:
+```bash
+--update-weights-interval 1  # Sync every rollout
+```
+
+### Multi-Turn Training Issues
+
+#### Issue: Tool Responses Included in Loss
+
+**Symptoms**: Model learns to output tool responses verbatim
+
+**Solution**: Properly set loss mask in custom generate function:
+```python
+def build_loss_mask(sample):
+    """Create loss mask that excludes tool responses."""
+    mask = []
+    for i, token in enumerate(sample.tokens):
+        if is_tool_response(token, sample.metadata):
+            mask.append(0)  # Don't compute loss
+        else:
+            mask.append(1)  # Compute loss
+    return mask
+```
+
+#### Issue: Multi-Turn Context Too Long
+
+**Symptoms**: OOM or truncation in multi-turn conversations
+
+**Solutions**:
+
+1. **Limit conversation history**:
+```python
+# In custom generate function
+conversation = sample.prompt[-10:]  # Keep last 10 turns
+```
+
+2. **Increase context length**:
+```bash
+--sglang-context-length 16384
+```
+
+### Checkpoint Issues
+
+#### Issue: Checkpoint Loading Fails
+
+**Symptoms**: Cannot load saved checkpoint
+
+**Solutions**:
+
+1. **Verify checkpoint path**:
+```bash
+ls -la /path/to/checkpoint/
+```
+
+2. **Check parallelism matches**:
+```bash
+# Checkpoint was saved with TP=2, must load with TP=2
+--tensor-model-parallel-size 2
+```
+
+3. **Convert HuggingFace to Megatron** (if needed):
+```bash
+python tools/convert_hf_to_megatron.py \
+    --hf_model_path /path/to/hf/model \
+    --save_path /path/to/megatron/checkpoint
+```
+
+### Debugging Tips
+
+#### Enable Verbose Logging
+
+```bash
+--log-level DEBUG
+export SLIME_DEBUG=1
+```
+
+#### Check GPU Utilization
+
+```bash
+watch -n 1 nvidia-smi
+```
+
+#### Monitor Training
+
+```bash
+tensorboard --logdir outputs/
+```
+
+#### Test Custom Functions Independently
+
+```python
+# Test reward function
+import asyncio
+from custom_rm import reward_func
+
+async def test():
+    sample = Sample(prompt="test", response="test", label="expected")
+    reward = await reward_func(args, sample)
+    print(f"Reward: {reward}")
+
+asyncio.run(test())
+```
+
+## Constraint Reference
+
+Key constraint to remember:
+
+```
+rollout_batch_size × n_samples_per_prompt = global_batch_size × num_steps_per_rollout
+```
+
+Example: `32 × 8 = 256 × 1`
+
+## Resources
+
+- GitHub Issues: https://github.com/THUDM/slime/issues
+- Documentation: https://thudm.github.io/slime/
+- Examples: `examples/` directory