fix: restore all removed bundled skills + fix skills sync system

- Restored 21 skills removed in commits 757d012 and 740dd92: accelerate, audiocraft, code-review, faiss, flash-attention, gguf, grpo-rl-training, guidance, llava, nemo-curator, obliteratus, peft, pytorch-fsdp, pytorch-lightning, simpo, slime, stable-diffusion, tensorrt-llm, torchtitan, trl-fine-tuning, whisper - Rewrote sync_skills() with proper update semantics: * New skills (not in manifest): copied to user dir * Existing skills (in manifest + on disk): updated via hash comparison * User-deleted skills (in manifest, not on disk): respected, not re-added * Stale manifest entries (removed from bundled): cleaned from manifest - Added sync_skills() to CLI startup (cmd_chat) and gateway startup (start_gateway) — previously only ran during 'hermes update' - Updated cmd_update output to show new/updated/cleaned counts - Rewrote tests: 20 tests covering manifest CRUD, dir hashing, fresh install, user deletion respect, update detection, stale cleanup, and name collision handling 75 bundled skills total. 2002 tests pass.
2026-03-06 15:57:12 -08:00 · 2026-03-06 15:57:12 -08:00 · ab0f4126cf
commit ab0f4126cf
parent 68fbae5692
74 changed files with 27881 additions and 44 deletions
--- a/skills/mlops/flash-attention/references/benchmarks.md
+++ b/skills/mlops/flash-attention/references/benchmarks.md
@ -0,0 +1,215 @@
+# Performance Benchmarks
+
+## Contents
+- Speed comparisons across GPUs
+- Memory usage analysis
+- Scaling with sequence length
+- Training vs inference performance
+- Flash Attention versions comparison
+
+## Speed comparisons across GPUs
+
+### A100 80GB (Ampere)
+
+**Forward pass time** (milliseconds, batch=8, heads=32, dim=64):
+
+| Seq Length | Standard | Flash Attn 2 | Flash Attn 3 | Speedup (FA2) |
+|------------|----------|--------------|--------------|---------------|
+| 512 | 1.2 | 0.9 | N/A | 1.3x |
+| 1024 | 3.8 | 1.4 | N/A | 2.7x |
+| 2048 | 14.2 | 4.8 | N/A | 3.0x |
+| 4096 | 55.1 | 17.3 | N/A | 3.2x |
+| 8192 | 218.5 | 66.2 | N/A | 3.3x |
+
+### H100 80GB (Hopper)
+
+**Forward pass time** (milliseconds, same config):
+
+| Seq Length | Standard | Flash Attn 2 | Flash Attn 3 (FP16) | Flash Attn 3 (FP8) | Best Speedup |
+|------------|----------|--------------|---------------------|--------------------|--------------|
+| 512 | 0.8 | 0.6 | 0.4 | 0.3 | 2.7x |
+| 1024 | 2.6 | 1.0 | 0.6 | 0.4 | 6.5x |
+| 2048 | 9.8 | 3.4 | 2.0 | 1.3 | 7.5x |
+| 4096 | 38.2 | 12.5 | 7.2 | 4.8 | 8.0x |
+| 8192 | 151.4 | 47.8 | 27.1 | 18.2 | 8.3x |
+
+**Key insight**: Flash Attention 3 on H100 with FP8 achieves ~1.2 PFLOPS (75% of theoretical max).
+
+### A10G 24GB (Ampere)
+
+**Forward pass time** (milliseconds, batch=4):
+
+| Seq Length | Standard | Flash Attn 2 | Speedup |
+|------------|----------|--------------|---------|
+| 512 | 2.1 | 1.6 | 1.3x |
+| 1024 | 6.8 | 2.8 | 2.4x |
+| 2048 | 25.9 | 9.4 | 2.8x |
+| 4096 | 102.1 | 35.2 | 2.9x |
+
+## Memory usage analysis
+
+### GPU memory consumption (batch=8, heads=32, dim=64)
+
+**Standard attention memory**:
+
+| Seq Length | Attention Matrix | KV Cache | Total | Notes |
+|------------|------------------|----------|-------|-------|
+| 512 | 8 MB | 32 MB | 40 MB | Manageable |
+| 2048 | 128 MB | 128 MB | 256 MB | Growing |
+| 8192 | 2048 MB (2 GB) | 512 MB | 2.5 GB | Large |
+| 32768 | 32768 MB (32 GB) | 2048 MB | 34 GB | OOM on 24GB GPUs |
+
+**Flash Attention 2 memory**:
+
+| Seq Length | Attention (on-chip) | KV Cache | Total | Reduction |
+|------------|---------------------|----------|-------|-----------|
+| 512 | 0 MB (recomputed) | 32 MB | 32 MB | 20% |
+| 2048 | 0 MB | 128 MB | 128 MB | 50% |
+| 8192 | 0 MB | 512 MB | 512 MB | 80% |
+| 32768 | 0 MB | 2048 MB | 2 GB | 94% |
+
+**Key insight**: Flash Attention doesn't materialize attention matrix, saving O(N²) memory.
+
+### Memory scaling comparison
+
+**Llama 2 7B model memory** (float16, batch=1):
+
+| Context Length | Standard Attention | Flash Attention 2 | Can Fit 24GB GPU? |
+|----------------|-------------------|-------------------|-------------------|
+| 2K | 3.2 GB | 2.1 GB | Both: Yes |
+| 4K | 5.8 GB | 2.8 GB | Both: Yes |
+| 8K | 12.1 GB | 4.2 GB | Both: Yes |
+| 16K | 26.3 GB (OOM) | 7.8 GB | Only Flash: Yes |
+| 32K | OOM | 14.2 GB | Only Flash: Yes |
+
+### Training memory (Llama 2 7B, batch=4)
+
+| Context | Standard (GB) | Flash Attn (GB) | Reduction |
+|---------|---------------|-----------------|-----------|
+| 2K | 18.2 | 12.4 | 32% |
+| 4K | 34.8 | 16.8 | 52% |
+| 8K | OOM (>40GB) | 26.2 | Fits! |
+
+## Scaling with sequence length
+
+### Computational complexity
+
+**Standard attention**:
+- Time: O(N² × d)
+- Memory: O(N² + N × d)
+
+**Flash Attention**:
+- Time: O(N² × d) (same, but with better constants)
+- Memory: O(N × d) (linear!)
+
+### Empirical scaling (A100, batch=1, heads=32, dim=64)
+
+**Time per token (milliseconds)**:
+
+| Sequence | 512 | 1K | 2K | 4K | 8K | 16K |
+|----------|-----|-----|-----|-----|-----|------|
+| Standard | 0.15 | 0.37 | 1.11 | 3.44 | 13.4 | 52.8 |
+| Flash Attn 2 | 0.11 | 0.14 | 0.24 | 0.43 | 0.83 | 1.64 |
+| Speedup | 1.4x | 2.6x | 4.6x | 8.0x | 16.1x | 32.2x |
+
+**Observation**: Speedup increases quadratically with sequence length!
+
+### Memory per token (MB)
+
+| Sequence | 512 | 1K | 2K | 4K | 8K | 16K |
+|----------|-----|-----|-----|-----|-----|------|
+| Standard | 0.08 | 0.13 | 0.25 | 0.64 | 2.05 | 8.13 |
+| Flash Attn 2 | 0.06 | 0.06 | 0.06 | 0.06 | 0.06 | 0.06 |
+
+**Observation**: Flash Attention memory per token is constant!
+
+## Training vs inference performance
+
+### Training (forward + backward, Llama 2 7B, A100)
+
+| Batch × Seq | Standard (samples/sec) | Flash Attn (samples/sec) | Speedup |
+|-------------|------------------------|--------------------------|---------|
+| 4 × 2K | 1.2 | 3.1 | 2.6x |
+| 8 × 2K | 2.1 | 5.8 | 2.8x |
+| 4 × 4K | 0.4 | 1.3 | 3.3x |
+| 8 × 4K | OOM | 2.4 | Enabled |
+| 2 × 8K | 0.1 | 0.4 | 4.0x |
+
+### Inference (generation, Llama 2 7B, A100)
+
+| Context Length | Standard (tokens/sec) | Flash Attn (tokens/sec) | Speedup |
+|----------------|----------------------|-------------------------|---------|
+| 512 | 48 | 52 | 1.1x |
+| 2K | 42 | 62 | 1.5x |
+| 4K | 31 | 58 | 1.9x |
+| 8K | 18 | 51 | 2.8x |
+| 16K | OOM | 42 | Enabled |
+
+**Note**: Inference speedup less dramatic than training because generation is memory-bound (KV cache accesses).
+
+## Flash Attention versions comparison
+
+### Flash Attention 1 vs 2 vs 3 (H100, seq=4096, batch=8)
+
+| Metric | FA1 | FA2 | FA3 (FP16) | FA3 (FP8) |
+|--------|-----|-----|------------|-----------|
+| Forward time (ms) | 28.4 | 12.5 | 7.2 | 4.8 |
+| Memory (GB) | 4.8 | 4.2 | 4.2 | 2.8 |
+| TFLOPS | 180 | 420 | 740 | 1150 |
+| GPU util % | 35% | 55% | 75% | 82% |
+
+**Key improvements**:
+- FA2: 2.3x faster than FA1 (better parallelism)
+- FA3 (FP16): 1.7x faster than FA2 (H100 async optimizations)
+- FA3 (FP8): 2.6x faster than FA2 (low precision)
+
+### Features by version
+
+| Feature | FA1 | FA2 | FA3 |
+|---------|-----|-----|-----|
+| Basic attention | ✅ | ✅ | ✅ |
+| Causal masking | ✅ | ✅ | ✅ |
+| Multi-query attention | ❌ | ✅ | ✅ |
+| Sliding window | ❌ | ✅ | ✅ |
+| Paged KV cache | ❌ | ✅ | ✅ |
+| FP8 support | ❌ | ❌ | ✅ (H100 only) |
+| Work partitioning | Basic | Advanced | Optimal |
+
+## Real-world model benchmarks
+
+### Llama 2 models (A100 80GB, batch=4, seq=2048)
+
+| Model | Params | Standard (samples/sec) | Flash Attn (samples/sec) | Speedup |
+|-------|--------|------------------------|--------------------------|---------|
+| Llama 2 7B | 7B | 1.2 | 3.1 | 2.6x |
+| Llama 2 13B | 13B | 0.6 | 1.7 | 2.8x |
+| Llama 2 70B | 70B | 0.12 | 0.34 | 2.8x |
+
+### GPT-style models (seq=1024)
+
+| Model | Standard (tokens/sec) | Flash Attn (tokens/sec) | Speedup |
+|-------|----------------------|-------------------------|---------|
+| GPT-2 (124M) | 520 | 680 | 1.3x |
+| GPT-J (6B) | 42 | 98 | 2.3x |
+| GPT-NeoX (20B) | 8 | 22 | 2.75x |
+
+## Recommendations by use case
+
+**Training large models (>7B parameters)**:
+- Use Flash Attention 2 on A100
+- Use Flash Attention 3 FP8 on H100 for maximum speed
+- Expected: 2.5-3x speedup
+
+**Long context inference (>4K tokens)**:
+- Flash Attention essential (enables contexts standard attention can't handle)
+- Expected: 2-4x speedup, 5-10x memory reduction
+
+**Short sequences (<512 tokens)**:
+- Flash Attention provides 1.2-1.5x speedup
+- Minimal memory benefit
+- Still worth enabling (no downside)
+
+**Multi-user serving**:
+- Flash Attention reduces per-request memory
+- Allows higher concurrent batch sizes
+- Can serve 2-3x more users on same hardware
--- a/skills/mlops/flash-attention/references/transformers-integration.md
+++ b/skills/mlops/flash-attention/references/transformers-integration.md
@ -0,0 +1,293 @@
+# HuggingFace Transformers Integration
+
+## Contents
+- Enabling Flash Attention in Transformers
+- Supported model architectures
+- Configuration examples
+- Performance comparisons
+- Troubleshooting model-specific issues
+
+## Enabling Flash Attention in Transformers
+
+HuggingFace Transformers (v4.36+) supports Flash Attention 2 natively.
+
+**Simple enable for any supported model**:
+```python
+from transformers import AutoModel
+
+model = AutoModel.from_pretrained(
+    "meta-llama/Llama-2-7b-hf",
+    attn_implementation="flash_attention_2",
+    torch_dtype=torch.float16,
+    device_map="auto"
+)
+```
+
+**Install requirements**:
+```bash
+pip install transformers>=4.36
+pip install flash-attn --no-build-isolation
+```
+
+## Supported model architectures
+
+As of Transformers 4.40:
+
+**Fully supported**:
+- Llama / Llama 2 / Llama 3
+- Mistral / Mixtral
+- Falcon
+- GPT-NeoX
+- Phi / Phi-2 / Phi-3
+- Qwen / Qwen2
+- Gemma
+- Starcoder2
+- GPT-J
+- OPT
+- BLOOM
+
+**Partially supported** (encoder-decoder):
+- BART
+- T5 / Flan-T5
+- Whisper
+
+**Check support**:
+```python
+from transformers import AutoConfig
+
+config = AutoConfig.from_pretrained("model-name")
+print(config._attn_implementation_internal)
+# 'flash_attention_2' if supported
+```
+
+## Configuration examples
+
+### Llama 2 with Flash Attention
+
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import torch
+
+model_id = "meta-llama/Llama-2-7b-hf"
+
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    attn_implementation="flash_attention_2",
+    torch_dtype=torch.float16,
+    device_map="auto"
+)
+
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+
+# Generate
+inputs = tokenizer("Once upon a time", return_tensors="pt").to("cuda")
+outputs = model.generate(**inputs, max_length=100)
+print(tokenizer.decode(outputs[0]))
+```
+
+### Mistral with Flash Attention for long context
+
+```python
+from transformers import AutoModelForCausalLM
+import torch
+
+model = AutoModelForCausalLM.from_pretrained(
+    "mistralai/Mistral-7B-v0.1",
+    attn_implementation="flash_attention_2",
+    torch_dtype=torch.bfloat16,  # Better for long context
+    device_map="auto",
+    max_position_embeddings=32768  # Extended context
+)
+
+# Process long document (32K tokens)
+long_text = "..." * 10000
+inputs = tokenizer(long_text, return_tensors="pt", truncation=False).to("cuda")
+outputs = model.generate(**inputs, max_new_tokens=512)
+```
+
+### Fine-tuning with Flash Attention
+
+```python
+from transformers import Trainer, TrainingArguments
+from transformers import AutoModelForCausalLM
+
+model = AutoModelForCausalLM.from_pretrained(
+    "meta-llama/Llama-2-7b-hf",
+    attn_implementation="flash_attention_2",
+    torch_dtype=torch.float16
+)
+
+training_args = TrainingArguments(
+    output_dir="./results",
+    per_device_train_batch_size=4,
+    gradient_accumulation_steps=4,
+    num_train_epochs=3,
+    fp16=True,  # Must match model dtype
+    optim="adamw_torch_fused"  # Fast optimizer
+)
+
+trainer = Trainer(
+    model=model,
+    args=training_args,
+    train_dataset=train_dataset
+)
+
+trainer.train()
+```
+
+### Multi-GPU training
+
+```python
+from transformers import AutoModelForCausalLM
+import torch
+
+# Model parallelism with Flash Attention
+model = AutoModelForCausalLM.from_pretrained(
+    "meta-llama/Llama-2-13b-hf",
+    attn_implementation="flash_attention_2",
+    torch_dtype=torch.float16,
+    device_map="auto",  # Automatic multi-GPU placement
+    max_memory={0: "20GB", 1: "20GB"}  # Limit per GPU
+)
+```
+
+## Performance comparisons
+
+### Memory usage (Llama 2 7B, batch=1)
+
+| Sequence Length | Standard Attention | Flash Attention 2 | Reduction |
+|-----------------|-------------------|-------------------|-----------|
+| 512 | 1.2 GB | 0.9 GB | 25% |
+| 2048 | 3.8 GB | 1.4 GB | 63% |
+| 8192 | 14.2 GB | 3.2 GB | 77% |
+| 32768 | OOM (>24GB) | 10.8 GB | Fits! |
+
+### Speed (tokens/sec, A100 80GB)
+
+| Model | Standard | Flash Attn 2 | Speedup |
+|-------|----------|--------------|---------|
+| Llama 2 7B (seq=2048) | 42 | 118 | 2.8x |
+| Llama 2 13B (seq=4096) | 18 | 52 | 2.9x |
+| Llama 2 70B (seq=2048) | 4 | 11 | 2.75x |
+
+### Training throughput (samples/sec)
+
+| Model | Batch Size | Standard | Flash Attn 2 | Speedup |
+|-------|------------|----------|--------------|---------|
+| Llama 2 7B | 4 | 1.2 | 3.1 | 2.6x |
+| Llama 2 7B | 8 | 2.1 | 5.8 | 2.8x |
+| Llama 2 13B | 2 | 0.6 | 1.7 | 2.8x |
+
+## Troubleshooting model-specific issues
+
+### Issue: Model doesn't support Flash Attention
+
+Check support list above. If not supported, use PyTorch SDPA as fallback:
+
+```python
+model = AutoModelForCausalLM.from_pretrained(
+    "model-name",
+    attn_implementation="sdpa",  # PyTorch native (still faster)
+    torch_dtype=torch.float16
+)
+```
+
+### Issue: CUDA out of memory during loading
+
+Reduce memory footprint:
+
+```python
+model = AutoModelForCausalLM.from_pretrained(
+    "model-name",
+    attn_implementation="flash_attention_2",
+    torch_dtype=torch.float16,
+    device_map="auto",
+    max_memory={0: "18GB"},  # Reserve memory for KV cache
+    low_cpu_mem_usage=True
+)
+```
+
+### Issue: Slower inference than expected
+
+Ensure dtype matches:
+
+```python
+# Model and inputs must both be float16/bfloat16
+model = model.to(torch.float16)
+inputs = tokenizer(..., return_tensors="pt").to("cuda")
+inputs = {k: v.to(torch.float16) if v.dtype == torch.float32 else v
+          for k, v in inputs.items()}
+```
+
+### Issue: Different outputs vs standard attention
+
+Flash Attention is numerically equivalent but uses different computation order. Small differences (<1e-3) are normal:
+
+```python
+# Compare outputs
+model_standard = AutoModelForCausalLM.from_pretrained("model-name", torch_dtype=torch.float16)
+model_flash = AutoModelForCausalLM.from_pretrained(
+    "model-name",
+    attn_implementation="flash_attention_2",
+    torch_dtype=torch.float16
+)
+
+inputs = tokenizer("Test", return_tensors="pt").to("cuda")
+
+with torch.no_grad():
+    out_standard = model_standard(**inputs).logits
+    out_flash = model_flash(**inputs).logits
+
+diff = (out_standard - out_flash).abs().max()
+print(f"Max diff: {diff:.6f}")  # Should be ~1e-3 to 1e-4
+```
+
+### Issue: ImportError during model loading
+
+Install flash-attn:
+```bash
+pip install flash-attn --no-build-isolation
+```
+
+Or disable Flash Attention:
+```python
+model = AutoModelForCausalLM.from_pretrained(
+    "model-name",
+    attn_implementation="eager",  # Standard PyTorch
+    torch_dtype=torch.float16
+)
+```
+
+## Best practices
+
+1. **Always use float16/bfloat16** with Flash Attention (not float32)
+2. **Set device_map="auto"** for automatic memory management
+3. **Use bfloat16 for long context** (better numerical stability)
+4. **Enable gradient checkpointing** for training large models
+5. **Monitor memory** with `torch.cuda.max_memory_allocated()`
+
+**Example with all best practices**:
+```python
+from transformers import AutoModelForCausalLM, TrainingArguments
+
+model = AutoModelForCausalLM.from_pretrained(
+    "meta-llama/Llama-2-7b-hf",
+    attn_implementation="flash_attention_2",
+    torch_dtype=torch.bfloat16,  # Better for training
+    device_map="auto",
+    low_cpu_mem_usage=True
+)
+
+# Enable gradient checkpointing for memory
+model.gradient_checkpointing_enable()
+
+# Training with optimizations
+training_args = TrainingArguments(
+    output_dir="./results",
+    per_device_train_batch_size=8,
+    gradient_accumulation_steps=2,
+    bf16=True,  # Match model dtype
+    optim="adamw_torch_fused",
+    gradient_checkpointing=True
+)
+```