refactor: remove outdated skills and references from MLOps

- Deleted the `huggingface-accelerate` skill documentation, which included details on distributed training and common workflows. - Removed `custom-plugins.md`, `megatron-integration.md`, `performance.md`, and other related reference documents that were no longer relevant or necessary. - This cleanup aims to streamline the MLOps skills repository and improve maintainability.
2026-02-25 04:22:48 -08:00 · 2026-02-25 04:22:48 -08:00 · 757d012ab5
commit 757d012ab5
parent f64a87209d
47 changed files with 170 additions and 21638 deletions
--- a/skills/mlops/accelerate/SKILL.md
+++ b/skills/mlops/accelerate/SKILL.md
@ -1,335 +0,0 @@
---
-name: huggingface-accelerate
-description: Simplest distributed training API. 4 lines to add distributed support to any PyTorch script. Unified API for DeepSpeed/FSDP/Megatron/DDP. Automatic device placement, mixed precision (FP16/BF16/FP8). Interactive config, single launch command. HuggingFace ecosystem standard.
-version: 1.0.0
-author: Orchestra Research
-license: MIT
-dependencies: [accelerate, torch, transformers]
-metadata:
-  hermes:
-    tags: [Distributed Training, HuggingFace, Accelerate, DeepSpeed, FSDP, Mixed Precision, PyTorch, DDP, Unified API, Simple]
-
---
-
-# HuggingFace Accelerate - Unified Distributed Training
-
-## Quick start
-
-Accelerate simplifies distributed training to 4 lines of code.
-
-**Installation**:
-```bash
-pip install accelerate
-```
-
-**Convert PyTorch script** (4 lines):
-```python
-import torch
-+ from accelerate import Accelerator
-
-+ accelerator = Accelerator()
-
-  model = torch.nn.Transformer()
-  optimizer = torch.optim.Adam(model.parameters())
-  dataloader = torch.utils.data.DataLoader(dataset)
-
-+ model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
-
-  for batch in dataloader:
-      optimizer.zero_grad()
-      loss = model(batch)
-     loss.backward()
-+     accelerator.backward(loss)
-      optimizer.step()
-```
-
-**Run** (single command):
-```bash
-accelerate launch train.py
-```
-
-## Common workflows
-
-### Workflow 1: From single GPU to multi-GPU
-
-**Original script**:
-```python
-# train.py
-import torch
-
-model = torch.nn.Linear(10, 2).to('cuda')
-optimizer = torch.optim.Adam(model.parameters())
-dataloader = torch.utils.data.DataLoader(dataset, batch_size=32)
-
-for epoch in range(10):
-    for batch in dataloader:
-        batch = batch.to('cuda')
-        optimizer.zero_grad()
-        loss = model(batch).mean()
-        loss.backward()
-        optimizer.step()
-```
-
-**With Accelerate** (4 lines added):
-```python
-# train.py
-import torch
-from accelerate import Accelerator  # +1
-
-accelerator = Accelerator()  # +2
-
-model = torch.nn.Linear(10, 2)
-optimizer = torch.optim.Adam(model.parameters())
-dataloader = torch.utils.data.DataLoader(dataset, batch_size=32)
-
-model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)  # +3
-
-for epoch in range(10):
-    for batch in dataloader:
-        # No .to('cuda') needed - automatic!
-        optimizer.zero_grad()
-        loss = model(batch).mean()
-        accelerator.backward(loss)  # +4
-        optimizer.step()
-```
-
-**Configure** (interactive):
-```bash
-accelerate config
-```
-
-**Questions**:
- Which machine? (single/multi GPU/TPU/CPU)
- How many machines? (1)
- Mixed precision? (no/fp16/bf16/fp8)
- DeepSpeed? (no/yes)
-
-**Launch** (works on any setup):
-```bash
-# Single GPU
-accelerate launch train.py
-
-# Multi-GPU (8 GPUs)
-accelerate launch --multi_gpu --num_processes 8 train.py
-
-# Multi-node
-accelerate launch --multi_gpu --num_processes 16 \
-  --num_machines 2 --machine_rank 0 \
-  --main_process_ip $MASTER_ADDR \
-  train.py
-```
-
-### Workflow 2: Mixed precision training
-
-**Enable FP16/BF16**:
-```python
-from accelerate import Accelerator
-
-# FP16 (with gradient scaling)
-accelerator = Accelerator(mixed_precision='fp16')
-
-# BF16 (no scaling, more stable)
-accelerator = Accelerator(mixed_precision='bf16')
-
-# FP8 (H100+)
-accelerator = Accelerator(mixed_precision='fp8')
-
-model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
-
-# Everything else is automatic!
-for batch in dataloader:
-    with accelerator.autocast():  # Optional, done automatically
-        loss = model(batch)
-    accelerator.backward(loss)
-```
-
-### Workflow 3: DeepSpeed ZeRO integration
-
-**Enable DeepSpeed ZeRO-2**:
-```python
-from accelerate import Accelerator
-
-accelerator = Accelerator(
-    mixed_precision='bf16',
-    deepspeed_plugin={
-        "zero_stage": 2,  # ZeRO-2
-        "offload_optimizer": False,
-        "gradient_accumulation_steps": 4
-    }
-)
-
-# Same code as before!
-model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
-```
-
-**Or via config**:
-```bash
-accelerate config
-# Select: DeepSpeed → ZeRO-2
-```
-
-**deepspeed_config.json**:
-```json
-{
-    "fp16": {"enabled": false},
-    "bf16": {"enabled": true},
-    "zero_optimization": {
-        "stage": 2,
-        "offload_optimizer": {"device": "cpu"},
-        "allgather_bucket_size": 5e8,
-        "reduce_bucket_size": 5e8
-    }
-}
-```
-
-**Launch**:
-```bash
-accelerate launch --config_file deepspeed_config.json train.py
-```
-
-### Workflow 4: FSDP (Fully Sharded Data Parallel)
-
-**Enable FSDP**:
-```python
-from accelerate import Accelerator, FullyShardedDataParallelPlugin
-
-fsdp_plugin = FullyShardedDataParallelPlugin(
-    sharding_strategy="FULL_SHARD",  # ZeRO-3 equivalent
-    auto_wrap_policy="TRANSFORMER_AUTO_WRAP",
-    cpu_offload=False
-)
-
-accelerator = Accelerator(
-    mixed_precision='bf16',
-    fsdp_plugin=fsdp_plugin
-)
-
-model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
-```
-
-**Or via config**:
-```bash
-accelerate config
-# Select: FSDP → Full Shard → No CPU Offload
-```
-
-### Workflow 5: Gradient accumulation
-
-**Accumulate gradients**:
-```python
-from accelerate import Accelerator
-
-accelerator = Accelerator(gradient_accumulation_steps=4)
-
-model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
-
-for batch in dataloader:
-    with accelerator.accumulate(model):  # Handles accumulation
-        optimizer.zero_grad()
-        loss = model(batch)
-        accelerator.backward(loss)
-        optimizer.step()
-```
-
-**Effective batch size**: `batch_size * num_gpus * gradient_accumulation_steps`
-
-## When to use vs alternatives
-
-**Use Accelerate when**:
- Want simplest distributed training
- Need single script for any hardware
- Use HuggingFace ecosystem
- Want flexibility (DDP/DeepSpeed/FSDP/Megatron)
- Need quick prototyping
-
-**Key advantages**:
- **4 lines**: Minimal code changes
- **Unified API**: Same code for DDP, DeepSpeed, FSDP, Megatron
- **Automatic**: Device placement, mixed precision, sharding
- **Interactive config**: No manual launcher setup
- **Single launch**: Works everywhere
-
-**Use alternatives instead**:
- **PyTorch Lightning**: Need callbacks, high-level abstractions
- **Ray Train**: Multi-node orchestration, hyperparameter tuning
- **DeepSpeed**: Direct API control, advanced features
- **Raw DDP**: Maximum control, minimal abstraction
-
-## Common issues
-
-**Issue: Wrong device placement**
-
-Don't manually move to device:
-```python
-# WRONG
-batch = batch.to('cuda')
-
-# CORRECT
-# Accelerate handles it automatically after prepare()
-```
-
-**Issue: Gradient accumulation not working**
-
-Use context manager:
-```python
-# CORRECT
-with accelerator.accumulate(model):
-    optimizer.zero_grad()
-    accelerator.backward(loss)
-    optimizer.step()
-```
-
-**Issue: Checkpointing in distributed**
-
-Use accelerator methods:
-```python
-# Save only on main process
-if accelerator.is_main_process:
-    accelerator.save_state('checkpoint/')
-
-# Load on all processes
-accelerator.load_state('checkpoint/')
-```
-
-**Issue: Different results with FSDP**
-
-Ensure same random seed:
-```python
-from accelerate.utils import set_seed
-set_seed(42)
-```
-
-## Advanced topics
-
-**Megatron integration**: See [references/megatron-integration.md](references/megatron-integration.md) for tensor parallelism, pipeline parallelism, and sequence parallelism setup.
-
-**Custom plugins**: See [references/custom-plugins.md](references/custom-plugins.md) for creating custom distributed plugins and advanced configuration.
-
-**Performance tuning**: See [references/performance.md](references/performance.md) for profiling, memory optimization, and best practices.
-
-## Hardware requirements
-
- **CPU**: Works (slow)
- **Single GPU**: Works
- **Multi-GPU**: DDP (default), DeepSpeed, or FSDP
- **Multi-node**: DDP, DeepSpeed, FSDP, Megatron
- **TPU**: Supported
- **Apple MPS**: Supported
-
-**Launcher requirements**:
- **DDP**: `torch.distributed.run` (built-in)
- **DeepSpeed**: `deepspeed` (pip install deepspeed)
- **FSDP**: PyTorch 1.12+ (built-in)
- **Megatron**: Custom setup
-
-## Resources
-
- Docs: https://huggingface.co/docs/accelerate
- GitHub: https://github.com/huggingface/accelerate
- Version: 1.11.0+
- Tutorial: "Accelerate your scripts"
- Examples: https://github.com/huggingface/accelerate/tree/main/examples
- Used by: HuggingFace Transformers, TRL, PEFT, all HF libraries
-
-
-