refactor: reorganize skills into sub-categories

The skills directory was getting disorganized — mlops alone had 40 skills in a flat list, and 12 categories were singletons with just one skill each. Code change: - prompt_builder.py: Support sub-categories in skill scanner. skills/mlops/training/axolotl/SKILL.md now shows as category 'mlops/training' instead of just 'mlops'. Backwards-compatible with existing flat structure. Split mlops (40 skills) into 7 sub-categories: - mlops/training (12): accelerate, axolotl, flash-attention, grpo-rl-training, peft, pytorch-fsdp, pytorch-lightning, simpo, slime, torchtitan, trl-fine-tuning, unsloth - mlops/inference (8): gguf, guidance, instructor, llama-cpp, obliteratus, outlines, tensorrt-llm, vllm - mlops/models (6): audiocraft, clip, llava, segment-anything, stable-diffusion, whisper - mlops/vector-databases (4): chroma, faiss, pinecone, qdrant - mlops/evaluation (5): huggingface-tokenizers, lm-evaluation-harness, nemo-curator, saelens, weights-and-biases - mlops/cloud (2): lambda-labs, modal - mlops/research (1): dspy Merged singleton categories: - gifs → media (gif-search joins youtube-content) - music-creation → media (heartmula, songsee) - diagramming → creative (excalidraw joins ascii-art) - ocr-and-documents → productivity - domain → research (domain-intel) - feeds → research (blogwatcher) - market-data → research (polymarket) Fixed misplaced skills: - mlops/code-review → software-development (not ML-specific) - mlops/ml-paper-writing → research (academic writing) Added DESCRIPTION.md files for all new/updated categories.
2026-03-09 03:35:53 -07:00 · 2026-03-09 03:35:53 -07:00 · 732c66b0f3
commit 732c66b0f3
parent d6c710706f
217 changed files with 39 additions and 4 deletions
--- a/skills/mlops/trl-fine-tuning/references/online-rl.md
+++ b/skills/mlops/trl-fine-tuning/references/online-rl.md
@ -1,82 +0,0 @@
-# Online RL Methods
-
-Guide to online reinforcement learning with PPO, GRPO, RLOO, and OnlineDPO.
-
-## Overview
-
-Online RL generates completions during training and optimizes based on rewards.
-
-## PPO (Proximal Policy Optimization)
-
-Classic RL algorithm for LLM alignment.
-
-### Basic Usage
-
-```bash
-python -m trl.scripts.ppo \
-    --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
-    --reward_model_path reward-model \
-    --dataset_name trl-internal-testing/descriptiveness-sentiment-trl-style \
-    --output_dir model-ppo \
-    --learning_rate 3e-6 \
-    --per_device_train_batch_size 64 \
-    --total_episodes 10000 \
-    --num_ppo_epochs 4 \
-    --kl_coef 0.05
-```
-
-### Key Parameters
-
- `kl_coef`: KL penalty (0.05-0.2)
- `num_ppo_epochs`: Epochs per batch (2-4)
- `cliprange`: PPO clip (0.1-0.3)
- `vf_coef`: Value function coef (0.1)
-
-## GRPO (Group Relative Policy Optimization)
-
-Memory-efficient online RL.
-
-### Basic Usage
-
-```python
-from trl import GRPOTrainer, GRPOConfig
-from datasets import load_dataset
-
-# Define reward function
-def reward_func(completions, **kwargs):
-    return [len(set(c.split())) for c in completions]
-
-config = GRPOConfig(
-    output_dir="model-grpo",
-    num_generations=4,  # Completions per prompt
-    max_new_tokens=128
-)
-
-trainer = GRPOTrainer(
-    model="Qwen/Qwen2-0.5B-Instruct",
-    reward_funcs=reward_func,
-    args=config,
-    train_dataset=load_dataset("trl-lib/tldr", split="train")
-)
-trainer.train()
-```
-
-### Key Parameters
-
- `num_generations`: 2-8 completions
- `max_new_tokens`: 64-256
- Learning rate: 1e-5 to 1e-4
-
-## Memory Comparison
-
-| Method | Memory (7B) | Speed | Use Case |
-|--------|-------------|-------|----------|
-| PPO | 40GB | Medium | Maximum control |
-| GRPO | 24GB | Fast | **Memory-constrained** |
-| OnlineDPO | 28GB | Fast | No reward model |
-
-## References
-
- PPO paper: https://arxiv.org/abs/1707.06347
- GRPO paper: https://arxiv.org/abs/2402.03300
- TRL docs: https://huggingface.co/docs/trl/