fix: restore all removed bundled skills + fix skills sync system

- Restored 21 skills removed in commits 757d012 and 740dd92: accelerate, audiocraft, code-review, faiss, flash-attention, gguf, grpo-rl-training, guidance, llava, nemo-curator, obliteratus, peft, pytorch-fsdp, pytorch-lightning, simpo, slime, stable-diffusion, tensorrt-llm, torchtitan, trl-fine-tuning, whisper - Rewrote sync_skills() with proper update semantics: * New skills (not in manifest): copied to user dir * Existing skills (in manifest + on disk): updated via hash comparison * User-deleted skills (in manifest, not on disk): respected, not re-added * Stale manifest entries (removed from bundled): cleaned from manifest - Added sync_skills() to CLI startup (cmd_chat) and gateway startup (start_gateway) — previously only ran during 'hermes update' - Updated cmd_update output to show new/updated/cleaned counts - Rewrote tests: 20 tests covering manifest CRUD, dir hashing, fresh install, user deletion respect, update detection, stale cleanup, and name collision handling 75 bundled skills total. 2002 tests pass.
2026-03-06 15:57:12 -08:00 · 2026-03-06 15:57:12 -08:00 · ab0f4126cf
commit ab0f4126cf
parent 68fbae5692
74 changed files with 27881 additions and 44 deletions
--- a/skills/mlops/trl-fine-tuning/references/reward-modeling.md
+++ b/skills/mlops/trl-fine-tuning/references/reward-modeling.md
@ -0,0 +1,122 @@
+# Reward Modeling
+
+Guide to training reward models with TRL for RLHF pipelines.
+
+## Overview
+
+Reward models score completions based on human preferences. Used in:
+- PPO training (RL feedback)
+- GRPO online RL
+- Completion ranking
+
+## Basic Training
+
+```python
+from transformers import AutoModelForSequenceClassification, AutoTokenizer
+from trl import RewardTrainer, RewardConfig
+from datasets import load_dataset
+
+# Load model (num_labels=1 for single reward score)
+model = AutoModelForSequenceClassification.from_pretrained(
+    "Qwen/Qwen2.5-0.5B-Instruct",
+    num_labels=1
+)
+tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
+
+# Load preference dataset (chosen/rejected pairs)
+dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
+
+# Configure
+config = RewardConfig(
+    output_dir="Qwen2.5-Reward",
+    per_device_train_batch_size=2,
+    num_train_epochs=1,
+    learning_rate=1e-5
+)
+
+# Train
+trainer = RewardTrainer(
+    model=model,
+    args=config,
+    processing_class=tokenizer,
+    train_dataset=dataset
+)
+trainer.train()
+```
+
+## Dataset Format
+
+Required fields:
+```json
+{
+  "prompt": "Question or instruction",
+  "chosen": "Better response",
+  "rejected": "Worse response"
+}
+```
+
+## Bradley-Terry Loss
+
+Default loss function:
+```
+loss = -log(sigmoid(reward_chosen - reward_rejected))
+```
+
+Learns to score chosen > rejected.
+
+## Using Reward Models
+
+### Inference
+
+```python
+from transformers import pipeline
+
+# Load trained reward model
+reward_pipe = pipeline("text-classification", model="Qwen2.5-Reward")
+
+# Score completions
+texts = ["Good answer", "Bad answer"]
+scores = reward_pipe(texts)
+print(scores)  # Higher score = better
+```
+
+### In PPO
+
+```python
+from trl import PPOTrainer, PPOConfig
+
+config = PPOConfig(
+    reward_model_path="Qwen2.5-Reward"  # Use trained reward model
+)
+
+trainer = PPOTrainer(
+    model=policy_model,
+    config=config,
+    # Reward model loaded automatically
+)
+```
+
+## Hyperparameters
+
+| Model Size | Learning Rate | Batch Size | Epochs |
+|------------|---------------|------------|--------|
+| <1B | 2e-5 | 4-8 | 1-2 |
+| 1-7B | 1e-5 | 2-4 | 1 |
+| 7-13B | 5e-6 | 1-2 | 1 |
+
+## Evaluation
+
+Check reward separation:
+```python
+# Chosen should score higher than rejected
+chosen_rewards = model(**chosen_inputs).logits
+rejected_rewards = model(**rejected_inputs).logits
+
+accuracy = (chosen_rewards > rejected_rewards).float().mean()
+print(f"Accuracy: {accuracy:.2%}")  # Target: >80%
+```
+
+## References
+
+- InstructGPT paper: https://arxiv.org/abs/2203.02155
+- TRL docs: https://huggingface.co/docs/trl/reward_trainer