refactor: reorganize skills into sub-categories

The skills directory was getting disorganized — mlops alone had 40 skills in a flat list, and 12 categories were singletons with just one skill each. Code change: - prompt_builder.py: Support sub-categories in skill scanner. skills/mlops/training/axolotl/SKILL.md now shows as category 'mlops/training' instead of just 'mlops'. Backwards-compatible with existing flat structure. Split mlops (40 skills) into 7 sub-categories: - mlops/training (12): accelerate, axolotl, flash-attention, grpo-rl-training, peft, pytorch-fsdp, pytorch-lightning, simpo, slime, torchtitan, trl-fine-tuning, unsloth - mlops/inference (8): gguf, guidance, instructor, llama-cpp, obliteratus, outlines, tensorrt-llm, vllm - mlops/models (6): audiocraft, clip, llava, segment-anything, stable-diffusion, whisper - mlops/vector-databases (4): chroma, faiss, pinecone, qdrant - mlops/evaluation (5): huggingface-tokenizers, lm-evaluation-harness, nemo-curator, saelens, weights-and-biases - mlops/cloud (2): lambda-labs, modal - mlops/research (1): dspy Merged singleton categories: - gifs → media (gif-search joins youtube-content) - music-creation → media (heartmula, songsee) - diagramming → creative (excalidraw joins ascii-art) - ocr-and-documents → productivity - domain → research (domain-intel) - feeds → research (blogwatcher) - market-data → research (polymarket) Fixed misplaced skills: - mlops/code-review → software-development (not ML-specific) - mlops/ml-paper-writing → research (academic writing) Added DESCRIPTION.md files for all new/updated categories.
2026-03-09 03:35:53 -07:00 · 2026-03-09 03:35:53 -07:00 · 732c66b0f3
commit 732c66b0f3
parent d6c710706f
217 changed files with 39 additions and 4 deletions
--- a/skills/mlops/clip/SKILL.md
+++ b/skills/mlops/clip/SKILL.md
@ -1,256 +0,0 @@
---
-name: clip
-description: OpenAI's model connecting vision and language. Enables zero-shot image classification, image-text matching, and cross-modal retrieval. Trained on 400M image-text pairs. Use for image search, content moderation, or vision-language tasks without fine-tuning. Best for general-purpose image understanding.
-version: 1.0.0
-author: Orchestra Research
-license: MIT
-dependencies: [transformers, torch, pillow]
-metadata:
-  hermes:
-    tags: [Multimodal, CLIP, Vision-Language, Zero-Shot, Image Classification, OpenAI, Image Search, Cross-Modal Retrieval, Content Moderation]
-
---
-
-# CLIP - Contrastive Language-Image Pre-Training
-
-OpenAI's model that understands images from natural language.
-
-## When to use CLIP
-
-**Use when:**
- Zero-shot image classification (no training data needed)
- Image-text similarity/matching
- Semantic image search
- Content moderation (detect NSFW, violence)
- Visual question answering
- Cross-modal retrieval (image→text, text→image)
-
-**Metrics**:
- **25,300+ GitHub stars**
- Trained on 400M image-text pairs
- Matches ResNet-50 on ImageNet (zero-shot)
- MIT License
-
-**Use alternatives instead**:
- **BLIP-2**: Better captioning
- **LLaVA**: Vision-language chat
- **Segment Anything**: Image segmentation
-
-## Quick start
-
-### Installation
-
-```bash
-pip install git+https://github.com/openai/CLIP.git
-pip install torch torchvision ftfy regex tqdm
-```
-
-### Zero-shot classification
-
-```python
-import torch
-import clip
-from PIL import Image
-
-# Load model
-device = "cuda" if torch.cuda.is_available() else "cpu"
-model, preprocess = clip.load("ViT-B/32", device=device)
-
-# Load image
-image = preprocess(Image.open("photo.jpg")).unsqueeze(0).to(device)
-
-# Define possible labels
-text = clip.tokenize(["a dog", "a cat", "a bird", "a car"]).to(device)
-
-# Compute similarity
-with torch.no_grad():
-    image_features = model.encode_image(image)
-    text_features = model.encode_text(text)
-
-    # Cosine similarity
-    logits_per_image, logits_per_text = model(image, text)
-    probs = logits_per_image.softmax(dim=-1).cpu().numpy()
-
-# Print results
-labels = ["a dog", "a cat", "a bird", "a car"]
-for label, prob in zip(labels, probs[0]):
-    print(f"{label}: {prob:.2%}")
-```
-
-## Available models
-
-```python
-# Models (sorted by size)
-models = [
-    "RN50",           # ResNet-50
-    "RN101",          # ResNet-101
-    "ViT-B/32",       # Vision Transformer (recommended)
-    "ViT-B/16",       # Better quality, slower
-    "ViT-L/14",       # Best quality, slowest
-]
-
-model, preprocess = clip.load("ViT-B/32")
-```
-
-| Model | Parameters | Speed | Quality |
-|-------|------------|-------|---------|
-| RN50 | 102M | Fast | Good |
-| ViT-B/32 | 151M | Medium | Better |
-| ViT-L/14 | 428M | Slow | Best |
-
-## Image-text similarity
-
-```python
-# Compute embeddings
-image_features = model.encode_image(image)
-text_features = model.encode_text(text)
-
-# Normalize
-image_features /= image_features.norm(dim=-1, keepdim=True)
-text_features /= text_features.norm(dim=-1, keepdim=True)
-
-# Cosine similarity
-similarity = (image_features @ text_features.T).item()
-print(f"Similarity: {similarity:.4f}")
-```
-
-## Semantic image search
-
-```python
-# Index images
-image_paths = ["img1.jpg", "img2.jpg", "img3.jpg"]
-image_embeddings = []
-
-for img_path in image_paths:
-    image = preprocess(Image.open(img_path)).unsqueeze(0).to(device)
-    with torch.no_grad():
-        embedding = model.encode_image(image)
-        embedding /= embedding.norm(dim=-1, keepdim=True)
-    image_embeddings.append(embedding)
-
-image_embeddings = torch.cat(image_embeddings)
-
-# Search with text query
-query = "a sunset over the ocean"
-text_input = clip.tokenize([query]).to(device)
-with torch.no_grad():
-    text_embedding = model.encode_text(text_input)
-    text_embedding /= text_embedding.norm(dim=-1, keepdim=True)
-
-# Find most similar images
-similarities = (text_embedding @ image_embeddings.T).squeeze(0)
-top_k = similarities.topk(3)
-
-for idx, score in zip(top_k.indices, top_k.values):
-    print(f"{image_paths[idx]}: {score:.3f}")
-```
-
-## Content moderation
-
-```python
-# Define categories
-categories = [
-    "safe for work",
-    "not safe for work",
-    "violent content",
-    "graphic content"
-]
-
-text = clip.tokenize(categories).to(device)
-
-# Check image
-with torch.no_grad():
-    logits_per_image, _ = model(image, text)
-    probs = logits_per_image.softmax(dim=-1)
-
-# Get classification
-max_idx = probs.argmax().item()
-max_prob = probs[0, max_idx].item()
-
-print(f"Category: {categories[max_idx]} ({max_prob:.2%})")
-```
-
-## Batch processing
-
-```python
-# Process multiple images
-images = [preprocess(Image.open(f"img{i}.jpg")) for i in range(10)]
-images = torch.stack(images).to(device)
-
-with torch.no_grad():
-    image_features = model.encode_image(images)
-    image_features /= image_features.norm(dim=-1, keepdim=True)
-
-# Batch text
-texts = ["a dog", "a cat", "a bird"]
-text_tokens = clip.tokenize(texts).to(device)
-
-with torch.no_grad():
-    text_features = model.encode_text(text_tokens)
-    text_features /= text_features.norm(dim=-1, keepdim=True)
-
-# Similarity matrix (10 images × 3 texts)
-similarities = image_features @ text_features.T
-print(similarities.shape)  # (10, 3)
-```
-
-## Integration with vector databases
-
-```python
-# Store CLIP embeddings in Chroma/FAISS
-import chromadb
-
-client = chromadb.Client()
-collection = client.create_collection("image_embeddings")
-
-# Add image embeddings
-for img_path, embedding in zip(image_paths, image_embeddings):
-    collection.add(
-        embeddings=[embedding.cpu().numpy().tolist()],
-        metadatas=[{"path": img_path}],
-        ids=[img_path]
-    )
-
-# Query with text
-query = "a sunset"
-text_embedding = model.encode_text(clip.tokenize([query]))
-results = collection.query(
-    query_embeddings=[text_embedding.cpu().numpy().tolist()],
-    n_results=5
-)
-```
-
-## Best practices
-
-1. **Use ViT-B/32 for most cases** - Good balance
-2. **Normalize embeddings** - Required for cosine similarity
-3. **Batch processing** - More efficient
-4. **Cache embeddings** - Expensive to recompute
-5. **Use descriptive labels** - Better zero-shot performance
-6. **GPU recommended** - 10-50× faster
-7. **Preprocess images** - Use provided preprocess function
-
-## Performance
-
-| Operation | CPU | GPU (V100) |
-|-----------|-----|------------|
-| Image encoding | ~200ms | ~20ms |
-| Text encoding | ~50ms | ~5ms |
-| Similarity compute | <1ms | <1ms |
-
-## Limitations
-
-1. **Not for fine-grained tasks** - Best for broad categories
-2. **Requires descriptive text** - Vague labels perform poorly
-3. **Biased on web data** - May have dataset biases
-4. **No bounding boxes** - Whole image only
-5. **Limited spatial understanding** - Position/counting weak
-
-## Resources
-
- **GitHub**: https://github.com/openai/CLIP ⭐ 25,300+
- **Paper**: https://arxiv.org/abs/2103.00020
- **Colab**: https://colab.research.google.com/github/openai/clip/
- **License**: MIT
-
-