media-skill/story-gen/SKILL.md
2026-04-06 22:12:48 +03:00

7.7 KiB
Raw Blame History

name description
story-gen Generate a structured video scenario (JSON) from any input: product description, idea, joke, educational topic, or URL. Adapts to platform (TikTok, WB, YouTube, Instagram, VK), audience, and content restrictions. Returns scenes with detailed visual prompts for image/video generation, voiceover text, captions, and timing. Use when: user wants to create any video — ad, viral reel, educational, postcard, long-form (2 min), or product showcase.

Story Gen

Universal video scenario generator. Works for any content type and platform.

Language rules

  • This skill and all its documentation is written in English only
  • Input can be in any language — Russian, English, Chinese, etc.
  • visual_prompt is always in English (required by gpt-image-1.5 and veo-3.1)
  • voiceover and caption match the input language or --lang parameter
  • If --lang auto (default): language is detected automatically from input

When to use

  • User wants to make a video for Wildberries, TikTok, Instagram, YouTube, VK
  • User has a product, idea, joke, or topic and wants a ready script
  • Pipeline needs structured JSON with visual prompts + voiceover for next steps
  • User provides assets (photos, URLs) that need analysis before scripting

Setup

Needs env:

Two modes

--mode image (default)

Generates a storyboard scenario for image/visual generation. Each scene has a visual_prompt (English) ready for gpt-image-1.5 or veo-3.1.

For trend-photo asks such as anime portrait, studio headshot, USSR postcard, photo booth, aged self, flowers in hair, and the other curated portrait trends stored under scripts/trends/, generate.py can also emit a single-image scenario that downstream image generation can consume directly.

--mode video

Generates a full shooting script for real video production. Each scene has:

  • timecode — cumulative start time HH:MM:SS
  • voiceover — exact words spoken by narrator (in target language)
  • action — what happens on screen in English (for director / video generation)

Automatically saves two files when --out is given:

  • scenario.json — full structured script
  • scenario_voiceover.txt — ready for voice/voice_acting.py in [HH:MM:SS] text format

Parameters

Parameter Values Description
--mode image, video image: visual storyboard; video: full shooting script with voiceover
--format wb_ad, reels, viral, long, postcard, educational, trend_photo, auto Video format (image mode only)
--platform tiktok, instagram, wb, youtube, vk, auto Target platform
--audience any text Target audience description
--duration seconds Target duration
--lang ru, en, de, auto Language for voiceover and captions
--photo filepath Reference photo path for trend-photo scenarios
--analyze flag Analyze assets before generating (image mode only)
--out filepath Save JSON to file (video mode also saves _voiceover.txt)
--voice flag After script generation, immediately run voice synthesis (video mode + --out required)
--voice-out dirpath Directory for voice segments (default: voice_segments/ next to --out)

Usage examples

# WB product ad — image storyboard (default mode)
python3 {baseDir}/scripts/generate.py \
  "Женская сумка из экокожи, бежевая, 2500 руб" \
  --format wb_ad --platform wb

# Full video shooting script + automatically run voice synthesis
python3 {baseDir}/scripts/generate.py \
  "Обзор беговых кроссовок Nike для TikTok" \
  --mode video --platform tiktok --duration 60 --lang ru \
  --out assets/scenario.json --voice
# → saves assets/scenario.json
# → saves assets/scenario_voiceover.txt
# → runs voice_acting.py → saves wav segments to assets/voice_segments/
# → saves assets/voice_segments/segments.txt (manifest for combine_audio.sh)

# Without auto voice (manual step later):
python3 {baseDir}/scripts/generate.py \
  "Обзор беговых кроссовок Nike для TikTok" \
  --mode video --platform tiktok --duration 60 --lang ru \
  --out assets/scenario.json
# Then manually:
python3 voice/voice_acting.py assets/scenario_voiceover.txt -o assets/voice_segments

# Viral TikTok image storyboard (English voiceover)
python3 {baseDir}/scripts/generate.py \
  "Анекдот про программиста и кофе" \
  --format viral --platform tiktok --lang en

# Curated trend-photo scenario for downstream image generation
python3 {baseDir}/scripts/generate.py \
  "Сделай меня в стиле аниме" \
  --format trend_photo --photo assets/me.jpg \
  --out assets/trend-scenario.json
# Then hand the JSON to image-generation:
# python3 image-generation/scripts/generate-image.py --scenario assets/trend-scenario.json

# Long educational video shooting script
python3 {baseDir}/scripts/generate.py \
  "How to choose your first bicycle" \
  --mode video --platform youtube --duration 120 --lang en \
  --out assets/bicycle_scenario.json

Output JSON — image mode

{
  "title": "video title",
  "format": "wb_ad|reels|viral|long|postcard|educational|trend_photo",
  "platform": "tiktok|instagram|wb|youtube|vk",
  "language": "ru|en|...",
  "duration_sec": 30,
  "hook": "first 3 seconds — grabbing phrase or action",
  "target_audience": "who watches this",
  "content_restrictions": "platform rules (aspect ratio, age restrictions, etc.)",
  "scenes": [
    {
      "id": 1,
      "duration_sec": 5,
      "visual_prompt": "ALWAYS IN ENGLISH — detailed prompt for gpt-image-1.5 or veo-3.1",
      "visual_type": "image|video_clip|text_only",
      "voiceover": "narration text in target language",
      "caption": "on-screen text in target language"
    }
  ],
  "image_request": {
    "prompt": "single image prompt for downstream image-generation",
    "reference_image_required": true,
    "reference_image_path": "/abs/path/to/photo.jpg"
  },
  "storyboard_grid_prompt": "NxN storyboard grid — all scenes as one image. null if no recurring subject.",
  "music_mood": "upbeat|calm|dramatic|funny|inspirational",
  "style_notes": "overall style and delivery notes",
  "asset_analysis": null
}

Output JSON — video mode

{
  "title": "video title",
  "platform": "tiktok|instagram|wb|youtube|vk",
  "language": "ru|en|...",
  "duration_sec": 60,
  "hook": "first 3 seconds — what grabs attention",
  "target_audience": "who watches this",
  "scenes": [
    {
      "id": 1,
      "timecode": "00:00:00",
      "duration_sec": 5,
      "voiceover": "exact words spoken by narrator in target language",
      "action": "detailed English description of what is on screen: camera, subject, movement, lighting"
    }
  ],
  "music_mood": "upbeat|calm|dramatic|funny|inspirational",
  "style_notes": "overall visual style, pacing, tone"
}

Pipeline integration

Image mode output feeds into:

  • visual_prompt → image generation (gpt-image-1.5) or video (veo-3.1)
  • image_request.prompt + reference_image_pathimage-generation/scripts/generate-image.py for trend-photo edits
  • voiceover → TTS (Pocket-TTS or ElevenLabs)
  • caption + duration_sec → ffmpeg montage (../ffmpeg-editing/SKILL.md)
  • Full JSON → orchestrator (../SKILL.md)

Video mode output feeds into:

  • _voiceover.txtvoice/voice_acting.py for speech synthesis
  • action per scene → video generation or director instructions
  • Full JSON → orchestrator (../SKILL.md)