7.7 KiB
7.7 KiB
| name | description |
|---|---|
| story-gen | Generate a structured video scenario (JSON) from any input: product description, idea, joke, educational topic, or URL. Adapts to platform (TikTok, WB, YouTube, Instagram, VK), audience, and content restrictions. Returns scenes with detailed visual prompts for image/video generation, voiceover text, captions, and timing. Use when: user wants to create any video — ad, viral reel, educational, postcard, long-form (2 min), or product showcase. |
Story Gen
Universal video scenario generator. Works for any content type and platform.
Language rules
- This skill and all its documentation is written in English only
- Input can be in any language — Russian, English, Chinese, etc.
visual_promptis always in English (required by gpt-image-1.5 and veo-3.1)voiceoverandcaptionmatch the input language or--langparameter- If
--lang auto(default): language is detected automatically from input
When to use
- User wants to make a video for Wildberries, TikTok, Instagram, YouTube, VK
- User has a product, idea, joke, or topic and wants a ready script
- Pipeline needs structured JSON with visual prompts + voiceover for next steps
- User provides assets (photos, URLs) that need analysis before scripting
Setup
Needs env:
OPENAI_API_KEY— API keyOPENAI_BASE_URL— endpoint (default: https://llm.lambda.coredump.ru/v1)STORY_MODEL— model (default: qwen3.5-122b)
Two modes
--mode image (default)
Generates a storyboard scenario for image/visual generation. Each scene has a visual_prompt (English) ready for gpt-image-1.5 or veo-3.1.
For trend-photo asks such as anime portrait, studio headshot, USSR postcard,
photo booth, aged self, flowers in hair, and the other curated portrait trends
stored under scripts/trends/, generate.py can also emit a single-image
scenario that downstream image generation can consume directly.
--mode video
Generates a full shooting script for real video production. Each scene has:
timecode— cumulative start timeHH:MM:SSvoiceover— exact words spoken by narrator (in target language)action— what happens on screen in English (for director / video generation)
Automatically saves two files when --out is given:
scenario.json— full structured scriptscenario_voiceover.txt— ready forvoice/voice_acting.pyin[HH:MM:SS] textformat
Parameters
| Parameter | Values | Description |
|---|---|---|
--mode |
image, video |
image: visual storyboard; video: full shooting script with voiceover |
--format |
wb_ad, reels, viral, long, postcard, educational, trend_photo, auto |
Video format (image mode only) |
--platform |
tiktok, instagram, wb, youtube, vk, auto |
Target platform |
--audience |
any text | Target audience description |
--duration |
seconds | Target duration |
--lang |
ru, en, de, auto |
Language for voiceover and captions |
--photo |
filepath | Reference photo path for trend-photo scenarios |
--analyze |
flag | Analyze assets before generating (image mode only) |
--out |
filepath | Save JSON to file (video mode also saves _voiceover.txt) |
--voice |
flag | After script generation, immediately run voice synthesis (video mode + --out required) |
--voice-out |
dirpath | Directory for voice segments (default: voice_segments/ next to --out) |
Usage examples
# WB product ad — image storyboard (default mode)
python3 {baseDir}/scripts/generate.py \
"Женская сумка из экокожи, бежевая, 2500 руб" \
--format wb_ad --platform wb
# Full video shooting script + automatically run voice synthesis
python3 {baseDir}/scripts/generate.py \
"Обзор беговых кроссовок Nike для TikTok" \
--mode video --platform tiktok --duration 60 --lang ru \
--out assets/scenario.json --voice
# → saves assets/scenario.json
# → saves assets/scenario_voiceover.txt
# → runs voice_acting.py → saves wav segments to assets/voice_segments/
# → saves assets/voice_segments/segments.txt (manifest for combine_audio.sh)
# Without auto voice (manual step later):
python3 {baseDir}/scripts/generate.py \
"Обзор беговых кроссовок Nike для TikTok" \
--mode video --platform tiktok --duration 60 --lang ru \
--out assets/scenario.json
# Then manually:
python3 voice/voice_acting.py assets/scenario_voiceover.txt -o assets/voice_segments
# Viral TikTok image storyboard (English voiceover)
python3 {baseDir}/scripts/generate.py \
"Анекдот про программиста и кофе" \
--format viral --platform tiktok --lang en
# Curated trend-photo scenario for downstream image generation
python3 {baseDir}/scripts/generate.py \
"Сделай меня в стиле аниме" \
--format trend_photo --photo assets/me.jpg \
--out assets/trend-scenario.json
# Then hand the JSON to image-generation:
# python3 image-generation/scripts/generate-image.py --scenario assets/trend-scenario.json
# Long educational video shooting script
python3 {baseDir}/scripts/generate.py \
"How to choose your first bicycle" \
--mode video --platform youtube --duration 120 --lang en \
--out assets/bicycle_scenario.json
Output JSON — image mode
{
"title": "video title",
"format": "wb_ad|reels|viral|long|postcard|educational|trend_photo",
"platform": "tiktok|instagram|wb|youtube|vk",
"language": "ru|en|...",
"duration_sec": 30,
"hook": "first 3 seconds — grabbing phrase or action",
"target_audience": "who watches this",
"content_restrictions": "platform rules (aspect ratio, age restrictions, etc.)",
"scenes": [
{
"id": 1,
"duration_sec": 5,
"visual_prompt": "ALWAYS IN ENGLISH — detailed prompt for gpt-image-1.5 or veo-3.1",
"visual_type": "image|video_clip|text_only",
"voiceover": "narration text in target language",
"caption": "on-screen text in target language"
}
],
"image_request": {
"prompt": "single image prompt for downstream image-generation",
"reference_image_required": true,
"reference_image_path": "/abs/path/to/photo.jpg"
},
"storyboard_grid_prompt": "NxN storyboard grid — all scenes as one image. null if no recurring subject.",
"music_mood": "upbeat|calm|dramatic|funny|inspirational",
"style_notes": "overall style and delivery notes",
"asset_analysis": null
}
Output JSON — video mode
{
"title": "video title",
"platform": "tiktok|instagram|wb|youtube|vk",
"language": "ru|en|...",
"duration_sec": 60,
"hook": "first 3 seconds — what grabs attention",
"target_audience": "who watches this",
"scenes": [
{
"id": 1,
"timecode": "00:00:00",
"duration_sec": 5,
"voiceover": "exact words spoken by narrator in target language",
"action": "detailed English description of what is on screen: camera, subject, movement, lighting"
}
],
"music_mood": "upbeat|calm|dramatic|funny|inspirational",
"style_notes": "overall visual style, pacing, tone"
}
Pipeline integration
Image mode output feeds into:
visual_prompt→ image generation (gpt-image-1.5) or video (veo-3.1)image_request.prompt+reference_image_path→image-generation/scripts/generate-image.pyfor trend-photo editsvoiceover→ TTS (Pocket-TTSorElevenLabs)caption+duration_sec→ ffmpeg montage (../ffmpeg-editing/SKILL.md)- Full JSON → orchestrator (../SKILL.md)
Video mode output feeds into:
_voiceover.txt→voice/voice_acting.pyfor speech synthesisactionper scene → video generation or director instructions- Full JSON → orchestrator (
../SKILL.md)