18 KiB
| name | description |
|---|---|
| media-skill | Orchestrate end-to-end media production from mixed inputs. Use when an AI agent needs one main workflow for analyzing source assets such as local videos, images, video links, and user wishes in a scenario; writing a production-ready script; acquiring heavy assets by downloading source clips or generating missing shots with a video model; then producing voiceover and assembling the final montage. Also use for запросы про анализ ассетов, сценарий, генерацию видео, скачивание фрагментов, озвучку и монтаж. |
Media Production Pipeline
Overview
This skill is the top-level orchestrator for turning rough media inputs into a finished short-form video or a production-ready execution plan.
This repository contains local specialist skills that may not appear in the session-level skill registry. Before handling media tasks here, inspect the repo-local SKILL.md files and prefer the narrowest matching skill when the request clearly targets one stage.
Repository setup and tool installation commands live in SETUP.md.
Use it when the task spans several stages at once:
- asset analysis
- script writing
- heavy asset acquisition
- voiceover
- montage
Treat the local modules in this repository as specialists:
- story-gen/SKILL.md: generate the story and structured scenario brief from the normalized request and available assets.
- image-generation/SKILL.md: generate still images from prompts through the repo-local Nano Banana helper when the heavy-assets phase needs new art or another heavy still asset instead of downloaded stills.
- video-generation/SKILL.md: use the existing repo-specific AI video generation pipelines when the heavy-assets phase needs provider-backed generated video, marketplace promo generation, or staged narrative generation instead of ad hoc prompting.
- download-images/SKILL.md: fetch direct still-image assets into
assets/for heavy-asset acquisition, cutouts, and overlays. - download-youtube-segment/SKILL.md: fetch exact source ranges from YouTube.
- ffmpeg-editing/SKILL.md: deterministic cutting, reframing, audio work, captions, transitions, and export.
- remove-background/SKILL.md: remove still-image backgrounds locally with rembg and transparent PNG output.
- voice/SKILL.md: synthesize and integrate narration with GPT-SoVITS.
Known local skills in this repo:
- SKILL.md: top-level media workflow orchestrator.
- image-generation/SKILL.md: minimal text-to-image path for generated stills and other heavy generated visual assets via
openai/gemini-2.5-flash-image. - video-generation/SKILL.md: repo-specific AI video generation pipelines for generated clips, marketplace promo runs, Telegram-bot-backed generation, and microdrama/story-adaptation work during the heavy-assets phase.
- download-images/SKILL.md: download direct still-image assets into the local working set for heavy-assets acquisition.
- download-youtube-segment/SKILL.md: download YouTube segments or frames with the helper scripts in
download-youtube-segment/scripts/. - ffmpeg-editing/SKILL.md: deterministic ffmpeg-based editing workflows.
- remove-background/SKILL.md: local background removal with rembg, including bootstrap and model download.
- story-gen/SKILL.md: story and scenario generation.
- voice/SKILL.md: voice synthesis and audio integration.
Routing reminders:
- If the heavy-assets phase specifically needs a newly generated still image or another heavy generated visual asset from a prompt, use image-generation/SKILL.md first instead of inventing a fresh image-generation flow.
- If the user explicitly asks only for an AI-generated still image, including requests like
сгенерируй картинку по трендамor other trend-based image generation, treat that as a narrow image-generation task and go directly to image-generation/SKILL.md or the session-levelimagegenskill instead of forcing the full media pipeline. - If the heavy-assets phase specifically needs generated video clips and this repository's existing provider-backed generation workflows fit the task, use video-generation/SKILL.md first instead of inventing a fresh generation flow.
- If the request is specifically about YouTube clipping, use download-youtube-segment/SKILL.md first instead of falling back to ad hoc commands.
- If the request is specifically about removing the background from an image, use remove-background/SKILL.md first instead of ad hoc image-editing commands.
- If the request is specifically about downloading internet stills for the heavy-assets phase, use download-images/SKILL.md first.
There is no single bundled script here for neural video generation. When missing shots must be generated, this skill should define the prompt/spec, target duration, framing, and continuity constraints, then place the generated outputs back into the same pipeline as normal assets.
Inputs To Normalize
First reduce the user request into a concrete inventory:
Treat assets/ as the canonical folder for source materials unless the user explicitly says otherwise. This folder can contain videos, voice files, images, text wishes, transcripts, source-range briefs, and other reference inputs for the project. Downloaded, generated, extracted, and intermediate media artifacts should also be saved under assets/ by default unless the user explicitly requests another destination.
- local videos and images
- external video links
- direct image URLs
- transcript or raw notes
- scenario wishes: tone, hook, pacing, effects, subtitles, music, language
- delivery constraints: aspect ratio, duration, platform, output format
If the user gives a chaotic brief, normalize it before doing expensive work.
Mandatory Sequencing Contract
Treat requests such as сделай видео, сделай мем, собери ролик, or similar end-result wording as a full pipeline request by default, not as permission to jump straight to montage. Only skip to a narrower module when the user explicitly asks for a single stage such as just clip this, only write the script, or only add captions.
If the deliverable is only a trend-based AI-generated still image, this is a single-stage exception: go straight to image-generation/SKILL.md or the session-level imagegen skill and do not require a scenario brief or the rest of the media pipeline.
Before any heavy production step, the agent must create or update a structured scenario brief under assets/, preferably assets/scenario.json, and then use that file as the source of truth for later steps.
This is a hard gate:
- Do not start final source downloads, generation, voice synthesis, or montage until a scenario brief file exists.
- The only allowed pre-scenario exception is lightweight inspection work needed to author the scenario, such as reading local assets, checking metadata, or downloading subtitles/transcripts for quote search.
- If story-gen/SKILL.md cannot run because API access or env is missing, the agent must still write the scenario brief manually instead of skipping the scripting stage.
- If the brief is only a URL plus a short wish such as
make a meme, first inspect/subtitle the source, then write the beat choice and exact source ranges into the scenario brief, and only then acquire or trim the final clip. - Once production starts, the raw chat message or
assets/text.txtmust no longer be treated as the de facto plan; the scenario brief must be the working contract.
Workflow
1. Analyze Assets
Build an asset coverage view before writing commands or generating media.
For each input, capture:
- what it is: video, image, link, transcript, or pure creative note
- whether it is usable as-is
- which story beats it can cover
- whether it needs trimming, reframing, downloading, generation, or still-image cleanup such as background removal
- any hard constraints such as aspect ratio, duration, or required wording
If the brief would benefit from extra still images that are not already local, note that the heavy-assets phase may source them from the internet and feed them into the edit as normal assets.
Output of this stage:
- a beat list
- an asset inventory
- a gap list showing which beats are already covered and which still need heavy assets
If the input already looks like a structured scenario brief, extract:
- source URL
- exact source time ranges
- target timeline ranges
- voiceover text
- on-screen text
- montage notes and effects
2. Write The Script
Lock the narrative before starting expensive asset work.
Use story-gen/SKILL.md to generate the story and first structured scenario draft from the normalized brief and available assets. If needed, then refine that output into the final production-ready brief for downstream steps.
This stage should also generate a structured scenario brief file, usually under assets/, that becomes the source of truth for downstream steps.
The script should usually include:
- hook
- beat-by-beat structure
- duration per beat
- final voiceover wording
- on-screen text or subtitle intent
- source plan for each beat: existing asset, downloaded clip, downloaded still, or generated shot
Keep the script production-ready rather than literary. Every beat should answer:
- what the viewer sees
- what they hear
- how long it lasts
- where the visual comes from
The structured scenario brief should usually capture:
- target timeline ranges
- source links or source asset IDs
- source ranges when they are already known
- voiceover text per beat
- on-screen text per beat
- montage notes, transitions, and effects
Heavy asset acquisition, voiceover, and montage should consume this brief instead of reconstructing the plan from the raw user prompt.
Do not start heavy downloads or generation until the beat structure and approximate timings are stable.
3. Produce Heavy Assets
Only after the script is stable, acquire the expensive or slow assets.
Prefer these paths in order:
- Reuse existing local footage.
- Reuse existing local still images.
- Find and download still-image assets from the internet when the beat would benefit from real supporting imagery such as a logo, poster, product photo, reaction image, accessory overlay, sticker, prop, or reference still.
- Download exact source ranges from external video links.
- Generate missing shots or draw new still assets only when real footage or downloadable stills do not exist or cannot achieve the needed moment.
If the missing asset is a generated still image or another heavy generated visual asset rather than a generated video clip, route that work through image-generation/SKILL.md before falling back to ad hoc API calls.
For source downloads:
- If the user gives only a concept for a helpful still image such as a logo, poster, reaction image, prop, sticker, glasses, clown wig, or clown nose, the heavy-assets phase may first use built-in web/image search to find a suitable asset, then save the direct image URL into the scenario brief and fetch it with download-images/SKILL.md.
- If the beat needs a still image from a direct image URL, use download-images/SKILL.md and save it under
assets/with a beat-aligned filename. - If the user gives one URL and explicit time ranges, use download-youtube-segment/SKILL.md.
- If the brief already contains many source ranges from one video, prefer the repository batch helper instead of downloading each clip manually from a text plan in
assets/. - If the asset is already local, trim it with ffmpeg-editing/SKILL.md instead of re-downloading anything.
- Save newly acquired clips back into
assets/by default so later steps can treat that folder as the single working set. - Save newly acquired still images back into
assets/by default so later steps can treat that folder as the single working set. - If a YouTube source must be understood before clipping, save subtitles into
assets/first and use the.srtto search for quotes, beats, and punchline moments before final clip extraction.
For generated shots, define a precise request:
- shot purpose in the story
- duration in seconds
- aspect ratio
- camera movement
- subject/action
- lighting/style
- continuity constraints relative to neighboring shots
- negative constraints for elements that must not appear
If the missing asset should be a generated video clip rather than a still image, and the request fits this repository's existing generation system, route that heavy-assets work through video-generation/SKILL.md instead of inventing a separate local generation flow.
Save outputs with beat-aligned names so the final edit stage stays mechanical.
If a local or newly downloaded still image needs a transparent cutout, product-photo cleanup, sticker extraction, or isolation of a foreground subject before compositing, use remove-background/SKILL.md before the montage step instead of trying to solve that with ffmpeg. Treat the resulting *.nobg.png as the named overlay asset for the next editing phase.
Treat downloadable stills as one of the normal heavy-assets options whenever they help the content. If a downloaded still can cover the beat cleanly, prefer that over inventing unnecessary new art.
4. Produce Voiceover
After the script and heavy assets are locked, render the narration as a separate deliverable.
For narration:
- Generate speech from the approved text with voice/tts_generate.py.
- If the narration text already includes timestamps, keep those timings unless the script changed.
- Save the generated segments and manifest as their own artifact before starting montage.
5. Assemble The Montage
Only after the picture assets and voice package are stable, build the final timeline.
For montage:
- cut and merge clips
- reframe to the delivery format, often
9:16 - combine the voice package back into the cut with voice/replace_audio.sh when narration replaces or drives the audio
- overlay downloaded stills, logos, posters, or transparent cutout PNGs produced during the heavy-assets phase
- add music or ducking
- add captions or text overlays
- add transitions only where they support the beat
- export a clean delivery file
Use ffmpeg-editing/SKILL.md for the deterministic rendering work.
For meme-style or reaction-style shorts:
- prefer a short punchline cut rather than a long explanatory segment
- for TikTok or other phone-first
9:16meme delivery, prefer reframing directly from the source so the subject fills most of the screen instead of placing a small horizontal card inside a blurred plate - use the centered-clip-over-background approach only when preserving the full horizontal frame is more important than maximizing screen occupancy
- keep top/bottom meme text inside a bounded caption area; do not let long lines run off the frame
- prefer wrapped caption cards with a softer contour and shadow over a thick hard outline
- when the joke or story benefits from extra still imagery, it is reasonable to add downloaded overlay assets plus local cutouts instead of limiting the edit to the base frame only
- prefer pre-rendered caption bars/cards with a real bold font over brittle one-pass inline text when the meme text needs to look polished
- when a meme has a setup beat and a punchline beat, it is acceptable for those beats to use different reframes if that makes both stages read better on a phone
- after reframing to vertical, reposition glasses, masks, and other face overlays against the new crop instead of reusing coordinates from the horizontal version
Decision Rules
- Prefer real footage over generated footage when both can satisfy the beat.
- Prefer downloaded still assets when they materially improve the beat and can be integrated cleanly.
- Prefer locking the script first; do not spend compute or download time on unstable story beats.
- If the user gives only wishes and no structure, produce the beat list and asset gap analysis before any heavy step.
- If the user gives exact source ranges, acquire those ranges before inventing new visuals.
- If narration drives the pacing, freeze the voice text before voice generation and lock voice timing before the final montage.
- Keep file naming beat-oriented so later assembly is obvious.
- For TikTok-style delivery, validate the result on extracted vertical frames before calling it done; if the phone-sized frame still feels like a small embedded video, reframe more aggressively.
Heavy-assets still-image rule:
- If the content would be improved by extra still imagery that is not already local, the heavy-assets phase may search, download, and optionally cut out those images before the edit. This applies to meme accessories, logos, product shots, reaction images, props, stickers, and similar supporting visuals.
Expected Outputs
Depending on scope, this skill should produce one or more of these artifacts:
- normalized brief
- beat list or shot list
- approved script
- structured scenario brief
- heavy asset acquisition plan
- downloaded or generated source assets
- transparent cutout PNGs or cleaned still-image assets
- voice segments and manifest
- montage plan or assembled timeline
- final rendered video
If the request only covers one stage, hand off directly to the narrower skill instead of forcing the full pipeline. A request to make a video, make a meme, or assemble a short does not count as a single-stage request unless the user explicitly narrows the scope.