2026-04-06 22:12:48 +03:00

18 KiB

Raw Blame History

name	description
media-skill	Orchestrate end-to-end media production from mixed inputs. Use when an AI agent needs one main workflow for analyzing source assets such as local videos, images, video links, and user wishes in a scenario; writing a production-ready script; acquiring heavy assets by downloading source clips or generating missing shots with a video model; then producing voiceover and assembling the final montage. Also use for запросы про анализ ассетов, сценарий, генерацию видео, скачивание фрагментов, озвучку и монтаж.

name

description

media-skill

Orchestrate end-to-end media production from mixed inputs. Use when an AI agent needs one main workflow for analyzing source assets such as local videos, images, video links, and user wishes in a scenario; writing a production-ready script; acquiring heavy assets by downloading source clips or generating missing shots with a video model; then producing voiceover and assembling the final montage. Also use for запросы про анализ ассетов, сценарий, генерацию видео, скачивание фрагментов, озвучку и монтаж.

Media Production Pipeline

Overview

This skill is the top-level orchestrator for turning rough media inputs into a finished short-form video or a production-ready execution plan.

This repository contains local specialist skills that may not appear in the session-level skill registry. Before handling media tasks here, inspect the repo-local SKILL.md files and prefer the narrowest matching skill when the request clearly targets one stage.

Repository setup and tool installation commands live in SETUP.md.

Use it when the task spans several stages at once:

asset analysis
script writing
heavy asset acquisition
voiceover
montage

Treat the local modules in this repository as specialists:

story-gen/SKILL.md: generate the story and structured scenario brief from the normalized request and available assets.
image-generation/SKILL.md: generate still images from prompts through the repo-local Nano Banana helper when the heavy-assets phase needs new art or another heavy still asset instead of downloaded stills.
video-generation/SKILL.md: use the existing repo-specific AI video generation pipelines when the heavy-assets phase needs provider-backed generated video, marketplace promo generation, or staged narrative generation instead of ad hoc prompting.
download-images/SKILL.md: fetch direct still-image assets into assets/ for heavy-asset acquisition, cutouts, and overlays.
download-youtube-segment/SKILL.md: fetch exact source ranges from YouTube.
ffmpeg-editing/SKILL.md: deterministic cutting, reframing, audio work, captions, transitions, and export.
remove-background/SKILL.md: remove still-image backgrounds locally with rembg and transparent PNG output.
voice/SKILL.md: synthesize and integrate narration with GPT-SoVITS.

Known local skills in this repo:

SKILL.md: top-level media workflow orchestrator.
image-generation/SKILL.md: minimal text-to-image path for generated stills and other heavy generated visual assets via openai/gemini-2.5-flash-image.
video-generation/SKILL.md: repo-specific AI video generation pipelines for generated clips, marketplace promo runs, Telegram-bot-backed generation, and microdrama/story-adaptation work during the heavy-assets phase.
download-images/SKILL.md: download direct still-image assets into the local working set for heavy-assets acquisition.
download-youtube-segment/SKILL.md: download YouTube segments or frames with the helper scripts in download-youtube-segment/scripts/.
ffmpeg-editing/SKILL.md: deterministic ffmpeg-based editing workflows.
remove-background/SKILL.md: local background removal with rembg, including bootstrap and model download.
story-gen/SKILL.md: story and scenario generation.
voice/SKILL.md: voice synthesis and audio integration.

Routing reminders:

If the heavy-assets phase specifically needs a newly generated still image or another heavy generated visual asset from a prompt, use image-generation/SKILL.md first instead of inventing a fresh image-generation flow.
If the user explicitly asks only for an AI-generated still image, including requests like сгенерируй картинку по трендам or other trend-based image generation, treat that as a narrow image-generation task and go directly to image-generation/SKILL.md or the session-level imagegen skill instead of forcing the full media pipeline.
If the heavy-assets phase specifically needs generated video clips and this repository's existing provider-backed generation workflows fit the task, use video-generation/SKILL.md first instead of inventing a fresh generation flow.
If the request is specifically about YouTube clipping, use download-youtube-segment/SKILL.md first instead of falling back to ad hoc commands.
If the request is specifically about removing the background from an image, use remove-background/SKILL.md first instead of ad hoc image-editing commands.
If the request is specifically about downloading internet stills for the heavy-assets phase, use download-images/SKILL.md first.

There is no single bundled script here for neural video generation. When missing shots must be generated, this skill should define the prompt/spec, target duration, framing, and continuity constraints, then place the generated outputs back into the same pipeline as normal assets.

Inputs To Normalize

First reduce the user request into a concrete inventory:

Treat assets/ as the canonical folder for source materials unless the user explicitly says otherwise. This folder can contain videos, voice files, images, text wishes, transcripts, source-range briefs, and other reference inputs for the project. Downloaded, generated, extracted, and intermediate media artifacts should also be saved under assets/ by default unless the user explicitly requests another destination.

local videos and images
external video links
direct image URLs
transcript or raw notes
scenario wishes: tone, hook, pacing, effects, subtitles, music, language
delivery constraints: aspect ratio, duration, platform, output format

If the user gives a chaotic brief, normalize it before doing expensive work.

Mandatory Sequencing Contract

Treat requests such as сделай видео, сделай мем, собери ролик, or similar end-result wording as a full pipeline request by default, not as permission to jump straight to montage. Only skip to a narrower module when the user explicitly asks for a single stage such as just clip this, only write the script, or only add captions.

If the deliverable is only a trend-based AI-generated still image, this is a single-stage exception: go straight to image-generation/SKILL.md or the session-level imagegen skill and do not require a scenario brief or the rest of the media pipeline.

Before any heavy production step, the agent must create or update a structured scenario brief under assets/, preferably assets/scenario.json, and then use that file as the source of truth for later steps.

This is a hard gate:

Do not start final source downloads, generation, voice synthesis, or montage until a scenario brief file exists.
The only allowed pre-scenario exception is lightweight inspection work needed to author the scenario, such as reading local assets, checking metadata, or downloading subtitles/transcripts for quote search.
If story-gen/SKILL.md cannot run because API access or env is missing, the agent must still write the scenario brief manually instead of skipping the scripting stage.
If the brief is only a URL plus a short wish such as make a meme, first inspect/subtitle the source, then write the beat choice and exact source ranges into the scenario brief, and only then acquire or trim the final clip.
Once production starts, the raw chat message or assets/text.txt must no longer be treated as the de facto plan; the scenario brief must be the working contract.

Workflow

1. Analyze Assets

Build an asset coverage view before writing commands or generating media.

For each input, capture:

what it is: video, image, link, transcript, or pure creative note
whether it is usable as-is
which story beats it can cover
whether it needs trimming, reframing, downloading, generation, or still-image cleanup such as background removal
any hard constraints such as aspect ratio, duration, or required wording

If the brief would benefit from extra still images that are not already local, note that the heavy-assets phase may source them from the internet and feed them into the edit as normal assets.

Output of this stage:

a beat list
an asset inventory
a gap list showing which beats are already covered and which still need heavy assets

If the input already looks like a structured scenario brief, extract:

source URL
exact source time ranges
target timeline ranges
voiceover text
on-screen text
montage notes and effects

2. Write The Script

Lock the narrative before starting expensive asset work.

Use story-gen/SKILL.md to generate the story and first structured scenario draft from the normalized brief and available assets. If needed, then refine that output into the final production-ready brief for downstream steps.

This stage should also generate a structured scenario brief file, usually under assets/, that becomes the source of truth for downstream steps.

The script should usually include:

hook
beat-by-beat structure
duration per beat
final voiceover wording
on-screen text or subtitle intent
source plan for each beat: existing asset, downloaded clip, downloaded still, or generated shot

Keep the script production-ready rather than literary. Every beat should answer:

what the viewer sees
what they hear
how long it lasts
where the visual comes from

The structured scenario brief should usually capture:

target timeline ranges
source links or source asset IDs
source ranges when they are already known
voiceover text per beat
on-screen text per beat
montage notes, transitions, and effects

Heavy asset acquisition, voiceover, and montage should consume this brief instead of reconstructing the plan from the raw user prompt.

Do not start heavy downloads or generation until the beat structure and approximate timings are stable.

3. Produce Heavy Assets

Only after the script is stable, acquire the expensive or slow assets.

Prefer these paths in order:

Reuse existing local footage.
Reuse existing local still images.
Find and download still-image assets from the internet when the beat would benefit from real supporting imagery such as a logo, poster, product photo, reaction image, accessory overlay, sticker, prop, or reference still.
Download exact source ranges from external video links.
Generate missing shots or draw new still assets only when real footage or downloadable stills do not exist or cannot achieve the needed moment.

If the missing asset is a generated still image or another heavy generated visual asset rather than a generated video clip, route that work through image-generation/SKILL.md before falling back to ad hoc API calls.

For source downloads:

If the user gives only a concept for a helpful still image such as a logo, poster, reaction image, prop, sticker, glasses, clown wig, or clown nose, the heavy-assets phase may first use built-in web/image search to find a suitable asset, then save the direct image URL into the scenario brief and fetch it with download-images/SKILL.md.
If the beat needs a still image from a direct image URL, use download-images/SKILL.md and save it under assets/ with a beat-aligned filename.
If the user gives one URL and explicit time ranges, use download-youtube-segment/SKILL.md.
If the brief already contains many source ranges from one video, prefer the repository batch helper instead of downloading each clip manually from a text plan in assets/.
If the asset is already local, trim it with ffmpeg-editing/SKILL.md instead of re-downloading anything.
Save newly acquired clips back into assets/ by default so later steps can treat that folder as the single working set.
Save newly acquired still images back into assets/ by default so later steps can treat that folder as the single working set.
If a YouTube source must be understood before clipping, save subtitles into assets/ first and use the .srt to search for quotes, beats, and punchline moments before final clip extraction.

For generated shots, define a precise request:

shot purpose in the story
duration in seconds
aspect ratio
camera movement
subject/action
lighting/style
continuity constraints relative to neighboring shots
negative constraints for elements that must not appear

If the missing asset should be a generated video clip rather than a still image, and the request fits this repository's existing generation system, route that heavy-assets work through video-generation/SKILL.md instead of inventing a separate local generation flow.

Save outputs with beat-aligned names so the final edit stage stays mechanical.

If a local or newly downloaded still image needs a transparent cutout, product-photo cleanup, sticker extraction, or isolation of a foreground subject before compositing, use remove-background/SKILL.md before the montage step instead of trying to solve that with ffmpeg. Treat the resulting *.nobg.png as the named overlay asset for the next editing phase.

Treat downloadable stills as one of the normal heavy-assets options whenever they help the content. If a downloaded still can cover the beat cleanly, prefer that over inventing unnecessary new art.

4. Produce Voiceover

After the script and heavy assets are locked, render the narration as a separate deliverable.

For narration:

Generate speech from the approved text with voice/tts_generate.py.
If the narration text already includes timestamps, keep those timings unless the script changed.
Save the generated segments and manifest as their own artifact before starting montage.

5. Assemble The Montage

Only after the picture assets and voice package are stable, build the final timeline.

For montage:

cut and merge clips
reframe to the delivery format, often 9:16
combine the voice package back into the cut with voice/replace_audio.sh when narration replaces or drives the audio
overlay downloaded stills, logos, posters, or transparent cutout PNGs produced during the heavy-assets phase
add music or ducking
add captions or text overlays
add transitions only where they support the beat
export a clean delivery file

Use ffmpeg-editing/SKILL.md for the deterministic rendering work.

For meme-style or reaction-style shorts:

prefer a short punchline cut rather than a long explanatory segment
for TikTok or other phone-first 9:16 meme delivery, prefer reframing directly from the source so the subject fills most of the screen instead of placing a small horizontal card inside a blurred plate
use the centered-clip-over-background approach only when preserving the full horizontal frame is more important than maximizing screen occupancy
keep top/bottom meme text inside a bounded caption area; do not let long lines run off the frame
prefer wrapped caption cards with a softer contour and shadow over a thick hard outline
when the joke or story benefits from extra still imagery, it is reasonable to add downloaded overlay assets plus local cutouts instead of limiting the edit to the base frame only
prefer pre-rendered caption bars/cards with a real bold font over brittle one-pass inline text when the meme text needs to look polished
when a meme has a setup beat and a punchline beat, it is acceptable for those beats to use different reframes if that makes both stages read better on a phone
after reframing to vertical, reposition glasses, masks, and other face overlays against the new crop instead of reusing coordinates from the horizontal version

Decision Rules

Prefer real footage over generated footage when both can satisfy the beat.
Prefer downloaded still assets when they materially improve the beat and can be integrated cleanly.
Prefer locking the script first; do not spend compute or download time on unstable story beats.
If the user gives only wishes and no structure, produce the beat list and asset gap analysis before any heavy step.
If the user gives exact source ranges, acquire those ranges before inventing new visuals.
If narration drives the pacing, freeze the voice text before voice generation and lock voice timing before the final montage.
Keep file naming beat-oriented so later assembly is obvious.
For TikTok-style delivery, validate the result on extracted vertical frames before calling it done; if the phone-sized frame still feels like a small embedded video, reframe more aggressively.

Heavy-assets still-image rule:

If the content would be improved by extra still imagery that is not already local, the heavy-assets phase may search, download, and optionally cut out those images before the edit. This applies to meme accessories, logos, product shots, reaction images, props, stickers, and similar supporting visuals.

Expected Outputs

Depending on scope, this skill should produce one or more of these artifacts:

normalized brief
beat list or shot list
approved script
structured scenario brief
heavy asset acquisition plan
downloaded or generated source assets
transparent cutout PNGs or cleaned still-image assets
voice segments and manifest
montage plan or assembled timeline
final rendered video

If the request only covers one stage, hand off directly to the narrower skill instead of forcing the full pipeline. A request to make a video, make a meme, or assemble a short does not count as a single-stage request unless the user explicitly narrows the scope.

18 KiB Raw Blame History