media-skill/ffmpeg-editing/SKILL.md
2026-04-03 22:48:26 +03:00

11 KiB

name description
ffmpeg-editing Plan and execute deterministic audio and video edits with ffmpeg and ffprobe. Use when an AI agent needs to cut clips, concatenate videos, reorder segments, replace or mix audio, burn or mux captions, add text or image overlays, reframe footage for vertical or square formats such as 9:16, add transitions between clips, change speed, extract frames, normalize exports, or translate a plain-English editing request into concrete ffmpeg commands or scripts.

FFmpeg Editing

Overview

Inspect the media first, then choose the simplest edit path that satisfies the request with the least quality loss.

Prefer stream copy for pure trims, remuxes, and compatible concatenation. Re-encode when the request involves filters, frame-accurate cuts, captions, overlays, speed changes, reframing, or mixed audio.

Quick Start

  1. Inspect every input with ffprobe.
  2. Normalize the request into an edit plan:
    • inputs
    • desired output
    • exact time ranges
    • whether timing must be frame-accurate
    • whether subtitles are burned in or soft
    • whether original audio must be preserved, replaced, or mixed
  3. Choose the edit family:
    • trim/remux
    • concat/reorder
    • filter-based video edit
    • filter-based audio edit
    • subtitle or overlay pass
  4. Choose stream copy or re-encode deliberately.
  5. Build explicit -map rules instead of relying on default stream selection.
  6. For larger graphs, write a -filter_complex_script file instead of an unreadable inline filter string.
  7. For MP4 outputs, usually add -movflags +faststart.

Scripts

Prefer the bundled scripts in simple or repetitive cases before writing raw ffmpeg by hand:

  • scripts/trim-clip.sh: cut one file by start/end or start/+duration, with copy or accurate mode.
  • scripts/merge-clips.sh: concatenate already-compatible clips after checking their stream signatures.
  • scripts/make-vertical.sh: export a 9:16 version with crop, pad, or dynamic motion mode.
  • scripts/render-meme-vertical.sh: build a meme-style vertical render with a blurred background plate, centered source clip, and wrapped top/bottom caption cards.
  • scripts/replace-audio.sh: attach a new audio track to an existing video.
  • scripts/mix-audio.sh: mix or duck background music under the original track.
  • scripts/burn-captions.sh: burn .srt, .vtt, or .ass captions into the picture.
  • scripts/transition-two-clips.sh: build a normalized two-clip render with xfade and acrossfade.

Use the scripts for the common path. Fall back to references/patterns.md when the request needs a custom graph or a multi-stage edit.

When the edit uses downloaded still images or transparent cutout PNGs from the heavy-assets phase, treat those files as normal overlay inputs and keep them in assets/ so the montage step can reference them mechanically. When the heavy-assets phase prepared extra still images for the edit, treat them as first-class overlay inputs in the same way as local PNGs, logos, or cutouts.

Workflow

1. Inspect Inputs

Run ffprobe before writing commands. Capture:

  • duration
  • resolution
  • frame rate
  • pixel format
  • video codec
  • audio codec
  • channel layout
  • subtitle streams
  • time base issues or variable frame rate

Use this information to decide whether stream copy is safe, whether concat demuxer can work, and whether a compatibility transcode is needed first.

2. Classify the Request

Map the user request to one of these patterns:

  • Cut a clip: trim one source into one output.
  • Merge videos: concatenate compatible clips or use concat filter after normalizing them.
  • Apply sound to video: replace audio, mix music under speech, or keep only one track.
  • Apply captions: burn captions into video or mux subtitle streams.
  • Make vertical: scale, crop, and optionally zoom for 9:16.
  • Add transitions: crossfade, fade-to-black, fade-to-white, wipe, or slide between adjacent clips.
  • Add text/logo: use drawtext or overlay.
  • Composite downloaded stills or cutouts: use overlay with the downloaded image or *.nobg.png asset from the heavy-assets phase.
  • Build a meme short: trim a punchline moment, reframe to 9:16, then overlay strong top/bottom text without letting long lines run off the frame.
  • Build a meme still: composite one source frame with downloaded overlay assets, then add strong top/bottom caption bars.
  • Speed up / slow down: use setpts and atempo.
  • Build a short edit: trim multiple ranges, transform each segment, then concat.

If the source is a YouTube URL and the task is only to fetch one segment in this repository, prefer ../download-youtube-segment/scripts/download-clip.py before doing further editing.

3. Choose Copy vs Re-encode

Use stream copy when all of these are true:

  • no filter is required
  • approximate keyframe-aligned cutting is acceptable
  • codecs and container are already acceptable

Re-encode when any of these are true:

  • cut points must be exact
  • subtitles must be burned in
  • text, image, crop, scale, pad, blur, zoom, or transitions are needed
  • audio must be mixed, ducked, faded, or normalized
  • clips need normalization before concatenation

4. Build Commands Deliberately

Apply these rules:

  • Use explicit -map values.
  • Set codecs intentionally instead of relying on defaults for production outputs.
  • Use libx264 -crf 18-23 -pix_fmt yuv420p for broadly compatible H.264 delivery unless the user needs something else.
  • Use AAC for common MP4 audio delivery.
  • Use -shortest only when you explicitly want the output to end at the shortest stream.
  • For accurate trims, prefer filter-based trim / atrim or place -ss after input with re-encode.
  • For fast rough trims, place -ss before input and copy when acceptable.

5. Validate the Output

After rendering, inspect the output with ffprobe and verify:

  • expected duration
  • expected resolution and aspect ratio
  • expected stream count
  • audio is present and synchronized
  • captions/overlays appear when expected

If the user asked for a reusable workflow, keep the command readable and parameterized.

Decision Rules

Trim One Clip

  • Use copy trim for speed and no-generation-loss when keyframe accuracy is acceptable.
  • Use re-encode trim for frame accuracy.

Concatenate Clips

  • Use the concat demuxer when clips already match codec, time base, dimensions, and stream layout.
  • Use the concat filter when clips differ or need per-clip transforms first.
  • Use xfade and acrossfade when the user wants polished clip-to-clip transitions instead of hard cuts.

Replace or Mix Audio

  • Replace audio by mapping the video from one input and audio from another.
  • Mix audio with amix or sidechaincompress when speech must stay clear over music.
  • Fade music in or out instead of hard starts and stops unless the user asked for abrupt edits.

Apply Captions

  • Burn captions into the picture for platform-safe delivery or when the user wants styled subtitles.
  • Mux subtitles as soft tracks when the user needs togglable captions.
  • If no subtitle file or transcript exists, note that speech-to-text is a separate step.

Make Social Formats

  • Use crop for intentional reframing.
  • Use pad when preserving the full frame matters more than filling the canvas.
  • Keep 9:16 as a first-class output path for Shorts, Reels, and TikTok-style requests.
  • Add slight zoom or drift only when it supports the framing. Avoid constant motion on every clip.
  • Keep output fps explicit when building short-form deliverables.
  • For phone-first meme shorts, prefer a true fullscreen reframe from the source whenever the joke can survive cropping; do not default to a small horizontal clip floating inside a blurred background.
  • Use the centered-foreground-over-background pattern as a fallback when preserving the whole horizontal frame matters more than screen occupancy or when cropping would destroy the beat.
  • Different beats in the same meme may use different reframes. A setup can stay wider while the punchline snaps into a tighter fullscreen crop.

Add Text Or Meme Captions

  • For short meme renders, prefer scripts/render-meme-vertical.sh over ad hoc drawtext when the output needs top/bottom reaction text.
  • Never assume the caption fits on one line. Wrap long phrases into a bounded caption box so the text stays inside the frame.
  • Prefer overlaying rendered caption cards for multi-line meme text instead of building brittle single-line drawtext expressions.
  • Avoid thick harsh outlines by default. Prefer a thinner dark contour plus a soft shadow so the text stays readable without looking cheap.
  • Keep meme captions centered and leave explicit margins from the top and bottom edges.
  • If the output is a still meme rather than a video, it is acceptable to pre-render the caption bars/cards with ImageMagick using a real bold font and then composite them deterministically. Preserve the same wrapped-card look instead of dropping back to weak inline text.
  • For fullscreen TikTok renders, size caption cards for actual phone readability rather than reusing small caption assets from the horizontal version.
  • Re-seat glasses, masks, and other face overlays after every major reframe; coordinates that worked in a horizontal crop should be treated as invalid after a vertical fullscreen recut.

Composite Downloaded Stills

  • Prefer placing externally downloaded stills, logos, or product photos into assets/ during the heavy-assets phase before you write the ffmpeg command.
  • If the still needs transparency, run remove-background/SKILL.md first and use the resulting PNG as the overlay input.
  • Keep overlay asset names beat-oriented so the edit can reference them without reconstructing the scenario.
  • For overlays prepared in the heavy-assets phase, prefer the downloaded and locally cleaned asset that best supports the beat.
  • Position the overlay against the actual face/head landmarks in the chosen frame; do not leave the asset floating off-face just because the download step succeeded.

Add Transitions

  • Prefer short transitions, usually 0.10-0.35 seconds, for social edits unless the user wants a slower dramatic style.
  • Use plain cuts for beat-driven action when transitions would blur impact.
  • Use xfade for video and acrossfade for audio so the transition feels cohesive.
  • Normalize resolution, fps, sample rate, and channel layout before applying transitions.
  • For larger edits, put the full graph in a filter script instead of building one long quoted command.

Reference Files

Read references/patterns.md for command skeletons covering the main editing patterns.