October 4, 2025

Building an FFmpeg Pipeline for LLM Video Benchmarks

article.mdx

Context

When evaluating LLMs on video tasks, the benchmark fails if preprocessing is inconsistent. I use FFmpeg as a deterministic ingestion layer so model comparisons are fair.

Core idea

Every model should receive the same normalized input:

same resolution
same frame rate
same audio format
same clip boundaries

Without this, benchmark results mix model quality with preprocessing noise.

Benchmark UI snapshot

FFmpeg and LLM benchmark interface

Screenshot from my benchmark setup UI used to compare prompts and model outputs.

Baseline FFmpeg normalization

ffmpeg -i input.mp4 \
  -vf "scale=1280:720:force_original_aspect_ratio=decrease,fps=2" \
  -c:v libx264 -preset medium -crf 23 \
  -c:a aac -ar 16000 -ac 1 \
  -y normalized.mp4

I also export frames for visual reasoning tasks:

ffmpeg -i normalized.mp4 -vf "fps=2" frames/frame_%05d.jpg

Benchmark dimensions

For each model run, I track at least:

Task quality: correctness against labeled data
Latency: end-to-end runtime per clip
Cost: token usage + infra cost
Stability: variance across repeated runs

Suggested record schema

{
  "clip_id": "video_0042",
  "model": "my-llm-name",
  "prompt_version": "v3",
  "latency_ms": 1840,
  "total_cost_usd": 0.0184,
  "score": 0.91
}

Lessons from practice

1) Prompt quality is not enough

Even strong prompts underperform when frame extraction is noisy or inconsistent.

2) Frame cadence changes outcomes

For some tasks, fps=1 is enough; for motion-sensitive tasks, fps=2 or fps=4 improves recall but increases cost.

3) Keep benchmark versions immutable

Version these pieces independently:

FFmpeg profile
prompt template
evaluation script
model version

If any of them changes, it is a new benchmark run.

Where Hugging Face fits

Hugging Face is useful both for:

selecting open models to compare
reusing evaluation datasets

This helps keep your benchmark transparent and reproducible across environments.

Final takeaway

A good LLM benchmark is mostly about system design discipline. FFmpeg gives you the deterministic preprocessing layer; your evaluation pipeline gives you the truth.