Skip to content

Instantly share code, notes, and snippets.

@vishalsachdev
Created January 15, 2026 04:37
Show Gist options
  • Select an option

  • Save vishalsachdev/46823273e14a9f4fd2dcccd16857cd87 to your computer and use it in GitHub Desktop.

Select an option

Save vishalsachdev/46823273e14a9f4fd2dcccd16857cd87 to your computer and use it in GitHub Desktop.
Clawdbot: local video understanding w/ Ollama (conversation excerpt)

Conversation excerpt (from: "can you explore a local video understanding model...")

USER — 2026-01-15T04:13:13.017Z

[Telegram Vishal Sachdev (@svishal2001) id:5101503769 2026-01-15T04:13Z] can you explore a local video understanding model so we can be more carefree about usage. we have ollama available. Dont update from github, as if you have summarize cli, i am guessing clawdbot updates will trigger summarize cli updates [message_id: 513]

ASSISTANT — 2026-01-15T04:14:33.509Z

[[reply_to:513]] Yep — makes sense. Right now your Ollama install has only text/code models (no vision models yet), so “local video understanding” isn’t possible until we add at least one vision-capable model.

What I found locally

ollama list currently shows:

  • deepseek-coder:33b-base-q4_0
  • qwen2.5-coder:32b-instruct-q4_0
  • nemotron-16k:latest
  • nemotron-3-nano:latest

No vision model (e.g., LLaVA / Llama-Vision / Moondream) is installed yet.

What I’m doing to explore (local + carefree usage)

  1. Add a lightweight vision model in Ollama
    I started pulling moondream (a relatively small, commonly-used local vision model) so we can do quick “what’s on this slide/frame?” extraction without caring about API costs.

  2. Use Ollama locally via HTTP API for images
    On this machine, ollama run --help doesn’t show an --image flag, so I wrote a tiny script that calls Ollama’s local endpoint (/api/generate) with base64 images:

    • Script: /Users/vishal/clawd/scripts/ollama_vision_frames.py
    • It takes a folder of frames (jpg/png) and produces JSONL with per-frame descriptions.

    Example usage (once moondream is pulled):

    python3 /Users/vishal/clawd/scripts/ollama_vision_frames.py \
      --model moondream \
      --frames-dir /Users/vishal/clawd/skills/rlm-long-context-agent/references/frames_selected \
      --out /Users/vishal/clawd/tmp/vision_frames_moondream.jsonl
  3. Workflow we can standardize

  • yt-dlp download (already working)
  • ffmpeg extract frames (already working)
  • local vision model over selected frames → “visual notes”
  • optional: use a local text model (e.g., nemotron-16k) to combine transcript + visual notes into a structured summary/skill.

Model options (Ollama) to consider next

  • moondream: lightweight; good for “slide description / OCR-ish” tasks; easiest to run often.
  • llama3.2-vision (if we want higher quality): better visual understanding, but heavier.

Re: summarize updates

Understood — I won’t update from GitHub. We’ll treat summarize as managed via your existing Clawdbot / environment updates.

If you tell me what “video understanding” means for you (slide OCR + outline? diagram understanding? “chaptering”?), I can tune the prompts + decide whether we should stick with moondream or also pull a heavier vision model.

USER — 2026-01-15T04:14:35.045Z

[Telegram Vishal Sachdev (@svishal2001) id:5101503769 2026-01-15T04:14Z] create a reminder for me to do this on friday - • vault/01_Projects/folders-sync-setup.md#L1. [message_id: 514]

ASSISTANT — 2026-01-15T04:14:55.117Z

[[reply_to:514]] Done — I scheduled a reminder for Friday (Jan 16) at 9:00am America/Chicago:

“Reminder: do the folders sync setup task: vault/01_Projects/folders-sync-setup.md#L1.”

USER — 2026-01-15T04:16:17.468Z

Read HEARTBEAT.md if exists. Consider outstanding tasks. Checkup sometimes on your human during (user local) day time.

ASSISTANT — 2026-01-15T04:16:23.134Z

HEARTBEAT_OK

USER — 2026-01-15T04:16:59.596Z

[Telegram Vishal Sachdev (@svishal2001) id:5101503769 2026-01-15T04:16Z] video understanding will depend on the video. Some are walkthroughs of a new way of doing things, so they have demos etc. I usually dont have slides based videos. [message_id: 517]

ASSISTANT — 2026-01-15T04:17:15.961Z

[[reply_to:517]] That helps—a lot. For demo/walkthrough videos, “video understanding” is less about slide OCR and more about turning what happened on screen into a structured set of steps + key screenshots.

Here’s the approach I’d recommend locally with Ollama:

  1. Transcript track (what’s said)
  • Keep doing what we’re doing: yt-dlp captions (or Whisper if needed).
  • This usually gives you the narrative + intent.
  1. Visual track (what’s shown) Instead of evenly sampling frames, do event-driven keyframes:
  • Detect scene changes / big visual shifts (good for “new screen / new menu / new step”).
  • For each segment, grab 1–3 representative frames.

Then run a local vision model on those frames with prompts like:

  • “What app/site is this? What screen/state are we in?”
  • “What did the user click/change? What’s the before/after?”
  • “Extract any visible UI text, button labels, menu items.”
  • “What step of the workflow is being demonstrated?”
  1. Synthesis (turn into a how-to) Use a local text model (e.g., nemotron-16k) to combine:
  • transcript summary + per-keyframe descriptions into:
  • a numbered procedure
  • common pitfalls
  • “if you see X screen, do Y”
  • and optionally a reusable “skill” / checklist

Status: I’ve already started pulling a local vision model (moondream) and I wrote a script to run an Ollama vision model over a set of frames and output JSONL (scripts/ollama_vision_frames.py). Next step is improving frame selection for demos (scene-change-based rather than slide-based).

Two quick questions so I optimize for your use:

  • Do you want outputs optimized for (A) a written how-to / checklist or (B) a reusable agent skill (with workflow + guardrails)?
  • For demos, is it more valuable to capture click-by-click UI steps (lots of frames) or milestone states (fewer frames, higher-level)?

USER — 2026-01-15T04:18:59.296Z

[Telegram Vishal Sachdev (@svishal2001) id:5101503769 2026-01-15T04:18Z] B, milestone states [message_id: 519]

ASSISTANT — 2026-01-15T04:19:09.960Z

[[reply_to:519]] Great — “reusable agent skill” + milestone states is the cleanest setup.

Plan (local-first)

Pipeline for demo/walkthrough videos:

  1. Transcript (captions via yt-dlp; Whisper fallback only if needed)
  2. Milestone frames (scene-change based, not time-based)
  3. Local vision pass (Ollama vision model) to label each milestone: “what screen/state is this + what changed + why it matters”
  4. Synthesize SKILL.md (local text model) using: transcript summary + milestone notes + timestamps

What I’ll change vs the slide-based approach

Instead of fps=1/15, we’ll use scene-change detection to pick key frames, e.g. ffmpeg’s select='gt(scene,THRESH)' and then cap to ~10–25 milestones for a ~1 hour video.

What’s already in motion

  • I’m pulling a local vision model in Ollama (moondream) so we can run this without hosted costs.
  • I added a script to run an Ollama vision model over a directory of frames and emit JSONL:
    • /Users/vishal/clawd/scripts/ollama_vision_frames.py

Quick confirm (so I tune defaults)

For milestone states, what’s your preferred target count per video?

  • 10–15 (very high-level)
  • 20–30 (still milestones, but more actionable)

Once you answer, I’ll bake that into the default extraction settings and we can re-run on the last video as a test.

USER — 2026-01-15T04:20:00.427Z

[Telegram Vishal Sachdev (@svishal2001) id:5101503769 2026-01-15T04:20Z] 20-30 assuming we are only using local models, so we should not be worried about compute [message_id: 521]

ASSISTANT — 2026-01-15T04:29:15.681Z

[[reply_to:521]] Perfect — 20–30 milestones is a good default if we’re staying local.

I went ahead and set up + validated the local pipeline end-to-end:

  • Installed local vision model (Ollama): moondream:latest is now pulled and available.

  • Milestone (scene-change) frame extraction: I added a script that uses ffmpeg scene-change detection + then downsamples to a target count:

    • /Users/vishal/clawd/scripts/extract_milestone_frames.sh
    • Default target count can be 25; threshold is tunable per video.
  • Local vision pass over milestone frames (Ollama → JSONL):

    • Script: /Users/vishal/clawd/scripts/ollama_vision_frames.py
    • Output example from the last video:
      • Frames: tmp/milestones/NQ2MiVuuJ6A/selected/ (25 frames)
      • Vision notes: tmp/milestones/NQ2MiVuuJ6A/moondream_frames.jsonl (+ .md)
  • Local synthesis into “skill-like” structure (text model):

    • Using nemotron-16k locally, I generated a structured draft from transcript summary + milestone notes:
      • tmp/milestones/NQ2MiVuuJ6A/nemotron_synthesis.md

Next improvement (to make it work better for non-slide demos): I’ll tweak the prompts for moondream to focus on UI state + “what changed” (not “slide description”), and I’ll add a small “cap to 20–30” default + automatic threshold adjustment when scene-change returns too few frames.

If you send the next demo-style YouTube link you care about, I can run this full local pipeline and generate a new SKILL.md output from it.

USER — 2026-01-15T04:36:07.075Z

[Telegram Vishal Sachdev (@svishal2001) id:5101503769 2026-01-15T04:36Z] can you look at our conversation history above since my message "can you explore a local video understanding model so we can be more carefree about usage. we have ollama available. Dont update from github, as if you have summarize cli, i am guessing clawdbot updates will trigger summarize cli updates" and write an article about how we improved. I am mindblown by your capabilities, and started writing a tweet "@clawdbot is awesome! I have been feeding it videos I like that would help improve my AI agent workflows by creating skills from the video. Have it working on a beefy m2 max with Ollama for local models and codex auth for gpt5.2 as the orchestrator/brain ...." . Use that seed to write the article and follow my other guidelines in the write article skill. Then create the twitter thread and give it to me here. publish a public gist of the conversation to link in the twitter thread. [message_id: 523]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment