[Telegram Vishal Sachdev (@svishal2001) id:5101503769 2026-01-15T04:13Z] can you explore a local video understanding model so we can be more carefree about usage. we have ollama available. Dont update from github, as if you have summarize cli, i am guessing clawdbot updates will trigger summarize cli updates [message_id: 513]
[[reply_to:513]] Yep — makes sense. Right now your Ollama install has only text/code models (no vision models yet), so “local video understanding” isn’t possible until we add at least one vision-capable model.
ollama list currently shows:
deepseek-coder:33b-base-q4_0qwen2.5-coder:32b-instruct-q4_0nemotron-16k:latestnemotron-3-nano:latest
No vision model (e.g., LLaVA / Llama-Vision / Moondream) is installed yet.
-
Add a lightweight vision model in Ollama
I started pullingmoondream(a relatively small, commonly-used local vision model) so we can do quick “what’s on this slide/frame?” extraction without caring about API costs. -
Use Ollama locally via HTTP API for images
On this machine,ollama run --helpdoesn’t show an--imageflag, so I wrote a tiny script that calls Ollama’s local endpoint (/api/generate) with base64 images:- Script:
/Users/vishal/clawd/scripts/ollama_vision_frames.py - It takes a folder of frames (jpg/png) and produces JSONL with per-frame descriptions.
Example usage (once
moondreamis pulled):python3 /Users/vishal/clawd/scripts/ollama_vision_frames.py \ --model moondream \ --frames-dir /Users/vishal/clawd/skills/rlm-long-context-agent/references/frames_selected \ --out /Users/vishal/clawd/tmp/vision_frames_moondream.jsonl
- Script:
-
Workflow we can standardize
yt-dlpdownload (already working)ffmpegextract frames (already working)- local vision model over selected frames → “visual notes”
- optional: use a local text model (e.g.,
nemotron-16k) to combine transcript + visual notes into a structured summary/skill.
- moondream: lightweight; good for “slide description / OCR-ish” tasks; easiest to run often.
- llama3.2-vision (if we want higher quality): better visual understanding, but heavier.
Understood — I won’t update from GitHub. We’ll treat summarize as managed via your existing Clawdbot / environment updates.
If you tell me what “video understanding” means for you (slide OCR + outline? diagram understanding? “chaptering”?), I can tune the prompts + decide whether we should stick with moondream or also pull a heavier vision model.
[Telegram Vishal Sachdev (@svishal2001) id:5101503769 2026-01-15T04:14Z] create a reminder for me to do this on friday - • vault/01_Projects/folders-sync-setup.md#L1. [message_id: 514]
[[reply_to:514]] Done — I scheduled a reminder for Friday (Jan 16) at 9:00am America/Chicago:
“Reminder: do the folders sync setup task: vault/01_Projects/folders-sync-setup.md#L1.”
Read HEARTBEAT.md if exists. Consider outstanding tasks. Checkup sometimes on your human during (user local) day time.
HEARTBEAT_OK
[Telegram Vishal Sachdev (@svishal2001) id:5101503769 2026-01-15T04:16Z] video understanding will depend on the video. Some are walkthroughs of a new way of doing things, so they have demos etc. I usually dont have slides based videos. [message_id: 517]
[[reply_to:517]] That helps—a lot. For demo/walkthrough videos, “video understanding” is less about slide OCR and more about turning what happened on screen into a structured set of steps + key screenshots.
Here’s the approach I’d recommend locally with Ollama:
- Transcript track (what’s said)
- Keep doing what we’re doing:
yt-dlpcaptions (or Whisper if needed). - This usually gives you the narrative + intent.
- Visual track (what’s shown) Instead of evenly sampling frames, do event-driven keyframes:
- Detect scene changes / big visual shifts (good for “new screen / new menu / new step”).
- For each segment, grab 1–3 representative frames.
Then run a local vision model on those frames with prompts like:
- “What app/site is this? What screen/state are we in?”
- “What did the user click/change? What’s the before/after?”
- “Extract any visible UI text, button labels, menu items.”
- “What step of the workflow is being demonstrated?”
- Synthesis (turn into a how-to)
Use a local text model (e.g.,
nemotron-16k) to combine:
- transcript summary + per-keyframe descriptions into:
- a numbered procedure
- common pitfalls
- “if you see X screen, do Y”
- and optionally a reusable “skill” / checklist
Status: I’ve already started pulling a local vision model (moondream) and I wrote a script to run an Ollama vision model over a set of frames and output JSONL (scripts/ollama_vision_frames.py). Next step is improving frame selection for demos (scene-change-based rather than slide-based).
Two quick questions so I optimize for your use:
- Do you want outputs optimized for (A) a written how-to / checklist or (B) a reusable agent skill (with workflow + guardrails)?
- For demos, is it more valuable to capture click-by-click UI steps (lots of frames) or milestone states (fewer frames, higher-level)?
[Telegram Vishal Sachdev (@svishal2001) id:5101503769 2026-01-15T04:18Z] B, milestone states [message_id: 519]
[[reply_to:519]] Great — “reusable agent skill” + milestone states is the cleanest setup.
Pipeline for demo/walkthrough videos:
- Transcript (captions via yt-dlp; Whisper fallback only if needed)
- Milestone frames (scene-change based, not time-based)
- Local vision pass (Ollama vision model) to label each milestone: “what screen/state is this + what changed + why it matters”
- Synthesize SKILL.md (local text model) using: transcript summary + milestone notes + timestamps
Instead of fps=1/15, we’ll use scene-change detection to pick key frames, e.g. ffmpeg’s select='gt(scene,THRESH)' and then cap to ~10–25 milestones for a ~1 hour video.
- I’m pulling a local vision model in Ollama (
moondream) so we can run this without hosted costs. - I added a script to run an Ollama vision model over a directory of frames and emit JSONL:
/Users/vishal/clawd/scripts/ollama_vision_frames.py
For milestone states, what’s your preferred target count per video?
- 10–15 (very high-level)
- 20–30 (still milestones, but more actionable)
Once you answer, I’ll bake that into the default extraction settings and we can re-run on the last video as a test.
[Telegram Vishal Sachdev (@svishal2001) id:5101503769 2026-01-15T04:20Z] 20-30 assuming we are only using local models, so we should not be worried about compute [message_id: 521]
[[reply_to:521]] Perfect — 20–30 milestones is a good default if we’re staying local.
I went ahead and set up + validated the local pipeline end-to-end:
-
Installed local vision model (Ollama):
moondream:latestis now pulled and available. -
Milestone (scene-change) frame extraction: I added a script that uses ffmpeg scene-change detection + then downsamples to a target count:
/Users/vishal/clawd/scripts/extract_milestone_frames.sh- Default target count can be 25; threshold is tunable per video.
-
Local vision pass over milestone frames (Ollama → JSONL):
- Script:
/Users/vishal/clawd/scripts/ollama_vision_frames.py - Output example from the last video:
- Frames:
tmp/milestones/NQ2MiVuuJ6A/selected/(25 frames) - Vision notes:
tmp/milestones/NQ2MiVuuJ6A/moondream_frames.jsonl(+.md)
- Frames:
- Script:
-
Local synthesis into “skill-like” structure (text model):
- Using
nemotron-16klocally, I generated a structured draft from transcript summary + milestone notes:tmp/milestones/NQ2MiVuuJ6A/nemotron_synthesis.md
- Using
Next improvement (to make it work better for non-slide demos): I’ll tweak the prompts for moondream to focus on UI state + “what changed” (not “slide description”), and I’ll add a small “cap to 20–30” default + automatic threshold adjustment when scene-change returns too few frames.
If you send the next demo-style YouTube link you care about, I can run this full local pipeline and generate a new SKILL.md output from it.
[Telegram Vishal Sachdev (@svishal2001) id:5101503769 2026-01-15T04:36Z] can you look at our conversation history above since my message "can you explore a local video understanding model so we can be more carefree about usage. we have ollama available. Dont update from github, as if you have summarize cli, i am guessing clawdbot updates will trigger summarize cli updates" and write an article about how we improved. I am mindblown by your capabilities, and started writing a tweet "@clawdbot is awesome! I have been feeding it videos I like that would help improve my AI agent workflows by creating skills from the video. Have it working on a beefy m2 max with Ollama for local models and codex auth for gpt5.2 as the orchestrator/brain ...." . Use that seed to write the article and follow my other guidelines in the write article skill. Then create the twitter thread and give it to me here. publish a public gist of the conversation to link in the twitter thread. [message_id: 523]