vishalsachdev/conversation_excerpt.md

## conversation_excerpt.md

      
    Raw
  

              conversation_excerpt.md
            
          
    Conversation excerpt (from: "can you explore a local video understanding model...")

USER — 2026-01-15T04:13:13.017Z

[Telegram Vishal Sachdev (@svishal2001) id:5101503769 2026-01-15T04:13Z] can you explore a local video understanding model so we can be more carefree about usage. we have ollama available. Dont update from github, as if you have summarize cli, i am guessing clawdbot updates will trigger summarize cli updates
[message_id: 513]
ASSISTANT — 2026-01-15T04:14:33.509Z

[[reply_to:513]] Yep — makes sense. Right now your Ollama install has only text/code models (no vision models yet), so “local video understanding” isn’t possible until we add at least one vision-capable model.
What I found locally

ollama list currently shows:

deepseek-coder:33b-base-q4_0
qwen2.5-coder:32b-instruct-q4_0
nemotron-16k:latest
nemotron-3-nano:latest

No vision model (e.g., LLaVA / Llama-Vision / Moondream) is installed yet.
What I’m doing to explore (local + carefree usage)


Add a lightweight vision model in Ollama

I started pulling moondream (a relatively small, commonly-used local vision model) so we can do quick “what’s on this slide/frame?” extraction without caring about API costs.


Use Ollama locally via HTTP API for images

On this machine, ollama run --help doesn’t show an --image flag, so I wrote a tiny script that calls Ollama’s local endpoint (/api/generate) with base64 images:

Script: /Users/vishal/clawd/scripts/ollama_vision_frames.py
It takes a folder of frames (jpg/png) and produces JSONL with per-frame descriptions.

Example usage (once moondream is pulled):
python3 /Users/vishal/clawd/scripts/ollama_vision_frames.py \
  --model moondream \
  --frames-dir /Users/vishal/clawd/skills/rlm-long-context-agent/references/frames_selected \
  --out /Users/vishal/clawd/tmp/vision_frames_moondream.jsonl


Workflow we can standardize


yt-dlp download (already working)
ffmpeg extract frames (already working)
local vision model over selected frames → “visual notes”
optional: use a local text model (e.g., nemotron-16k) to combine transcript + visual notes into a structured summary/skill.

Model options (Ollama) to consider next


moondream: lightweight; good for “slide description / OCR-ish” tasks; easiest to run often.
llama3.2-vision (if we want higher quality): better visual understanding, but heavier.

Re: summarize updates

Understood — I won’t update from GitHub. We’ll treat summarize as managed via your existing Clawdbot / environment updates.
If you tell me what “video understanding” means for you (slide OCR + outline? diagram understanding? “chaptering”?), I can tune the prompts + decide whether we should stick with moondream or also pull a heavier vision model.
USER — 2026-01-15T04:14:35.045Z

[Telegram Vishal Sachdev (@svishal2001) id:5101503769 2026-01-15T04:14Z] create a reminder for me to do this on friday - • vault/01_Projects/folders-sync-setup.md#L1.
[message_id: 514]
ASSISTANT — 2026-01-15T04:14:55.117Z

[[reply_to:514]] Done — I scheduled a reminder for Friday (Jan 16) at 9:00am America/Chicago:
“Reminder: do the folders sync setup task: vault/01_Projects/folders-sync-setup.md#L1.”
USER — 2026-01-15T04:16:17.468Z

Read HEARTBEAT.md if exists. Consider outstanding tasks. Checkup sometimes on your human during (user local) day time.
ASSISTANT — 2026-01-15T04:16:23.134Z

HEARTBEAT_OK
USER — 2026-01-15T04:16:59.596Z

[Telegram Vishal Sachdev (@svishal2001) id:5101503769 2026-01-15T04:16Z] video understanding will depend on the video. Some are walkthroughs of a new way of doing things, so they have demos etc. I usually dont have slides based videos.
[message_id: 517]
ASSISTANT — 2026-01-15T04:17:15.961Z

[[reply_to:517]] That helps—a lot. For demo/walkthrough videos, “video understanding” is less about slide OCR and more about turning what happened on screen into a structured set of steps + key screenshots.
Here’s the approach I’d recommend locally with Ollama:

Transcript track (what’s said)


Keep doing what we’re doing: yt-dlp captions (or Whisper if needed).
This usually gives you the narrative + intent.


Visual track (what’s shown)
Instead of evenly sampling frames, do event-driven keyframes:


Detect scene changes / big visual shifts (good for “new screen / new menu / new step”).
For each segment, grab 1–3 representative frames.

Then run a local vision model on those frames with prompts like:

“What app/site is this? What screen/state are we in?”
“What did the user click/change? What’s the before/after?”
“Extract any visible UI text, button labels, menu items.”
“What step of the workflow is being demonstrated?”


Synthesis (turn into a how-to)
Use a local text model (e.g., nemotron-16k) to combine:


transcript summary + per-keyframe descriptions
into:
a numbered procedure
common pitfalls
“if you see X screen, do Y”
and optionally a reusable “skill” / checklist

Status: I’ve already started pulling a local vision model (moondream) and I wrote a script to run an Ollama vision model over a set of frames and output JSONL (scripts/ollama_vision_frames.py). Next step is improving frame selection for demos (scene-change-based rather than slide-based).
Two quick questions so I optimize for your use:

Do you want outputs optimized for (A) a written how-to / checklist or (B) a reusable agent skill (with workflow + guardrails)?
For demos, is it more valuable to capture click-by-click UI steps (lots of frames) or milestone states (fewer frames, higher-level)?

USER — 2026-01-15T04:18:59.296Z

[Telegram Vishal Sachdev (@svishal2001) id:5101503769 2026-01-15T04:18Z] B, milestone states
[message_id: 519]
ASSISTANT — 2026-01-15T04:19:09.960Z

[[reply_to:519]] Great — “reusable agent skill” + milestone states is the cleanest setup.
Plan (local-first)

Pipeline for demo/walkthrough videos:

Transcript (captions via yt-dlp; Whisper fallback only if needed)
Milestone frames (scene-change based, not time-based)
Local vision pass (Ollama vision model) to label each milestone: “what screen/state is this + what changed + why it matters”
Synthesize SKILL.md (local text model) using: transcript summary + milestone notes + timestamps

What I’ll change vs the slide-based approach

Instead of fps=1/15, we’ll use scene-change detection to pick key frames, e.g. ffmpeg’s select='gt(scene,THRESH)' and then cap to ~10–25 milestones for a ~1 hour video.
What’s already in motion


I’m pulling a local vision model in Ollama (moondream) so we can run this without hosted costs.
I added a script to run an Ollama vision model over a directory of frames and emit JSONL:

/Users/vishal/clawd/scripts/ollama_vision_frames.py


Quick confirm (so I tune defaults)

For milestone states, what’s your preferred target count per video?

10–15 (very high-level)
20–30 (still milestones, but more actionable)

Once you answer, I’ll bake that into the default extraction settings and we can re-run on the last video as a test.
USER — 2026-01-15T04:20:00.427Z

[Telegram Vishal Sachdev (@svishal2001) id:5101503769 2026-01-15T04:20Z] 20-30 assuming we are only using local models, so we should not be worried about compute
[message_id: 521]
ASSISTANT — 2026-01-15T04:29:15.681Z

[[reply_to:521]] Perfect — 20–30 milestones is a good default if we’re staying local.
I went ahead and set up + validated the local pipeline end-to-end:


Installed local vision model (Ollama): moondream:latest is now pulled and available.


Milestone (scene-change) frame extraction: I added a script that uses ffmpeg scene-change detection + then downsamples to a target count:

/Users/vishal/clawd/scripts/extract_milestone_frames.sh
Default target count can be 25; threshold is tunable per video.


Local vision pass over milestone frames (Ollama → JSONL):

Script: /Users/vishal/clawd/scripts/ollama_vision_frames.py
Output example from the last video:

Frames: tmp/milestones/NQ2MiVuuJ6A/selected/ (25 frames)
Vision notes: tmp/milestones/NQ2MiVuuJ6A/moondream_frames.jsonl (+ .md)


Local synthesis into “skill-like” structure (text model):

Using nemotron-16k locally, I generated a structured draft from transcript summary + milestone notes:

tmp/milestones/NQ2MiVuuJ6A/nemotron_synthesis.md


Next improvement (to make it work better for non-slide demos): I’ll tweak the prompts for moondream to focus on UI state + “what changed” (not “slide description”), and I’ll add a small “cap to 20–30” default + automatic threshold adjustment when scene-change returns too few frames.
If you send the next demo-style YouTube link you care about, I can run this full local pipeline and generate a new SKILL.md output from it.
USER — 2026-01-15T04:36:07.075Z

[Telegram Vishal Sachdev (@svishal2001) id:5101503769 2026-01-15T04:36Z] can you look at our conversation history above since my message "can you explore a local video understanding model so we can be more carefree about usage. we have ollama available. Dont update from github, as if you have summarize cli, i am guessing clawdbot updates will trigger summarize cli updates" and write an article about how we improved. I am mindblown by your capabilities, and started writing a tweet "@clawdbot is awesome! I have been feeding it videos I like that would help improve my AI agent workflows by creating skills from the video.
Have it working on a beefy m2 max with Ollama for local models and codex auth for gpt5.2 as the orchestrator/brain ...." . Use that seed to write the article and follow my other guidelines in the write article skill. Then create the twitter thread and give it to me here. publish a public gist of the conversation to link in the twitter thread.
[message_id: 523]
No results found