I've been experimenting as much as I can with Gemini video understanding. Multiple videos, multi-step prompts, etc.
Analyze this YouTube video.
https://www.youtube.com/watch?v=PgyJs0jfp_o
- Pick three interesting segments that are 2 minutes long or shorter, for posting on social media.
- Summarize each segment and explain why it's interesting. 3.Transcribe each segment.
I let Gemini Pro do its "automatic thinking budget" thing. Took about 90 seconds to start streaming the real (post-thinking) response.
Of course! Here are three interesting segments from the video, perfect for sharing on social media.
Timestamp: 01:35 - 02:51
Summary: Kwindla Hultman Kramer breaks down the fundamental architecture of a modern voice AI agent. Using a clear block diagram, he explains the step-by-step pipeline: user audio comes in, it's converted from speech-to-text, turn detection figures out when the user is done talking, the text is managed in the context, sent to an LLM, the LLM's text response is converted back to speech, and the resulting agent audio is sent back to the user. He emphasizes that this entire loop must be managed for every single turn in the conversation.
Why it's interesting: This is a fantastic "explainer" clip that demystifies how voice AI agents work. The block diagram provides a simple, visual guide to a complex process, making it highly valuable for developers, product managers, or anyone curious about the mechanics behind conversational AI. It’s a perfect foundational piece for understanding the rest of the conversation around voice AI.
Transcription: Um, so let's just talk a little bit about voice architecture, because if you haven't built voice agents before, it's worth understanding kind of what we do in a typical voice agent. This is the block diagram for a single conversation turn. We run in a loop because a voice conversation is multi-turn, and as you know from building with LLMs, LLMs are stateless, so you have to give the conversation history to the LLM each turn. Uh, you may do things like modify and summarize the conversation in flight. Those are slightly more advanced topics, which we won't cover today, but we can talk about them in the Q&A.
But the basic block diagram is audio comes in from the user. You turn that audio into text. Uh, and in parallel, you do turn detection to know when the user is done talking and you should trigger inference and have the voice agent respond. Uh, that user text needs to go into whatever context management logic you're using so that you're tracking it. Uh, the LLM inference comes out as text and goes into a text-to-speech or a voice generation model. That generates the audio that the user actually hears. Again, you have to put the text or the audio or both in your context management logic. And then you run that loop again when you get new user input. Every voice agent in production looks like this.
Timestamp: 08:01 - 09:37
Summary: In this segment, Kwindla is asked to clarify what "long context" means specifically for voice AI applications. He explains that it's a combination of two things: first, the easiest way to implement RAG (Retrieval-Augmented Generation) is to load a large amount of context at the beginning of the call, which immediately creates a long context. Second, many voice AI use cases, like a 30-minute customer support call, naturally involve many back-and-forth turns. Each turn adds to the conversation history, which cumulatively creates a very long context that the model must process, leading to unique challenges.
Why it's interesting: This clip provides a clear, practical answer to a nuanced technical question. It highlights that "long context" in voice AI isn't just about a single long user prompt, but about the accumulation of turns and upfront RAG data. This is a critical insight for anyone building robust conversational AI, as it directly impacts model performance, cost, and the user experience over the duration of a call. The interactive Q&A format also makes it very engaging.
Transcription: Shreya: Yeah, I have a question. Um, how long is long context? What does that mean in the voice case? Do you mean it's context that you're retrieving as part of RAG given some voice snippet, or like it's like people giving a soliloquy of five minutes and that becomes long context?
Kwindla Kramer: So it's two things. It's not usually a soliloquy, for most use cases, although there are some voice use cases where people talk and talk and talk, dictation style, and then you do something with it. But for most use cases, it's two things. One is the easiest way to do RAG for voice AI is to pull a bunch of stuff into the first part of the context at the very beginning of the conversation. So often you're adding, you know, 50,000 tokens or something at the beginning of the conversation, if you can do it, like if your evals survive it, right?
The other thing is that there are a lot of voice AI use cases where there's a 20-minute or a 30-minute conversation. Like think of a customer support call where you're on the phone a long time trying to get help in a customer support context. So the multi-turn, you're just adding, like if you do that naively, you are adding a user input and an LLM output message every single conversation turn.
Timestamp: 15:53 - 17:22
Summary: Kwindla demonstrates his "Do It Yourself" approach to collecting conversation data. He walks through the Python code he uses to save key metrics for each conversation turn—like session ID, start/end times, user and LLM text, and response times—into a simple SQLite database table. Additionally, the entire conversation is saved as a WAV file. He then uses a simple script to play back a specific turn from a previous conversation, revealing a surprising and amusing failure where the AI, when asked for a break suggestion, responds with "How about a quick stroll outside for some fresh air, or maybe just a five-minute dance party to your favorite song?"
Why it's interesting: This clip is a perfect, hands-on demonstration of the "look at your data" mantra. It shows how simple it can be to start collecting structured data and audio from voice conversations. The AI's unexpected and funny "dance party" suggestion is a memorable example of a hallucination or an unexpected model behavior that you would only catch by actually listening to the data. It’s a compelling case for why raw data analysis is indispensable for improving voice AI.
Transcription: So this was late last night. I was on my Linux box. I definitely had like a slightly weird sample rate thing going on with my input microphone. I mean, you know, I don't know. It was a local connection though, so it shouldn't have been a network issue. Um, but we'll come back to it. Uh, yeah, just making sure I have the right slides. Yes, okay.
Um, so let's jump into doing something similar to that, but more from scratch. There's more code here, so I won't show code, but you can look at 003-bot-sqlite.py for the code. It's probably another 100 lines of code to save exactly the data I wanted. And here's the list of things that I was like, oh, I want to save this stuff because this is the kind of thing I always need to look at, either in an individual debugging session, like what just, why did that bot just do that, or aggregates that I care about across a bunch of a bunch of sessions.
So like a unique session ID, the turn number in the conversation, the turn start time, the turned end time, the user text, the LLM text, the voice-to-voice response time, because latency matters so much, and whether the user interrupted the bot before it was done talking. So it's a pretty good basic set. Plus we'll save the entire wave file so that for any chunk of the conversation we can sort of play back the audio as we heard it. And up in the top right is the SQLite uh, database creation command I ran from from Bash to to set this up. Uh, and then I vibe-coded some "look at your data" scripts.