Skip to content

Instantly share code, notes, and snippets.

@altryne
Last active January 22, 2026 00:04
Show Gist options
  • Select an option

  • Save altryne/5a45416dd28c5ed48e3e859ecc09d01b to your computer and use it in GitHub Desktop.

Select an option

Save altryne/5a45416dd28c5ed48e3e859ecc09d01b to your computer and use it in GitHub Desktop.
ThursdAI Show Notes - January 22, 2026

ThursdAI Show Notes - January 22, 2026

Prepared by Wolfred 🐺


Inworld AI releases TTS-1.5, the fastest production-grade text-to-speech model that's 4x faster and 25x cheaper than competitors (X, Blog, Press Release)

  • Executive summary:

  • Inworld AI just dropped TTS-1.5 today, and this is a pretty significant release for anyone building voice-enabled applications. They're already sitting at number one on the Artificial Analysis TTS leaderboard based on blind user tests, and with this update they've achieved sub-250 millisecond latency for their Max model and under 130 milliseconds for Mini - that's 4 times faster than their previous generation. The quality improvements are substantial too: 30% more expressive speech and 40% lower word error rates, meaning fewer hallucinations and audio artifacts. But here's the kicker on pricing - at half a cent per minute for Mini and one cent for Max, they're claiming to be 25 times cheaper than the next best alternative. They've also expanded to 15 languages including Hindi, added enhanced voice cloning via API, and now offer on-premise deployment for enterprises with data residency requirements.

  • 10 factoids:

  • TTS-1.5 Max achieves P90 latency under 250ms, while Mini hits under 130ms - both representing a 4x speed improvement over previous generations

  • Ranked #1 on Artificial Analysis TTS Leaderboard based on blind comparisons by thousands of real users

  • 30% more expressive speech output compared to prior versions

  • 40% reduction in word error rate, minimizing hallucinations, cutoffs, and audio artifacts

  • Pricing is $0.005/minute (Mini) and $0.01/minute (Max) - equivalent to $5-10 per million characters, claimed to be 25x cheaper than competitors

  • Supports 15 languages including English, Spanish, French, Korean, Chinese, Hindi, Japanese, German, and more

  • Voice cloning now available via API with 5-15 seconds of audio for instant cloning, or professional fine-tuning for maximum fidelity

  • On-premise deployment now available for enterprises with compliance/data residency requirements

  • Already integrated with Layercode, LiveKit, NLX, Pipecat, Stream Vision Agents, Ultravox, Vapi, and Voximplant

  • Founded by team from Google and DeepMind, backed by Lightspeed, Kleiner Perkins, and Stanford University

Additional links: TTS Product Page, TTS Playground, Documentation, Artificial Analysis Leaderboard



Flashlabs releases Chroma 1.0, the world's first open-source real-time speech-to-speech dialogue model with voice cloning (X)

  • Executive summary:

  • Flashlabs just dropped Chroma 1.0, and this is a big one for the open source voice AI community. It's being billed as the world's first open-source, end-to-end, real-time speech-to-speech dialogue model that includes personalized voice cloning. We're talking sub-150 millisecond latency here, which means it can actually hold a natural conversation without those awkward pauses. The voice cloning is remarkably efficient, needing only a few seconds of audio to replicate someone's voice, and they're claiming a speaker similarity score of 0.817, which is actually about 11% better than human baseline. What's really impressive is they packed all this into just 4 billion parameters, the full weights and code are open source, and it's been optimized with SGLang for faster inference. This is going to be huge for anyone building voice assistants, real-time translation, or conversational AI applications.

  • 10 factoids:

  • First open-source end-to-end real-time speech-to-speech dialogue model with voice cloning

  • Sub-150ms end-to-end latency enables natural conversational flow

  • High-fidelity voice cloning requires only seconds of reference audio

  • Speaker similarity (SIM) score of 0.817 — 10.96% higher than human baseline

  • Compact architecture with only 4 billion parameters

  • Fully open weights and code released

  • Optimized with SGLang for faster inference performance

  • Developed by Flashlabs

  • End-to-end architecture means no separate ASR/TTS pipeline needed

  • Enables personalized voice cloning for custom voice assistants and applications

Additional links: Paper (link from tweet), HuggingFace Model (link from tweet), GitHub Code (link from tweet)



Liquid AI releases LFM2.5-1.2B-Thinking, a reasoning model that runs entirely on-device in under 900MB of memory (X, Blog, Hugging Face)

  • Executive summary:

  • Liquid AI just dropped something pretty wild - a reasoning model called LFM2.5-1.2B-Thinking that can run entirely on your phone with less than 900 megabytes of memory. This is a 1.2 billion parameter model that generates internal thinking traces before producing answers, essentially doing the chain-of-thought reasoning we've seen in much larger models, but at edge-scale latency. The kicker? What required a data center two years ago now runs offline in your pocket. They've trained this thing on 28 trillion tokens and used multi-stage reinforcement learning to make it particularly good at tool use, math, and instruction following. On benchmarks, it's matching or beating Qwen3-1.7B despite having 40% fewer parameters, and it scores 87.96 on MATH-500 compared to the non-thinking version's 63.2. Launch partners include Qualcomm, AMD, Ollama, and others, so this is ready for deployment across phones, laptops, vehicles, and IoT devices on day one.

  • 10 factoids:

  • Runs in under 900MB of memory - fits on any modern smartphone

  • 1.2 billion parameters with 16 layers (10 LIV convolution blocks + 6 GQA blocks)

  • Trained on 28 trillion tokens (up from 10T in LFM2)

  • 32,768 token context length with robust long-context scaling

  • MATH-500 benchmark: 87.96% (jumped from 63.2% in the non-thinking Instruct version)

  • Decodes at 70 tok/s on Samsung Galaxy S25 Ultra CPU, up to 235 tok/s on AMD Ryzen AI Max 395

  • Uses GRPO-style reinforcement learning with curriculum training across math, reasoning, and tool use domains

  • Solved the "doom loop" problem (repetitive text patterns) - reduced from 15.74% to 0.36% through preference alignment

  • Open-weight with day-one support for llama.cpp, MLX, vLLM, ONNX, Ollama, and LM Studio

  • LFM2 family has crossed 6 million downloads on Hugging Face

Additional links:



Vercel launches skills.sh, an "npm for AI agents" that hit 20K installs within hours (X, Vercel Changelog, GitHub)

Date: January 20, 2026

  • Executive summary:

    • Vercel just dropped skills.sh, which is essentially a package manager for AI coding agents—think npm but for teaching your AI assistant best practices. You install a skill with one command like npx skills add vercel-labs/agent-skills, and suddenly your Claude Code, Cursor, Codex, Windsurf, or any of the 17+ supported agents knows 10 years worth of React and Next.js optimization patterns, web design guidelines, and can even deploy to Vercel for you. The ecosystem exploded on launch—their top React skill hit over 26,000 installs, the announcement tweet got 125,000+ views, and major players like Stripe, Expo, and Remotion shipped their own skills within hours. This is Vercel doing what Vercel does: entering a market and immediately becoming the default.
  • 10 factoids:

    • The skills.sh leaderboard shows 200+ skills already live, with Vercel's React Best Practices skill leading at 26,000+ installs
    • Skills follow the Agent Skills specification, an open format originally developed by Anthropic
    • Supports 17+ AI coding agents: Claude Code, Cursor, Codex, GitHub Copilot, Windsurf, Clawdbot, Amp, Antigravity, Gemini, Goose, Kilo, Kiro, OpenCode, Roo, Trae, and more
    • The react-best-practices skill contains 40+ rules across 8 categories covering waterfalls, bundle size, server-side performance, and more
    • The web-design-guidelines skill includes 100+ rules covering accessibility, ARIA, forms, animation, typography, dark mode, and i18n
    • The vercel-deploy-claimable skill auto-detects 40+ frameworks and returns a preview URL plus a claimable URL for ownership transfer
    • Remotion's Jonny Burger created a whole video just by prompting Claude Code with Remotion's skill—the announcement video hit 147,000 views
    • Stripe shipped their own best practices skill within hours of the ecosystem launch
    • Skills are structured as folders with SKILL.md for instructions, a scripts directory for automation, and optional references
    • The ecosystem already includes skills from Expo (8 skills for React Native), Anthropic, Callstack, Trail of Bits (security), and many community contributors

Additional links: Agent Skills Spec, skills.sh Directory, MarkTechPost Coverage, Medium: 20K installs in hours



Anthropic's Claude Code VS Code Extension Hits General Availability, Bringing Full Agentic Coding to the IDE (X, VS Code Marketplace, Docs)

  • Executive summary:

  • Anthropic has officially launched the general availability of their Claude Code VS Code extension, bringing the full power of their CLI-based agentic coding tool directly into the world's most popular code editor. The extension provides a native graphical interface that lets developers chat with Claude to build features, debug code, and navigate codebases - all without leaving VS Code. What makes this release significant is the feature parity with the CLI: you can now @-mention files for context (with fuzzy matching!), use familiar slash commands like /model, /mcp, and /context, and Claude can autonomously explore your codebase, write code, and run terminal commands with your permission. The extension supports powerful agentic features including subagents, custom slash commands, and MCP (Model Context Protocol) for connecting to external tools like GitHub, Jira, and Google Drive. It's available for Pro, Max, Team, and Enterprise subscribers, or via pay-as-you-go pricing.

  • 10 factoids:

  • The extension requires VS Code 1.98.0 or higher and works with Cursor as well

  • Three permission modes available: normal (asks each time), Plan mode (describes then waits for approval), and auto-accept (makes edits without asking)

  • @-mentions support fuzzy matching - type @auth to find auth.js, AuthService.ts, etc., and add trailing slash for folders like @src/components/

  • Claude can see your highlighted code automatically when you select text, and you can toggle this visibility on/off

  • Multiple conversations can run in parallel using "Open in New Tab" or "Open in New Window" - each with isolated context

  • Extension and CLI share conversation history - use claude --resume in terminal to continue any extension conversation

  • Supports third-party AI providers including Amazon Bedrock, Google Vertex AI, and Microsoft Foundry

  • MCP (Model Context Protocol) servers must be configured via CLI first, then become available in the extension

  • @terminal:name syntax lets you reference terminal output in prompts without copy-pasting error messages

  • A terminal-mode option exists for developers who prefer the CLI-style interface - toggle via settings

Additional links:



OpenAI announces ads coming to ChatGPT free and Go tiers in the US (X, OpenAI Blog)

  • Executive summary:

  • OpenAI is taking its first big step into the advertising business—in the coming weeks, they'll start testing ads in ChatGPT for logged-in adult users in the US on the free tier and the new $8/month ChatGPT Go plan. The company is being very careful to emphasize that ads won't influence ChatGPT's actual responses—they'll appear in clearly labeled boxes at the bottom of answers, completely separate from the conversation. They're promising not to sell user data to advertisers, not to show ads on sensitive topics like health, mental health, or politics, and not to serve ads to users under 18. Paid subscribers on Plus, Pro, Business, and Enterprise tiers won't see any ads at all. This move reflects the massive financial pressure OpenAI is under—burning through about $9 billion this year while only 5% of their 800 million weekly users actually pay for subscriptions.

  • 10 factoids:

  • Ads will only appear on free tier and the new $8/month ChatGPT Go subscription—Plus ($20/mo), Pro, Business, and Enterprise remain ad-free

  • Ads will be placed at the bottom of ChatGPT's answers in clearly labeled, separate boxes—never within the actual response text

  • OpenAI promises conversations are kept private from advertisers and user data is never sold to advertisers

  • Users can turn off ad personalization and clear data used for ads at any time without affecting other ChatGPT personalization features

  • Ads won't appear near sensitive or regulated topics including health, mental health, and politics

  • No ads will be shown to users under 18 (either self-reported or predicted by an age-prediction model OpenAI is rolling out)

  • OpenAI expects to burn through ~$9 billion this year while generating $13 billion in revenue

  • Only about 5% of ChatGPT's 800 million weekly active users currently pay for subscriptions

  • OpenAI has committed to spending about $1.4 trillion on massive data centers and chips for AI

  • CEO Sam Altman previously said in 2024 that he found the combination of ads and AI "uniquely unsettling"—this marks a notable shift from that position

Additional links:



Z.ai releases GLM-4.7-Flash, a 30B parameter MoE model that sets a new standard for lightweight local AI assistants (X, Technical Blog, HuggingFace)

  • Executive summary:

  • Zhipu AI, the Chinese AI company behind the ChatGLM family, just dropped GLM-4.7-Flash, which is a 30B parameter MoE (mixture of experts) model with only 3 billion active parameters - making it incredibly efficient for local deployment. This model is specifically designed to be your local coding and agentic assistant, and it's crushing benchmarks in its weight class. It scores 91.6% on AIME 2025, hits 59.2% on SWE-bench Verified (compared to just 22% for Qwen3-30B), and delivers 79.5% on τ²-Bench for agent tasks. The model supports vLLM and SGLang for local inference, offers a free API tier with 1 concurrency, and is particularly recommended for coding, creative writing, translation, long-context tasks, and roleplay. Weights are fully open on HuggingFace.

  • 10 factoids:

  • GLM-4.7-Flash is a 30B-A3B MoE model (30 billion total parameters, only 3 billion active at any time)

  • Achieves 91.6% on AIME 2025, outperforming even GPT-OSS-20B (91.7%) and crushing Qwen3-30B-A3B-Thinking (85%)

  • Scores 59.2% on SWE-bench Verified vs only 22% for comparable Qwen3-30B model - nearly 3x better at coding agents

  • Supports up to 131,072 max new tokens for generation

  • Free API available with 1 concurrency, plus FlashX tier for high-speed inference

  • Open weights available on HuggingFace and ModelScope for local deployment

  • Supports vLLM and SGLang inference frameworks with speculative decoding

  • Features "Preserved Thinking" mode that retains reasoning across multi-turn conversations for complex agentic tasks

  • Part of the GLM-4.7 family from Zhipu AI (Z.ai), which also includes full GLM-4.7 with 73.8% on SWE-bench Verified

  • Built by THUDM (Tsinghua University Data Mining group) and Zhipu AI, the team behind ChatGLM, CogView, and CogVideo

Additional links:



Overworld releases a real-time local-first diffusion world model that runs at 60fps on consumer hardware (X, Press Release)

  • Executive summary:

  • Overworld, formerly known as Wayfarer Labs, just dropped a research preview of their real-time diffusion world model that's all about creating playable, interactive AI-generated worlds that run entirely on your local machine. This is huge because unlike cloud-based solutions that need round-trips to data centers, their model runs at 60 frames per second with sub-20ms latency on consumer-grade GPUs - we're talking Chromebooks, gaming PCs, even console-class hardware. The company is positioning this as a new kind of interactive AI where you can shape adaptive worlds directly through human imagination - they describe it as "living worlds that behave more like lucid dreams than software." It's backed by a $4.5 million pre-seed round led by Kindred Ventures with angels like Logan Kilpatrick on board, and they're making the code open source on GitHub.

  • 10 factoids:

  • Runs at 60fps with sub-20ms latency entirely on local consumer GPUs - no cloud required

  • Works on everything from Chromebooks to gaming PCs to console-class hardware

  • The company rebranded from Wayfarer Labs to Overworld for this release

  • Backed by $4.5 million pre-seed round led by Kindred Ventures

  • Notable angel investors include Logan Kilpatrick and senior leaders from Snowflake and Roblox

  • Founded by Louis Castricato (CEO, Brown University) and Shahbuland Matiana (Chief Science Officer)

  • Uses diffusion-based world models structured as continuous real-time systems that incorporate user input into every frame

  • Fully open source with code on GitHub including world_engine inference library and Biome desktop client

  • Supports full keyboard and mouse input, not just basic WASD controls

  • The model uses a DiT architecture with autoencoder, text encoder, and KV cache with optimized backends for Nvidia, AMD, and Apple Silicon

Additional links:



Sakana AI introduces RePo, a new way for language models to dynamically reorganize their context for better attention (X, Paper, Website)

  • Executive summary:

  • Sakana AI, the Tokyo-based research lab, just dropped a pretty clever innovation called RePo - short for Context Re-Positioning. The core insight here is that current LLMs are stuck processing everything as one flat, rigid sequence of tokens, which makes it hard for them to handle noisy contexts, structured data like tables, or long documents where important info is far apart. RePo adds a lightweight learned module that assigns each token a real-valued position based on its meaning, not just its place in the sequence. This lets the model dynamically pull semantically related tokens closer together in "attention space" even if they're far apart in the actual text, and push irrelevant noise away. They trained it on OLMo-2 1B for 50 billion tokens and saw consistent gains - 11 points better on noisy context tasks, almost 2 points better on structured data, and strong improvements on long-context benchmarks up to 16K tokens even though it was only trained on 4K. The overhead is less than 1% compute, so it's basically free performance.

  • 10 factoids:

  • RePo stands for "Context Re-Positioning" - it lets models assign learned, real-valued positions to tokens based on semantics rather than fixed integer indices

  • The technique is inspired by Cognitive Load Theory from human psychology - the idea that poor organization creates "extraneous load" that wastes our working memory

  • RePo improved noisy context performance by 11.04 points over standard RoPE positional encoding on the RULER benchmark

  • On structured data tasks (graphs and tables linearized into text), RePo beat RoPE by 1.94 exact match points

  • The model was trained on only 4K context length but extrapolated well to 8K and 16K tokens, beating baselines by at least 5.48 points on LongBench

  • RePo adds less than 1% compute overhead - it's essentially free performance

  • Analysis shows RePo allocates 15% more attention to "needle" tokens (distant but relevant) compared to standard RoPE

  • Different attention heads learn different position ranges - some specialize in local reorganization, others in global structure

  • The learned positions show non-linear patterns with plateaus and jumps, indicating the model discovers meaningful structure

  • The code, models, and an interactive demo are all open-source - you can visualize how tokens get repositioned in real-time

Additional links:



Runway Gen-4.5

TL;DR

Runway released Gen-4.5, introducing a novel tiled input method that generates multi-panel videos from image grids. Also launched two new apps: Sketch Motion (draw animation paths on images) and Character Swap (insert yourself into scenes).

Executive Summary

Runway's Gen-4.5 represents a significant evolution in video generation, most notably through its innovative tiled input system. Users can submit 2x2, 3x3, or 4x4 image grids, and the model generates coherent multi-panel animations—a technique the community has been exploring for VFX-like motion effects.

Creative techniques emerging include "abusing whip pans" for controlled visual exploration and using panels individually for smoother animations, though consistency varies across generations. The model scores highly (92/100) for visuals and coherence in early reviews.

Alongside the model, Runway introduced Sketch Motion (draw animation paths directly on images) and Character Swap (insert users into scenes), with more apps promised next week. The platform is evolving toward an all-in-one node-based UI for tool integration.

Key Details

  • Tiled Input Method: Submit 2x2, 3x3, or 4x4 image grids for multi-panel video generation
  • Sketch Motion: New app for drawing animation paths on images
  • Character Swap: Insert yourself into generated scenes
  • No native audio: Community using MMAudio and external tools
  • VFX potential: Professional-grade motion from still images
  • Whip pan technique: Discovered for controlled exploration
  • Node-based UI: Platform positioning as all-in-one suite

Limitations

  • Occasional over-animation of grids
  • Consistency varies across generations
  • No native sound generation

10+ Factoids

  • Gen-4.5 accepts tiled image grids (2x2, 3x3, 4x4) as input
  • Multi-panel animations can be used for VFX-like motion effects
  • Sketch Motion lets you literally draw where things should move
  • Character Swap inserts your likeness into generated scenes
  • Community scoring it 92/100 for visuals and coherence
  • "Whip pan abuse" is becoming a creative technique
  • More apps coming next week per Runway
  • Detail retention from source images described as "robust"
  • Works well for professional-grade motion from stills
  • Stress tests include feathers, spaghetti physics, and complex scenarios
  • MMAudio being used for sound since native audio is absent
  • Node-based UI improvements positioning it as evolving platform

Community Links


End of show notes — 10 items

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment