numman-ali/Generative AI Model Research Plan.md

## Generative AI Model Research Plan.md

      
    Raw
  

              Generative AI Model Research Plan.md
            
          
    Generative AI Model Research Plan


Priority: TOP — This informs schema design, skill prompts, and render pipeline.
Last Updated: 2025-12-28
Status: ✅ REFERENCES VALIDATED (cloud + on-device) — synthesis still pending


✅ Reference Library Re-Validation (Status)

Problem Identified (resolved for references/): Early drafts were created from web searches and aggregator sources and required systematic validation against official vendor documentation.
What exists:

50 cloud model reference docs
3 on-device model compilation docs (55+ models)
Synthesis documents in planning/synthesis/* (not yet revalidated)
Model ID audit (complete for models already documented; new models may appear over time)

What is now done (2025-12-28):

All existing references/** docs have been validated against official vendor documentation (for cloud) or primary upstream sources (HF/GitHub) for on-device.
Gaps (models that should exist in the library but do not yet have dedicated docs) are tracked in:

references/GAPS.md
references/MODEL-INVENTORY.md


Remaining risk: Any earlier inaccuracies may still exist in planning/synthesis/* until those documents are revalidated against the now-canonical reference docs.

Re-Validation Plan

Phase 1: Cloud Models (existing docs) — DONE (2025-12-28)

Each document requires an agent to:

Fetch official vendor documentation
Compare EVERY claim in the reference doc
Verify prompting vocabulary matches official guidance
Verify capabilities (resolution, duration, formats)
Verify pricing (cross-check with MODEL-AUDIT.md)
Verify API parameters and endpoints
Update the reference doc with corrections
Note any new features/capabilities not captured

Phase 2: On-Device Models (3 compilation docs) — DONE (2025-12-28)

Similar validation against HuggingFace model cards and GitHub repos.
Phase 3: Synthesis Documents (7 documents) — TODO

After reference docs are validated, verify synthesis docs reflect corrected information.

Cloud Models in the Reference Library (as of 2025-12-28)

The canonical “what’s covered vs missing” list lives in:

references/MODEL-INVENTORY.md
references/GAPS.md

The tables below are a convenience snapshot for this plan doc.
Video Generation (15 documents)


Document
Model
Provider
Primary Source
Status


references/video/veo-3.md
Veo 3.1
Google
cloud.google.com/vertex-ai/generative-ai/docs
COVERED


references/video/sora-2.md
Sora 2
OpenAI
platform.openai.com/docs
COVERED


references/video/runway-gen4.5.md
Gen-4/4.5
Runway
docs.dev.runwayml.com
COVERED


references/video/kling-2.1.md
Kling 2.1
Kuaishou
klingai.com/global/dev
COVERED


references/video/luma-ray3.md
Ray2/Ray3
Luma AI
docs.lumalabs.ai
COVERED


references/video/hailuo-02.md
Hailuo 02
MiniMax
platform.minimaxi.com/docs/api-reference/video-generation-intro
COVERED


references/video/midjourney-video.md
Midjourney Video
Midjourney
docs.midjourney.com/docs/video
COVERED


references/video/seedance-1.5-pro.md
Seedance 1.5 Pro / 1.0 family
ByteDance (Volcengine Ark)
volcengine.com/docs/82379
COVERED


references/video/pika-2.md
Pika 2.2 (via fal.ai)
Pika
fal.ai/models
COVERED


references/video/pixverse.md
PixVerse (v5.5)
PixVerse
docs.platform.pixverse.ai
COVERED


references/video/haiper-2.x.md
Haiper Video 2.x
Haiper
docs.haiper.ai/api-reference
COVERED


references/video/vidu.md
Vidu (viduq1 / 2.0 / 1.5)
Vidu
docs.platform.vidu.com
COVERED


references/video/firefly-video.md
Firefly Video (Generate Video API)
Adobe
developer.adobe.com/firefly-services/docs
COVERED


references/video/nova-reel.md
Nova Reel
AWS (Amazon Bedrock)
docs.aws.amazon.com/nova/latest/userguide
COVERED


references/video/alibaba-wan.md
Wan (Wan2.x / Wanx2.1 + VACE editing)
Alibaba Cloud (Model Studio / DashScope)
alibabacloud.com/help
COVERED


Image Generation (17 documents)


Document
Model
Provider
Primary Source
Status


references/image/nano-banana-pro.md
Nano Banana / Nano Banana Pro
Google
ai.google.dev/gemini-api/docs/image-generation
COVERED


references/image/imagen-4.md
Imagen 4
Google
ai.google.dev/gemini-api/docs/imagen
COVERED


references/image/flux-2.md
FLUX.2
Black Forest Labs
docs.bfl.ai
COVERED


references/image/flux-kontext.md
FLUX.1 Kontext
Black Forest Labs
docs.bfl.ai/kontext
COVERED


references/image/gpt-image.md
GPT Image 1.5
OpenAI
platform.openai.com/docs/guides/image-generation
COVERED


references/image/midjourney.md
Midjourney V7
Midjourney
docs.midjourney.com
COVERED


references/image/ideogram-3.md
Ideogram 3.0
Ideogram
developer.ideogram.ai
COVERED


references/image/seedream-4.md
Seedream 4.5
ByteDance
docs.byteplus.com
COVERED


references/image/firefly-image.md
Firefly Image (API)
Adobe
developer.adobe.com/firefly-services/docs
COVERED


references/image/stability-image.md
Stable Image + SD 3.5 (API)
Stability AI
api.stability.ai/v2alpha/openapi
COVERED


references/image/nova-canvas.md
Nova Canvas
AWS (Amazon Bedrock)
docs.aws.amazon.com/nova/latest/userguide
COVERED


references/image/minimax-image.md
MiniMax Image Generation (image-01, image-01-live)
MiniMax
platform.minimaxi.com/docs/api-reference/image-generation-intro
COVERED


references/image/recraft.md
Recraft (Recraft API)
Recraft
recraft.ai/docs/api-reference
COVERED


references/image/leonardo.md
Leonardo (Image API)
Leonardo AI
docs.leonardo.ai/reference
COVERED


references/image/reve-image.md
Reve Image API (Create/Edit/Remix)
Reve
api.reve.com
COVERED


references/image/krea.md
Krea (Image/Video API)
Krea
docs.krea.ai/api-reference
COVERED


references/image/freepik-mystic.md
Freepik Mystic
Freepik
docs.freepik.com/api-reference
COVERED


Audio Generation (18 documents)


Document
Model
Provider
Primary Source
Status


references/audio/elevenlabs.md
ElevenLabs TTS
ElevenLabs
elevenlabs.io/docs
COVERED


references/audio/eleven-music.md
Eleven Music
ElevenLabs
elevenlabs.io/docs
COVERED


references/audio/minimax-music.md
MiniMax Music 2.0 (music-2.0)
MiniMax
platform.minimaxi.com/docs/api-reference/music-intro
COVERED


references/audio/suno-v5.md
Suno v5
Suno
help.suno.com
COVERED


references/audio/udio.md
Udio v1.5
Udio
help.udio.com
COVERED


references/audio/openai-tts.md
OpenAI TTS
OpenAI
platform.openai.com/docs/guides/text-to-speech
COVERED


references/audio/fish-audio-openaudio-s1.md
OpenAudio S1
Fish Audio
docs.fish.audio
COVERED


references/audio/cartesia-sonic.md
Sonic 3
Cartesia
docs.cartesia.ai
COVERED


references/audio/playht.md
PlayHT
PlayHT
docs.play.ht
COVERED


references/audio/gemini-tts.md
Gemini Preview TTS
Google (Gemini API)
ai.google.dev/gemini-api/docs/speech-generation
COVERED


references/audio/minimax-speech.md
MiniMax Speech (T2A + Async + Voice Design/Cloning)
MiniMax
platform.minimaxi.com/docs/api-reference/speech-t2a-intro
COVERED


references/audio/google-cloud-tts.md
Google Cloud TTS
Google Cloud
cloud.google.com/text-to-speech
COVERED


references/audio/azure-tts.md
Azure TTS
Microsoft
learn.microsoft.com/azure/ai-services/speech-service
COVERED


references/audio/amazon-polly.md
Amazon Polly
AWS
docs.aws.amazon.com/polly
COVERED


references/audio/respeecher.md
Respeecher
Respeecher
docs.respeecher.com
COVERED


references/audio/stable-audio.md
Stable Audio 2 / 2.5
Stability AI
api.stability.ai/v2alpha/openapi
COVERED


references/audio/lyria-2.md
Lyria 2
Google
docs.cloud.google.com/vertex-ai/generative-ai/docs
COVERED


references/audio/lyria-realtime.md
Lyria RealTime
Google (Gemini API)
ai.google.dev/gemini-api/docs/music-generation
COVERED


On-Device Models to Validate


Document
Models
Primary Sources
Status


references/video/on-device-models.md
compilation doc
HuggingFace model cards, GitHub
COVERED


references/image/on-device-models.md
compilation doc
HuggingFace model cards, GitHub
COVERED


references/audio/on-device-models.md
compilation doc
HuggingFace model cards
COVERED


Agent Prompts for Validation

Template: Cloud Model Validation Agent

**Task**: Validate the reference document for [MODEL] against official [VENDOR] documentation.

**Reference Document**: `references/[category]/[file].md`
**Primary Source**: [OFFICIAL_DOCS_URL]
**Secondary Sources**: [AGGREGATOR_URLS]

**Validation Checklist**:

1. **Model Identity**
   - [ ] Correct model name/version
   - [ ] Correct API model_id (cross-check MODEL-AUDIT.md)
   - [ ] Correct provider attribution

2. **Capabilities**
   - [ ] Resolution limits verified
   - [ ] Duration limits verified
   - [ ] Supported formats verified
   - [ ] Feature claims verified (audio support, text rendering, etc.)

3. **Pricing**
   - [ ] Current pricing verified
   - [ ] Pricing tiers/variants verified
   - [ ] Credit system (if applicable) verified

4. **API Documentation**
   - [ ] Endpoint format verified
   - [ ] Authentication method verified
   - [ ] Required parameters verified
   - [ ] Optional parameters verified
   - [ ] Response format verified

5. **Prompting Guide**
   - [ ] Camera movement vocabulary verified (video)
   - [ ] Style/aesthetic terminology verified (image)
   - [ ] Voice/emotion controls verified (audio)
   - [ ] Best practices match official guidance
   - [ ] Example prompts verified

6. **Limitations**
   - [ ] Known limitations documented
   - [ ] Rate limits documented
   - [ ] Content restrictions documented

**Output**:
- List of CONFIRMED items (with evidence links)
- List of CORRECTIONS needed (with correct information and evidence)
- List of ADDITIONS (new features/capabilities not in current doc)
- Updated reference document content

**Quality Bar**:
- Every claim must have evidence from official source
- No "seems" or "probably" - use UNKNOWN if unverifiable
- Preserve document structure, only update content


Specific Agent Prompts

Video: Veo 3.1

Validate `references/video/veo-3.md` against:
- https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/veo-video-generation
- https://cloud.google.com/vertex-ai/generative-ai/docs/models/veo/3-1-generate
- https://ai.google.dev/gemini-api/docs

Focus areas:
- Timestamp prompting format (is [00:00-00:03] correct?)
- Audio generation capabilities
- Camera movement vocabulary (what terms does Google recommend?)
- Resolution/duration limits
- Pricing per second

Video: Sora 2

Validate `references/video/sora-2.md` against:
- https://platform.openai.com/docs/models/sora-2
- https://platform.openai.com/docs/models/sora-2-pro
- https://platform.openai.com/docs/api-reference/videos

Focus areas:
- Multi-scene capabilities
- Duration limits (sora-2 vs sora-2-pro)
- Resolution options
- Prompt structure recommendations
- Credit/pricing system

Video: Runway Gen-4

Validate `references/video/runway-gen4.5.md` against:
- https://docs.dev.runwayml.com/guides/models/
- https://docs.dev.runwayml.com/guides/pricing/

Focus areas:
- Gen-4 vs Gen-4.5 availability (Gen-4.5 API not yet available per audit)
- Motion Brush documentation
- Camera control parameters
- Credit system

Video: Kling 2.1

Validate `references/video/kling-2.1.md` against:
- https://klingai.com/global/dev
- https://app.klingai.com/global/dev/document-api

Focus areas:
- Model tiers (standard/pro/master)
- Lip-sync capabilities
- Camera movement vocabulary
- Duration limits per tier
- Pricing structure

Video: Luma Ray

Validate `references/video/luma-ray3.md` against:
- https://docs.lumalabs.ai/docs/api
- https://lumalabs.ai/learning-hub

Focus areas:
- Ray2 vs Ray3 availability (Ray3 API not yet available per audit)
- HDR capabilities
- Draft mode documentation
- Credit system

Video: Hailuo 02

Validate `references/video/hailuo-02.md` against:
- https://platform.minimaxi.com/docs/api-reference/video-generation-intro

Focus areas:
- Model variants (02 vs 2.3 vs 2.3-Fast)
- Resolution/duration options
- Pricing per resolution tier

Image: Nano Banana / Nano Banana Pro (Gemini native image generation)

Validate `references/image/nano-banana-pro.md` against:
- https://ai.google.dev/gemini-api/docs/nanobanana
- https://ai.google.dev/gemini-api/docs/image-generation
- https://ai.google.dev/gemini-api/docs/pricing

Focus areas:
- Correct model IDs (`gemini-2.5-flash-image`, `gemini-3-pro-image-preview`)
- Token/pricing tables and image-size token costs
- 4K output + “Thinking” + thought signatures behavior (Pro)
- Prompting vocabulary + official prompt templates

Image: Imagen 4

Validate `references/image/imagen-4.md` against:
- https://ai.google.dev/gemini-api/docs/imagen
- https://cloud.google.com/vertex-ai/generative-ai/docs/models/imagen/4-0-generate
- https://cloud.google.com/vertex-ai/generative-ai/pricing

Focus areas:
- Model variants (fast/standard/ultra) and IDs
- Pricing (Gemini API vs Vertex AI pricing surfaces)
- Aspect ratio + output size constraints
- Prompting guidance (official)

Image: FLUX.2

Validate `references/image/flux-2.md` against:
- https://docs.bfl.ai/quick_start/generating_images
- https://docs.bfl.ai/flux_2/flux2_overview
- https://bfl.ai/pricing

Focus areas:
- All FLUX.2 variants (pro/max/flex/dev)
- Endpoint-based API (not model_id based)
- Text rendering capabilities
- Pricing per megapixel

Image: GPT Image 1.5

Validate `references/image/gpt-image.md` against:
- https://platform.openai.com/docs/models/gpt-image-1.5
- https://platform.openai.com/docs/guides/image-generation

Focus areas:
- Model versions (1.5 vs 1 vs 1-mini)
- Token-based pricing
- Quality tiers
- Text rendering accuracy

Image: Midjourney V7

Validate `references/image/midjourney.md` against:
- https://docs.midjourney.com

Focus areas:
- V7 capabilities
- API availability (still no public API?)
- Parameter syntax (--ar, --stylize, etc.)
- Style reference system

Image: Ideogram 3.0

Validate `references/image/ideogram-3.md` against:
- https://developer.ideogram.ai/api-reference/api-reference/generate-v3
- https://ideogram.ai/features/3.0

Focus areas:
- Version 3.0 features
- Text rendering accuracy claims
- Style Codes feature
- API endpoint format

Image: Seedream 4.5

Validate `references/image/seedream-4.md` against:
- https://docs.byteplus.com/en/docs/ModelArk
- https://seed.bytedance.com/en/seedream4_5

Focus areas:
- API availability (via BytePlus ModelArk)
- Multi-reference fusion capabilities
- Speed benchmarks
- Pricing

Audio: ElevenLabs

Validate `references/audio/elevenlabs.md` against:
- https://elevenlabs.io/docs/overview/models
- https://elevenlabs.io/docs/api-reference

Focus areas:
- Model IDs (eleven_v3, eleven_multilingual_v2, etc. - use underscores!)
- Voice cloning requirements
- Stability/similarity controls
- Pricing per character

Audio: Suno v5

Validate `references/audio/suno-v5.md` against:
- https://help.suno.com
- https://suno.com

Focus areas:
- v5 capabilities vs v4
- NO official API (only third-party wrappers)
- Song duration limits
- Lyric formatting

Audio: Udio v1.5

Validate `references/audio/udio.md` against:
- https://help.udio.com
- https://www.udio.com/blog

Focus areas:
- v1.5 and v1.5 Allegro differences
- NO official API (Udio explicitly states this)
- Stem separation features
- Key control

Audio: OpenAI TTS

Validate `references/audio/openai-tts.md` against:
- https://platform.openai.com/docs/guides/text-to-speech
- https://platform.openai.com/docs/api-reference/audio

Focus areas:
- Model IDs (tts-1, tts-1-hd, gpt-4o-mini-tts)
- Voice options
- Instructions support (gpt-4o-mini-tts only)
- Pricing structure

Audio: Fish Audio S1

Validate `references/audio/fish-audio-openaudio-s1.md` against:
- https://docs.fish.audio/api-reference/endpoint/openapi-v1/text-to-speech
- https://docs.fish.audio/developer-guide/models-pricing

Focus areas:
- Model ID is just "s1" in API
- Pricing per UTF-8 bytes
- Emotion control capabilities
- Voice cloning

Audio: Cartesia Sonic

Validate `references/audio/cartesia-sonic.md` against:
- https://docs.cartesia.ai/build-with-cartesia/tts-models
- https://cartesia.ai/pricing

Focus areas:
- Sonic-3 vs Sonic-2 vs Sonic-turbo
- Date-stamped version snapshots
- State Space Models claims
- Latency benchmarks


On-Device Model Validation (55 Agents - 1 Per Model)

On-Device Agent Template

**Task**: Validate on-device model [MODEL] against HuggingFace/GitHub.

**Sources**:
- HuggingFace model card: [HF_URL]
- GitHub repo: [GITHUB_URL]

**MANDATORY Validation Checklist**:

1. **Hardware Requirements**
   - [ ] Minimum VRAM verified
   - [ ] Recommended VRAM verified
   - [ ] RAM requirements verified

2. **Mac Compatibility** (CRITICAL - user uses MacBook)
   - [ ] MPS (Metal) support: YES/NO/PARTIAL
   - [ ] Apple Silicon (M1/M2/M3/M4) tested: YES/NO/UNKNOWN
   - [ ] Mac-specific installation steps documented
   - [ ] Mac performance benchmarks if available
   - [ ] Known Mac limitations or issues

3. **License**
   - [ ] License type verified
   - [ ] Commercial use allowed: YES/NO/CONDITIONAL
   - [ ] Revenue limits (if any)

4. **Model Specs**
   - [ ] Parameter count verified
   - [ ] Current version/release date
   - [ ] Output specs (resolution, duration, quality)

5. **Quality Claims**
   - [ ] Benchmark scores verified with source
   - [ ] Comparison claims verified

**Output**: Corrections + Mac compatibility assessment

Video On-Device (13 agents)


#
Model
HuggingFace/GitHub
Focus


1
HunyuanVideo 1.5
tencent/HunyuanVideo-1.5
GGUF options, VRAM, SSTA claims


2
Wan2.1/2.2
Wan-AI/Wan2.1-T2V-14B, Wan-AI/Wan2.2-TI2V-5B
MoE architecture, Apache 2.0


3
LTX-Video
Lightricks/LTX-Video
MPS support, speed claims


4
CogVideoX
THUDM/CogVideoX-5b, THUDM/CogVideoX-2b
Quantization, Mac support


5
Mochi 1
genmo/mochi-1-preview
VRAM requirements, ComfyUI


6
Stable Video Diffusion
stabilityai/stable-video-diffusion-img2vid-xt
License, optimizations


7
Open-Sora 2.0
hpcaitech/Open-Sora
VRAM, output specs


8
Open-Sora Plan
PKU-YuanGroup/Open-Sora-Plan
v1.5 capabilities


9
AnimateDiff
guoyww/AnimateDiff
VRAM by config, SDXL support


10
SkyReels V1
SkyworkAI/SkyReels-V1
Human-centric features, VBench


11
Pyramid Flow
rain1011/pyramid-flow-sd3
MIT license, Mac support


12
Kandinsky 5.0
kandinskylab/Kandinsky-5.0-T2V-Lite
10s video, attention engines


13
Step-Video
stepfun-ai/Step-Video-T2V
30B params, multi-GPU


Image On-Device (18 agents)


#
Model
HuggingFace
Focus


14
SD 1.5
runwayml/stable-diffusion-v1-5
License, ecosystem


15
SDXL
stabilityai/stable-diffusion-xl-base-1.0
License terms, refiner


16
SDXL Turbo
stabilityai/sdxl-turbo
Steps, resolution limits


17
SDXL Lightning
ByteDance
2-8 step quality


18
SD 3.5 Medium
stabilityai/stable-diffusion-3.5-medium
License (<$1M), VRAM


19
SD 3.5 Large
stabilityai/stable-diffusion-3.5-large
Quantization options


20
FLUX.1 Schnell
black-forest-labs/FLUX.1-schnell
Apache 2.0, NF4 options


21
FLUX.1 Dev
black-forest-labs/FLUX.1-dev
Non-commercial terms


22
FLUX.2 Dev
black-forest-labs/FLUX.2-dev
32B params, consumer viability


23
Stable Cascade
stabilityai/stable-cascade
3-stage architecture


24
PixArt-Sigma
PixArt-alpha/PixArt-Sigma-XL-2-1024-MS
DiT architecture, 4K


25
HiDream-I1
HiDream.ai
17B params, GGUF variants


26
Z-Image Turbo
Tongyi-MAI/Z-Image-Turbo
#1 leaderboard, bilingual


27
Kolors
Kwai-Kolors/Kolors
Commercial registration


28
Playground v2.5
playgroundai/playground-v2.5-1024px-aesthetic
Open vs v3 closed


29
HunyuanDiT
Tencent
OpenVINO, Chinese


30
DeepFloyd IF
DeepFloyd/IF-I-XL-v1.0
Text rendering, VRAM


31
Kandinsky 5.0 Lite
kandinskylab/kandinsky-5.0-image-lite
Multi-modal family


Audio TTS On-Device (17 agents)


#
Model
HuggingFace/GitHub
Focus


32
Chatterbox
ResembleAI/chatterbox
MIT, emotion control, 63.8% pref


33
Fish Speech/OpenAudio S1
fishaudio/fish-speech
CC-BY-NC, #1 TTS-Arena


34
CosyVoice2
FunAudioLLM/CosyVoice2-0.5B
Apache 2.0, streaming


35
Kokoro-82M
hexgrad/Kokoro-82M
Apache 2.0, 82M params


36
F5-TTS
SWivid/F5-TTS
CC-BY-NC weights


37
IndexTTS-2
index-tts/index-tts
Duration control


38
XTTS v2
coqui/XTTS-v2
Coqui license, 17 langs


39
StyleTTS2
yl4579/StyleTTS2
MIT, human-level


40
GPT-SoVITS
RVC-Boss/GPT-SoVITS
MIT, singing support


41
Bark
suno/bark
MIT, sound effects


42
OpenVoice v2
myshell-ai/OpenVoiceV2
MIT, lightweight


43
Piper
rhasspy/piper
MIT, CPU-only


44
Tortoise TTS
neonbjb/tortoise-tts
Apache 2.0, slow


45
WhisperSpeech
WhisperSpeech/WhisperSpeech
Apache 2.0/MIT


46
MaskGCT
Amphion
ICLR 2025, 6 langs


47
OuteTTS
edwko/OuteTTS
MIT, llama.cpp


48
Spark-TTS
SparkAudio/Spark-TTS-0.5B
CC-BY-NC-SA


Audio Music On-Device (7 agents)


#
Model
HuggingFace/GitHub
Focus


49
ACE-Step
ACE-Step/ACE-Step-v1-3.5B
Apache 2.0, 4min songs


50
YuE
multimodal-art-projection/YuE
Apache 2.0, 5min


51
DiffRhythm
ASLP-lab/DiffRhythm
Apache 2.0, 4m45s


52
MusicGen
facebook/musicgen-large
CC-BY-NC, variants


53
Stable Audio Open
stabilityai/stable-audio-open-1.0
<$1M license


54
Riffusion
riffusion/riffusion-model-v1
MIT, spectrograms


55
Magenta RT
Google
Open weights, real-time


Execution Plan

Session: Full Library Re-Validation (73 Agents Total)

Phase 1: Cloud Models (18 Opus agents, parallel)

6 video model agents
6 image model agents
6 audio model agents
Each validates against official vendor docs
Returns: corrections, updated content, evidence links

Phase 2: On-Device Models (55 agents, parallel batches)

13 video model agents
18 image model agents
17 TTS model agents
7 music model agents
Each validates against HuggingFace + GitHub
CRITICAL: Mac compatibility verification for each model

Phase 3: Merge & Update

Merge all corrections into reference docs
Update 3 on-device compilation docs with per-model corrections
Cross-check against MODEL-AUDIT.md

Phase 4: Synthesis Update

Update PROMPT-VOCABULARY.md with verified terminology
Update comparison docs with verified capabilities
Update COST-OPTIMIZATION.md with verified pricing

Phase 5: Finalize

Mark all documents as validated
Update CONTINUITY.md
Layer 2 truly complete


Agent Summary


Category
Cloud Agents
On-Device Agents
Total


Video
7
13
20


Image
7
18
25


Audio (TTS)
4
17
21


Audio (Music)
2
7
9


Total
20
55
75


Success Criteria


 All 20 cloud model docs validated against official sources
 All 55 on-device models validated against HuggingFace/GitHub
 Mac compatibility verified for every on-device model
 Every prompting guide verified against vendor recommendations
 Every capability claim has evidence link
 MODEL-AUDIT.md corrections applied to reference docs
 Synthesis docs updated to reflect corrected information
 CONTINUITY.md updated with completion status


Files Structure

references/
├── README.md              # Library index (needs status update)
├── GLOSSARY.md            # Terms and conventions
├── GAPS.md                # Known gaps
├── VALIDATION-REPORT.md   # Accuracy verification (needs update)
├── video/
│   ├── README.md
│   ├── veo-3.md           # NEEDS REVIEW
│   ├── sora-2.md          # NEEDS REVIEW
│   ├── runway-gen4.5.md   # NEEDS REVIEW
│   ├── kling-2.1.md       # NEEDS REVIEW
│   ├── luma-ray3.md       # NEEDS REVIEW
│   ├── hailuo-02.md       # NEEDS REVIEW
│   ├── midjourney-video.md # NEEDS REVIEW
│   └── on-device-models.md # NEEDS REVIEW
├── image/
│   ├── README.md
│   ├── nano-banana-pro.md # NEEDS REVIEW
│   ├── imagen-4.md        # NEEDS REVIEW
│   ├── flux-2.md          # NEEDS REVIEW
│   ├── gpt-image.md       # NEEDS REVIEW
│   ├── midjourney.md      # NEEDS REVIEW
│   ├── ideogram-3.md      # NEEDS REVIEW
│   ├── seedream-4.md      # NEEDS REVIEW
│   └── on-device-models.md # NEEDS REVIEW
└── audio/
    ├── README.md
    ├── elevenlabs.md      # NEEDS REVIEW
    ├── suno-v5.md         # NEEDS REVIEW
    ├── udio.md            # NEEDS REVIEW
    ├── openai-tts.md      # NEEDS REVIEW
    ├── fish-audio-openaudio-s1.md # NEEDS REVIEW
    ├── cartesia-sonic.md  # NEEDS REVIEW
    └── on-device-models.md # NEEDS REVIEW

planning/synthesis/
├── MODEL-AUDIT.md         # COMPLETE (model IDs verified)
├── VIDEO-COMPARISON.md    # NEEDS UPDATE after validation
├── IMAGE-COMPARISON.md    # NEEDS UPDATE after validation
├── AUDIO-COMPARISON.md    # NEEDS UPDATE after validation
├── PROMPT-VOCABULARY.md   # NEEDS UPDATE after validation
├── COST-OPTIMIZATION.md   # NEEDS UPDATE after validation
├── SCHEMA-RECOMMENDATIONS.md # NEEDS UPDATE after validation
└── INTEGRATION-PATTERNS.md # NEEDS UPDATE after validation


This research plan was updated 2025-12-27 to require full library re-validation before Layer 2 can be considered complete.
Document	Model	Provider	Primary Source	Status
`references/video/veo-3.md`	Veo 3.1	Google	cloud.google.com/vertex-ai/generative-ai/docs	COVERED
`references/video/sora-2.md`	Sora 2	OpenAI	platform.openai.com/docs	COVERED
`references/video/runway-gen4.5.md`	Gen-4/4.5	Runway	docs.dev.runwayml.com	COVERED
`references/video/kling-2.1.md`	Kling 2.1	Kuaishou	klingai.com/global/dev	COVERED
`references/video/luma-ray3.md`	Ray2/Ray3	Luma AI	docs.lumalabs.ai	COVERED
`references/video/hailuo-02.md`	Hailuo 02	MiniMax	platform.minimaxi.com/docs/api-reference/video-generation-intro	COVERED
`references/video/midjourney-video.md`	Midjourney Video	Midjourney	docs.midjourney.com/docs/video	COVERED
`references/video/seedance-1.5-pro.md`	Seedance 1.5 Pro / 1.0 family	ByteDance (Volcengine Ark)	volcengine.com/docs/82379	COVERED
`references/video/pika-2.md`	Pika 2.2 (via fal.ai)	Pika	fal.ai/models	COVERED
`references/video/pixverse.md`	PixVerse (v5.5)	PixVerse	docs.platform.pixverse.ai	COVERED
`references/video/haiper-2.x.md`	Haiper Video 2.x	Haiper	docs.haiper.ai/api-reference	COVERED
`references/video/vidu.md`	Vidu (viduq1 / 2.0 / 1.5)	Vidu	docs.platform.vidu.com	COVERED
`references/video/firefly-video.md`	Firefly Video (Generate Video API)	Adobe	developer.adobe.com/firefly-services/docs	COVERED
`references/video/nova-reel.md`	Nova Reel	AWS (Amazon Bedrock)	docs.aws.amazon.com/nova/latest/userguide	COVERED
`references/video/alibaba-wan.md`	Wan (Wan2.x / Wanx2.1 + VACE editing)	Alibaba Cloud (Model Studio / DashScope)	alibabacloud.com/help	COVERED
Document	Model	Provider	Primary Source	Status
`references/image/nano-banana-pro.md`	Nano Banana / Nano Banana Pro	Google	ai.google.dev/gemini-api/docs/image-generation	COVERED
`references/image/imagen-4.md`	Imagen 4	Google	ai.google.dev/gemini-api/docs/imagen	COVERED
`references/image/flux-2.md`	FLUX.2	Black Forest Labs	docs.bfl.ai	COVERED
`references/image/flux-kontext.md`	FLUX.1 Kontext	Black Forest Labs	docs.bfl.ai/kontext	COVERED
`references/image/gpt-image.md`	GPT Image 1.5	OpenAI	platform.openai.com/docs/guides/image-generation	COVERED
`references/image/midjourney.md`	Midjourney V7	Midjourney	docs.midjourney.com	COVERED
`references/image/ideogram-3.md`	Ideogram 3.0	Ideogram	developer.ideogram.ai	COVERED
`references/image/seedream-4.md`	Seedream 4.5	ByteDance	docs.byteplus.com	COVERED
`references/image/firefly-image.md`	Firefly Image (API)	Adobe	developer.adobe.com/firefly-services/docs	COVERED
`references/image/stability-image.md`	Stable Image + SD 3.5 (API)	Stability AI	api.stability.ai/v2alpha/openapi	COVERED
`references/image/nova-canvas.md`	Nova Canvas	AWS (Amazon Bedrock)	docs.aws.amazon.com/nova/latest/userguide	COVERED
`references/image/minimax-image.md`	MiniMax Image Generation (`image-01`, `image-01-live`)	MiniMax	platform.minimaxi.com/docs/api-reference/image-generation-intro	COVERED
`references/image/recraft.md`	Recraft (Recraft API)	Recraft	recraft.ai/docs/api-reference	COVERED
`references/image/leonardo.md`	Leonardo (Image API)	Leonardo AI	docs.leonardo.ai/reference	COVERED
`references/image/reve-image.md`	Reve Image API (Create/Edit/Remix)	Reve	api.reve.com	COVERED
`references/image/krea.md`	Krea (Image/Video API)	Krea	docs.krea.ai/api-reference	COVERED
`references/image/freepik-mystic.md`	Freepik Mystic	Freepik	docs.freepik.com/api-reference	COVERED
Document	Model	Provider	Primary Source	Status
`references/audio/elevenlabs.md`	ElevenLabs TTS	ElevenLabs	elevenlabs.io/docs	COVERED
`references/audio/eleven-music.md`	Eleven Music	ElevenLabs	elevenlabs.io/docs	COVERED
`references/audio/minimax-music.md`	MiniMax Music 2.0 (`music-2.0`)	MiniMax	platform.minimaxi.com/docs/api-reference/music-intro	COVERED
`references/audio/suno-v5.md`	Suno v5	Suno	help.suno.com	COVERED
`references/audio/udio.md`	Udio v1.5	Udio	help.udio.com	COVERED
`references/audio/openai-tts.md`	OpenAI TTS	OpenAI	platform.openai.com/docs/guides/text-to-speech	COVERED
`references/audio/fish-audio-openaudio-s1.md`	OpenAudio S1	Fish Audio	docs.fish.audio	COVERED
`references/audio/cartesia-sonic.md`	Sonic 3	Cartesia	docs.cartesia.ai	COVERED
`references/audio/playht.md`	PlayHT	PlayHT	docs.play.ht	COVERED
`references/audio/gemini-tts.md`	Gemini Preview TTS	Google (Gemini API)	ai.google.dev/gemini-api/docs/speech-generation	COVERED
`references/audio/minimax-speech.md`	MiniMax Speech (T2A + Async + Voice Design/Cloning)	MiniMax	platform.minimaxi.com/docs/api-reference/speech-t2a-intro	COVERED
`references/audio/google-cloud-tts.md`	Google Cloud TTS	Google Cloud	cloud.google.com/text-to-speech	COVERED
`references/audio/azure-tts.md`	Azure TTS	Microsoft	learn.microsoft.com/azure/ai-services/speech-service	COVERED
`references/audio/amazon-polly.md`	Amazon Polly	AWS	docs.aws.amazon.com/polly	COVERED
`references/audio/respeecher.md`	Respeecher	Respeecher	docs.respeecher.com	COVERED
`references/audio/stable-audio.md`	Stable Audio 2 / 2.5	Stability AI	api.stability.ai/v2alpha/openapi	COVERED
`references/audio/lyria-2.md`	Lyria 2	Google	docs.cloud.google.com/vertex-ai/generative-ai/docs	COVERED
`references/audio/lyria-realtime.md`	Lyria RealTime	Google (Gemini API)	ai.google.dev/gemini-api/docs/music-generation	COVERED
Document	Models	Primary Sources	Status
`references/video/on-device-models.md`	compilation doc	HuggingFace model cards, GitHub	COVERED
`references/image/on-device-models.md`	compilation doc	HuggingFace model cards, GitHub	COVERED
`references/audio/on-device-models.md`	compilation doc	HuggingFace model cards	COVERED
#	Model	HuggingFace/GitHub	Focus
1	HunyuanVideo 1.5	`tencent/HunyuanVideo-1.5`	GGUF options, VRAM, SSTA claims
2	Wan2.1/2.2	`Wan-AI/Wan2.1-T2V-14B`, `Wan-AI/Wan2.2-TI2V-5B`	MoE architecture, Apache 2.0
3	LTX-Video	`Lightricks/LTX-Video`	MPS support, speed claims
4	CogVideoX	`THUDM/CogVideoX-5b`, `THUDM/CogVideoX-2b`	Quantization, Mac support
5	Mochi 1	`genmo/mochi-1-preview`	VRAM requirements, ComfyUI
6	Stable Video Diffusion	`stabilityai/stable-video-diffusion-img2vid-xt`	License, optimizations
7	Open-Sora 2.0	`hpcaitech/Open-Sora`	VRAM, output specs
8	Open-Sora Plan	`PKU-YuanGroup/Open-Sora-Plan`	v1.5 capabilities
9	AnimateDiff	`guoyww/AnimateDiff`	VRAM by config, SDXL support
10	SkyReels V1	`SkyworkAI/SkyReels-V1`	Human-centric features, VBench
11	Pyramid Flow	`rain1011/pyramid-flow-sd3`	MIT license, Mac support
12	Kandinsky 5.0	`kandinskylab/Kandinsky-5.0-T2V-Lite`	10s video, attention engines
13	Step-Video	`stepfun-ai/Step-Video-T2V`	30B params, multi-GPU
#	Model	HuggingFace	Focus
14	SD 1.5	`runwayml/stable-diffusion-v1-5`	License, ecosystem
15	SDXL	`stabilityai/stable-diffusion-xl-base-1.0`	License terms, refiner
16	SDXL Turbo	`stabilityai/sdxl-turbo`	Steps, resolution limits
17	SDXL Lightning	ByteDance	2-8 step quality
18	SD 3.5 Medium	`stabilityai/stable-diffusion-3.5-medium`	License (<$1M), VRAM
19	SD 3.5 Large	`stabilityai/stable-diffusion-3.5-large`	Quantization options
20	FLUX.1 Schnell	`black-forest-labs/FLUX.1-schnell`	Apache 2.0, NF4 options
21	FLUX.1 Dev	`black-forest-labs/FLUX.1-dev`	Non-commercial terms
22	FLUX.2 Dev	`black-forest-labs/FLUX.2-dev`	32B params, consumer viability
23	Stable Cascade	`stabilityai/stable-cascade`	3-stage architecture
24	PixArt-Sigma	`PixArt-alpha/PixArt-Sigma-XL-2-1024-MS`	DiT architecture, 4K
25	HiDream-I1	HiDream.ai	17B params, GGUF variants
26	Z-Image Turbo	`Tongyi-MAI/Z-Image-Turbo`	#1 leaderboard, bilingual
27	Kolors	`Kwai-Kolors/Kolors`	Commercial registration
28	Playground v2.5	`playgroundai/playground-v2.5-1024px-aesthetic`	Open vs v3 closed
29	HunyuanDiT	Tencent	OpenVINO, Chinese
30	DeepFloyd IF	`DeepFloyd/IF-I-XL-v1.0`	Text rendering, VRAM
31	Kandinsky 5.0 Lite	`kandinskylab/kandinsky-5.0-image-lite`	Multi-modal family
#	Model	HuggingFace/GitHub	Focus
32	Chatterbox	`ResembleAI/chatterbox`	MIT, emotion control, 63.8% pref
33	Fish Speech/OpenAudio S1	`fishaudio/fish-speech`	CC-BY-NC, #1 TTS-Arena
34	CosyVoice2	`FunAudioLLM/CosyVoice2-0.5B`	Apache 2.0, streaming
35	Kokoro-82M	`hexgrad/Kokoro-82M`	Apache 2.0, 82M params
36	F5-TTS	`SWivid/F5-TTS`	CC-BY-NC weights
37	IndexTTS-2	`index-tts/index-tts`	Duration control
38	XTTS v2	`coqui/XTTS-v2`	Coqui license, 17 langs
39	StyleTTS2	`yl4579/StyleTTS2`	MIT, human-level
40	GPT-SoVITS	`RVC-Boss/GPT-SoVITS`	MIT, singing support
41	Bark	`suno/bark`	MIT, sound effects
42	OpenVoice v2	`myshell-ai/OpenVoiceV2`	MIT, lightweight
43	Piper	`rhasspy/piper`	MIT, CPU-only
44	Tortoise TTS	`neonbjb/tortoise-tts`	Apache 2.0, slow
45	WhisperSpeech	`WhisperSpeech/WhisperSpeech`	Apache 2.0/MIT
46	MaskGCT	Amphion	ICLR 2025, 6 langs
47	OuteTTS	`edwko/OuteTTS`	MIT, llama.cpp
48	Spark-TTS	`SparkAudio/Spark-TTS-0.5B`	CC-BY-NC-SA
#	Model	HuggingFace/GitHub	Focus
49	ACE-Step	`ACE-Step/ACE-Step-v1-3.5B`	Apache 2.0, 4min songs
50	YuE	`multimodal-art-projection/YuE`	Apache 2.0, 5min
51	DiffRhythm	`ASLP-lab/DiffRhythm`	Apache 2.0, 4m45s
52	MusicGen	`facebook/musicgen-large`	CC-BY-NC, variants
53	Stable Audio Open	`stabilityai/stable-audio-open-1.0`	<$1M license
54	Riffusion	`riffusion/riffusion-model-v1`	MIT, spectrograms
55	Magenta RT	Google	Open weights, real-time
Category	Cloud Agents	On-Device Agents	Total
Video	7	13	20
Image	7	18	25
Audio (TTS)	4	17	21
Audio (Music)	2	7	9
Total	20	55	75