villander/orchestrated-code-creation.md

## orchestrated-code-creation.md

      
    Raw
  

              orchestrated-code-creation.md
            
          
    Proposal: Orchestrated Code Creation — A Framework for AI-Driven Next.js/React Development

Author: Michael Villander
Team: Design Team Vega
Status: Draft
Date: 2026-02-27

Summary

Teams across G2i already using AI to write code and shipping fast to meet customer needs. However a lot of times the code is getting worse, not better. People just vibing with AI, not thinking, generating tons of code without really understanding it. And then this mess passes review because we dont have good systems to catch problems before a human has to look at everything.
This proposal is about building a review pipeline that actually works — five AI agents, each one specialized in different stuff, running on every pull request. They catch the obvious problems so engineers can focus on the real decisions instead of being the last defense against bad code.

"You're not just typing code anymore, you're orchestrating code creation."
— Addy Osmani, React Summit 2025


Problem

When AI code turns out bad, people blame the model. But honestly? Most times is not the model. Is because someone gave vague instructions, no context, no rules to follow. The difference between "AI helped me ship fast" and "AI gave me a disaster" usually comes down to how specific you were when asking and how good your checks are on the other side.
Is a Pipeline, Not a Model

People always blame the model when AI code is bad. "Oh, Claude messed up" or "GPT gave me garbage." But most of the time is not the model fault. Is everything around it.
Think about all the pieces involved when you ask AI to write or review code:


Layer
What it is
Where you have control


Base model
Foundation — language, reasoning, code knowledge
Model selection


System prompt and instructions
Agent specs, reference guides, review process
High — this is where we going to live


Your user prompt
The diff, the task, the context you provide
High


Fine-tuning and code training
React, TypeScript, Next.js baked into the model
Low — accept it as given


Tools and retrieval (RAG)
Agents fetching reference guides on demand
High


Agent loops (iteration)
Each agent reasoning in steps, not one shot
High


Post-processing
Synthesis step — grouping labels, final report
High


The majority of this stuff we can control. The model training is done, we cant change that. But the instructions, the context, how agents use tools, how they reason step by step — all of this is on us. This proposal focuses exactly there.
Current state on most teams:

No structured review of AI-generated code before it reaches a human
Findings from review all come out at same level — blocking and advisory look identical
Agents silently skip categories on thin diffs, producing false clean bills
No awareness of spec/code drift when documentation files change alongside code
Vague checks with no actionable workflow


Proposal

We going to build five specialized Claude agents that run in parallel on every diff. Each agent owns a lane, has tight spec, reference guide, structured review process, and pre-labeled output.
Agent 1: Coding Reviewer

Reviews architecture, type safety, error handling, performance, DRY, naming, and design patterns against a references/coding.md guide.
App Router awareness is super critical here. The agent must check for:

"use client" overuse on components that dont need browser APIs
Missing cache policy on fetch() calls (no cache: or next: { revalidate })
Ad-hoc loading state with a useState boolean instead of loading.tsx or <Suspense>
Ad-hoc error handling in render instead of error.tsx
Component contract enforcement — props interface, union types for variants, named interactive states, named state unions — before examining JSX

The difference between poor and strong component prompt illustrates why contract-first matters:
Poor
Make a modal component that can show different content

Strong
Create a Modal component for our design system:

Props:
- isOpen: boolean
- onClose: () => void
- title: string
- size: 'sm' | 'md' | 'lg' | 'fullscreen'
- closeOnOverlayClick?: boolean (default: true)
- closeOnEscape?: boolean (default: true)

Requirements:
- Use Radix Dialog primitive under the hood
- Trap focus inside modal when open
- Return focus to trigger element on close
- Animate in/out with Tailwind (fade + scale)
- Lock body scroll when open
- ARIA: role="dialog", aria-modal="true", aria-labelledby for title

Example usage:
<Modal isOpen={showConfirm} onClose={() => setShowConfirm(false)} title="Delete Item?" size="sm">
  <p>You sure you want to delete this? No going back.</p>
  <Button onClick={handleDelete}>Yes, delete</Button>
</Modal>

The poor prompt generates something that maybe works but probably has no focus trap, no escape key handling, maybe even breaks scroll on mobile. The strong one hits all requirements because it defines contract before asking for implementation.
Agent 2: React Reviewer

Covers React correctness, performance, React 18/19+ patterns, Suspense boundaries, and hydration issues — backed by a references/react.md guide. This entire category is currently unreviewed on most teams, which is kind of crazy.
Correctness

Missing cleanup breaks Strict Mode
// ❌ Bad
useEffect(() => { api.subscribe(); }, []);

// ✅ Good
useEffect(() => {
  api.subscribe();
  return () => api.unsubscribe();
}, []);
Stale closures
// ❌ Bad — count is always the initial value
useEffect(() => {
  const id = setInterval(() => console.log(count), 1000);
  return () => clearInterval(id);
}, []); // missing dep

// ✅ Good
useEffect(() => {
  const id = setInterval(() => console.log(count), 1000);
  return () => clearInterval(id);
}, [count]);
Hydration-unsafe values
// ❌ Bad — different on server vs client
return <div>{Date.now()}</div>;

// ✅ Good
const [time, setTime] = useState(null);
useEffect(() => { setTime(Date.now()); }, []);
return <div>{time ?? 'Loading...'}</div>;
Performance

Context fan-out
// ❌ Bad — new object every render, all consumers re-render
<ThemeContext.Provider value={{ theme, setTheme }}>

// ✅ Good
const value = useMemo(() => ({ theme, setTheme }), [theme]);
<ThemeContext.Provider value={value}>
Derived state in effects
// ❌ Bad — you dont need this
useEffect(() => { setFullName(first + ' ' + last); }, [first, last]);

// ✅ Good — derive during render, way more simple
const fullName = `${first} ${last}`;
Patterns

Suspense boundary placement
// ❌ Bad — whole page suspends for one widget, imagine that
<Suspense fallback={<PageLoader />}><Page /></Suspense>

// ✅ Good
<Page>
  <Suspense fallback={<WidgetSkeleton />}>
    <SlowWidget />
  </Suspense>
</Page>
React 19+

Manual form state instead of useActionState
// ❌ Bad — old way
const [loading, setLoading] = useState(false);
const [error, setError] = useState(null);

// ✅ Good — new way, much better
const [state, formAction, isPending] = useActionState(createPost, {});
Agent 3: Testing Reviewer

Reviews test organization against fixture management and co-location principles:

Extract shared mock data to __fixtures__/
Module mocks to __mocks__/
Co-locate tests with source files — no __tests__/ directories
Keep tests DRY, always

Agent 4: Documentation Reviewer

Two-pass strategy to catch what a diff alone misses:

Pass 1 (diff): Functions whose signatures changed always require JSDoc re-evaluation
Pass 2 (full source): For each src/lib/ file in the diff, scan all exported functions — undocumented functions that was never touched wont show up in the diff

Agent 5: Linting Reviewer

Style consistency and import patterns across two principles.
Principle 12: Styling & UI Consistency

Semantic classes — hardcoded colors break in dark mode
// ❌ Bad — going to break in dark mode
<div className="bg-yellow-100 text-yellow-800">Warning</div>

// ✅ Good
<div className="bg-warning text-warning-foreground">Warning</div>
Design token discipline — check tailwind.config.ts before flagging
// ❌ Bad — arbitrary value when token exists, unnecessary
<div className="gap-[8px] text-[12px] bg-[#f3f4f6]">

// ✅ Good
<div className="gap-2 text-xs bg-muted">
Semantic HTML — <div onClick> breaks keyboard navigation
// ❌ Bad — zero accessibility
<div onClick={handleSubmit} className="cursor-pointer">Submit</div>

// ✅ Good
<button onClick={handleSubmit}>Submit</button>
Principle 13: Import Consistency

// ❌ Bad — looks ugly
import { cn } from "../../../lib/utils";
import { Button } from "components/ui/button";

// ✅ Good — clean
import { cn } from "@/lib/utils";
import { Button } from "@/components/ui/button";

Output Design

Severity Labels

Every finding should carry severity label assigned by the agent at time of reporting — not re-interpreted later:


Label
Meaning


[blocking]
Must fix before merge, mandatory


[discuss]
Warrants a conversation, not a hard block


[advisory]
Improve if convenient


Coverage Confirmation

Every agent should end its report with explicit list of every category it checked. This distinguishes "I looked and found nothing" from "there wasnt enough code in this diff to evaluate." Two different signals that currently look identical, very important to differentiate.
Living-Document Check

The coding reviewer should check for spec/code drift: if CLAUDE.md, README.md, or reference file appears in the diff alongside code changes, verify the two are still consistent with each other.
Synthesis

The orchestrator collects all agent output and groups findings by the labels agents already assigned — no re-interpretation needed. One structured report with single verdict: Pass / Needs Minor Changes / Needs Major Changes.

Important Caveats

Look, this pipeline is not going to solve everything. You still need humans in the loop. Think of agents like interns — they can do a lot of work but you need to tell them exactly what you want, check their work, and never let them push to production alone. No auto-merge, ever!
The difference between poor and strong agent task prompt:
Poor
The search is not working right, users are complaining

Strong
Task: Fix search results not updating when filters change

Context:
- Location: app/products/search/page.tsx and lib/hooks/useProductSearch.ts
- Bug: When user selects a category filter, search results dont refresh
- Root cause (suspected): useEffect dependency array missing the filters param
- Sentry link: https://sentry.io/issues/12345

Current behavior:
1. User types "shoes" in search -> results show correctly
2. User clicks "Running" category filter -> results stay the same
3. User has to type something else to trigger new search

Expected behavior:
- Results should refresh immediately when any filter changes

How to verify:
1. Run `npm run dev`
2. Go to /products/search
3. Search for something, then change category filter
4. Results should update without needing to retype

Constraints:
- Dont break the debounce on text input (we need that for performance)
- Keep the URL in sync with search state (already working, dont break it)
- Add test case in useProductSearch.test.ts

The poor prompt leaves everything undefined — which file, what exactly is broken, what is expected. The agent going to guess and probably refactor half the search system when it was just a missing dependency.
At the end of the day, this is just writing down what good engineers already do in their heads. We making it explicit so agents can follow the same rules. Nothing magical here, just discipline turned into automation.

Phases

Phase 1: Five Specialist Agents

Build the five agents described in Proposal section above — coding, React, testing, documentation, and linting — each with tight spec, reference guide, severity labels, and coverage confirmations. This is the foundation everything else builds on.
Phase 2: MCP and Real Tooling

The first version reads files and runs git diffs. Phase 2 gives agents direct access to real tools:

Build logs and CI output
Sentry traces
Playwright test results
Next.js DevTools (routes, layouts, server actions, runtime errors)
Chrome DevTools (console, network, performance traces, screenshots)

Instead of the agent just guessing if the code is okay, it actually checks your running app. Big difference between an agent that reads code and hopes for the best versus one that can see real errors, real test results, real performance data.
Phase 3: Context Engineering

Most agent failures are context failures, I already said that. Phase 1 agents get a diff and reference guide. Phase 3 gives them more signal:

Component usage across the codebase
Related test failures
Lighthouse score before and after
Failing CI output that would prevent bad patch

More context is not always better, you know? If you dump everything on the agent it gets confused. The goal is to give just the right information — not too much, not too little — so the agent can actually make good decisions.
Phase 4: Guardrails That Verify, Not Just Review

ESLint, Prettier, and typecheck wired into pipeline so agents cannot suggest something that would fail your existing tooling. The agents snap to your house style, not generic one, very important this.

The Bigger Picture

Every team has different conventions, different toolchains, different taste, right? This is not about setting universal standard. The goal is to teach teams how to think in this framework so they can codify their own standards and get full advantage of Claude or any AI model.
Every team going to keep having their autonomy, not our reference guides. Their principles, their conventions, their guardrails — automated and running on every diff. Each team with their own style!
The workflow is nothing new, honestly. Is what good engineers already do:

Know what you building before you start typing
Give the AI all the context it needs — what stack, what docs, what rules
Make it show you a plan first, dont let it just go crazy writing code
Small steps, test each one, fix problems as you go
Keep going until is actually ready for production, not just "works on my machine"

The best teams using AI aren't doing anything special. They just taking all that stuff good engineers already know in their heads and writing it down so the AI can follow. Then they let AI handle the boring parts that do not need a senior engineer to think about.

References

AI-Assisted Development


Addy Osmani — How Good Is AI at Coding React (Really)? (React Summit 2025)
Addy Osmani — Beyond Vibe Coding
Addy Osmani — How to Write a Good Spec for AI Agents
Anthropic — Claude Code Documentation
Anthropic — Building Effective Agents
Simon Willison — Prompt Engineering Best Practices
Swyx — AI Engineer World's Fair Talks

React & Next.js


React Documentation — React 19 Release Notes
Next.js Documentation — App Router Best Practices
Kent C. Dodds — Testing JavaScript
Dan Abramov — A Complete Guide to useEffect
Josh Comeau — The Joy of React

Code Review & Quality


Google Engineering Practices — Code Review Guidelines
Michaela Greiler — Code Review Best Practices
Martin Fowler — Refactoring Catalog

Tools & MCP


Anthropic Claude Code Plugin System
Next.js DevTools MCP
Context7 MCP
Vercel — AI SDK Documentation
OpenAI — Function Calling Best Practices
Layer	What it is	Where you have control
Base model	Foundation — language, reasoning, code knowledge	Model selection
System prompt and instructions	Agent specs, reference guides, review process	High — this is where we going to live
Your user prompt	The diff, the task, the context you provide	High
Fine-tuning and code training	React, TypeScript, Next.js baked into the model	Low — accept it as given
Tools and retrieval (RAG)	Agents fetching reference guides on demand	High
Agent loops (iteration)	Each agent reasoning in steps, not one shot	High
Post-processing	Synthesis step — grouping labels, final report	High
Label	Meaning
`[blocking]`	Must fix before merge, mandatory
`[discuss]`	Warrants a conversation, not a hard block
`[advisory]`	Improve if convenient