Skip to content

Instantly share code, notes, and snippets.

@villander
Created February 27, 2026 22:37
Show Gist options
  • Select an option

  • Save villander/4b6b51b9880db3b43b7b8519a4e72d1d to your computer and use it in GitHub Desktop.

Select an option

Save villander/4b6b51b9880db3b43b7b8519a4e72d1d to your computer and use it in GitHub Desktop.
Orchestrated Code Creation

Proposal: Orchestrated Code Creation — A Framework for AI-Driven Next.js/React Development

Author: Michael Villander Team: Design Team Vega Status: Draft Date: 2026-02-27


Summary

Teams across G2i already using AI to write code and shipping fast to meet customer needs. However a lot of times the code is getting worse, not better. People just vibing with AI, not thinking, generating tons of code without really understanding it. And then this mess passes review because we dont have good systems to catch problems before a human has to look at everything.

This proposal is about building a review pipeline that actually works — five AI agents, each one specialized in different stuff, running on every pull request. They catch the obvious problems so engineers can focus on the real decisions instead of being the last defense against bad code.

"You're not just typing code anymore, you're orchestrating code creation." — Addy Osmani, React Summit 2025


Problem

When AI code turns out bad, people blame the model. But honestly? Most times is not the model. Is because someone gave vague instructions, no context, no rules to follow. The difference between "AI helped me ship fast" and "AI gave me a disaster" usually comes down to how specific you were when asking and how good your checks are on the other side.

Is a Pipeline, Not a Model

People always blame the model when AI code is bad. "Oh, Claude messed up" or "GPT gave me garbage." But most of the time is not the model fault. Is everything around it.

Think about all the pieces involved when you ask AI to write or review code:

Layer What it is Where you have control
Base model Foundation — language, reasoning, code knowledge Model selection
System prompt and instructions Agent specs, reference guides, review process High — this is where we going to live
Your user prompt The diff, the task, the context you provide High
Fine-tuning and code training React, TypeScript, Next.js baked into the model Low — accept it as given
Tools and retrieval (RAG) Agents fetching reference guides on demand High
Agent loops (iteration) Each agent reasoning in steps, not one shot High
Post-processing Synthesis step — grouping labels, final report High

The majority of this stuff we can control. The model training is done, we cant change that. But the instructions, the context, how agents use tools, how they reason step by step — all of this is on us. This proposal focuses exactly there.

Current state on most teams:

  • No structured review of AI-generated code before it reaches a human
  • Findings from review all come out at same level — blocking and advisory look identical
  • Agents silently skip categories on thin diffs, producing false clean bills
  • No awareness of spec/code drift when documentation files change alongside code
  • Vague checks with no actionable workflow

Proposal

We going to build five specialized Claude agents that run in parallel on every diff. Each agent owns a lane, has tight spec, reference guide, structured review process, and pre-labeled output.

Agent 1: Coding Reviewer

Reviews architecture, type safety, error handling, performance, DRY, naming, and design patterns against a references/coding.md guide.

App Router awareness is super critical here. The agent must check for:

  • "use client" overuse on components that dont need browser APIs
  • Missing cache policy on fetch() calls (no cache: or next: { revalidate })
  • Ad-hoc loading state with a useState boolean instead of loading.tsx or <Suspense>
  • Ad-hoc error handling in render instead of error.tsx
  • Component contract enforcement — props interface, union types for variants, named interactive states, named state unions — before examining JSX

The difference between poor and strong component prompt illustrates why contract-first matters:

Poor

Make a modal component that can show different content

Strong

Create a Modal component for our design system:

Props:
- isOpen: boolean
- onClose: () => void
- title: string
- size: 'sm' | 'md' | 'lg' | 'fullscreen'
- closeOnOverlayClick?: boolean (default: true)
- closeOnEscape?: boolean (default: true)

Requirements:
- Use Radix Dialog primitive under the hood
- Trap focus inside modal when open
- Return focus to trigger element on close
- Animate in/out with Tailwind (fade + scale)
- Lock body scroll when open
- ARIA: role="dialog", aria-modal="true", aria-labelledby for title

Example usage:
<Modal isOpen={showConfirm} onClose={() => setShowConfirm(false)} title="Delete Item?" size="sm">
  <p>You sure you want to delete this? No going back.</p>
  <Button onClick={handleDelete}>Yes, delete</Button>
</Modal>

The poor prompt generates something that maybe works but probably has no focus trap, no escape key handling, maybe even breaks scroll on mobile. The strong one hits all requirements because it defines contract before asking for implementation.

Agent 2: React Reviewer

Covers React correctness, performance, React 18/19+ patterns, Suspense boundaries, and hydration issues — backed by a references/react.md guide. This entire category is currently unreviewed on most teams, which is kind of crazy.

Correctness

Missing cleanup breaks Strict Mode

// ❌ Bad
useEffect(() => { api.subscribe(); }, []);

// ✅ Good
useEffect(() => {
  api.subscribe();
  return () => api.unsubscribe();
}, []);

Stale closures

// ❌ Bad — count is always the initial value
useEffect(() => {
  const id = setInterval(() => console.log(count), 1000);
  return () => clearInterval(id);
}, []); // missing dep

// ✅ Good
useEffect(() => {
  const id = setInterval(() => console.log(count), 1000);
  return () => clearInterval(id);
}, [count]);

Hydration-unsafe values

// ❌ Bad — different on server vs client
return <div>{Date.now()}</div>;

// ✅ Good
const [time, setTime] = useState(null);
useEffect(() => { setTime(Date.now()); }, []);
return <div>{time ?? 'Loading...'}</div>;

Performance

Context fan-out

// ❌ Bad — new object every render, all consumers re-render
<ThemeContext.Provider value={{ theme, setTheme }}>

// ✅ Good
const value = useMemo(() => ({ theme, setTheme }), [theme]);
<ThemeContext.Provider value={value}>

Derived state in effects

// ❌ Bad — you dont need this
useEffect(() => { setFullName(first + ' ' + last); }, [first, last]);

// ✅ Good — derive during render, way more simple
const fullName = `${first} ${last}`;

Patterns

Suspense boundary placement

// ❌ Bad — whole page suspends for one widget, imagine that
<Suspense fallback={<PageLoader />}><Page /></Suspense>

// ✅ Good
<Page>
  <Suspense fallback={<WidgetSkeleton />}>
    <SlowWidget />
  </Suspense>
</Page>

React 19+

Manual form state instead of useActionState

// ❌ Bad — old way
const [loading, setLoading] = useState(false);
const [error, setError] = useState(null);

// ✅ Good — new way, much better
const [state, formAction, isPending] = useActionState(createPost, {});

Agent 3: Testing Reviewer

Reviews test organization against fixture management and co-location principles:

  • Extract shared mock data to __fixtures__/
  • Module mocks to __mocks__/
  • Co-locate tests with source files — no __tests__/ directories
  • Keep tests DRY, always

Agent 4: Documentation Reviewer

Two-pass strategy to catch what a diff alone misses:

  • Pass 1 (diff): Functions whose signatures changed always require JSDoc re-evaluation
  • Pass 2 (full source): For each src/lib/ file in the diff, scan all exported functions — undocumented functions that was never touched wont show up in the diff

Agent 5: Linting Reviewer

Style consistency and import patterns across two principles.

Principle 12: Styling & UI Consistency

Semantic classes — hardcoded colors break in dark mode

// ❌ Bad — going to break in dark mode
<div className="bg-yellow-100 text-yellow-800">Warning</div>

// ✅ Good
<div className="bg-warning text-warning-foreground">Warning</div>

Design token discipline — check tailwind.config.ts before flagging

// ❌ Bad — arbitrary value when token exists, unnecessary
<div className="gap-[8px] text-[12px] bg-[#f3f4f6]">

// ✅ Good
<div className="gap-2 text-xs bg-muted">

Semantic HTML — <div onClick> breaks keyboard navigation

// ❌ Bad — zero accessibility
<div onClick={handleSubmit} className="cursor-pointer">Submit</div>

// ✅ Good
<button onClick={handleSubmit}>Submit</button>

Principle 13: Import Consistency

// ❌ Bad — looks ugly
import { cn } from "../../../lib/utils";
import { Button } from "components/ui/button";

// ✅ Good — clean
import { cn } from "@/lib/utils";
import { Button } from "@/components/ui/button";

Output Design

Severity Labels

Every finding should carry severity label assigned by the agent at time of reporting — not re-interpreted later:

Label Meaning
[blocking] Must fix before merge, mandatory
[discuss] Warrants a conversation, not a hard block
[advisory] Improve if convenient

Coverage Confirmation

Every agent should end its report with explicit list of every category it checked. This distinguishes "I looked and found nothing" from "there wasnt enough code in this diff to evaluate." Two different signals that currently look identical, very important to differentiate.

Living-Document Check

The coding reviewer should check for spec/code drift: if CLAUDE.md, README.md, or reference file appears in the diff alongside code changes, verify the two are still consistent with each other.

Synthesis

The orchestrator collects all agent output and groups findings by the labels agents already assigned — no re-interpretation needed. One structured report with single verdict: Pass / Needs Minor Changes / Needs Major Changes.


Important Caveats

Look, this pipeline is not going to solve everything. You still need humans in the loop. Think of agents like interns — they can do a lot of work but you need to tell them exactly what you want, check their work, and never let them push to production alone. No auto-merge, ever!

The difference between poor and strong agent task prompt:

Poor

The search is not working right, users are complaining

Strong

Task: Fix search results not updating when filters change

Context:
- Location: app/products/search/page.tsx and lib/hooks/useProductSearch.ts
- Bug: When user selects a category filter, search results dont refresh
- Root cause (suspected): useEffect dependency array missing the filters param
- Sentry link: https://sentry.io/issues/12345

Current behavior:
1. User types "shoes" in search -> results show correctly
2. User clicks "Running" category filter -> results stay the same
3. User has to type something else to trigger new search

Expected behavior:
- Results should refresh immediately when any filter changes

How to verify:
1. Run `npm run dev`
2. Go to /products/search
3. Search for something, then change category filter
4. Results should update without needing to retype

Constraints:
- Dont break the debounce on text input (we need that for performance)
- Keep the URL in sync with search state (already working, dont break it)
- Add test case in useProductSearch.test.ts

The poor prompt leaves everything undefined — which file, what exactly is broken, what is expected. The agent going to guess and probably refactor half the search system when it was just a missing dependency.

At the end of the day, this is just writing down what good engineers already do in their heads. We making it explicit so agents can follow the same rules. Nothing magical here, just discipline turned into automation.


Phases

Phase 1: Five Specialist Agents

Build the five agents described in Proposal section above — coding, React, testing, documentation, and linting — each with tight spec, reference guide, severity labels, and coverage confirmations. This is the foundation everything else builds on.

Phase 2: MCP and Real Tooling

The first version reads files and runs git diffs. Phase 2 gives agents direct access to real tools:

  • Build logs and CI output
  • Sentry traces
  • Playwright test results
  • Next.js DevTools (routes, layouts, server actions, runtime errors)
  • Chrome DevTools (console, network, performance traces, screenshots)

Instead of the agent just guessing if the code is okay, it actually checks your running app. Big difference between an agent that reads code and hopes for the best versus one that can see real errors, real test results, real performance data.

Phase 3: Context Engineering

Most agent failures are context failures, I already said that. Phase 1 agents get a diff and reference guide. Phase 3 gives them more signal:

  • Component usage across the codebase
  • Related test failures
  • Lighthouse score before and after
  • Failing CI output that would prevent bad patch

More context is not always better, you know? If you dump everything on the agent it gets confused. The goal is to give just the right information — not too much, not too little — so the agent can actually make good decisions.

Phase 4: Guardrails That Verify, Not Just Review

ESLint, Prettier, and typecheck wired into pipeline so agents cannot suggest something that would fail your existing tooling. The agents snap to your house style, not generic one, very important this.


The Bigger Picture

Every team has different conventions, different toolchains, different taste, right? This is not about setting universal standard. The goal is to teach teams how to think in this framework so they can codify their own standards and get full advantage of Claude or any AI model.

Every team going to keep having their autonomy, not our reference guides. Their principles, their conventions, their guardrails — automated and running on every diff. Each team with their own style!

The workflow is nothing new, honestly. Is what good engineers already do:

  1. Know what you building before you start typing
  2. Give the AI all the context it needs — what stack, what docs, what rules
  3. Make it show you a plan first, dont let it just go crazy writing code
  4. Small steps, test each one, fix problems as you go
  5. Keep going until is actually ready for production, not just "works on my machine"

The best teams using AI aren't doing anything special. They just taking all that stuff good engineers already know in their heads and writing it down so the AI can follow. Then they let AI handle the boring parts that do not need a senior engineer to think about.


References

AI-Assisted Development

React & Next.js

Code Review & Quality

Tools & MCP

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment