kvenkatrajan/pr-1244-review.md

## pr-1244-review.md

      
    Raw
  

              pr-1244-review.md
            
          
    PR #1244 Review: azure-infra-planner skill

High-Level Suggestions

1. Rename to azure-enterprise-infra-planner

azure-prepare plans infrastructure by analyzing source code and mapping it to one of 5 host types (containerapp, appservice, function, staticwebapp, aks). It has no logic to plan the remaining 43 resource types in this skill — VNets, Firewalls, VPN Gateways, VMs, Service Bus, Key Vault, etc. It cannot look at a workload description and decide "you need a hub-spoke VNet with NSGs and a Firewall."
This skill fills that gap: infrastructure-first planning for platform engineers, where the input is a workload description (not code) and the output covers all 48 resource types. Renaming to azure-enterprise-infra-planner makes this distinction clear.
2. This is a parallel path, not a pre-step to azure-prepare

The two skills serve different personas and should be independent workflows:


azure-prepare
azure-enterprise-infra-planner


Persona
App developer
Platform engineer / cloud architect


Input
Source code
Workload requirements


Output
azure.yaml + infra/ + Dockerfiles
infra/ only (Bicep or Terraform)


Deployment
azd up
az deployment / terraform apply


Scope
Resource group
Often subscription-level


The skill should own its full lifecycle (plan → generate IaC → deploy) independently. bicep-generation.md, terraform-generation.md, and deployment.md should stay but be infra-focused (subscription-scope deployments, CAF naming, module-per-category structure).
Why the infra planner can't flow through azure-validate → azure-deploy

The existing chain is tightly coupled to the app-developer workflow: azure-validate requires .azure/plan.md in azure-prepare's schema and runs app-centric checks (azure.yaml validation, project builds). azure-deploy requires Validated status and runs azd up (azd provision + azd deploy) — but for infra-only templates with no services, azd deploy would no-op or fail since there's no app code to push.


Option
How it works
Tradeoff


A. Own its deployment
Infra planner runs az deployment create / terraform apply / azd provision directly via its own deployment.md
Simple, self-contained. Loses azure-deploy's error recovery.


B. Generate a compatible plan
Infra planner creates a .azure/plan.md that azure-validate and azure-deploy understand
Requires changes to validate + deploy to accept infra-only plans (no services).


C. Hybrid
Infra planner owns az deployment / terraform apply directly but generates azure.yaml + infra/ for users who want azd provision
Most flexible. No changes to existing skills.


The PR currently takes Option A. The infra planner's deployment.md runs az deployment group create and terraform apply directly.
Option C would be the ideal future state — add the ability to emit a minimal azure.yaml with no services: block (infra-only project) and run azd provision as an alternative deployment path. This lets enterprise users benefit from azd's environment management without forcing through a pipeline designed for app deployments.
Workflow diagram


      flowchart TD
    UP["User Prompt"] --> D1{"What kind of request?"}

    D1 -->|"deploy my app<br>create a web app<br>build and deploy"| AP["azure-prepare"]
    D1 -->|"plan Azure infrastructure<br>set up networking + VMs<br>architect landing zone"| IP["azure-enterprise-infra-planner"]

    AP --> AP1["Scan source code"]
    AP1 --> AP2["Select recipe + host type"]
    AP2 --> AP3["Generate azure.yaml"]
    AP3 --> AP4["Generate infra/ + Dockerfiles"]

    IP --> IP1["Research WAF + requirements"]
    IP1 --> IP2["Plan resources from catalog"]
    IP2 --> IP3["Validate pairing constraints"]
    IP3 --> IP4["Generate infra/ only<br>No azure.yaml · No app code"]

    AP4 --> AV["azure-validate"]
    AV --> AD["azure-deploy<br>azd up"]

    IP4 --> DD{"Deployment option"}
    DD -->|"Default"| AZ["az deployment create"]
    DD -->|"Terraform"| TF["terraform apply"]
    DD -->|"Option C future"| AZD["azd provision<br>infra-only azure.yaml"]

    style AP fill:#4a9eda,color:#fff
    style IP fill:#e06c3a,color:#fff
    style AV fill:#4a9eda,color:#fff
    style AD fill:#4a9eda,color:#fff
    style AZ fill:#e06c3a,color:#fff
    style TF fill:#e06c3a,color:#fff
    style AZD fill:#e06c3a,color:#fff,stroke-dasharray: 5 5

    
      Loading

  
3. Replace 166 static resource files with tool calls (200 → ~39 files)

The bulk of the PR (166 of 200 files) is per-resource reference files. Each of the 48 resources has 3-6 small files (bicep.md, constraints.md, properties.md, skus.md). Most of this content is obtainable at runtime from existing tools:


Deleted content
Replace with tool


bicep.md (all 48)
mcp_bicep_get_az_resource_type_schema — returns full schema with required properties


SKUs, Key Properties, Child Resources
Same Bicep schema tool — includes all property types, descriptions, valid values


Naming rules
microsoft_docs_search — "Service Bus naming rules" returns min/max length, chars, scope


Pairing constraints
❌ No tool provides this — must stay as static file


Replace 166 files with 2 shared files:

resource-catalog.md — lookup table (~123 lines) with ARM type, API version, CAF prefix for all 48 resources
constraints.md — merged pairing rules for all 48 resources (the only content tools can't provide)

4. Narrow description triggers to avoid routing conflicts

Current triggers overlap significantly with azure-prepare. For example, "generate Bicep", "deploy to Azure Container Apps", "deploy a GenAI backend", and "provision microservices on AKS" all appear in both skills' descriptions. Narrow to enterprise/IaaS-specific language and replace DO NOT USE FOR with PREFER to avoid injecting competing keywords into the routing description:
description: "Architect and provision enterprise Azure infrastructure from 
workload descriptions. For platform engineers needing networking, security, 
compliance, and WAF alignment. Generates Bicep or Terraform directly (no azd). 
WHEN: 'plan Azure infrastructure', 'set up networking and VMs', 
'architect Azure landing zone', 'design hub-spoke network', 
'provision enterprise workload', 'plan DR infrastructure'.
PREFER azure-prepare FOR app-centric workflows."
Prompts that overlap today → resolved with new description


Prompt
Current routing
With new description


"generate Bicep for my app"
Both match
→ azure-prepare (fixed — app only in azure-prepare)


"deploy to Azure Container Apps"
Both match
→ azure-prepare (fixed — deploy to, Container Apps removed from infra-planner)


"deploy a GenAI backend with supporting services"
Both match
→ azure-prepare (fixed — deploy, GenAI, backend removed)


"provision microservices on AKS"
Both match
→ azure-prepare (fixed — microservices, AKS removed)


"generate Terraform from a workload description"
Both match
→ infra-planner (stronger match — workload descriptions is in its description)


Prompts that route to azure-prepare only


Prompt
Why


"create a Node.js web app and deploy it"
App code + deployment


"add authentication to my existing API"
Modifying existing app


"build a todo list with React frontend and Express API"
Scaffolding app code


"deploy my Python Flask app to App Service"
Source code → host type mapping


"containerize my .NET app for Container Apps"
App code + Dockerfile + azure.yaml


"create a timer-triggered Azure Function"
App code generation


Prompts that route to azure-enterprise-infra-planner only


Prompt
Why


"design a hub-spoke network with VPN Gateway and Firewall"
Enterprise networking, no app code


"plan Azure landing zone infrastructure for our organization"
Platform engineering, subscription-scope


"set up VMs with NSGs, bastion host, and load balancer"
IaaS resources azure-prepare can't plan


"architect disaster recovery across two regions"
Infra-only, cross-region topology


"provision a Service Bus namespace with private endpoints and Key Vault"
Enterprise middleware, no app


"plan infrastructure for PCI-DSS compliance"
Compliance-driven infra planning


5. Add integration tests for end-to-end prompt testing

Current integration tests only validate skill invocation (correct skill is selected). They don't test the full prompt → plan → IaC generation flow. Add end-to-end tests that:

Send a natural language prompt (e.g., "plan infrastructure for a microservices app with Service Bus and Key Vault")
Verify the generated infrastructure-plan.json contains expected resources and pairings
Verify the generated Bicep/Terraform output compiles and includes required properties
Test error paths (e.g., incompatible SKU pairings trigger constraint violations)

6. Split into reviewable PRs (Optional — if file count is reduced to ~39)


PR
Contents
Files


PR A: Core skill
SKILL.md, research.md, plan-schema.md, verification.md, pairing-checks.md, waf-checklist.md, resource-catalog.md, constraints.md, bicep/terraform generation, deployment, sample plan
~12


PR B: Tests & evals
Tests, eval tasks, golden dataset, skills.json, eslint
~23


Implementation Walkthrough: Messaging Resources

BEFORE — 10 files

references/resources/messaging/
├── index.md                           ← lookup table (15 lines)
├── service-bus/
│   ├── service-bus.md                 ← ARM type, SKUs, naming, properties (68 lines)
│   ├── bicep.md                       ← Bicep snippet (10 lines)
│   └── constraints.md                 ← pairing rules (12 lines)
├── event-hub/
│   ├── event-hub.md                   ← (72 lines)
│   ├── bicep.md                       ← (9 lines)
│   └── constraints.md                 ← (13 lines)
└── event-grid/
    ├── event-grid.md                  ← (68 lines)
    ├── bicep.md                       ← (6 lines)
    └── constraints.md                 ← (14 lines)

10 files, ~287 lines.
AFTER — 0 messaging-specific files

Everything folds into 2 shared files that cover ALL 48 resources:
File 1: resource-catalog.md — one row per resource:
## Messaging

| Resource | ARM Type | API Version | CAF Prefix | Scope | Region |
|----------|----------|-------------|------------|-------|--------|
| Service Bus | `Microsoft.ServiceBus/namespaces` | `2024-01-01` | `sbns` | Global | Foundational |
| Event Hub | `Microsoft.EventHub/namespaces` | `2024-01-01` | `evhns` | Global | Foundational |
| Event Grid | `Microsoft.EventGrid/topics` | `2025-02-15` | `evgt` | Region | Mainstream |
File 2: constraints.md — pairing rules per resource:
## Service Bus
| Paired With | Constraint |
|-------------|------------|
| Topics | Standard/Premium only. Basic = queues only |
| VNet | Premium only supports private endpoints |
| Message Size | Basic/Standard: 256 KB. Premium: 100 MB |
| Function App | Needs `ServiceBusConnection` in app settings |

## Event Hub
| Paired With | Constraint |
|-------------|------------|
| Kafka | Standard/Premium only |
| Capture | Standard/Premium only |
| Retention | Basic: 1d. Standard: 7d. Premium: 90d |

## Event Grid
| Paired With | Constraint |
|-------------|------------|
| Private Endpoint | Premium SKU only |
| Managed Identity | Required for dead-letter delivery |
What happened to the deleted content?


Deleted content
Now comes from
How


bicep.md (Bicep snippet)
mcp_bicep_get_az_resource_type_schema tool
Agent calls with ARM type from catalog → gets full schema with required properties → generates Bicep


SKUs, Properties from service-bus.md
Same Bicep schema tool
Schema includes SKU values (Basic|Standard|Premium), all properties with types


Naming rules from service-bus.md
microsoft_docs_search tool
Agent searches "Service Bus naming rules" → gets min/max length, allowed chars


index.md
resource-catalog.md
Folded into the shared catalog


constraints.md
constraints.md (shared)
Moved into the shared file — only content that stays static


Agent workflow at runtime

1. READ resource-catalog.md
   → "Service Bus = Microsoft.ServiceBus/namespaces @ 2024-01-01, prefix sbns"

2. CALL mcp_bicep_get_az_resource_type_schema(Microsoft.ServiceBus/namespaces, 2024-01-01)
   → Gets: full 20KB schema — all properties, SKU values, required flags
   → Replaces: service-bus.md + bicep.md

3. CALL microsoft_docs_search("Service Bus naming rules")
   → Gets: min 6, max 50, globally unique
   → Replaces: Naming section of service-bus.md

4. READ constraints.md § Service Bus
   → Gets: pairing rules (no tool provides these)


Proposed Skill Structure

Summary (skill files only, excludes tests)


Current
Proposed
Reduction


Files
177 (11 core + 166 resource)
13 (11 core + 2 shared resource)
-164 files (93%)


Lines
4,917 (579 core + 4,338 resource)
~1,200 (579 core + ~623 catalog/constraints)
-3,717 lines (76%)


Words
~37,100 (4,587 core + 32,510 resource)
~7,500 (4,587 core + ~2,900 catalog/constraints)
-29,600 words (80%)


Est. tokens
~48,200
~9,700
-38,500 tokens (80%)


Token estimate: words × 1.3 (accounts for markdown formatting, pipes, code fences).

The 166 deleted resource files are replaced by runtime tool calls that fetch richer, always-up-to-date content (20-61KB per resource from the Bicep schema tool alone).

Current: 200 files

azure-infra-planner/
├── SKILL.md
└── references/
    ├── deployment.md
    ├── pairing-checks.md
    ├── plan-schema.md
    ├── research.md
    ├── resources.md
    ├── verification.md
    ├── waf-checklist.md
    ├── sample_infrastructure_plan.json
    ├── DSLs/
    │   ├── bicep/bicep-generation.md
    │   └── terraform/terraform-generation.md
    └── resources/                          ← 166 files across 48 services
        ├── ai/ (7 index + service files)
        ├── compute/ (37 files)
        ├── data/ (34 files)
        ├── messaging/ (10 files)
        ├── monitoring/ (7 files)
        ├── networking/ (56 files)
        └── security/ (9 files)

tests/azure-infra-planner/                  ← 23 test files

Proposed: ~39 files

azure-enterprise-infra-planner/
├── SKILL.md                               ← renamed, narrow triggers, enterprise focus
└── references/
    ├── research.md                        ← UPDATED: directs tool calls instead of file reads
    ├── plan-schema.md                     ← unchanged
    ├── verification.md                    ← unchanged
    ├── pairing-checks.md                  ← unchanged
    ├── waf-checklist.md                   ← unchanged
    ├── resource-catalog.md                ← NEW: single lookup table (~123 lines, all 48 resources)
    ├── constraints.md                     ← NEW: merged pairing rules (~500 lines, all 48 resources)
    ├── bicep-generation.md                ← infra-focused (subscription scope, CAF naming)
    ├── terraform-generation.md            ← infra-focused (module-per-category)
    ├── deployment.md                      ← az deployment / terraform apply (not azd)
    ├── resources.md                       ← simplified (points to catalog + tools)
    └── sample_infrastructure_plan.json    ← unchanged

tests/azure-enterprise-infra-planner/       ← 23 test files (updated paths)

Key changes


Change
Before
After


Resource reference files
166 files (48 dirs × 3-6 files)
2 files (resource-catalog.md + constraints.md)


resources/ directory
7 categories × many subdirs
Deleted entirely


index.md files (×7)
Category lookup tables
Folded into resource-catalog.md


bicep.md files (×48)
Hand-written Bicep snippets
Tool call: mcp_bicep_get_az_resource_type_schema


<service>.md files (×48)
SKUs, naming, properties
Tool calls: Bicep schema + microsoft_docs_search


constraints.md files (×48)
Pairing rules
Merged into single constraints.md


DSLs directory
DSLs/bicep/ and DSLs/terraform/
Flattened to bicep-generation.md and terraform-generation.md


research.md
"Load these static files"
"Call these tools with ARM type from catalog"


Skill name
azure-infra-planner
azure-enterprise-infra-planner


Total files
200
~39
	`azure-prepare`	`azure-enterprise-infra-planner`
Persona	App developer	Platform engineer / cloud architect
Input	Source code	Workload requirements
Output	azure.yaml + infra/ + Dockerfiles	infra/ only (Bicep or Terraform)
Deployment	`azd up`	`az deployment` / `terraform apply`
Scope	Resource group	Often subscription-level
Option	How it works	Tradeoff
A. Own its deployment	Infra planner runs `az deployment create` / `terraform apply` / `azd provision` directly via its own `deployment.md`	Simple, self-contained. Loses azure-deploy's error recovery.
B. Generate a compatible plan	Infra planner creates a `.azure/plan.md` that azure-validate and azure-deploy understand	Requires changes to validate + deploy to accept infra-only plans (no services).
C. Hybrid	Infra planner owns `az deployment` / `terraform apply` directly but generates `azure.yaml` + `infra/` for users who want `azd provision`	Most flexible. No changes to existing skills.
Deleted content	Replace with tool
`bicep.md` (all 48)	`mcp_bicep_get_az_resource_type_schema` — returns full schema with required properties
SKUs, Key Properties, Child Resources	Same Bicep schema tool — includes all property types, descriptions, valid values
Naming rules	`microsoft_docs_search` — "Service Bus naming rules" returns min/max length, chars, scope
Pairing constraints	❌ No tool provides this — must stay as static file
Prompt	Current routing	With new description
"generate Bicep for my app"	Both match	→ azure-prepare (fixed — `app` only in azure-prepare)
"deploy to Azure Container Apps"	Both match	→ azure-prepare (fixed — `deploy to`, `Container Apps` removed from infra-planner)
"deploy a GenAI backend with supporting services"	Both match	→ azure-prepare (fixed — `deploy`, `GenAI`, `backend` removed)
"provision microservices on AKS"	Both match	→ azure-prepare (fixed — `microservices`, `AKS` removed)
"generate Terraform from a workload description"	Both match	→ infra-planner (stronger match — `workload descriptions` is in its description)
Prompt	Why
"create a Node.js web app and deploy it"	App code + deployment
"add authentication to my existing API"	Modifying existing app
"build a todo list with React frontend and Express API"	Scaffolding app code
"deploy my Python Flask app to App Service"	Source code → host type mapping
"containerize my .NET app for Container Apps"	App code + Dockerfile + azure.yaml
"create a timer-triggered Azure Function"	App code generation
Prompt	Why
"design a hub-spoke network with VPN Gateway and Firewall"	Enterprise networking, no app code
"plan Azure landing zone infrastructure for our organization"	Platform engineering, subscription-scope
"set up VMs with NSGs, bastion host, and load balancer"	IaaS resources azure-prepare can't plan
"architect disaster recovery across two regions"	Infra-only, cross-region topology
"provision a Service Bus namespace with private endpoints and Key Vault"	Enterprise middleware, no app
"plan infrastructure for PCI-DSS compliance"	Compliance-driven infra planning
PR	Contents	Files
PR A: Core skill	SKILL.md, research.md, plan-schema.md, verification.md, pairing-checks.md, waf-checklist.md, resource-catalog.md, constraints.md, bicep/terraform generation, deployment, sample plan	~12
PR B: Tests & evals	Tests, eval tasks, golden dataset, skills.json, eslint	~23
Deleted content	Now comes from	How
`bicep.md` (Bicep snippet)	`mcp_bicep_get_az_resource_type_schema` tool	Agent calls with ARM type from catalog → gets full schema with required properties → generates Bicep
SKUs, Properties from `service-bus.md`	Same Bicep schema tool	Schema includes SKU values (`Basic\|Standard\|Premium`), all properties with types
Naming rules from `service-bus.md`	`microsoft_docs_search` tool	Agent searches "Service Bus naming rules" → gets min/max length, allowed chars
`index.md`	`resource-catalog.md`	Folded into the shared catalog
`constraints.md`	`constraints.md` (shared)	Moved into the shared file — only content that stays static
	Current	Proposed	Reduction
Files	177 (11 core + 166 resource)	13 (11 core + 2 shared resource)	-164 files (93%)
Lines	4,917 (579 core + 4,338 resource)	~1,200 (579 core + ~623 catalog/constraints)	-3,717 lines (76%)
Words	~37,100 (4,587 core + 32,510 resource)	~7,500 (4,587 core + ~2,900 catalog/constraints)	-29,600 words (80%)
Est. tokens	~48,200	~9,700	-38,500 tokens (80%)
Change	Before	After
Resource reference files	166 files (48 dirs × 3-6 files)	2 files (`resource-catalog.md` + `constraints.md`)
`resources/` directory	7 categories × many subdirs	Deleted entirely
`index.md` files (×7)	Category lookup tables	Folded into `resource-catalog.md`
`bicep.md` files (×48)	Hand-written Bicep snippets	Tool call: `mcp_bicep_get_az_resource_type_schema`
`<service>.md` files (×48)	SKUs, naming, properties	Tool calls: Bicep schema + `microsoft_docs_search`
`constraints.md` files (×48)	Pairing rules	Merged into single `constraints.md`
DSLs directory	`DSLs/bicep/` and `DSLs/terraform/`	Flattened to `bicep-generation.md` and `terraform-generation.md`
`research.md`	"Load these static files"	"Call these tools with ARM type from catalog"
Skill name	`azure-infra-planner`	`azure-enterprise-infra-planner`
Total files	200	~39