Skip to content

Instantly share code, notes, and snippets.

@kvenkatrajan
Last active March 13, 2026 18:57
Show Gist options
  • Select an option

  • Save kvenkatrajan/b6fe0b625977ae47586b89598f111ab6 to your computer and use it in GitHub Desktop.

Select an option

Save kvenkatrajan/b6fe0b625977ae47586b89598f111ab6 to your computer and use it in GitHub Desktop.
PR #1244 Review: azure-infra-planner skill — compression, architecture, and rename suggestions

PR #1244 Review: azure-infra-planner skill

High-Level Suggestions

1. Rename to azure-enterprise-infra-planner

azure-prepare plans infrastructure by analyzing source code and mapping it to one of 5 host types (containerapp, appservice, function, staticwebapp, aks). It has no logic to plan the remaining 43 resource types in this skill — VNets, Firewalls, VPN Gateways, VMs, Service Bus, Key Vault, etc. It cannot look at a workload description and decide "you need a hub-spoke VNet with NSGs and a Firewall."

This skill fills that gap: infrastructure-first planning for platform engineers, where the input is a workload description (not code) and the output covers all 48 resource types. Renaming to azure-enterprise-infra-planner makes this distinction clear.

2. This is a parallel path, not a pre-step to azure-prepare

The two skills serve different personas and should be independent workflows:

azure-prepare azure-enterprise-infra-planner
Persona App developer Platform engineer / cloud architect
Input Source code Workload requirements
Output azure.yaml + infra/ + Dockerfiles infra/ only (Bicep or Terraform)
Deployment azd up az deployment / terraform apply
Scope Resource group Often subscription-level

The skill should own its full lifecycle (plan → generate IaC → deploy) independently. bicep-generation.md, terraform-generation.md, and deployment.md should stay but be infra-focused (subscription-scope deployments, CAF naming, module-per-category structure).

Why the infra planner can't flow through azure-validate → azure-deploy

The existing chain is tightly coupled to the app-developer workflow: azure-validate requires .azure/plan.md in azure-prepare's schema and runs app-centric checks (azure.yaml validation, project builds). azure-deploy requires Validated status and runs azd up (azd provision + azd deploy) — but for infra-only templates with no services, azd deploy would no-op or fail since there's no app code to push.

Option How it works Tradeoff
A. Own its deployment Infra planner runs az deployment create / terraform apply / azd provision directly via its own deployment.md Simple, self-contained. Loses azure-deploy's error recovery.
B. Generate a compatible plan Infra planner creates a .azure/plan.md that azure-validate and azure-deploy understand Requires changes to validate + deploy to accept infra-only plans (no services).
C. Hybrid Infra planner owns az deployment / terraform apply directly but generates azure.yaml + infra/ for users who want azd provision Most flexible. No changes to existing skills.

The PR currently takes Option A. The infra planner's deployment.md runs az deployment group create and terraform apply directly.

Option C would be the ideal future state — add the ability to emit a minimal azure.yaml with no services: block (infra-only project) and run azd provision as an alternative deployment path. This lets enterprise users benefit from azd's environment management without forcing through a pipeline designed for app deployments.

Workflow diagram

flowchart TD
    UP["User Prompt"] --> D1{"What kind of request?"}

    D1 -->|"deploy my app<br>create a web app<br>build and deploy"| AP["azure-prepare"]
    D1 -->|"plan Azure infrastructure<br>set up networking + VMs<br>architect landing zone"| IP["azure-enterprise-infra-planner"]

    AP --> AP1["Scan source code"]
    AP1 --> AP2["Select recipe + host type"]
    AP2 --> AP3["Generate azure.yaml"]
    AP3 --> AP4["Generate infra/ + Dockerfiles"]

    IP --> IP1["Research WAF + requirements"]
    IP1 --> IP2["Plan resources from catalog"]
    IP2 --> IP3["Validate pairing constraints"]
    IP3 --> IP4["Generate infra/ only<br>No azure.yaml · No app code"]

    AP4 --> AV["azure-validate"]
    AV --> AD["azure-deploy<br>azd up"]

    IP4 --> DD{"Deployment option"}
    DD -->|"Default"| AZ["az deployment create"]
    DD -->|"Terraform"| TF["terraform apply"]
    DD -->|"Option C future"| AZD["azd provision<br>infra-only azure.yaml"]

    style AP fill:#4a9eda,color:#fff
    style IP fill:#e06c3a,color:#fff
    style AV fill:#4a9eda,color:#fff
    style AD fill:#4a9eda,color:#fff
    style AZ fill:#e06c3a,color:#fff
    style TF fill:#e06c3a,color:#fff
    style AZD fill:#e06c3a,color:#fff,stroke-dasharray: 5 5
Loading

3. Replace 166 static resource files with tool calls (200 → ~39 files)

The bulk of the PR (166 of 200 files) is per-resource reference files. Each of the 48 resources has 3-6 small files (bicep.md, constraints.md, properties.md, skus.md). Most of this content is obtainable at runtime from existing tools:

Deleted content Replace with tool
bicep.md (all 48) mcp_bicep_get_az_resource_type_schema — returns full schema with required properties
SKUs, Key Properties, Child Resources Same Bicep schema tool — includes all property types, descriptions, valid values
Naming rules microsoft_docs_search — "Service Bus naming rules" returns min/max length, chars, scope
Pairing constraints No tool provides this — must stay as static file

Replace 166 files with 2 shared files:

  • resource-catalog.md — lookup table (~123 lines) with ARM type, API version, CAF prefix for all 48 resources
  • constraints.md — merged pairing rules for all 48 resources (the only content tools can't provide)

4. Narrow description triggers to avoid routing conflicts

Current triggers overlap significantly with azure-prepare. For example, "generate Bicep", "deploy to Azure Container Apps", "deploy a GenAI backend", and "provision microservices on AKS" all appear in both skills' descriptions. Narrow to enterprise/IaaS-specific language and replace DO NOT USE FOR with PREFER to avoid injecting competing keywords into the routing description:

description: "Architect and provision enterprise Azure infrastructure from 
workload descriptions. For platform engineers needing networking, security, 
compliance, and WAF alignment. Generates Bicep or Terraform directly (no azd). 
WHEN: 'plan Azure infrastructure', 'set up networking and VMs', 
'architect Azure landing zone', 'design hub-spoke network', 
'provision enterprise workload', 'plan DR infrastructure'.
PREFER azure-prepare FOR app-centric workflows."

Prompts that overlap today → resolved with new description

Prompt Current routing With new description
"generate Bicep for my app" Both match azure-prepare (fixed — app only in azure-prepare)
"deploy to Azure Container Apps" Both match azure-prepare (fixed — deploy to, Container Apps removed from infra-planner)
"deploy a GenAI backend with supporting services" Both match azure-prepare (fixed — deploy, GenAI, backend removed)
"provision microservices on AKS" Both match azure-prepare (fixed — microservices, AKS removed)
"generate Terraform from a workload description" Both match infra-planner (stronger match — workload descriptions is in its description)

Prompts that route to azure-prepare only

Prompt Why
"create a Node.js web app and deploy it" App code + deployment
"add authentication to my existing API" Modifying existing app
"build a todo list with React frontend and Express API" Scaffolding app code
"deploy my Python Flask app to App Service" Source code → host type mapping
"containerize my .NET app for Container Apps" App code + Dockerfile + azure.yaml
"create a timer-triggered Azure Function" App code generation

Prompts that route to azure-enterprise-infra-planner only

Prompt Why
"design a hub-spoke network with VPN Gateway and Firewall" Enterprise networking, no app code
"plan Azure landing zone infrastructure for our organization" Platform engineering, subscription-scope
"set up VMs with NSGs, bastion host, and load balancer" IaaS resources azure-prepare can't plan
"architect disaster recovery across two regions" Infra-only, cross-region topology
"provision a Service Bus namespace with private endpoints and Key Vault" Enterprise middleware, no app
"plan infrastructure for PCI-DSS compliance" Compliance-driven infra planning

5. Add integration tests for end-to-end prompt testing

Current integration tests only validate skill invocation (correct skill is selected). They don't test the full prompt → plan → IaC generation flow. Add end-to-end tests that:

  • Send a natural language prompt (e.g., "plan infrastructure for a microservices app with Service Bus and Key Vault")
  • Verify the generated infrastructure-plan.json contains expected resources and pairings
  • Verify the generated Bicep/Terraform output compiles and includes required properties
  • Test error paths (e.g., incompatible SKU pairings trigger constraint violations)

6. Split into reviewable PRs (Optional — if file count is reduced to ~39)

PR Contents Files
PR A: Core skill SKILL.md, research.md, plan-schema.md, verification.md, pairing-checks.md, waf-checklist.md, resource-catalog.md, constraints.md, bicep/terraform generation, deployment, sample plan ~12
PR B: Tests & evals Tests, eval tasks, golden dataset, skills.json, eslint ~23

Implementation Walkthrough: Messaging Resources

BEFORE — 10 files

references/resources/messaging/
├── index.md                           ← lookup table (15 lines)
├── service-bus/
│   ├── service-bus.md                 ← ARM type, SKUs, naming, properties (68 lines)
│   ├── bicep.md                       ← Bicep snippet (10 lines)
│   └── constraints.md                 ← pairing rules (12 lines)
├── event-hub/
│   ├── event-hub.md                   ← (72 lines)
│   ├── bicep.md                       ← (9 lines)
│   └── constraints.md                 ← (13 lines)
└── event-grid/
    ├── event-grid.md                  ← (68 lines)
    ├── bicep.md                       ← (6 lines)
    └── constraints.md                 ← (14 lines)

10 files, ~287 lines.

AFTER — 0 messaging-specific files

Everything folds into 2 shared files that cover ALL 48 resources:

File 1: resource-catalog.md — one row per resource:

## Messaging

| Resource | ARM Type | API Version | CAF Prefix | Scope | Region |
|----------|----------|-------------|------------|-------|--------|
| Service Bus | `Microsoft.ServiceBus/namespaces` | `2024-01-01` | `sbns` | Global | Foundational |
| Event Hub | `Microsoft.EventHub/namespaces` | `2024-01-01` | `evhns` | Global | Foundational |
| Event Grid | `Microsoft.EventGrid/topics` | `2025-02-15` | `evgt` | Region | Mainstream |

File 2: constraints.md — pairing rules per resource:

## Service Bus
| Paired With | Constraint |
|-------------|------------|
| Topics | Standard/Premium only. Basic = queues only |
| VNet | Premium only supports private endpoints |
| Message Size | Basic/Standard: 256 KB. Premium: 100 MB |
| Function App | Needs `ServiceBusConnection` in app settings |

## Event Hub
| Paired With | Constraint |
|-------------|------------|
| Kafka | Standard/Premium only |
| Capture | Standard/Premium only |
| Retention | Basic: 1d. Standard: 7d. Premium: 90d |

## Event Grid
| Paired With | Constraint |
|-------------|------------|
| Private Endpoint | Premium SKU only |
| Managed Identity | Required for dead-letter delivery |

What happened to the deleted content?

Deleted content Now comes from How
bicep.md (Bicep snippet) mcp_bicep_get_az_resource_type_schema tool Agent calls with ARM type from catalog → gets full schema with required properties → generates Bicep
SKUs, Properties from service-bus.md Same Bicep schema tool Schema includes SKU values (Basic|Standard|Premium), all properties with types
Naming rules from service-bus.md microsoft_docs_search tool Agent searches "Service Bus naming rules" → gets min/max length, allowed chars
index.md resource-catalog.md Folded into the shared catalog
constraints.md constraints.md (shared) Moved into the shared file — only content that stays static

Agent workflow at runtime

1. READ resource-catalog.md
   → "Service Bus = Microsoft.ServiceBus/namespaces @ 2024-01-01, prefix sbns"

2. CALL mcp_bicep_get_az_resource_type_schema(Microsoft.ServiceBus/namespaces, 2024-01-01)
   → Gets: full 20KB schema — all properties, SKU values, required flags
   → Replaces: service-bus.md + bicep.md

3. CALL microsoft_docs_search("Service Bus naming rules")
   → Gets: min 6, max 50, globally unique
   → Replaces: Naming section of service-bus.md

4. READ constraints.md § Service Bus
   → Gets: pairing rules (no tool provides these)

Proposed Skill Structure

Summary (skill files only, excludes tests)

Current Proposed Reduction
Files 177 (11 core + 166 resource) 13 (11 core + 2 shared resource) -164 files (93%)
Lines 4,917 (579 core + 4,338 resource) ~1,200 (579 core + ~623 catalog/constraints) -3,717 lines (76%)
Words ~37,100 (4,587 core + 32,510 resource) ~7,500 (4,587 core + ~2,900 catalog/constraints) -29,600 words (80%)
Est. tokens ~48,200 ~9,700 -38,500 tokens (80%)

Token estimate: words × 1.3 (accounts for markdown formatting, pipes, code fences).
The 166 deleted resource files are replaced by runtime tool calls that fetch richer, always-up-to-date content (20-61KB per resource from the Bicep schema tool alone).

Current: 200 files

azure-infra-planner/
├── SKILL.md
└── references/
    ├── deployment.md
    ├── pairing-checks.md
    ├── plan-schema.md
    ├── research.md
    ├── resources.md
    ├── verification.md
    ├── waf-checklist.md
    ├── sample_infrastructure_plan.json
    ├── DSLs/
    │   ├── bicep/bicep-generation.md
    │   └── terraform/terraform-generation.md
    └── resources/                          ← 166 files across 48 services
        ├── ai/ (7 index + service files)
        ├── compute/ (37 files)
        ├── data/ (34 files)
        ├── messaging/ (10 files)
        ├── monitoring/ (7 files)
        ├── networking/ (56 files)
        └── security/ (9 files)

tests/azure-infra-planner/                  ← 23 test files

Proposed: ~39 files

azure-enterprise-infra-planner/
├── SKILL.md                               ← renamed, narrow triggers, enterprise focus
└── references/
    ├── research.md                        ← UPDATED: directs tool calls instead of file reads
    ├── plan-schema.md                     ← unchanged
    ├── verification.md                    ← unchanged
    ├── pairing-checks.md                  ← unchanged
    ├── waf-checklist.md                   ← unchanged
    ├── resource-catalog.md                ← NEW: single lookup table (~123 lines, all 48 resources)
    ├── constraints.md                     ← NEW: merged pairing rules (~500 lines, all 48 resources)
    ├── bicep-generation.md                ← infra-focused (subscription scope, CAF naming)
    ├── terraform-generation.md            ← infra-focused (module-per-category)
    ├── deployment.md                      ← az deployment / terraform apply (not azd)
    ├── resources.md                       ← simplified (points to catalog + tools)
    └── sample_infrastructure_plan.json    ← unchanged

tests/azure-enterprise-infra-planner/       ← 23 test files (updated paths)

Key changes

Change Before After
Resource reference files 166 files (48 dirs × 3-6 files) 2 files (resource-catalog.md + constraints.md)
resources/ directory 7 categories × many subdirs Deleted entirely
index.md files (×7) Category lookup tables Folded into resource-catalog.md
bicep.md files (×48) Hand-written Bicep snippets Tool call: mcp_bicep_get_az_resource_type_schema
<service>.md files (×48) SKUs, naming, properties Tool calls: Bicep schema + microsoft_docs_search
constraints.md files (×48) Pairing rules Merged into single constraints.md
DSLs directory DSLs/bicep/ and DSLs/terraform/ Flattened to bicep-generation.md and terraform-generation.md
research.md "Load these static files" "Call these tools with ARM type from catalog"
Skill name azure-infra-planner azure-enterprise-infra-planner
Total files 200 ~39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment