kvenkatrajan/azure-skills-agent-skills-convergence.md

## azure-skills-agent-skills-convergence.md

      
    Raw
  

              azure-skills-agent-skills-convergence.md
            
          
    azure-skills + MicrosoftDocs/Agent-Skills Convergence: Diagnostics & Troubleshooting Strategy
azure-skills + MicrosoftDocs/Agent-Skills Convergence: Diagnostics & Troubleshooting


Date: March 11, 2026
Context: Should azure-skills reference MicrosoftDocs/Agent-Skills content? Should diagnostics be split into diagnostics + troubleshooting?


1. The Problem Today

azure-diagnostics has deep, actionable guides for 2 out of 20+ relevant Azure services:


Service
Coverage in azure-skills
Agent-Skills Troubleshooting Entries


Container Apps
Complete (105 lines, 5 issues, CLI commands)
11 entries


Function Apps
Complete (88 lines, App Insights linking)
23 entries


App Service
Minimal (1 ARG query only)
3 entries


AKS
None
17 entries


Cosmos DB
None
50 entries (per-SDK, per-HTTP-error)


Azure Monitor
None
42 entries


Azure SQL
None
Available in Agent-Skills


Azure Storage
None
Available in Agent-Skills


Key Vault
None
Available in Agent-Skills


API Management
Partial (AI Gateway only)
Available in Agent-Skills


When a user says "my Cosmos DB is slow" or "my AKS pod keeps crashing," the diagnostics skill has nothing — just generic KQL templates.

2. What Each Repo Brings to the Table

azure-skills diagnostics (depth, actionability)

Strengths:
├── Inline CLI commands (az containerapp show, az functionapp ...)
├── KQL query templates (copy-paste ready)
├── Azure Resource Graph queries
├── Step-by-step fix instructions (not just "read the docs")
├── MCP tool orchestration (AppLens, Azure Monitor)
└── Plan-first diagnostic workflows

Weaknesses:
├── Only 2 services have dedicated guides
├── No error-code-level guidance
├── No per-SDK troubleshooting
├── No limits/quotas awareness
└── No best practices content

MicrosoftDocs/Agent-Skills (breadth, documentation depth)

Strengths:
├── 180+ services covered
├── Categorized by: Troubleshooting, Best Practices, Limits & Quotas,
│   Architecture, Security, Configuration, Deployment, Integrations
├── Error-code-specific docs (Cosmos HTTP 400/401/403/404/408/409/429/503)
├── Per-SDK troubleshooting (Cosmos .NET, Java v4, Python, Async Java v2)
├── Operationally relevant best practices (AKS upgrade strategies,
│   autoscale flapping, partitioning)
├── Auto-updated weekly from Microsoft Learn
└── Limits/quotas that cause operational issues

Weaknesses:
├── No inline commands or scripts
├── No actionable fix-it steps
├── Requires network access to fetch Learn content
├── No KQL/ARG queries
└── No MCP tool integration

The two approaches are complementary, not competing.

3. Recommendation: Don't Split — Expand References

Why NOT to create a separate "troubleshooting" skill


Concern
Problem with splitting


User experience
"My app is broken" is one workflow — diagnose then troubleshoot then fix. Splitting forces two skill invocations.


Routing ambiguity
Is "App Service won't start" a diagnostics issue or a troubleshooting issue? The agent can't reliably distinguish.


Token waste
Two skills means two SKILL.md files always loaded into context for overlapping scenarios.


Maintenance
Two skills to update when a service's diagnostic story changes.


What to do instead: expand references/ with Agent-Skills content

The pattern that already works in azure-skills is progressive disclosure via JIT reference files. Only the relevant service's reference loads when needed.
Current structure (2 services)

azure-diagnostics/
├── SKILL.md                         ← Generic diagnostic flow (~400 tokens)
├── references/
│   ├── kql-queries.md               ← Generic KQL templates
│   ├── azure-resource-graph.md      ← Generic ARG queries
│   ├── container-apps/README.md     ← Deep: 5 issues, CLI commands, fixes
│   └── functions/README.md          ← Deep: App Insights linking, deploys

Proposed structure (20+ services)

azure-diagnostics/
├── SKILL.md                         ← Generic flow + service routing table
├── references/
│   ├── kql-queries.md               ← Generic KQL templates
│   ├── azure-resource-graph.md      ← Generic ARG queries
│   │
│   ├── container-apps/README.md     ← EXISTING: deep actionable guide
│   ├── functions/README.md          ← EXISTING: deep actionable guide
│   │
│   ├── app-service/README.md        ← NEW: actionable + Learn URLs
│   ├── aks/README.md                ← NEW: actionable + Learn URLs
│   ├── cosmos-db/README.md          ← NEW: actionable + Learn URLs
│   ├── azure-sql/README.md          ← NEW: thin reference + Learn URLs
│   ├── storage/README.md            ← NEW: thin reference + Learn URLs
│   ├── key-vault/README.md          ← NEW: thin reference + Learn URLs
│   ├── monitor/README.md            ← NEW: thin reference + Learn URLs
│   └── ...

SKILL.md routing table addition

Add a service routing section to the main SKILL.md:
## Service-Specific Guides

When the user mentions a specific Azure service, load the corresponding reference:

| Service | Reference |
|---|---|
| Container Apps | [references/container-apps/README.md](references/container-apps/README.md) |
| Function Apps | [references/functions/README.md](references/functions/README.md) |
| App Service | [references/app-service/README.md](references/app-service/README.md) |
| AKS / Kubernetes | [references/aks/README.md](references/aks/README.md) |
| Cosmos DB | [references/cosmos-db/README.md](references/cosmos-db/README.md) |
| Azure SQL | [references/azure-sql/README.md](references/azure-sql/README.md) |
| Azure Storage | [references/storage/README.md](references/storage/README.md) |

4. Reference File Templates

Template A: Deep Actionable Guide (for high-traffic services)

Use for services where users frequently need operational help (App Service, AKS, Cosmos DB).
# {Service} Troubleshooting

## Common Issues

| Symptom | Likely Cause | Diagnostic Command | Fix |
|---|---|---|---|
| HTTP 503 responses | Instance unhealthy | `az webapp show -n {app} -g {rg} --query state` | Restart: `az webapp restart` |
| Slow cold starts | Large app package | Check App Insights: `requests \| summarize avg(duration)` | Enable Always On or pre-warm |
| Deployment fails | SCM locked | `az webapp deployment show -n {app} -g {rg}` | Check deployment locks |

## KQL Queries

### Recent errors for this service
```kusto
AppExceptions
| where TimeGenerated > ago(1h)
| where AppRoleName == "{service-name}"
| summarize count() by ProblemId, OuterMessage
| top 10 by count_
```

## Error Codes

| Code | Description | Learn Reference |
|---|---|---|
| HTTP 502 | Bad Gateway | [Troubleshoot HTTP 502/503](https://learn.microsoft.com/azure/app-service/troubleshoot-http-502-http-503) |
| HTTP 503 | Service Unavailable | [App Service diagnostics](https://learn.microsoft.com/azure/app-service/overview-diagnostics) |

## Further Reading
<!-- Sourced from MicrosoftDocs/Agent-Skills -->
- [Best practices for App Service](https://learn.microsoft.com/azure/app-service/deploy-best-practices)
- [Networking features](https://learn.microsoft.com/azure/app-service/networking-features)
- [App Service limits](https://learn.microsoft.com/azure/azure-resource-manager/management/azure-subscription-service-limits#app-service-limits)
Template B: Thin Reference (for long-tail services)

Use for services where you don't yet have deep actionable content. Provides instant baseline coverage with minimal effort.
# {Service} Troubleshooting

> For comprehensive troubleshooting, refer to Microsoft Learn documentation below.

## Quick Diagnostics

- Check health: `az {service} show --name {name} -g {rg} --query provisioningState`
- Resource health: Use [azure-resource-graph.md](../azure-resource-graph.md) queries
- Recent errors: Use [kql-queries.md](../kql-queries.md) templates filtered to this service

## Learn References
<!-- Sourced from MicrosoftDocs/Agent-Skills: {skill-name} -->

### Troubleshooting
- [{Title}]({learn-url})
- [{Title}]({learn-url})

### Best Practices
- [{Title}]({learn-url})

### Limits & Quotas
- [{Title}]({learn-url})

5. Token Budget Analysis


Component
Tokens
When loaded


SKILL.md (with routing table)
~500
Always (on skill activation)


kql-queries.md
~250
On demand


azure-resource-graph.md
~350
On demand


Deep service guide (Template A)
~400-600
On demand (only 1 loaded per query)


Thin service reference (Template B)
~150-250
On demand (only 1 loaded per query)


Worst case per query: ~500 (SKILL.md) + ~600 (one deep guide) = ~1,100 tokens
Budget: SKILL.md hard limit is 5,000 tokens; references hard limit is 2,000 each. This fits comfortably.
Adding 20 service reference files does NOT increase context cost — only the one relevant file loads via JIT progressive disclosure.

6. Concrete Example: Cosmos DB Reference File

Showing how Agent-Skills content maps into the azure-skills reference format:
# Cosmos DB Troubleshooting

## Common Issues

| Symptom | Likely Cause | Diagnostic Command | Fix |
|---|---|---|---|
| HTTP 429 (Too Many Requests) | RU limit exceeded | `az cosmosdb sql database show -a {acct} -n {db} -g {rg}` | Increase RU/s or enable autoscale |
| High latency on reads | Cross-region reads / bad partition key | App Insights: `dependencies \| where target has "cosmos"` | Review partition key strategy |
| HTTP 503 (Service Unavailable) | Transient / region failover | Check service health in portal | Retry with exponential backoff |
| Connection timeout | Firewall / VNET misconfiguration | `az cosmosdb show -n {acct} -g {rg} --query "publicNetworkAccess"` | Check firewall rules, VNET service endpoints |

## KQL Queries

### Cosmos DB dependency failures
```kusto
AppDependencies
| where TimeGenerated > ago(1h)
| where DependencyType == "Azure DocumentDB"
| where Success == false
| summarize count() by ResultCode, DependencyType, Target
| order by count_ desc
```

### Cosmos DB latency by operation
```kusto
AppDependencies
| where DependencyType == "Azure DocumentDB"
| summarize avg(DurationMs), p95=percentile(DurationMs, 95) by OperationName
| order by avg_DurationMs desc
```

## Error Codes

| Code | Description | Learn Reference |
|---|---|---|
| HTTP 400 | Bad Request | [Troubleshoot bad requests](https://learn.microsoft.com/azure/cosmos-db/nosql/troubleshoot-bad-request) |
| HTTP 401 | Unauthorized | [Troubleshoot unauthorized](https://learn.microsoft.com/azure/cosmos-db/nosql/troubleshoot-unauthorized) |
| HTTP 403 | Forbidden | [Troubleshoot forbidden](https://learn.microsoft.com/azure/cosmos-db/nosql/troubleshoot-forbidden) |
| HTTP 404 | Not Found | [Troubleshoot not found](https://learn.microsoft.com/azure/cosmos-db/nosql/troubleshoot-not-found) |
| HTTP 408 | Request Timeout | [Troubleshoot request timeout](https://learn.microsoft.com/azure/cosmos-db/nosql/troubleshoot-request-timeout) |
| HTTP 409 | Conflict | [Troubleshoot conflict](https://learn.microsoft.com/azure/cosmos-db/nosql/troubleshoot-conflict) |
| HTTP 429 | Too Many Requests | [Troubleshoot rate limiting](https://learn.microsoft.com/azure/cosmos-db/nosql/troubleshoot-request-rate-too-large) |
| HTTP 503 | Service Unavailable | [Troubleshoot service unavailable](https://learn.microsoft.com/azure/cosmos-db/nosql/troubleshoot-service-unavailable) |

## Per-SDK Troubleshooting

| SDK | Guide |
|---|---|
| .NET SDK | [Troubleshoot .NET SDK](https://learn.microsoft.com/azure/cosmos-db/nosql/troubleshoot-dotnet-sdk) |
| Java v4 SDK | [Troubleshoot Java v4 SDK](https://learn.microsoft.com/azure/cosmos-db/nosql/troubleshoot-java-sdk-v4) |
| Python SDK | [Performance tips for Python SDK](https://learn.microsoft.com/azure/cosmos-db/nosql/performance-tips-python-sdk) |

## Best Practices
<!-- Sourced from MicrosoftDocs/Agent-Skills: azure-cosmos-db, Best Practices category -->
- [Performance tips for .NET SDK v3](https://learn.microsoft.com/azure/cosmos-db/nosql/performance-tips-dotnet-sdk-v3)
- [Best practices for scaling provisioned throughput](https://learn.microsoft.com/azure/cosmos-db/scaling-provisioned-throughput-best-practices)
- [Diagnose and troubleshoot query performance](https://learn.microsoft.com/azure/cosmos-db/nosql/troubleshoot-query-performance)

## Limits & Quotas
<!-- Sourced from MicrosoftDocs/Agent-Skills: azure-cosmos-db, Limits & Quotas category -->
- [Azure Cosmos DB service quotas](https://learn.microsoft.com/azure/cosmos-db/concepts-limits)
- [RU burst capacity](https://learn.microsoft.com/azure/cosmos-db/burst-capacity)
- [Autoscale FAQ](https://learn.microsoft.com/azure/cosmos-db/nosql/autoscale-faq)

7. Implementation Plan

Phase 1: High-Impact Services (Week 1-2)

Add deep actionable guides (Template A) for the 3 most-requested services missing coverage:


Service
Agent-Skills Entries to Source From
Estimated Effort


App Service
3 troubleshooting + 8 best practices
4 hours


AKS
17 troubleshooting + 44 best practices
6 hours


Cosmos DB
50 troubleshooting + 57 best practices
6 hours


Deliverables:

 references/app-service/README.md
 references/aks/README.md
 references/cosmos-db/README.md
 Updated SKILL.md with service routing table
 Bump metadata.version
 Add/update tests

Phase 2: Broad Coverage (Week 3-4)

Add thin reference files (Template B) for 10+ additional services:


Service
Effort


Azure SQL
1 hour


Azure Storage (Blob, Queue, Table)
1 hour


Key Vault
1 hour


Azure Monitor / Log Analytics
2 hours


Event Hubs
1 hour (supplement existing messaging skill)


Service Bus
1 hour (supplement existing messaging skill)


VMs / Virtual Machine Scale Sets
2 hours


Azure Cache for Redis
1 hour


Azure API Management
1 hour


Azure Front Door / Application Gateway
1 hour


Deliverables:

 10+ thin reference files
 Updated routing table
 Version bump + tests

Phase 3: Automation (Month 2+)

Build a script that:

Reads Agent-Skills SKILL.md files for target services
Extracts Troubleshooting, Best Practices, and Limits & Quotas URLs
Generates thin reference files (Template B) automatically
Validates URLs still resolve (404 check)
Runs on schedule to detect new Agent-Skills entries

MicrosoftDocs/Agent-Skills          azure-skills
┌──────────────────────┐            ┌──────────────────────────┐
│ skills/               │            │ azure-diagnostics/       │
│   azure-cosmos-db/   │  sync      │   references/            │
│     SKILL.md         │──script──▶│     cosmos-db/README.md  │
│       Troubleshooting│            │       CLI commands       │
│       Best Practices │            │       KQL queries        │
│       Limits & Quotas│            │       + Learn URLs ◄──── │
└──────────────────────┘            └──────────────────────────┘
     (auto-updated                       (actionable content
      weekly by pipeline)                 + sourced URLs)


8. What Changes in SKILL.md

The main SKILL.md needs a small addition — a routing table that tells the agent which reference to load based on the service mentioned. This does NOT require splitting the skill.
Current SKILL.md flow

1. Identify symptoms
2. Check resource health (ARG)
3. Review logs (KQL)
4. Analyze metrics
5. Investigate recent changes

Updated SKILL.md flow

1. Identify the Azure service involved
2. If service-specific guide exists → load it from references/
3. Run the generic diagnostic flow:
   a. Check resource health (ARG)
   b. Review logs (KQL)
   c. Analyze metrics
   d. Investigate recent changes
4. If issue persists → consult Learn references in the service guide

The routing step adds ~100 tokens to SKILL.md. Well within budget.

9. Summary


Question
Answer


Should diagnostics be split into two skills?
No. One skill, expanded references.


Should azure-skills reference Agent-Skills content?
Yes. Learn URLs in Further Reading / Error Codes sections of each reference file.


What's the pattern?
Top of reference = actionable CLI/KQL commands. Bottom = curated Learn URLs sourced from Agent-Skills.


How does this scale?
JIT progressive disclosure — only one service reference loads per query. 20 files cost nothing until activated.


Can it be automated?
Yes — a sync script can generate thin reference files from Agent-Skills and detect new entries.


What's the effort?
Phase 1 (3 deep guides): ~16 hours. Phase 2 (10 thin refs): ~12 hours. Phase 3 (automation): ~1-2 weeks.
Service	Coverage in azure-skills	Agent-Skills Troubleshooting Entries
Container Apps	Complete (105 lines, 5 issues, CLI commands)	11 entries
Function Apps	Complete (88 lines, App Insights linking)	23 entries
App Service	Minimal (1 ARG query only)	3 entries
AKS	None	17 entries
Cosmos DB	None	50 entries (per-SDK, per-HTTP-error)
Azure Monitor	None	42 entries
Azure SQL	None	Available in Agent-Skills
Azure Storage	None	Available in Agent-Skills
Key Vault	None	Available in Agent-Skills
API Management	Partial (AI Gateway only)	Available in Agent-Skills
Concern	Problem with splitting
User experience	"My app is broken" is one workflow — diagnose then troubleshoot then fix. Splitting forces two skill invocations.
Routing ambiguity	Is "App Service won't start" a diagnostics issue or a troubleshooting issue? The agent can't reliably distinguish.
Token waste	Two skills means two SKILL.md files always loaded into context for overlapping scenarios.
Maintenance	Two skills to update when a service's diagnostic story changes.
Component	Tokens	When loaded
SKILL.md (with routing table)	~500	Always (on skill activation)
kql-queries.md	~250	On demand
azure-resource-graph.md	~350	On demand
Deep service guide (Template A)	~400-600	On demand (only 1 loaded per query)
Thin service reference (Template B)	~150-250	On demand (only 1 loaded per query)
Service	Agent-Skills Entries to Source From	Estimated Effort
App Service	3 troubleshooting + 8 best practices	4 hours
AKS	17 troubleshooting + 44 best practices	6 hours
Cosmos DB	50 troubleshooting + 57 best practices	6 hours
Service	Effort
Azure SQL	1 hour
Azure Storage (Blob, Queue, Table)	1 hour
Key Vault	1 hour
Azure Monitor / Log Analytics	2 hours
Event Hubs	1 hour (supplement existing messaging skill)
Service Bus	1 hour (supplement existing messaging skill)
VMs / Virtual Machine Scale Sets	2 hours
Azure Cache for Redis	1 hour
Azure API Management	1 hour
Azure Front Door / Application Gateway	1 hour
Question	Answer
Should diagnostics be split into two skills?	No. One skill, expanded references.
Should azure-skills reference Agent-Skills content?	Yes. Learn URLs in `Further Reading` / `Error Codes` sections of each reference file.
What's the pattern?	Top of reference = actionable CLI/KQL commands. Bottom = curated Learn URLs sourced from Agent-Skills.
How does this scale?	JIT progressive disclosure — only one service reference loads per query. 20 files cost nothing until activated.
Can it be automated?	Yes — a sync script can generate thin reference files from Agent-Skills and detect new entries.
What's the effort?	Phase 1 (3 deep guides): ~16 hours. Phase 2 (10 thin refs): ~12 hours. Phase 3 (automation): ~1-2 weeks.