azure-skills + MicrosoftDocs/Agent-Skills Convergence: Diagnostics & Troubleshooting Strategy
Date: March 11, 2026 Context: Should
azure-skillsreferenceMicrosoftDocs/Agent-Skillscontent? Should diagnostics be split into diagnostics + troubleshooting?
azure-diagnostics has deep, actionable guides for 2 out of 20+ relevant Azure services:
| Service | Coverage in azure-skills | Agent-Skills Troubleshooting Entries |
|---|---|---|
| Container Apps | Complete (105 lines, 5 issues, CLI commands) | 11 entries |
| Function Apps | Complete (88 lines, App Insights linking) | 23 entries |
| App Service | Minimal (1 ARG query only) | 3 entries |
| AKS | None | 17 entries |
| Cosmos DB | None | 50 entries (per-SDK, per-HTTP-error) |
| Azure Monitor | None | 42 entries |
| Azure SQL | None | Available in Agent-Skills |
| Azure Storage | None | Available in Agent-Skills |
| Key Vault | None | Available in Agent-Skills |
| API Management | Partial (AI Gateway only) | Available in Agent-Skills |
When a user says "my Cosmos DB is slow" or "my AKS pod keeps crashing," the diagnostics skill has nothing — just generic KQL templates.
Strengths:
├── Inline CLI commands (az containerapp show, az functionapp ...)
├── KQL query templates (copy-paste ready)
├── Azure Resource Graph queries
├── Step-by-step fix instructions (not just "read the docs")
├── MCP tool orchestration (AppLens, Azure Monitor)
└── Plan-first diagnostic workflows
Weaknesses:
├── Only 2 services have dedicated guides
├── No error-code-level guidance
├── No per-SDK troubleshooting
├── No limits/quotas awareness
└── No best practices content
Strengths:
├── 180+ services covered
├── Categorized by: Troubleshooting, Best Practices, Limits & Quotas,
│ Architecture, Security, Configuration, Deployment, Integrations
├── Error-code-specific docs (Cosmos HTTP 400/401/403/404/408/409/429/503)
├── Per-SDK troubleshooting (Cosmos .NET, Java v4, Python, Async Java v2)
├── Operationally relevant best practices (AKS upgrade strategies,
│ autoscale flapping, partitioning)
├── Auto-updated weekly from Microsoft Learn
└── Limits/quotas that cause operational issues
Weaknesses:
├── No inline commands or scripts
├── No actionable fix-it steps
├── Requires network access to fetch Learn content
├── No KQL/ARG queries
└── No MCP tool integration
The two approaches are complementary, not competing.
| Concern | Problem with splitting |
|---|---|
| User experience | "My app is broken" is one workflow — diagnose then troubleshoot then fix. Splitting forces two skill invocations. |
| Routing ambiguity | Is "App Service won't start" a diagnostics issue or a troubleshooting issue? The agent can't reliably distinguish. |
| Token waste | Two skills means two SKILL.md files always loaded into context for overlapping scenarios. |
| Maintenance | Two skills to update when a service's diagnostic story changes. |
The pattern that already works in azure-skills is progressive disclosure via JIT reference files. Only the relevant service's reference loads when needed.
azure-diagnostics/
├── SKILL.md ← Generic diagnostic flow (~400 tokens)
├── references/
│ ├── kql-queries.md ← Generic KQL templates
│ ├── azure-resource-graph.md ← Generic ARG queries
│ ├── container-apps/README.md ← Deep: 5 issues, CLI commands, fixes
│ └── functions/README.md ← Deep: App Insights linking, deploys
azure-diagnostics/
├── SKILL.md ← Generic flow + service routing table
├── references/
│ ├── kql-queries.md ← Generic KQL templates
│ ├── azure-resource-graph.md ← Generic ARG queries
│ │
│ ├── container-apps/README.md ← EXISTING: deep actionable guide
│ ├── functions/README.md ← EXISTING: deep actionable guide
│ │
│ ├── app-service/README.md ← NEW: actionable + Learn URLs
│ ├── aks/README.md ← NEW: actionable + Learn URLs
│ ├── cosmos-db/README.md ← NEW: actionable + Learn URLs
│ ├── azure-sql/README.md ← NEW: thin reference + Learn URLs
│ ├── storage/README.md ← NEW: thin reference + Learn URLs
│ ├── key-vault/README.md ← NEW: thin reference + Learn URLs
│ ├── monitor/README.md ← NEW: thin reference + Learn URLs
│ └── ...
Add a service routing section to the main SKILL.md:
## Service-Specific Guides
When the user mentions a specific Azure service, load the corresponding reference:
| Service | Reference |
|---|---|
| Container Apps | [references/container-apps/README.md](references/container-apps/README.md) |
| Function Apps | [references/functions/README.md](references/functions/README.md) |
| App Service | [references/app-service/README.md](references/app-service/README.md) |
| AKS / Kubernetes | [references/aks/README.md](references/aks/README.md) |
| Cosmos DB | [references/cosmos-db/README.md](references/cosmos-db/README.md) |
| Azure SQL | [references/azure-sql/README.md](references/azure-sql/README.md) |
| Azure Storage | [references/storage/README.md](references/storage/README.md) |Use for services where users frequently need operational help (App Service, AKS, Cosmos DB).
# {Service} Troubleshooting
## Common Issues
| Symptom | Likely Cause | Diagnostic Command | Fix |
|---|---|---|---|
| HTTP 503 responses | Instance unhealthy | `az webapp show -n {app} -g {rg} --query state` | Restart: `az webapp restart` |
| Slow cold starts | Large app package | Check App Insights: `requests \| summarize avg(duration)` | Enable Always On or pre-warm |
| Deployment fails | SCM locked | `az webapp deployment show -n {app} -g {rg}` | Check deployment locks |
## KQL Queries
### Recent errors for this service
```kusto
AppExceptions
| where TimeGenerated > ago(1h)
| where AppRoleName == "{service-name}"
| summarize count() by ProblemId, OuterMessage
| top 10 by count_
```
## Error Codes
| Code | Description | Learn Reference |
|---|---|---|
| HTTP 502 | Bad Gateway | [Troubleshoot HTTP 502/503](https://learn.microsoft.com/azure/app-service/troubleshoot-http-502-http-503) |
| HTTP 503 | Service Unavailable | [App Service diagnostics](https://learn.microsoft.com/azure/app-service/overview-diagnostics) |
## Further Reading
<!-- Sourced from MicrosoftDocs/Agent-Skills -->
- [Best practices for App Service](https://learn.microsoft.com/azure/app-service/deploy-best-practices)
- [Networking features](https://learn.microsoft.com/azure/app-service/networking-features)
- [App Service limits](https://learn.microsoft.com/azure/azure-resource-manager/management/azure-subscription-service-limits#app-service-limits)Use for services where you don't yet have deep actionable content. Provides instant baseline coverage with minimal effort.
# {Service} Troubleshooting
> For comprehensive troubleshooting, refer to Microsoft Learn documentation below.
## Quick Diagnostics
- Check health: `az {service} show --name {name} -g {rg} --query provisioningState`
- Resource health: Use [azure-resource-graph.md](../azure-resource-graph.md) queries
- Recent errors: Use [kql-queries.md](../kql-queries.md) templates filtered to this service
## Learn References
<!-- Sourced from MicrosoftDocs/Agent-Skills: {skill-name} -->
### Troubleshooting
- [{Title}]({learn-url})
- [{Title}]({learn-url})
### Best Practices
- [{Title}]({learn-url})
### Limits & Quotas
- [{Title}]({learn-url})| Component | Tokens | When loaded |
|---|---|---|
| SKILL.md (with routing table) | ~500 | Always (on skill activation) |
| kql-queries.md | ~250 | On demand |
| azure-resource-graph.md | ~350 | On demand |
| Deep service guide (Template A) | ~400-600 | On demand (only 1 loaded per query) |
| Thin service reference (Template B) | ~150-250 | On demand (only 1 loaded per query) |
Worst case per query: ~500 (SKILL.md) + ~600 (one deep guide) = ~1,100 tokens
Budget: SKILL.md hard limit is 5,000 tokens; references hard limit is 2,000 each. This fits comfortably.
Adding 20 service reference files does NOT increase context cost — only the one relevant file loads via JIT progressive disclosure.
Showing how Agent-Skills content maps into the azure-skills reference format:
# Cosmos DB Troubleshooting
## Common Issues
| Symptom | Likely Cause | Diagnostic Command | Fix |
|---|---|---|---|
| HTTP 429 (Too Many Requests) | RU limit exceeded | `az cosmosdb sql database show -a {acct} -n {db} -g {rg}` | Increase RU/s or enable autoscale |
| High latency on reads | Cross-region reads / bad partition key | App Insights: `dependencies \| where target has "cosmos"` | Review partition key strategy |
| HTTP 503 (Service Unavailable) | Transient / region failover | Check service health in portal | Retry with exponential backoff |
| Connection timeout | Firewall / VNET misconfiguration | `az cosmosdb show -n {acct} -g {rg} --query "publicNetworkAccess"` | Check firewall rules, VNET service endpoints |
## KQL Queries
### Cosmos DB dependency failures
```kusto
AppDependencies
| where TimeGenerated > ago(1h)
| where DependencyType == "Azure DocumentDB"
| where Success == false
| summarize count() by ResultCode, DependencyType, Target
| order by count_ desc
```
### Cosmos DB latency by operation
```kusto
AppDependencies
| where DependencyType == "Azure DocumentDB"
| summarize avg(DurationMs), p95=percentile(DurationMs, 95) by OperationName
| order by avg_DurationMs desc
```
## Error Codes
| Code | Description | Learn Reference |
|---|---|---|
| HTTP 400 | Bad Request | [Troubleshoot bad requests](https://learn.microsoft.com/azure/cosmos-db/nosql/troubleshoot-bad-request) |
| HTTP 401 | Unauthorized | [Troubleshoot unauthorized](https://learn.microsoft.com/azure/cosmos-db/nosql/troubleshoot-unauthorized) |
| HTTP 403 | Forbidden | [Troubleshoot forbidden](https://learn.microsoft.com/azure/cosmos-db/nosql/troubleshoot-forbidden) |
| HTTP 404 | Not Found | [Troubleshoot not found](https://learn.microsoft.com/azure/cosmos-db/nosql/troubleshoot-not-found) |
| HTTP 408 | Request Timeout | [Troubleshoot request timeout](https://learn.microsoft.com/azure/cosmos-db/nosql/troubleshoot-request-timeout) |
| HTTP 409 | Conflict | [Troubleshoot conflict](https://learn.microsoft.com/azure/cosmos-db/nosql/troubleshoot-conflict) |
| HTTP 429 | Too Many Requests | [Troubleshoot rate limiting](https://learn.microsoft.com/azure/cosmos-db/nosql/troubleshoot-request-rate-too-large) |
| HTTP 503 | Service Unavailable | [Troubleshoot service unavailable](https://learn.microsoft.com/azure/cosmos-db/nosql/troubleshoot-service-unavailable) |
## Per-SDK Troubleshooting
| SDK | Guide |
|---|---|
| .NET SDK | [Troubleshoot .NET SDK](https://learn.microsoft.com/azure/cosmos-db/nosql/troubleshoot-dotnet-sdk) |
| Java v4 SDK | [Troubleshoot Java v4 SDK](https://learn.microsoft.com/azure/cosmos-db/nosql/troubleshoot-java-sdk-v4) |
| Python SDK | [Performance tips for Python SDK](https://learn.microsoft.com/azure/cosmos-db/nosql/performance-tips-python-sdk) |
## Best Practices
<!-- Sourced from MicrosoftDocs/Agent-Skills: azure-cosmos-db, Best Practices category -->
- [Performance tips for .NET SDK v3](https://learn.microsoft.com/azure/cosmos-db/nosql/performance-tips-dotnet-sdk-v3)
- [Best practices for scaling provisioned throughput](https://learn.microsoft.com/azure/cosmos-db/scaling-provisioned-throughput-best-practices)
- [Diagnose and troubleshoot query performance](https://learn.microsoft.com/azure/cosmos-db/nosql/troubleshoot-query-performance)
## Limits & Quotas
<!-- Sourced from MicrosoftDocs/Agent-Skills: azure-cosmos-db, Limits & Quotas category -->
- [Azure Cosmos DB service quotas](https://learn.microsoft.com/azure/cosmos-db/concepts-limits)
- [RU burst capacity](https://learn.microsoft.com/azure/cosmos-db/burst-capacity)
- [Autoscale FAQ](https://learn.microsoft.com/azure/cosmos-db/nosql/autoscale-faq)Add deep actionable guides (Template A) for the 3 most-requested services missing coverage:
| Service | Agent-Skills Entries to Source From | Estimated Effort |
|---|---|---|
| App Service | 3 troubleshooting + 8 best practices | 4 hours |
| AKS | 17 troubleshooting + 44 best practices | 6 hours |
| Cosmos DB | 50 troubleshooting + 57 best practices | 6 hours |
Deliverables:
-
references/app-service/README.md -
references/aks/README.md -
references/cosmos-db/README.md - Updated SKILL.md with service routing table
- Bump
metadata.version - Add/update tests
Add thin reference files (Template B) for 10+ additional services:
| Service | Effort |
|---|---|
| Azure SQL | 1 hour |
| Azure Storage (Blob, Queue, Table) | 1 hour |
| Key Vault | 1 hour |
| Azure Monitor / Log Analytics | 2 hours |
| Event Hubs | 1 hour (supplement existing messaging skill) |
| Service Bus | 1 hour (supplement existing messaging skill) |
| VMs / Virtual Machine Scale Sets | 2 hours |
| Azure Cache for Redis | 1 hour |
| Azure API Management | 1 hour |
| Azure Front Door / Application Gateway | 1 hour |
Deliverables:
- 10+ thin reference files
- Updated routing table
- Version bump + tests
Build a script that:
- Reads Agent-Skills SKILL.md files for target services
- Extracts Troubleshooting, Best Practices, and Limits & Quotas URLs
- Generates thin reference files (Template B) automatically
- Validates URLs still resolve (404 check)
- Runs on schedule to detect new Agent-Skills entries
MicrosoftDocs/Agent-Skills azure-skills
┌──────────────────────┐ ┌──────────────────────────┐
│ skills/ │ │ azure-diagnostics/ │
│ azure-cosmos-db/ │ sync │ references/ │
│ SKILL.md │──script──▶│ cosmos-db/README.md │
│ Troubleshooting│ │ CLI commands │
│ Best Practices │ │ KQL queries │
│ Limits & Quotas│ │ + Learn URLs ◄──── │
└──────────────────────┘ └──────────────────────────┘
(auto-updated (actionable content
weekly by pipeline) + sourced URLs)
The main SKILL.md needs a small addition — a routing table that tells the agent which reference to load based on the service mentioned. This does NOT require splitting the skill.
1. Identify symptoms
2. Check resource health (ARG)
3. Review logs (KQL)
4. Analyze metrics
5. Investigate recent changes
1. Identify the Azure service involved
2. If service-specific guide exists → load it from references/
3. Run the generic diagnostic flow:
a. Check resource health (ARG)
b. Review logs (KQL)
c. Analyze metrics
d. Investigate recent changes
4. If issue persists → consult Learn references in the service guide
The routing step adds ~100 tokens to SKILL.md. Well within budget.
| Question | Answer |
|---|---|
| Should diagnostics be split into two skills? | No. One skill, expanded references. |
| Should azure-skills reference Agent-Skills content? | Yes. Learn URLs in Further Reading / Error Codes sections of each reference file. |
| What's the pattern? | Top of reference = actionable CLI/KQL commands. Bottom = curated Learn URLs sourced from Agent-Skills. |
| How does this scale? | JIT progressive disclosure — only one service reference loads per query. 20 files cost nothing until activated. |
| Can it be automated? | Yes — a sync script can generate thin reference files from Agent-Skills and detect new entries. |
| What's the effort? | Phase 1 (3 deep guides): ~16 hours. Phase 2 (10 thin refs): ~12 hours. Phase 3 (automation): ~1-2 weeks. |