Skip to content

Instantly share code, notes, and snippets.

@kvenkatrajan
Last active March 11, 2026 14:16
Show Gist options
  • Select an option

  • Save kvenkatrajan/c3a4b5c614d2d4dbe3941607c01158df to your computer and use it in GitHub Desktop.

Select an option

Save kvenkatrajan/c3a4b5c614d2d4dbe3941607c01158df to your computer and use it in GitHub Desktop.
azure-skills + MicrosoftDocs/Agent-Skills Convergence: Diagnostics & Troubleshooting Strategy

azure-skills + MicrosoftDocs/Agent-Skills Convergence: Diagnostics & Troubleshooting Strategy

azure-skills + MicrosoftDocs/Agent-Skills Convergence: Diagnostics & Troubleshooting

Date: March 11, 2026 Context: Should azure-skills reference MicrosoftDocs/Agent-Skills content? Should diagnostics be split into diagnostics + troubleshooting?


1. The Problem Today

azure-diagnostics has deep, actionable guides for 2 out of 20+ relevant Azure services:

Service Coverage in azure-skills Agent-Skills Troubleshooting Entries
Container Apps Complete (105 lines, 5 issues, CLI commands) 11 entries
Function Apps Complete (88 lines, App Insights linking) 23 entries
App Service Minimal (1 ARG query only) 3 entries
AKS None 17 entries
Cosmos DB None 50 entries (per-SDK, per-HTTP-error)
Azure Monitor None 42 entries
Azure SQL None Available in Agent-Skills
Azure Storage None Available in Agent-Skills
Key Vault None Available in Agent-Skills
API Management Partial (AI Gateway only) Available in Agent-Skills

When a user says "my Cosmos DB is slow" or "my AKS pod keeps crashing," the diagnostics skill has nothing — just generic KQL templates.


2. What Each Repo Brings to the Table

azure-skills diagnostics (depth, actionability)

Strengths:
├── Inline CLI commands (az containerapp show, az functionapp ...)
├── KQL query templates (copy-paste ready)
├── Azure Resource Graph queries
├── Step-by-step fix instructions (not just "read the docs")
├── MCP tool orchestration (AppLens, Azure Monitor)
└── Plan-first diagnostic workflows

Weaknesses:
├── Only 2 services have dedicated guides
├── No error-code-level guidance
├── No per-SDK troubleshooting
├── No limits/quotas awareness
└── No best practices content

MicrosoftDocs/Agent-Skills (breadth, documentation depth)

Strengths:
├── 180+ services covered
├── Categorized by: Troubleshooting, Best Practices, Limits & Quotas,
│   Architecture, Security, Configuration, Deployment, Integrations
├── Error-code-specific docs (Cosmos HTTP 400/401/403/404/408/409/429/503)
├── Per-SDK troubleshooting (Cosmos .NET, Java v4, Python, Async Java v2)
├── Operationally relevant best practices (AKS upgrade strategies,
│   autoscale flapping, partitioning)
├── Auto-updated weekly from Microsoft Learn
└── Limits/quotas that cause operational issues

Weaknesses:
├── No inline commands or scripts
├── No actionable fix-it steps
├── Requires network access to fetch Learn content
├── No KQL/ARG queries
└── No MCP tool integration

The two approaches are complementary, not competing.


3. Recommendation: Don't Split — Expand References

Why NOT to create a separate "troubleshooting" skill

Concern Problem with splitting
User experience "My app is broken" is one workflow — diagnose then troubleshoot then fix. Splitting forces two skill invocations.
Routing ambiguity Is "App Service won't start" a diagnostics issue or a troubleshooting issue? The agent can't reliably distinguish.
Token waste Two skills means two SKILL.md files always loaded into context for overlapping scenarios.
Maintenance Two skills to update when a service's diagnostic story changes.

What to do instead: expand references/ with Agent-Skills content

The pattern that already works in azure-skills is progressive disclosure via JIT reference files. Only the relevant service's reference loads when needed.

Current structure (2 services)

azure-diagnostics/
├── SKILL.md                         ← Generic diagnostic flow (~400 tokens)
├── references/
│   ├── kql-queries.md               ← Generic KQL templates
│   ├── azure-resource-graph.md      ← Generic ARG queries
│   ├── container-apps/README.md     ← Deep: 5 issues, CLI commands, fixes
│   └── functions/README.md          ← Deep: App Insights linking, deploys

Proposed structure (20+ services)

azure-diagnostics/
├── SKILL.md                         ← Generic flow + service routing table
├── references/
│   ├── kql-queries.md               ← Generic KQL templates
│   ├── azure-resource-graph.md      ← Generic ARG queries
│   │
│   ├── container-apps/README.md     ← EXISTING: deep actionable guide
│   ├── functions/README.md          ← EXISTING: deep actionable guide
│   │
│   ├── app-service/README.md        ← NEW: actionable + Learn URLs
│   ├── aks/README.md                ← NEW: actionable + Learn URLs
│   ├── cosmos-db/README.md          ← NEW: actionable + Learn URLs
│   ├── azure-sql/README.md          ← NEW: thin reference + Learn URLs
│   ├── storage/README.md            ← NEW: thin reference + Learn URLs
│   ├── key-vault/README.md          ← NEW: thin reference + Learn URLs
│   ├── monitor/README.md            ← NEW: thin reference + Learn URLs
│   └── ...

SKILL.md routing table addition

Add a service routing section to the main SKILL.md:

## Service-Specific Guides

When the user mentions a specific Azure service, load the corresponding reference:

| Service | Reference |
|---|---|
| Container Apps | [references/container-apps/README.md](references/container-apps/README.md) |
| Function Apps | [references/functions/README.md](references/functions/README.md) |
| App Service | [references/app-service/README.md](references/app-service/README.md) |
| AKS / Kubernetes | [references/aks/README.md](references/aks/README.md) |
| Cosmos DB | [references/cosmos-db/README.md](references/cosmos-db/README.md) |
| Azure SQL | [references/azure-sql/README.md](references/azure-sql/README.md) |
| Azure Storage | [references/storage/README.md](references/storage/README.md) |

4. Reference File Templates

Template A: Deep Actionable Guide (for high-traffic services)

Use for services where users frequently need operational help (App Service, AKS, Cosmos DB).

# {Service} Troubleshooting

## Common Issues

| Symptom | Likely Cause | Diagnostic Command | Fix |
|---|---|---|---|
| HTTP 503 responses | Instance unhealthy | `az webapp show -n {app} -g {rg} --query state` | Restart: `az webapp restart` |
| Slow cold starts | Large app package | Check App Insights: `requests \| summarize avg(duration)` | Enable Always On or pre-warm |
| Deployment fails | SCM locked | `az webapp deployment show -n {app} -g {rg}` | Check deployment locks |

## KQL Queries

### Recent errors for this service
```kusto
AppExceptions
| where TimeGenerated > ago(1h)
| where AppRoleName == "{service-name}"
| summarize count() by ProblemId, OuterMessage
| top 10 by count_
```

## Error Codes

| Code | Description | Learn Reference |
|---|---|---|
| HTTP 502 | Bad Gateway | [Troubleshoot HTTP 502/503](https://learn.microsoft.com/azure/app-service/troubleshoot-http-502-http-503) |
| HTTP 503 | Service Unavailable | [App Service diagnostics](https://learn.microsoft.com/azure/app-service/overview-diagnostics) |

## Further Reading
<!-- Sourced from MicrosoftDocs/Agent-Skills -->
- [Best practices for App Service](https://learn.microsoft.com/azure/app-service/deploy-best-practices)
- [Networking features](https://learn.microsoft.com/azure/app-service/networking-features)
- [App Service limits](https://learn.microsoft.com/azure/azure-resource-manager/management/azure-subscription-service-limits#app-service-limits)

Template B: Thin Reference (for long-tail services)

Use for services where you don't yet have deep actionable content. Provides instant baseline coverage with minimal effort.

# {Service} Troubleshooting

> For comprehensive troubleshooting, refer to Microsoft Learn documentation below.

## Quick Diagnostics

- Check health: `az {service} show --name {name} -g {rg} --query provisioningState`
- Resource health: Use [azure-resource-graph.md](../azure-resource-graph.md) queries
- Recent errors: Use [kql-queries.md](../kql-queries.md) templates filtered to this service

## Learn References
<!-- Sourced from MicrosoftDocs/Agent-Skills: {skill-name} -->

### Troubleshooting
- [{Title}]({learn-url})
- [{Title}]({learn-url})

### Best Practices
- [{Title}]({learn-url})

### Limits & Quotas
- [{Title}]({learn-url})

5. Token Budget Analysis

Component Tokens When loaded
SKILL.md (with routing table) ~500 Always (on skill activation)
kql-queries.md ~250 On demand
azure-resource-graph.md ~350 On demand
Deep service guide (Template A) ~400-600 On demand (only 1 loaded per query)
Thin service reference (Template B) ~150-250 On demand (only 1 loaded per query)

Worst case per query: ~500 (SKILL.md) + ~600 (one deep guide) = ~1,100 tokens

Budget: SKILL.md hard limit is 5,000 tokens; references hard limit is 2,000 each. This fits comfortably.

Adding 20 service reference files does NOT increase context cost — only the one relevant file loads via JIT progressive disclosure.


6. Concrete Example: Cosmos DB Reference File

Showing how Agent-Skills content maps into the azure-skills reference format:

# Cosmos DB Troubleshooting

## Common Issues

| Symptom | Likely Cause | Diagnostic Command | Fix |
|---|---|---|---|
| HTTP 429 (Too Many Requests) | RU limit exceeded | `az cosmosdb sql database show -a {acct} -n {db} -g {rg}` | Increase RU/s or enable autoscale |
| High latency on reads | Cross-region reads / bad partition key | App Insights: `dependencies \| where target has "cosmos"` | Review partition key strategy |
| HTTP 503 (Service Unavailable) | Transient / region failover | Check service health in portal | Retry with exponential backoff |
| Connection timeout | Firewall / VNET misconfiguration | `az cosmosdb show -n {acct} -g {rg} --query "publicNetworkAccess"` | Check firewall rules, VNET service endpoints |

## KQL Queries

### Cosmos DB dependency failures
```kusto
AppDependencies
| where TimeGenerated > ago(1h)
| where DependencyType == "Azure DocumentDB"
| where Success == false
| summarize count() by ResultCode, DependencyType, Target
| order by count_ desc
```

### Cosmos DB latency by operation
```kusto
AppDependencies
| where DependencyType == "Azure DocumentDB"
| summarize avg(DurationMs), p95=percentile(DurationMs, 95) by OperationName
| order by avg_DurationMs desc
```

## Error Codes

| Code | Description | Learn Reference |
|---|---|---|
| HTTP 400 | Bad Request | [Troubleshoot bad requests](https://learn.microsoft.com/azure/cosmos-db/nosql/troubleshoot-bad-request) |
| HTTP 401 | Unauthorized | [Troubleshoot unauthorized](https://learn.microsoft.com/azure/cosmos-db/nosql/troubleshoot-unauthorized) |
| HTTP 403 | Forbidden | [Troubleshoot forbidden](https://learn.microsoft.com/azure/cosmos-db/nosql/troubleshoot-forbidden) |
| HTTP 404 | Not Found | [Troubleshoot not found](https://learn.microsoft.com/azure/cosmos-db/nosql/troubleshoot-not-found) |
| HTTP 408 | Request Timeout | [Troubleshoot request timeout](https://learn.microsoft.com/azure/cosmos-db/nosql/troubleshoot-request-timeout) |
| HTTP 409 | Conflict | [Troubleshoot conflict](https://learn.microsoft.com/azure/cosmos-db/nosql/troubleshoot-conflict) |
| HTTP 429 | Too Many Requests | [Troubleshoot rate limiting](https://learn.microsoft.com/azure/cosmos-db/nosql/troubleshoot-request-rate-too-large) |
| HTTP 503 | Service Unavailable | [Troubleshoot service unavailable](https://learn.microsoft.com/azure/cosmos-db/nosql/troubleshoot-service-unavailable) |

## Per-SDK Troubleshooting

| SDK | Guide |
|---|---|
| .NET SDK | [Troubleshoot .NET SDK](https://learn.microsoft.com/azure/cosmos-db/nosql/troubleshoot-dotnet-sdk) |
| Java v4 SDK | [Troubleshoot Java v4 SDK](https://learn.microsoft.com/azure/cosmos-db/nosql/troubleshoot-java-sdk-v4) |
| Python SDK | [Performance tips for Python SDK](https://learn.microsoft.com/azure/cosmos-db/nosql/performance-tips-python-sdk) |

## Best Practices
<!-- Sourced from MicrosoftDocs/Agent-Skills: azure-cosmos-db, Best Practices category -->
- [Performance tips for .NET SDK v3](https://learn.microsoft.com/azure/cosmos-db/nosql/performance-tips-dotnet-sdk-v3)
- [Best practices for scaling provisioned throughput](https://learn.microsoft.com/azure/cosmos-db/scaling-provisioned-throughput-best-practices)
- [Diagnose and troubleshoot query performance](https://learn.microsoft.com/azure/cosmos-db/nosql/troubleshoot-query-performance)

## Limits & Quotas
<!-- Sourced from MicrosoftDocs/Agent-Skills: azure-cosmos-db, Limits & Quotas category -->
- [Azure Cosmos DB service quotas](https://learn.microsoft.com/azure/cosmos-db/concepts-limits)
- [RU burst capacity](https://learn.microsoft.com/azure/cosmos-db/burst-capacity)
- [Autoscale FAQ](https://learn.microsoft.com/azure/cosmos-db/nosql/autoscale-faq)

7. Implementation Plan

Phase 1: High-Impact Services (Week 1-2)

Add deep actionable guides (Template A) for the 3 most-requested services missing coverage:

Service Agent-Skills Entries to Source From Estimated Effort
App Service 3 troubleshooting + 8 best practices 4 hours
AKS 17 troubleshooting + 44 best practices 6 hours
Cosmos DB 50 troubleshooting + 57 best practices 6 hours

Deliverables:

  • references/app-service/README.md
  • references/aks/README.md
  • references/cosmos-db/README.md
  • Updated SKILL.md with service routing table
  • Bump metadata.version
  • Add/update tests

Phase 2: Broad Coverage (Week 3-4)

Add thin reference files (Template B) for 10+ additional services:

Service Effort
Azure SQL 1 hour
Azure Storage (Blob, Queue, Table) 1 hour
Key Vault 1 hour
Azure Monitor / Log Analytics 2 hours
Event Hubs 1 hour (supplement existing messaging skill)
Service Bus 1 hour (supplement existing messaging skill)
VMs / Virtual Machine Scale Sets 2 hours
Azure Cache for Redis 1 hour
Azure API Management 1 hour
Azure Front Door / Application Gateway 1 hour

Deliverables:

  • 10+ thin reference files
  • Updated routing table
  • Version bump + tests

Phase 3: Automation (Month 2+)

Build a script that:

  1. Reads Agent-Skills SKILL.md files for target services
  2. Extracts Troubleshooting, Best Practices, and Limits & Quotas URLs
  3. Generates thin reference files (Template B) automatically
  4. Validates URLs still resolve (404 check)
  5. Runs on schedule to detect new Agent-Skills entries
MicrosoftDocs/Agent-Skills          azure-skills
┌──────────────────────┐            ┌──────────────────────────┐
│ skills/               │            │ azure-diagnostics/       │
│   azure-cosmos-db/   │  sync      │   references/            │
│     SKILL.md         │──script──▶│     cosmos-db/README.md  │
│       Troubleshooting│            │       CLI commands       │
│       Best Practices │            │       KQL queries        │
│       Limits & Quotas│            │       + Learn URLs ◄──── │
└──────────────────────┘            └──────────────────────────┘
     (auto-updated                       (actionable content
      weekly by pipeline)                 + sourced URLs)

8. What Changes in SKILL.md

The main SKILL.md needs a small addition — a routing table that tells the agent which reference to load based on the service mentioned. This does NOT require splitting the skill.

Current SKILL.md flow

1. Identify symptoms
2. Check resource health (ARG)
3. Review logs (KQL)
4. Analyze metrics
5. Investigate recent changes

Updated SKILL.md flow

1. Identify the Azure service involved
2. If service-specific guide exists → load it from references/
3. Run the generic diagnostic flow:
   a. Check resource health (ARG)
   b. Review logs (KQL)
   c. Analyze metrics
   d. Investigate recent changes
4. If issue persists → consult Learn references in the service guide

The routing step adds ~100 tokens to SKILL.md. Well within budget.


9. Summary

Question Answer
Should diagnostics be split into two skills? No. One skill, expanded references.
Should azure-skills reference Agent-Skills content? Yes. Learn URLs in Further Reading / Error Codes sections of each reference file.
What's the pattern? Top of reference = actionable CLI/KQL commands. Bottom = curated Learn URLs sourced from Agent-Skills.
How does this scale? JIT progressive disclosure — only one service reference loads per query. 20 files cost nothing until activated.
Can it be automated? Yes — a sync script can generate thin reference files from Agent-Skills and detect new entries.
What's the effort? Phase 1 (3 deep guides): ~16 hours. Phase 2 (10 thin refs): ~12 hours. Phase 3 (automation): ~1-2 weeks.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment