Date: 2025-10-02 Author: Ryan Bruns (with Claude)
This document catalogs and analyzes two MCP (Model Context Protocol) servers developed for Slack's monitoring infrastructure. The analysis identifies tool patterns, use cases, naming conventions, and opportunities for optimization.
-
Two Distinct Servers with Different Purposes:
monitoring-mcp: Basic Prometheus integration with example tools (hello world, echo, server info)monitoring-ai/mcp: Production-ready unified server with Grafana and AlertManager toolsets
-
Tool Overlap: Both servers have Prometheus tools but with different implementations and capabilities
-
Naming Patterns: Tools use descriptive function-style names (verb_noun) with MCP namespace prefixes
-
Primary Use Case: Enabling AI assistants to query metrics, logs, and alerts for incident investigation and monitoring tasks
Location: /Users/rbruns/local/src/slack-github.com/slack/monitoring-mcp
Purpose: Basic MCP server template with Prometheus integration
Architecture: FastMCP with tool modules
| Tool Name | Category | Purpose | Use Case | Status |
|---|---|---|---|---|
hello_world |
Demo | Returns greeting with timestamp | Learning/testing MCP | Keep for examples |
echo |
Demo | Echo message with metadata | Testing structured responses | Keep for examples |
server_info |
System | Server status and configuration | Debugging/monitoring | Keep |
| Tool Name | Category | Purpose | Use Case | Status |
|---|---|---|---|---|
prometheus_query |
Metrics | Legacy PromQL query tool | Direct Prometheus queries | Consider deprecating |
get_rules |
Metrics | Query monitoring rules with filtering | Rule discovery and analysis | Useful |
list_metrics |
Metrics | List available metrics in cluster | Metric discovery | Useful |
- Registration Pattern: Tools organized in modules (
tools/greeting.py,tools/prometheus.py) withregister_tools(mcp_server)functions - Metrics Tracking: All tools use
@track_tool_usagedecorator for observability - Cluster Routing: Prometheus tools use cluster parameter for routing (
{base_url}/{cluster}) - Time Handling: Basic support, no sophisticated time parsing
- Issue: Both servers have Prometheus query capabilities but with different feature sets
- Recommendation: Consolidate into monitoring-ai/mcp unified server
- Issue: Only has basic query, rules, and metric listing - no label discovery or advanced features
- Recommendation: Adopt monitoring-ai/mcp Grafana tools which include label discovery and better time handling
- Issue: Marked as "legacy" but still present
- Recommendation: Deprecate in favor of grafana_prometheus_query with better features
Location: /Users/rbruns/local/src/slack-github.com/slack/monitoring-ai/mcp
Purpose: Production unified MCP server for Grafana and AlertManager
Architecture: Unified Starlette app with toolset routing
- Unified Endpoint: Single
/monitoring/v1/mcpendpoint with?toolset={grafana|alertmanager}parameter - Sub-apps: Separate FastMCP apps for Grafana and AlertManager
- Authentication: JWT/OAuth2 with PKCE flow via toolbelt.tinyspeck.com
- Metrics: Per-toolset metrics tracking with user context
- Service Discovery: OAuth-protected resource discovery at
/.well-known/oauth-protected-resource
| Tool Name | Category | Purpose | Use Case | Quality |
|---|---|---|---|---|
grafana_search_dashboards |
Discovery | Search dashboards with quality ranking | Find existing monitoring patterns | Excellent |
grafana_get_dashboard |
Discovery | Extract dashboard config, panels, queries | Copy proven PromQL/Lucene queries | Excellent |
Quality Features:
- Quality scoring based on update frequency, "Viz" tag, version count
- Results limited to 20 with ranking
- Includes panel queries, variables, thresholds
| Tool Name | Category | Purpose | Use Case | Quality |
|---|---|---|---|---|
grafana_list_datasources |
Discovery | List available datasources with descriptions | Find correct datasource UID | Excellent |
Quality Features:
- Filter by type (metrics, logs)
- Comprehensive descriptions for 60+ datasources
- Usage guidance in descriptions
- Covers Prometheus, Astra/OpenSearch datasources
| Tool Name | Category | Purpose | Use Case | Quality |
|---|---|---|---|---|
grafana_prometheus_query |
Query | Execute PromQL range queries | Query metrics over time | Excellent |
grafana_prometheus_label_values |
Discovery | Discover label values for metrics | Explore metric dimensions | Very Good |
grafana_prometheus_metric_names |
Discovery | Find metric names by glob pattern | Metric discovery | Very Good |
Quality Features:
- Sophisticated time parsing ("now-1h", "2025-08-21 14:40:18")
- Automatic step calculation for optimal granularity
- Series limiting (15 series max) for performance
- Error handling for empty results
- Glob pattern filtering with 20-result limit
| Tool Name | Category | Purpose | Use Case | Quality |
|---|---|---|---|---|
grafana_opensearch_query |
Query | Search logs with Lucene queries | Investigate application logs | Excellent |
grafana_opensearch_field_mapping |
Discovery | Discover available log fields by type | Build Lucene queries | Very Good |
Quality Features:
- Returns both histogram (trend) and sample logs (20 entries)
- Automatic interval calculation for histograms
- Clear documentation of sample limitations
- Field mapping grouped by Elasticsearch types
| Tool Name | Category | Purpose | Use Case | Quality |
|---|---|---|---|---|
get_alertmanager_status |
Status | Get AlertManager instance status | Check service health | Good |
get_alertmanager_server_config |
Configuration | Get AlertManager config (if available) | Understand alerting setup | Good |
get_receivers |
Configuration | List notification receivers | Find alert destinations | Good |
| Tool Name | Category | Purpose | Use Case | Quality |
|---|---|---|---|---|
get_alerts |
Query | Get list of current alerts | View active alerts | Good |
get_alert_groups |
Query | Get alert groups | View grouped alerts | Good |
get_alerts_summary |
Query | Get alert counts by severity/status | Quick alert overview | Very Good |
search_alerts_by_label |
Query | Search alerts by label | Find specific alert types | Very Good |
| Tool Name | Category | Purpose | Use Case | Quality |
|---|---|---|---|---|
get_silences |
Query | List silences with optional filter | View muted alerts | Good |
get_silence_by_id |
Query | Get specific silence details | Check silence status | Good |
Missing Capabilities:
- No write operations (create/delete silences, acknowledge alerts)
- All tools are read-only
- Naming Convention:
{service}_{resource}_{action}(e.g.,grafana_search_dashboards,alertmanager_get_status) - Toolset Routing: Tools grouped by service, routed by query parameter
- MCP Annotations: Uses
annotationsdict withtitle,readOnlyHint,openWorldHint,tags - Time Parsing: Sophisticated
parse_time_to_epoch_ms()supporting relative ("now-1h") and absolute times - Result Limiting: Consistent limits (15-20 results) with metadata about truncation
- Error Context: Helpful error messages with suggestions
- 60+ datasources with detailed descriptions
- Usage guidance ("Use for...", "Contains...")
- Critical warnings (e.g., Prometheus_Global: "VERY SLOW - DO NOT USE")
- Coverage explanation (what data is in each source)
- Flexible input ("now-1h", "now-30m", "2025-08-21 14:40:18")
- Automatic step/interval calculation for optimal performance
- Time range validation
- Multi-factor scoring (update frequency, Viz tag, version count)
- Helps users find actively maintained dashboards
- Sorted results with quality scores
- Consistent result limiting (15-20 items)
- Metadata about truncation (
totalSeries,limited) - Automatic step/interval calculation to cap data points
- Detailed docstrings with use cases
- Examples in tool descriptions
- Critical limitations clearly stated
| Functionality | monitoring-mcp | monitoring-ai/mcp | Recommendation |
|---|---|---|---|
| Basic PromQL query | prometheus_query |
grafana_prometheus_query |
Use monitoring-ai version (better features) |
| Rules discovery | get_rules |
None | Migrate to monitoring-ai |
| Metrics listing | list_metrics |
grafana_prometheus_metric_names |
monitoring-ai version has better filtering |
| Label discovery | None | grafana_prometheus_label_values |
Keep in monitoring-ai |
| Dashboard search | None | grafana_search_dashboards |
Keep in monitoring-ai |
-
Incident Investigation
- ✅ Query metrics to identify anomalies
- ✅ Search logs for error patterns
- ✅ Check active alerts
- ✅ View alert history and silences
-
Monitoring Discovery
- ✅ Find existing dashboards for services
- ✅ Discover available metrics and logs
- ✅ Extract working queries from dashboards
- ✅ Understand datasource coverage
-
Query Development
- ✅ Test PromQL queries
- ✅ Test Lucene log queries
- ✅ Discover metric labels and values
- ✅ Explore log field schemas
-
Alert Management
- ❌ Create/delete silences
- ❌ Acknowledge alerts
- ❌ Create/modify alerting rules
-
Dashboard Management
- ❌ Create/modify dashboards
- ❌ Create/modify panels
- ❌ Set dashboard permissions
-
Advanced Analytics
- ❌ Long-term trend analysis (Thanos queries)
- ❌ Cross-cluster queries
- ❌ Distributed tracing integration
- Simple names:
hello_world,echo,server_info - Prometheus tools:
prometheus_query,get_rules,list_metrics - Issues: Inconsistent (verb_noun vs noun_verb), no namespace prefix
- Namespace prefixed:
grafana_*,alertmanager_*,get_* - Format:
{service}_{resource}_{action}or{action}_{resource} - Issues: Inconsistent verb position (
grafana_search_dashboardsvsget_alertmanager_status)
{service}_{action}_{resource}
Examples:
grafana_search_dashboards✅ (already correct)grafana_get_dashboard✅ (already correct)grafana_list_datasources✅ (already correct)grafana_query_prometheus✅ (rename from grafana_prometheus_query)alertmanager_get_status✅ (rename from get_alertmanager_status)alertmanager_list_alerts✅ (rename from get_alerts)alertmanager_search_alerts✅ (rename from search_alerts_by_label)
Benefits:
- Consistent service namespace prefix
- Action verb in middle (get, list, search, query, create, delete)
- Resource noun at end (dashboards, alerts, silences)
- Easy to autocomplete in AI tools
Action: Deprecate monitoring-mcp, migrate tools to monitoring-ai/mcp
Rationale:
- Reduces maintenance burden
- monitoring-ai has superior Prometheus integration via Grafana
- Unified authentication and metrics
Migration Plan:
- Add
get_rulesequivalent to monitoring-ai (via Grafana datasource proxy or Prometheus API) - Verify monitoring-ai has equivalent metric discovery (already has
grafana_prometheus_metric_names) - Migrate demo tools (hello_world, echo, server_info) if needed for testing
- Update documentation and client configurations
- Archive monitoring-mcp repository
Action: Rename tools to follow {service}_{action}_{resource} pattern
Changes:
# AlertManager tools
get_alertmanager_status -> alertmanager_get_status
get_alerts -> alertmanager_list_alerts
get_alert_groups -> alertmanager_list_alert_groups
get_alerts_summary -> alertmanager_summarize_alerts
search_alerts_by_label -> alertmanager_search_alerts
get_silences -> alertmanager_list_silences
get_silence_by_id -> alertmanager_get_silence
get_receivers -> alertmanager_list_receivers
get_alertmanager_server_config -> alertmanager_get_config
# Grafana tools (minor adjustments)
grafana_prometheus_query -> grafana_query_metrics
grafana_prometheus_label_values -> grafana_discover_label_values
grafana_prometheus_metric_names -> grafana_discover_metrics
grafana_opensearch_query -> grafana_query_logs
grafana_opensearch_field_mapping -> grafana_discover_log_fieldsMigration: Maintain aliases for backward compatibility during transition period
Action: Add write capabilities for common workflows
Proposed Tools:
# Silence management
alertmanager_create_silence(matchers, duration, creator, comment)
alertmanager_delete_silence(silence_id)
# Alert management
alertmanager_acknowledge_alert(alert_id)
# Dashboard management (optional)
grafana_create_dashboard(title, panels, folder)
grafana_update_dashboard(uid, updates)Implementation Considerations:
- Require explicit user confirmation for write operations
- Add audit logging for all modifications
- Implement dry-run mode for testing
- Add permission validation
Action: Add cross-cutting discovery tools
Proposed Tools:
# Prometheus rules (missing from monitoring-ai)
grafana_list_alert_rules(datasource_uid, rule_type=None, filters={})
grafana_get_alert_rule(datasource_uid, rule_id)
# Dashboard annotations
grafana_search_annotations(dashboard_uid, time_range)
# Cross-datasource search
grafana_search_metrics_across_datasources(pattern)Action: Standardize docstring format
Template:
async def {service}_{action}_{resource}(...) -> ReturnType:
"""One-line summary of what the tool does.
Detailed description explaining the tool's purpose and behavior.
Include information about data limitations, performance considerations,
and relationships to other tools.
Use this tool to:
- Primary use case
- Secondary use case
- When to prefer this over alternatives
Args:
param1: Description with format examples
param2: Description with valid values/ranges
Returns:
Description of return structure with key fields listed
Raises:
ExceptionType: When this happens
Example:
To accomplish X, use param1="value" with param2=123.
See Also:
- related_tool_name: For alternative approach
"""grafana_search_dashboards- Quality ranking, comprehensive resultsgrafana_list_datasources- Detailed descriptions, usage guidancegrafana_prometheus_query- Time parsing, auto-stepping, result limitinggrafana_opensearch_query- Histogram + samples, interval calculation
grafana_get_dashboard- Could add annotation support (currently commented out)alertmanager_get_alerts_summary- Good aggregation, could add time filteringgrafana_prometheus_label_values- Solid implementation, could add result limiting
- AlertManager tools - Missing write operations, limited filtering
get_alertmanager_server_config- Error handling could be improved- monitoring-mcp Prometheus tools - Less sophisticated than Grafana equivalents
prometheus_query(monitoring-mcp) - Superseded bygrafana_prometheus_query- Demo tools in production deployment - Consider separate demo server
Location: monitoring-ai/mcp/grafana_server.py:36
def parse_time_to_epoch_ms(time_str: str) -> int:
"""Convert time string to epoch milliseconds.
Supports:
- Relative time: "now", "now-1h", "now-30m", "now-7d", etc.
- Datetime strings: "2025-08-21 14:40:18" (assumes local timezone)
"""Recommendation: Extract to shared utility module for reuse
Location: monitoring-ai/mcp/grafana_server.py:198
def calculate_quality_score(dashboard):
# Multi-factor scoring based on:
# - Update recency (1000 points for <7 days)
# - Viz tag presence (800 points)
# - Version count (300 points for >50 versions)
# - Folder organization (50 points)
# - Starred status (30 points)Recommendation: Pattern applicable to ranking other resources (alerts, panels)
Location: Throughout monitoring-ai tools
results = all_results[:15]
return {
"data": results,
"total": len(all_results),
"limited": len(all_results) > 15
}Recommendation: Standardize across all list/search operations
Location: monitoring-ai/mcp/grafana_server.py:491
Large dictionary mapping datasource names to detailed descriptions with:
- What data is contained
- When to use it
- Performance considerations
- Example use cases
Recommendation: Consider externalizing to YAML/JSON config file
Location: Both servers use @track_tool_usage decorator
Recommendation: Ensure consistent metrics labels across toolsets
- Authentication: Optional JWT via
MCP_JWT_AUDIENCEenv var - Token Validation: Against toolbelt.tinyspeck.com JWKS
- Development Mode: No auth if
MCP_JWT_AUDIENCEunset - OAuth Discovery: Standard
/.well-known/oauth-protected-resourceendpoint
- Authentication: JWT with PKCE flow, currently validation disabled (temporary workaround)
- Token Validation: Skipped due to MCP client RFC8707 issue (see commit 7ab728e)
- User Tracking: Extracts user from JWT payload for metrics
- OAuth Discovery: Standard endpoints with toolset-specific resource URLs
- Re-enable JWT validation once MCP client RFC8707 issue is resolved
- Audit logging for all tool invocations (especially write operations)
- Rate limiting to prevent abuse
- Scope-based permissions (e.g., read-only vs read-write tokens)
- Tool Usage: All tools instrumented with
@track_tool_usage - User Context: monitoring-ai tracks user and toolset per request
- Prometheus Metrics: Exposed at
/metricsendpoint - Health Checks:
/healthendpoint on both servers
- Query Performance: Track query duration, result size
- Error Rates: Track failures by tool and error type
- User Activity: Most active users, tools, queries
- Datasource Health: Track datasource availability and latency
- Cache Hit Rates: If implementing query caching
- Consolidate to single server (monitoring-ai)
- Standardize tool naming
- Re-enable JWT validation
- Add missing Prometheus rules tool
- Add write operations (silences, dashboards)
- Improve result pagination
- Add query caching layer
- Cross-datasource search
- Distributed tracing integration (Tempo)
- Advanced analytics (Thanos long-term storage)
- Alert recommendation system
- Automated incident investigation workflows
| Tool | Category | Input Parameters | Output | Use Case |
|---|---|---|---|---|
| hello_world | Demo | name: str | str | Testing MCP |
| echo | Demo | message: str | Dict[str, Any] | Testing structured responses |
| server_info | System | None | Dict[str, Any] | Server status check |
| prometheus_query | Metrics | query: str, host: str | Dict[str, Any] | Direct PromQL query |
| get_rules | Metrics | cluster: str, filters... | Dict[str, Any] | Alert/recording rule discovery |
| list_metrics | Metrics | cluster: str, match_filter: Optional[str] | Dict[str, Any] | Metric name discovery |
| Tool | Category | Input Parameters | Output | Use Case |
|---|---|---|---|---|
| grafana_search_dashboards | Discovery | query: Optional[str] | List[Dict] | Find relevant dashboards |
| grafana_get_dashboard | Discovery | uid: str | Dict | Extract dashboard config |
| grafana_list_datasources | Discovery | filter_type: Optional[str] | List[Dict] | Find datasource UIDs |
| grafana_prometheus_query | Query | datasource_uid, expr, start_time, end_time | Dict | Query metrics time series |
| grafana_prometheus_label_values | Discovery | datasource_uid, series_selector, start, end | Dict | Discover label values |
| grafana_prometheus_metric_names | Discovery | datasource_uid, glob_pattern, start, end | Dict | Find metric names |
| grafana_opensearch_query | Query | datasource_uid, query, start_time, end_time | Dict | Search logs with Lucene |
| grafana_opensearch_field_mapping | Discovery | datasource_uid | Dict | Discover log fields |
| Tool | Category | Input Parameters | Output | Use Case |
|---|---|---|---|---|
| get_alertmanager_status | Status | None | Dict | Check service health |
| get_receivers | Config | None | List[Dict] | List notification receivers |
| get_silences | Query | filter_query: Optional[str] | List[Dict] | List silences |
| get_silence_by_id | Query | silence_id: str | Dict | Get specific silence |
| get_alerts | Query | filter_query, silenced, inhibited, active | List[Dict] | List current alerts |
| get_alert_groups | Query | filter_query, silenced, inhibited, active | List[Dict] | List alert groups |
| get_alerts_summary | Query | filter_query, active_only | Dict | Alert count summary |
| search_alerts_by_label | Query | label_name, label_value, active_only | List[Dict] | Search alerts by label |
| get_alertmanager_server_config | Config | None | Dict | Get AlertManager config |
The monitoring-ai/mcp unified server represents a mature, production-ready MCP implementation with excellent tool design patterns. Key strengths include:
- Comprehensive coverage of monitoring workflows (metrics, logs, alerts)
- Outstanding UX patterns (quality ranking, detailed descriptions, time parsing)
- Performance optimizations (result limiting, auto-stepping, interval calculation)
- Consistent architecture (toolset routing, metrics tracking, error handling)
Primary recommendations:
- Consolidate monitoring-mcp into monitoring-ai/mcp
- Standardize tool naming to
{service}_{action}_{resource} - Add write operations for silence and dashboard management
- Re-enable JWT validation once client issues resolved
The codebase demonstrates strong engineering practices and can serve as a reference implementation for future MCP servers.
Generated with: Claude Code + claude-opus-4-1 Repository: slack-github.com/slack/monitoring-mcp, slack-github.com/slack/monitoring-ai