Skip to content

Instantly share code, notes, and snippets.

@nullren
Created October 2, 2025 14:35
Show Gist options
  • Select an option

  • Save nullren/30a6933fed653e3425978c9f544d662c to your computer and use it in GitHub Desktop.

Select an option

Save nullren/30a6933fed653e3425978c9f544d662c to your computer and use it in GitHub Desktop.
MCP Tools Analysis: Monitoring Infrastructure - Comprehensive analysis of Slack monitoring MCP servers

MCP Tools Analysis: Monitoring Infrastructure

Date: 2025-10-02 Author: Ryan Bruns (with Claude)

Executive Summary

This document catalogs and analyzes two MCP (Model Context Protocol) servers developed for Slack's monitoring infrastructure. The analysis identifies tool patterns, use cases, naming conventions, and opportunities for optimization.

Key Findings

  1. Two Distinct Servers with Different Purposes:

    • monitoring-mcp: Basic Prometheus integration with example tools (hello world, echo, server info)
    • monitoring-ai/mcp: Production-ready unified server with Grafana and AlertManager toolsets
  2. Tool Overlap: Both servers have Prometheus tools but with different implementations and capabilities

  3. Naming Patterns: Tools use descriptive function-style names (verb_noun) with MCP namespace prefixes

  4. Primary Use Case: Enabling AI assistants to query metrics, logs, and alerts for incident investigation and monitoring tasks


Server 1: monitoring-mcp

Location: /Users/rbruns/local/src/slack-github.com/slack/monitoring-mcp Purpose: Basic MCP server template with Prometheus integration Architecture: FastMCP with tool modules

Tools Inventory

Example/Demo Tools

Tool Name Category Purpose Use Case Status
hello_world Demo Returns greeting with timestamp Learning/testing MCP Keep for examples
echo Demo Echo message with metadata Testing structured responses Keep for examples
server_info System Server status and configuration Debugging/monitoring Keep

Prometheus Tools

Tool Name Category Purpose Use Case Status
prometheus_query Metrics Legacy PromQL query tool Direct Prometheus queries Consider deprecating
get_rules Metrics Query monitoring rules with filtering Rule discovery and analysis Useful
list_metrics Metrics List available metrics in cluster Metric discovery Useful

Tool Patterns Observed

  1. Registration Pattern: Tools organized in modules (tools/greeting.py, tools/prometheus.py) with register_tools(mcp_server) functions
  2. Metrics Tracking: All tools use @track_tool_usage decorator for observability
  3. Cluster Routing: Prometheus tools use cluster parameter for routing ({base_url}/{cluster})
  4. Time Handling: Basic support, no sophisticated time parsing

Issues and Optimization Opportunities

1. Tool Overlap with monitoring-ai/mcp

  • Issue: Both servers have Prometheus query capabilities but with different feature sets
  • Recommendation: Consolidate into monitoring-ai/mcp unified server

2. Limited Prometheus Capabilities

  • Issue: Only has basic query, rules, and metric listing - no label discovery or advanced features
  • Recommendation: Adopt monitoring-ai/mcp Grafana tools which include label discovery and better time handling

3. Legacy Tool (prometheus_query)

  • Issue: Marked as "legacy" but still present
  • Recommendation: Deprecate in favor of grafana_prometheus_query with better features

Server 2: monitoring-ai/mcp

Location: /Users/rbruns/local/src/slack-github.com/slack/monitoring-ai/mcp Purpose: Production unified MCP server for Grafana and AlertManager Architecture: Unified Starlette app with toolset routing

Architecture Highlights

  • Unified Endpoint: Single /monitoring/v1/mcp endpoint with ?toolset={grafana|alertmanager} parameter
  • Sub-apps: Separate FastMCP apps for Grafana and AlertManager
  • Authentication: JWT/OAuth2 with PKCE flow via toolbelt.tinyspeck.com
  • Metrics: Per-toolset metrics tracking with user context
  • Service Discovery: OAuth-protected resource discovery at /.well-known/oauth-protected-resource

Tools Inventory

Grafana Toolset (grafana_server.py)

Dashboard Tools
Tool Name Category Purpose Use Case Quality
grafana_search_dashboards Discovery Search dashboards with quality ranking Find existing monitoring patterns Excellent
grafana_get_dashboard Discovery Extract dashboard config, panels, queries Copy proven PromQL/Lucene queries Excellent

Quality Features:

  • Quality scoring based on update frequency, "Viz" tag, version count
  • Results limited to 20 with ranking
  • Includes panel queries, variables, thresholds
Datasource Tools
Tool Name Category Purpose Use Case Quality
grafana_list_datasources Discovery List available datasources with descriptions Find correct datasource UID Excellent

Quality Features:

  • Filter by type (metrics, logs)
  • Comprehensive descriptions for 60+ datasources
  • Usage guidance in descriptions
  • Covers Prometheus, Astra/OpenSearch datasources
Metrics Tools
Tool Name Category Purpose Use Case Quality
grafana_prometheus_query Query Execute PromQL range queries Query metrics over time Excellent
grafana_prometheus_label_values Discovery Discover label values for metrics Explore metric dimensions Very Good
grafana_prometheus_metric_names Discovery Find metric names by glob pattern Metric discovery Very Good

Quality Features:

  • Sophisticated time parsing ("now-1h", "2025-08-21 14:40:18")
  • Automatic step calculation for optimal granularity
  • Series limiting (15 series max) for performance
  • Error handling for empty results
  • Glob pattern filtering with 20-result limit
Logs Tools
Tool Name Category Purpose Use Case Quality
grafana_opensearch_query Query Search logs with Lucene queries Investigate application logs Excellent
grafana_opensearch_field_mapping Discovery Discover available log fields by type Build Lucene queries Very Good

Quality Features:

  • Returns both histogram (trend) and sample logs (20 entries)
  • Automatic interval calculation for histograms
  • Clear documentation of sample limitations
  • Field mapping grouped by Elasticsearch types

AlertManager Toolset (alertmanager_server.py)

Status & Configuration Tools
Tool Name Category Purpose Use Case Quality
get_alertmanager_status Status Get AlertManager instance status Check service health Good
get_alertmanager_server_config Configuration Get AlertManager config (if available) Understand alerting setup Good
get_receivers Configuration List notification receivers Find alert destinations Good
Alert Query Tools
Tool Name Category Purpose Use Case Quality
get_alerts Query Get list of current alerts View active alerts Good
get_alert_groups Query Get alert groups View grouped alerts Good
get_alerts_summary Query Get alert counts by severity/status Quick alert overview Very Good
search_alerts_by_label Query Search alerts by label Find specific alert types Very Good
Silence Tools
Tool Name Category Purpose Use Case Quality
get_silences Query List silences with optional filter View muted alerts Good
get_silence_by_id Query Get specific silence details Check silence status Good

Missing Capabilities:

  • No write operations (create/delete silences, acknowledge alerts)
  • All tools are read-only

Tool Patterns Observed

  1. Naming Convention: {service}_{resource}_{action} (e.g., grafana_search_dashboards, alertmanager_get_status)
  2. Toolset Routing: Tools grouped by service, routed by query parameter
  3. MCP Annotations: Uses annotations dict with title, readOnlyHint, openWorldHint, tags
  4. Time Parsing: Sophisticated parse_time_to_epoch_ms() supporting relative ("now-1h") and absolute times
  5. Result Limiting: Consistent limits (15-20 results) with metadata about truncation
  6. Error Context: Helpful error messages with suggestions

Outstanding Quality Features

1. Datasource Descriptions

  • 60+ datasources with detailed descriptions
  • Usage guidance ("Use for...", "Contains...")
  • Critical warnings (e.g., Prometheus_Global: "VERY SLOW - DO NOT USE")
  • Coverage explanation (what data is in each source)

2. Time Handling

  • Flexible input ("now-1h", "now-30m", "2025-08-21 14:40:18")
  • Automatic step/interval calculation for optimal performance
  • Time range validation

3. Dashboard Quality Ranking

  • Multi-factor scoring (update frequency, Viz tag, version count)
  • Helps users find actively maintained dashboards
  • Sorted results with quality scores

4. Performance Optimization

  • Consistent result limiting (15-20 items)
  • Metadata about truncation (totalSeries, limited)
  • Automatic step/interval calculation to cap data points

5. Documentation

  • Detailed docstrings with use cases
  • Examples in tool descriptions
  • Critical limitations clearly stated

Cross-Server Analysis

Tool Overlap

Functionality monitoring-mcp monitoring-ai/mcp Recommendation
Basic PromQL query prometheus_query grafana_prometheus_query Use monitoring-ai version (better features)
Rules discovery get_rules None Migrate to monitoring-ai
Metrics listing list_metrics grafana_prometheus_metric_names monitoring-ai version has better filtering
Label discovery None grafana_prometheus_label_values Keep in monitoring-ai
Dashboard search None grafana_search_dashboards Keep in monitoring-ai

Use Case Coverage

Supported Use Cases

  1. Incident Investigation

    • ✅ Query metrics to identify anomalies
    • ✅ Search logs for error patterns
    • ✅ Check active alerts
    • ✅ View alert history and silences
  2. Monitoring Discovery

    • ✅ Find existing dashboards for services
    • ✅ Discover available metrics and logs
    • ✅ Extract working queries from dashboards
    • ✅ Understand datasource coverage
  3. Query Development

    • ✅ Test PromQL queries
    • ✅ Test Lucene log queries
    • ✅ Discover metric labels and values
    • ✅ Explore log field schemas

Missing Use Cases

  1. Alert Management

    • ❌ Create/delete silences
    • ❌ Acknowledge alerts
    • ❌ Create/modify alerting rules
  2. Dashboard Management

    • ❌ Create/modify dashboards
    • ❌ Create/modify panels
    • ❌ Set dashboard permissions
  3. Advanced Analytics

    • ❌ Long-term trend analysis (Thanos queries)
    • ❌ Cross-cluster queries
    • ❌ Distributed tracing integration

Naming Convention Analysis

Current Patterns

monitoring-mcp

  • Simple names: hello_world, echo, server_info
  • Prometheus tools: prometheus_query, get_rules, list_metrics
  • Issues: Inconsistent (verb_noun vs noun_verb), no namespace prefix

monitoring-ai/mcp

  • Namespace prefixed: grafana_*, alertmanager_*, get_*
  • Format: {service}_{resource}_{action} or {action}_{resource}
  • Issues: Inconsistent verb position (grafana_search_dashboards vs get_alertmanager_status)

Recommended Convention

{service}_{action}_{resource}

Examples:

  • grafana_search_dashboards ✅ (already correct)
  • grafana_get_dashboard ✅ (already correct)
  • grafana_list_datasources ✅ (already correct)
  • grafana_query_prometheus ✅ (rename from grafana_prometheus_query)
  • alertmanager_get_status ✅ (rename from get_alertmanager_status)
  • alertmanager_list_alerts ✅ (rename from get_alerts)
  • alertmanager_search_alerts ✅ (rename from search_alerts_by_label)

Benefits:

  • Consistent service namespace prefix
  • Action verb in middle (get, list, search, query, create, delete)
  • Resource noun at end (dashboards, alerts, silences)
  • Easy to autocomplete in AI tools

Optimization Recommendations

Priority 1: Consolidate Servers

Action: Deprecate monitoring-mcp, migrate tools to monitoring-ai/mcp

Rationale:

  • Reduces maintenance burden
  • monitoring-ai has superior Prometheus integration via Grafana
  • Unified authentication and metrics

Migration Plan:

  1. Add get_rules equivalent to monitoring-ai (via Grafana datasource proxy or Prometheus API)
  2. Verify monitoring-ai has equivalent metric discovery (already has grafana_prometheus_metric_names)
  3. Migrate demo tools (hello_world, echo, server_info) if needed for testing
  4. Update documentation and client configurations
  5. Archive monitoring-mcp repository

Priority 2: Standardize Tool Names

Action: Rename tools to follow {service}_{action}_{resource} pattern

Changes:

# AlertManager tools
get_alertmanager_status -> alertmanager_get_status
get_alerts -> alertmanager_list_alerts
get_alert_groups -> alertmanager_list_alert_groups
get_alerts_summary -> alertmanager_summarize_alerts
search_alerts_by_label -> alertmanager_search_alerts
get_silences -> alertmanager_list_silences
get_silence_by_id -> alertmanager_get_silence
get_receivers -> alertmanager_list_receivers
get_alertmanager_server_config -> alertmanager_get_config

# Grafana tools (minor adjustments)
grafana_prometheus_query -> grafana_query_metrics
grafana_prometheus_label_values -> grafana_discover_label_values
grafana_prometheus_metric_names -> grafana_discover_metrics
grafana_opensearch_query -> grafana_query_logs
grafana_opensearch_field_mapping -> grafana_discover_log_fields

Migration: Maintain aliases for backward compatibility during transition period

Priority 3: Add Write Operations

Action: Add write capabilities for common workflows

Proposed Tools:

# Silence management
alertmanager_create_silence(matchers, duration, creator, comment)
alertmanager_delete_silence(silence_id)

# Alert management
alertmanager_acknowledge_alert(alert_id)

# Dashboard management (optional)
grafana_create_dashboard(title, panels, folder)
grafana_update_dashboard(uid, updates)

Implementation Considerations:

  • Require explicit user confirmation for write operations
  • Add audit logging for all modifications
  • Implement dry-run mode for testing
  • Add permission validation

Priority 4: Enhance Discovery Tools

Action: Add cross-cutting discovery tools

Proposed Tools:

# Prometheus rules (missing from monitoring-ai)
grafana_list_alert_rules(datasource_uid, rule_type=None, filters={})
grafana_get_alert_rule(datasource_uid, rule_id)

# Dashboard annotations
grafana_search_annotations(dashboard_uid, time_range)

# Cross-datasource search
grafana_search_metrics_across_datasources(pattern)

Priority 5: Improve Documentation

Action: Standardize docstring format

Template:

async def {service}_{action}_{resource}(...) -> ReturnType:
    """One-line summary of what the tool does.

    Detailed description explaining the tool's purpose and behavior.
    Include information about data limitations, performance considerations,
    and relationships to other tools.

    Use this tool to:
    - Primary use case
    - Secondary use case
    - When to prefer this over alternatives

    Args:
        param1: Description with format examples
        param2: Description with valid values/ranges

    Returns:
        Description of return structure with key fields listed

    Raises:
        ExceptionType: When this happens

    Example:
        To accomplish X, use param1="value" with param2=123.

    See Also:
        - related_tool_name: For alternative approach
    """

Tool Quality Assessment

Excellent Quality (Keep as Reference)

  1. grafana_search_dashboards - Quality ranking, comprehensive results
  2. grafana_list_datasources - Detailed descriptions, usage guidance
  3. grafana_prometheus_query - Time parsing, auto-stepping, result limiting
  4. grafana_opensearch_query - Histogram + samples, interval calculation

Very Good Quality (Minor improvements)

  1. grafana_get_dashboard - Could add annotation support (currently commented out)
  2. alertmanager_get_alerts_summary - Good aggregation, could add time filtering
  3. grafana_prometheus_label_values - Solid implementation, could add result limiting

Good Quality (Needs enhancement)

  1. AlertManager tools - Missing write operations, limited filtering
  2. get_alertmanager_server_config - Error handling could be improved
  3. monitoring-mcp Prometheus tools - Less sophisticated than Grafana equivalents

Consider Deprecating

  1. prometheus_query (monitoring-mcp) - Superseded by grafana_prometheus_query
  2. Demo tools in production deployment - Consider separate demo server

Implementation Patterns to Adopt

1. Time Parsing Function

Location: monitoring-ai/mcp/grafana_server.py:36

def parse_time_to_epoch_ms(time_str: str) -> int:
    """Convert time string to epoch milliseconds.

    Supports:
    - Relative time: "now", "now-1h", "now-30m", "now-7d", etc.
    - Datetime strings: "2025-08-21 14:40:18" (assumes local timezone)
    """

Recommendation: Extract to shared utility module for reuse

2. Quality Ranking Function

Location: monitoring-ai/mcp/grafana_server.py:198

def calculate_quality_score(dashboard):
    # Multi-factor scoring based on:
    # - Update recency (1000 points for <7 days)
    # - Viz tag presence (800 points)
    # - Version count (300 points for >50 versions)
    # - Folder organization (50 points)
    # - Starred status (30 points)

Recommendation: Pattern applicable to ranking other resources (alerts, panels)

3. Result Limiting Pattern

Location: Throughout monitoring-ai tools

results = all_results[:15]
return {
    "data": results,
    "total": len(all_results),
    "limited": len(all_results) > 15
}

Recommendation: Standardize across all list/search operations

4. Datasource Description Pattern

Location: monitoring-ai/mcp/grafana_server.py:491

Large dictionary mapping datasource names to detailed descriptions with:

  • What data is contained
  • When to use it
  • Performance considerations
  • Example use cases

Recommendation: Consider externalizing to YAML/JSON config file

5. Tool Metrics Tracking

Location: Both servers use @track_tool_usage decorator

Recommendation: Ensure consistent metrics labels across toolsets


Security and Authentication Analysis

monitoring-mcp

  • Authentication: Optional JWT via MCP_JWT_AUDIENCE env var
  • Token Validation: Against toolbelt.tinyspeck.com JWKS
  • Development Mode: No auth if MCP_JWT_AUDIENCE unset
  • OAuth Discovery: Standard /.well-known/oauth-protected-resource endpoint

monitoring-ai/mcp

  • Authentication: JWT with PKCE flow, currently validation disabled (temporary workaround)
  • Token Validation: Skipped due to MCP client RFC8707 issue (see commit 7ab728e)
  • User Tracking: Extracts user from JWT payload for metrics
  • OAuth Discovery: Standard endpoints with toolset-specific resource URLs

Security Recommendations

  1. Re-enable JWT validation once MCP client RFC8707 issue is resolved
  2. Audit logging for all tool invocations (especially write operations)
  3. Rate limiting to prevent abuse
  4. Scope-based permissions (e.g., read-only vs read-write tokens)

Metrics and Observability

Current Metrics

  • Tool Usage: All tools instrumented with @track_tool_usage
  • User Context: monitoring-ai tracks user and toolset per request
  • Prometheus Metrics: Exposed at /metrics endpoint
  • Health Checks: /health endpoint on both servers

Recommended Additions

  1. Query Performance: Track query duration, result size
  2. Error Rates: Track failures by tool and error type
  3. User Activity: Most active users, tools, queries
  4. Datasource Health: Track datasource availability and latency
  5. Cache Hit Rates: If implementing query caching

Future Directions

Short Term (1-3 months)

  1. Consolidate to single server (monitoring-ai)
  2. Standardize tool naming
  3. Re-enable JWT validation
  4. Add missing Prometheus rules tool

Medium Term (3-6 months)

  1. Add write operations (silences, dashboards)
  2. Improve result pagination
  3. Add query caching layer
  4. Cross-datasource search

Long Term (6+ months)

  1. Distributed tracing integration (Tempo)
  2. Advanced analytics (Thanos long-term storage)
  3. Alert recommendation system
  4. Automated incident investigation workflows

Appendix: Complete Tool Inventory

monitoring-mcp Tools

Tool Category Input Parameters Output Use Case
hello_world Demo name: str str Testing MCP
echo Demo message: str Dict[str, Any] Testing structured responses
server_info System None Dict[str, Any] Server status check
prometheus_query Metrics query: str, host: str Dict[str, Any] Direct PromQL query
get_rules Metrics cluster: str, filters... Dict[str, Any] Alert/recording rule discovery
list_metrics Metrics cluster: str, match_filter: Optional[str] Dict[str, Any] Metric name discovery

monitoring-ai/mcp Grafana Tools

Tool Category Input Parameters Output Use Case
grafana_search_dashboards Discovery query: Optional[str] List[Dict] Find relevant dashboards
grafana_get_dashboard Discovery uid: str Dict Extract dashboard config
grafana_list_datasources Discovery filter_type: Optional[str] List[Dict] Find datasource UIDs
grafana_prometheus_query Query datasource_uid, expr, start_time, end_time Dict Query metrics time series
grafana_prometheus_label_values Discovery datasource_uid, series_selector, start, end Dict Discover label values
grafana_prometheus_metric_names Discovery datasource_uid, glob_pattern, start, end Dict Find metric names
grafana_opensearch_query Query datasource_uid, query, start_time, end_time Dict Search logs with Lucene
grafana_opensearch_field_mapping Discovery datasource_uid Dict Discover log fields

monitoring-ai/mcp AlertManager Tools

Tool Category Input Parameters Output Use Case
get_alertmanager_status Status None Dict Check service health
get_receivers Config None List[Dict] List notification receivers
get_silences Query filter_query: Optional[str] List[Dict] List silences
get_silence_by_id Query silence_id: str Dict Get specific silence
get_alerts Query filter_query, silenced, inhibited, active List[Dict] List current alerts
get_alert_groups Query filter_query, silenced, inhibited, active List[Dict] List alert groups
get_alerts_summary Query filter_query, active_only Dict Alert count summary
search_alerts_by_label Query label_name, label_value, active_only List[Dict] Search alerts by label
get_alertmanager_server_config Config None Dict Get AlertManager config

Conclusion

The monitoring-ai/mcp unified server represents a mature, production-ready MCP implementation with excellent tool design patterns. Key strengths include:

  1. Comprehensive coverage of monitoring workflows (metrics, logs, alerts)
  2. Outstanding UX patterns (quality ranking, detailed descriptions, time parsing)
  3. Performance optimizations (result limiting, auto-stepping, interval calculation)
  4. Consistent architecture (toolset routing, metrics tracking, error handling)

Primary recommendations:

  1. Consolidate monitoring-mcp into monitoring-ai/mcp
  2. Standardize tool naming to {service}_{action}_{resource}
  3. Add write operations for silence and dashboard management
  4. Re-enable JWT validation once client issues resolved

The codebase demonstrates strong engineering practices and can serve as a reference implementation for future MCP servers.


Generated with: Claude Code + claude-opus-4-1 Repository: slack-github.com/slack/monitoring-mcp, slack-github.com/slack/monitoring-ai

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment