Skip to content

Instantly share code, notes, and snippets.

@yaroslav-tykhonchuk
Created November 18, 2025 09:48
Show Gist options
  • Select an option

  • Save yaroslav-tykhonchuk/d3fc6b3724a656507f8966bdb05d0ee0 to your computer and use it in GitHub Desktop.

Select an option

Save yaroslav-tykhonchuk/d3fc6b3724a656507f8966bdb05d0ee0 to your computer and use it in GitHub Desktop.
History-troubleshooter SKILL.md
name description version
history-troubleshooter
Diagnose and troubleshoot YouScan history collection pipeline - Search phase (Elasticsearch cluster health, slow queries, unassigned shards) and Save phase (slow saves, external service errors, pipeline bottlenecks)
2.2.0

History Troubleshooter

This skill helps diagnose and troubleshoot the YouScan history collection pipeline, analyzing both the Search phase (loading mentions from a >2.48 PB Elasticsearch History Cluster) and the Save phase (processing mentions through SaveMentionPipeline with external services).

When to Use This Skill

Use this skill when:

  • Need a systematic analysis of history collection pipeline (Search and Save phases)
  • History collection is slow or failing for topics
  • Search issues: Elasticsearch cluster shows timeouts, search context missing exceptions, unassigned shards, or degraded cluster health
  • Save issues: Slow batch saves (>150s), external service errors, or mention processing delays
  • Need to investigate performance issues with history searches or saves
  • Too many parallel history collections are suspected
  • External services (aspect-classification, sentiment-analysis, image-recognition) showing errors

Cluster Architecture Overview

Storage Tiers

  • Hot nodes (SSD): Last 1 year of data, 1 replica for last 45 days, 0 replicas after
  • Warm nodes (HDD): Older data, replicas up to 2206 (2022-06), gradual removal when space is limited

Index Structure

  • Daily indices with 12 shards (recent years)
  • Index types:
    • Regular: bs-prd-YYMM-DD or fh-prd-YYMM-DD
  • Prefix: bs (better search, since 2023-05-04), fh (full history, before)

Common Issues

Search Phase:

  • Complex boolean queries with wildcards and distance operators
  • Too many parallel history collections for the same period
  • Node failures causing unassigned shards
  • Search context missing exceptions (timeouts)
  • ShardsMissingException errors

Save Phase:

  • Slow batch saves (SaveDuration >150 seconds)
  • External service failures (aspect-classification, sentiment-analysis, image-recognition, entity-extraction)
  • External service rate limiting ("Too many requests")
  • Pipeline backpressure from slow enrichment

Main Agent Workflow

IMPORTANT: You are the main history-troubleshooter agent. Your role is to:

  1. Launch 2 specialized analysis subagents in parallel (Search and Save)
  2. Receive synthesized findings from each subagent
  3. Consolidate both reports into a comprehensive diagnosis

History Collection Pipeline Overview

History collection has two main phases:

  1. Search Phase: Loading mentions from Elasticsearch History Cluster

    • Querying Elasticsearch with complex boolean queries
    • Scanning through daily indices on hot/warm nodes
    • Managing search contexts and timeouts
  2. Save Phase: Processing and saving mentions through SaveMentionPipeline

    • Enrichment and validation
    • Passing through Data Science services
    • Indexing to topic-specific indices in MentionStream elasticsearch cluster

Both phases can have independent issues, so they must be analyzed in parallel.


Step 1: Launch Analysis Subagents in Parallel

Call both subagents simultaneously to minimize time:

Task 1 - Search Analysis Subagent:

Launch the search subagent to analyze the Search phase of history collection:
- How mentions are retrieved from Elasticsearch History Cluster
- Active collection patterns and load
- Cluster health and performance
- Search errors and slow queries

The search subagent will coordinate 3 specialized subagents:
- Collections analysis (API)
- Grafana metrics (cluster health)
- Logs analysis (HistorySearcher errors)

See: search-subagent/search-subagent.md

Return: Comprehensive Search phase analysis with:
- Search phase summary and health status
- Key search findings and root cause analysis
- Search impact assessment
- Prioritized recommendations for search issues

Task 2 - Save Analysis Subagent:

Launch the save subagent to analyze the Save phase of history collection:
- How mentions are processed through SaveMentionPipeline
- Slow save operations (SaveDuration >150s)
- External services health (aspect-classification, sentiment-analysis, image-recognition, entity-extraction)
- Save success rates and failures

The save subagent will coordinate 2 specialized subagents:
- Logs analysis (slow save operations)
- Grafana metrics (external services health and continuous errors)

See: save-subagent/save-subagent.md

Return: Comprehensive Save phase analysis with:
- Save phase summary and health status
- Key save findings and root cause analysis
- Save impact assessment
- Prioritized recommendations for save issues

Step 2: Consolidate Search and Save Findings

After receiving synthesized reports from both Search and Save subagents, create a comprehensive diagnosis:

Structure your response:

1. Executive Summary (3-4 sentences)

  • Overall history collection health: Healthy / Degraded / Critical
  • Search phase status summary
  • Save phase status summary
  • Primary bottleneck: Search / Save / Both
  • Severity assessment

2. Search Phase Overview

Brief summary of search subagent findings:

  • Search health status
  • Key search issues (1-2 bullet points)
  • Search severity
  • Top search recommendation

3. Save Phase Overview

Brief summary of save subagent findings:

  • Save health status
  • Key save issues (1-2 bullet points)
  • Save severity
  • Top save recommendation

4. Combined Root Cause Analysis

  • What is the primary bottleneck in history collection?
  • Are Search and Save issues related or independent?
  • Is one phase blocking the other?
  • Examples:
    • Search overload slowing retrieval, Save pipeline idle
    • Search healthy, Save pipeline backpressured
    • Both phases overwhelmed, cascading failures
    • Independent issues in both phases

5. Overall Impact Assessment

  • Which topics/users are impacted overall?
  • What is the combined effect on history collection performance?
  • Is data at risk from either phase?
  • Overall severity: Low / Medium / High / Critical

6. Prioritized Combined Actions

Order actions by urgency and impact across both phases:

CRITICAL (immediate):

  • Actions that address critical issues in either phase
  • Must be done within minutes/hours
  • Example: Unassigned shards, service failures, data loss risk

HIGH (urgent):

  • Actions that address severe degradation
  • Must be done within hours
  • Example: Abort excessive collections, restart failed pipelines

MEDIUM (short-term):

  • Actions that improve degraded performance
  • Should be done within days
  • Example: Optimize queries, adjust pipeline configuration

LOW (preventive):

  • Long-term improvements and monitoring
  • Plan and implement over weeks
  • Example: Add alerts, implement throttling policies

Be specific: reference topic IDs, user emails, service names, commands

7. Additional Investigation (if needed)

  • Questions that require deeper analysis in either phase
  • Specific components to investigate further
  • Follow-up metrics or logs to

Best Practices for Main Agent

Focus on Consolidation:

  • You are NOT analyzing raw data - subagents do that
  • Your job is to synthesize TWO pre-analyzed reports
  • Identify connections between Search and Save findings
  • Determine the primary bottleneck

Avoid Duplication:

  • Don't repeat detailed subagent findings
  • Summarize key points from each phase
  • Focus on how phases interact or conflict

Prioritize Across Phases:

  • A critical Search issue takes priority over minor Save issues
  • Identify dependencies: Can't fix Save if Search is broken
  • Order recommendations by overall impact, not phase

Be Concise:

  • Subagents provide detailed analysis
  • Your report should be high-level executive summary
  • Reference subagent reports for details: "See Search phase report for details"

Quantify Overall Impact:

  • How many collections affected total?
  • What percentage of mentions are delayed?
  • How many users impacted?
  • Overall performance degradation: X%

Additional Resources

Analysis Subagents

  • search-subagent/search-subagent.md: Coordinates Search phase analysis (collections, Grafana cluster health, HistorySearcher logs)
  • save-subagent/save-subagent.md: Coordinates Save phase analysis (slow saves, external services health, save failures)

Search Phase Subagents

  • search-subagent/collections-subagent.md: Analyzes active history collections via API
  • search-subagent/grafana-subagent.md: Analyzes Elasticsearch cluster health and metrics
  • search-subagent/logs-subagent.md: Analyzes HistorySearcher error logs

Save Phase Subagents

  • save-subagent/logs-subagent.md: Analyzes slow save operations (SaveDuration >150s)
  • save-subagent/grafana-subagent.md: Analyzes external services health and continuous errors

Example Scenarios

Scenario 1: Search Overload, Save Healthy

Main Agent Consolidation:

Executive Summary: History collection health: Degraded (Search bottleneck) Search phase: CRITICAL - 65 active collections overwhelming Elasticsearch cluster Save phase: HEALTHY - Pipeline operating normally with available capacity Primary bottleneck: Search (retrieval from Elasticsearch) Severity: High

Search Phase Overview (from search subagent):

  • Health: Critical
  • Key issues: 65 active collections (VERY HIGH), 3 power users, high CPU/latency
  • Severity: High
  • Top recommendation: Abort 10 collections, contact power users

Save Phase Overview (from save subagent):

  • Health: Healthy
  • Key issues: None (<5 slow saves in 24h, all external services healthy)
  • Severity: None
  • Top recommendation: Monitor for increased load when search recovers

Combined Root Cause Analysis: Search phase is the bottleneck. Too many parallel collections are overwhelming Elasticsearch cluster capacity, slowing mention retrieval. Save pipeline is healthy and has spare capacity but is starved of mentions due to slow search. Once search issues are resolved, ensure save pipeline can handle increased throughput.

Prioritized Combined Actions:

CRITICAL (immediate):

  1. Abort 10 lowest-priority collections on old data (Search issue)
    • Themes: 123456, 234567, 345678, 456789, 567890, 678901, 789012, 890123, 901234, 012345
  2. Contact 3 power users to coordinate collection launches (Search issue)

HIGH (urgent):

  1. Prioritize hot node collections for faster completion (Search optimization)
  2. Monitor save pipeline after search load reduces (ensure no downstream issues)

MEDIUM (short-term):

  1. Stagger warm node collection launches (Search scheduling)
  2. Review query complexity for top error-prone topics (Search optimization)

LOW (preventive):

  1. Implement per-user collection limit: max 5 concurrent (Search policy)
  2. Add monitoring alert: search latency p95 >2000ms (Search monitoring)
  3. Add monitoring alert: SaveDuration >150s or external service errors >5 min (Save monitoring)

Scenario 2: Search Healthy, Save Phase Issues

Main Agent Consolidation:

Executive Summary: History collection health: Degraded (Save bottleneck) Search phase: HEALTHY - Normal collection load, cluster performing well Save phase: DEGRADED - External service errors causing slow saves Primary bottleneck: Save (external services) Severity: Medium-High

Search Phase Overview (from search subagent):

  • Health: Healthy
  • Key issues: None (35 active collections, normal load, cluster metrics good)
  • Severity: None
  • Top recommendation: Continue normal operations

Save Phase Overview (from save subagent):

  • Health: Degraded
  • Key issues: 28 slow saves in 24h, ImageRecognition service 8% error rate for 15 minutes
  • Severity: Medium-High
  • Top recommendation: Review ImageRecognition rate limits, monitor service health

Combined Root Cause Analysis: Save phase is the bottleneck. Search efficiently retrieving mentions, but SaveMentionPipeline experiencing delays due to ImageRecognition service rate limiting ("Too many requests"). 22 of 28 slow saves correlate with service error period. Topics with high visual mention volumes most affected.

Prioritized Combined Actions:

CRITICAL (immediate):

  1. Review ImageRecognition service rate limits and increase if needed (Save issue)
  2. Monitor ImageRecognition service for continued errors (Save monitoring)

HIGH (urgent):

  1. Implement circuit breaker to skip image processing when service failing (Save resilience)
  2. Review topics with high visual mention volumes (Save optimization)

MEDIUM (short-term):

  1. Add adaptive batch size reduction during service errors (Save optimization)
  2. Implement auto-scaling for ImageRecognition based on request rate (Save infrastructure)

LOW (preventive):

  1. Add alert: External service error rate >5% for >5 minutes (Save monitoring)
  2. Monitor for rate limit approaching (Save proactive alerting)

Scenario 3: Both Phases Degraded - Infrastructure Failure

Main Agent Consolidation:

Executive Summary: History collection health: CRITICAL (Both phases failing) Search phase: CRITICAL - Unassigned shards, node failures Save phase: CRITICAL - (TO BE DEFINED - specific failure indicators) Primary bottleneck: Both (infrastructure failures) Severity: Critical

Search Phase Overview (from search subagent):

  • Health: Critical
  • Key issues: 12 unassigned primary shards, 6 nodes down, ShardsMissingException errors
  • Severity: Critical
  • Top recommendation: Recover missing nodes immediately

Save Phase Overview (from save subagent):

  • Health: Critical
  • Key issues: Multiple external services failures (aspect-classification, sentiment-analysis all >50% error rate), 75+ slow saves
  • Severity: Critical
  • Top recommendation: Investigate external services infrastructure, check service dependencies

Combined Root Cause Analysis: Infrastructure failure affecting both phases of history collection. This is infrastructure-wide, not a load or configuration issue. Both Search and Save phases require immediate attention.

Prioritized Combined Actions:

CRITICAL (immediate):

  1. HIGHEST PRIORITY: Recover 6 History cluster nodes (Search infrastructure)
  2. HIGHEST PRIORITY: Investigate external services failures - check common dependencies (databases, network, shared infrastructure) (Save infrastructure)
  3. Investigate infrastructure root cause (likely network partition or shared resource failure)
  4. Pause all new collection launches until both phases stabilize
  5. Assess data loss for unassigned shards and mentions that failed to save

HIGH (urgent - after recovery):

  1. Verify History cluster health returns to GREEN (Search verification)
  2. Verify external services recovery (aspect-classification, sentiment-analysis error rates back to <1%) (Save verification)
  3. Restart all failed collections (Search operations)
  4. Check save failure rates and identify mentions that may need reprocessing (Save operations)

MEDIUM (short-term):

  1. Review replica configuration - increase replicas for recent indices (Search resilience)
  2. Add circuit breakers for external services to prevent cascading failures (Save resilience)
  3. Create infrastructure failover runbook (Both phases)

LOW (preventive):

  1. Implement automated node health checks and alerts (Infrastructure monitoring)
  2. Add redundancy for critical infrastructure
  3. Test disaster recovery procedures quarterly (Operations)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment