| name | description | version |
|---|---|---|
history-troubleshooter |
Diagnose and troubleshoot YouScan history collection pipeline - Search phase (Elasticsearch cluster health, slow queries, unassigned shards) and Save phase (slow saves, external service errors, pipeline bottlenecks) |
2.2.0 |
This skill helps diagnose and troubleshoot the YouScan history collection pipeline, analyzing both the Search phase (loading mentions from a >2.48 PB Elasticsearch History Cluster) and the Save phase (processing mentions through SaveMentionPipeline with external services).
Use this skill when:
- Need a systematic analysis of history collection pipeline (Search and Save phases)
- History collection is slow or failing for topics
- Search issues: Elasticsearch cluster shows timeouts, search context missing exceptions, unassigned shards, or degraded cluster health
- Save issues: Slow batch saves (>150s), external service errors, or mention processing delays
- Need to investigate performance issues with history searches or saves
- Too many parallel history collections are suspected
- External services (aspect-classification, sentiment-analysis, image-recognition) showing errors
- Hot nodes (SSD): Last 1 year of data, 1 replica for last 45 days, 0 replicas after
- Warm nodes (HDD): Older data, replicas up to 2206 (2022-06), gradual removal when space is limited
- Daily indices with 12 shards (recent years)
- Index types:
- Regular:
bs-prd-YYMM-DDorfh-prd-YYMM-DD
- Regular:
- Prefix:
bs(better search, since 2023-05-04),fh(full history, before)
Search Phase:
- Complex boolean queries with wildcards and distance operators
- Too many parallel history collections for the same period
- Node failures causing unassigned shards
- Search context missing exceptions (timeouts)
- ShardsMissingException errors
Save Phase:
- Slow batch saves (SaveDuration >150 seconds)
- External service failures (aspect-classification, sentiment-analysis, image-recognition, entity-extraction)
- External service rate limiting ("Too many requests")
- Pipeline backpressure from slow enrichment
IMPORTANT: You are the main history-troubleshooter agent. Your role is to:
- Launch 2 specialized analysis subagents in parallel (Search and Save)
- Receive synthesized findings from each subagent
- Consolidate both reports into a comprehensive diagnosis
History collection has two main phases:
-
Search Phase: Loading mentions from Elasticsearch History Cluster
- Querying Elasticsearch with complex boolean queries
- Scanning through daily indices on hot/warm nodes
- Managing search contexts and timeouts
-
Save Phase: Processing and saving mentions through SaveMentionPipeline
- Enrichment and validation
- Passing through Data Science services
- Indexing to topic-specific indices in MentionStream elasticsearch cluster
Both phases can have independent issues, so they must be analyzed in parallel.
Call both subagents simultaneously to minimize time:
Task 1 - Search Analysis Subagent:
Launch the search subagent to analyze the Search phase of history collection:
- How mentions are retrieved from Elasticsearch History Cluster
- Active collection patterns and load
- Cluster health and performance
- Search errors and slow queries
The search subagent will coordinate 3 specialized subagents:
- Collections analysis (API)
- Grafana metrics (cluster health)
- Logs analysis (HistorySearcher errors)
See: search-subagent/search-subagent.md
Return: Comprehensive Search phase analysis with:
- Search phase summary and health status
- Key search findings and root cause analysis
- Search impact assessment
- Prioritized recommendations for search issues
Task 2 - Save Analysis Subagent:
Launch the save subagent to analyze the Save phase of history collection:
- How mentions are processed through SaveMentionPipeline
- Slow save operations (SaveDuration >150s)
- External services health (aspect-classification, sentiment-analysis, image-recognition, entity-extraction)
- Save success rates and failures
The save subagent will coordinate 2 specialized subagents:
- Logs analysis (slow save operations)
- Grafana metrics (external services health and continuous errors)
See: save-subagent/save-subagent.md
Return: Comprehensive Save phase analysis with:
- Save phase summary and health status
- Key save findings and root cause analysis
- Save impact assessment
- Prioritized recommendations for save issues
After receiving synthesized reports from both Search and Save subagents, create a comprehensive diagnosis:
Structure your response:
- Overall history collection health: Healthy / Degraded / Critical
- Search phase status summary
- Save phase status summary
- Primary bottleneck: Search / Save / Both
- Severity assessment
Brief summary of search subagent findings:
- Search health status
- Key search issues (1-2 bullet points)
- Search severity
- Top search recommendation
Brief summary of save subagent findings:
- Save health status
- Key save issues (1-2 bullet points)
- Save severity
- Top save recommendation
- What is the primary bottleneck in history collection?
- Are Search and Save issues related or independent?
- Is one phase blocking the other?
- Examples:
- Search overload slowing retrieval, Save pipeline idle
- Search healthy, Save pipeline backpressured
- Both phases overwhelmed, cascading failures
- Independent issues in both phases
- Which topics/users are impacted overall?
- What is the combined effect on history collection performance?
- Is data at risk from either phase?
- Overall severity: Low / Medium / High / Critical
Order actions by urgency and impact across both phases:
CRITICAL (immediate):
- Actions that address critical issues in either phase
- Must be done within minutes/hours
- Example: Unassigned shards, service failures, data loss risk
HIGH (urgent):
- Actions that address severe degradation
- Must be done within hours
- Example: Abort excessive collections, restart failed pipelines
MEDIUM (short-term):
- Actions that improve degraded performance
- Should be done within days
- Example: Optimize queries, adjust pipeline configuration
LOW (preventive):
- Long-term improvements and monitoring
- Plan and implement over weeks
- Example: Add alerts, implement throttling policies
Be specific: reference topic IDs, user emails, service names, commands
- Questions that require deeper analysis in either phase
- Specific components to investigate further
- Follow-up metrics or logs to
Focus on Consolidation:
- You are NOT analyzing raw data - subagents do that
- Your job is to synthesize TWO pre-analyzed reports
- Identify connections between Search and Save findings
- Determine the primary bottleneck
Avoid Duplication:
- Don't repeat detailed subagent findings
- Summarize key points from each phase
- Focus on how phases interact or conflict
Prioritize Across Phases:
- A critical Search issue takes priority over minor Save issues
- Identify dependencies: Can't fix Save if Search is broken
- Order recommendations by overall impact, not phase
Be Concise:
- Subagents provide detailed analysis
- Your report should be high-level executive summary
- Reference subagent reports for details: "See Search phase report for details"
Quantify Overall Impact:
- How many collections affected total?
- What percentage of mentions are delayed?
- How many users impacted?
- Overall performance degradation: X%
- search-subagent/search-subagent.md: Coordinates Search phase analysis (collections, Grafana cluster health, HistorySearcher logs)
- save-subagent/save-subagent.md: Coordinates Save phase analysis (slow saves, external services health, save failures)
- search-subagent/collections-subagent.md: Analyzes active history collections via API
- search-subagent/grafana-subagent.md: Analyzes Elasticsearch cluster health and metrics
- search-subagent/logs-subagent.md: Analyzes HistorySearcher error logs
- save-subagent/logs-subagent.md: Analyzes slow save operations (SaveDuration >150s)
- save-subagent/grafana-subagent.md: Analyzes external services health and continuous errors
Main Agent Consolidation:
Executive Summary: History collection health: Degraded (Search bottleneck) Search phase: CRITICAL - 65 active collections overwhelming Elasticsearch cluster Save phase: HEALTHY - Pipeline operating normally with available capacity Primary bottleneck: Search (retrieval from Elasticsearch) Severity: High
Search Phase Overview (from search subagent):
- Health: Critical
- Key issues: 65 active collections (VERY HIGH), 3 power users, high CPU/latency
- Severity: High
- Top recommendation: Abort 10 collections, contact power users
Save Phase Overview (from save subagent):
- Health: Healthy
- Key issues: None (<5 slow saves in 24h, all external services healthy)
- Severity: None
- Top recommendation: Monitor for increased load when search recovers
Combined Root Cause Analysis: Search phase is the bottleneck. Too many parallel collections are overwhelming Elasticsearch cluster capacity, slowing mention retrieval. Save pipeline is healthy and has spare capacity but is starved of mentions due to slow search. Once search issues are resolved, ensure save pipeline can handle increased throughput.
Prioritized Combined Actions:
CRITICAL (immediate):
- Abort 10 lowest-priority collections on old data (Search issue)
- Themes: 123456, 234567, 345678, 456789, 567890, 678901, 789012, 890123, 901234, 012345
- Contact 3 power users to coordinate collection launches (Search issue)
- user1@example.com (12 collections), user2@example.com (11), user3@example.com (10)
HIGH (urgent):
- Prioritize hot node collections for faster completion (Search optimization)
- Monitor save pipeline after search load reduces (ensure no downstream issues)
MEDIUM (short-term):
- Stagger warm node collection launches (Search scheduling)
- Review query complexity for top error-prone topics (Search optimization)
LOW (preventive):
- Implement per-user collection limit: max 5 concurrent (Search policy)
- Add monitoring alert: search latency p95 >2000ms (Search monitoring)
- Add monitoring alert: SaveDuration >150s or external service errors >5 min (Save monitoring)
Main Agent Consolidation:
Executive Summary: History collection health: Degraded (Save bottleneck) Search phase: HEALTHY - Normal collection load, cluster performing well Save phase: DEGRADED - External service errors causing slow saves Primary bottleneck: Save (external services) Severity: Medium-High
Search Phase Overview (from search subagent):
- Health: Healthy
- Key issues: None (35 active collections, normal load, cluster metrics good)
- Severity: None
- Top recommendation: Continue normal operations
Save Phase Overview (from save subagent):
- Health: Degraded
- Key issues: 28 slow saves in 24h, ImageRecognition service 8% error rate for 15 minutes
- Severity: Medium-High
- Top recommendation: Review ImageRecognition rate limits, monitor service health
Combined Root Cause Analysis: Save phase is the bottleneck. Search efficiently retrieving mentions, but SaveMentionPipeline experiencing delays due to ImageRecognition service rate limiting ("Too many requests"). 22 of 28 slow saves correlate with service error period. Topics with high visual mention volumes most affected.
Prioritized Combined Actions:
CRITICAL (immediate):
- Review ImageRecognition service rate limits and increase if needed (Save issue)
- Monitor ImageRecognition service for continued errors (Save monitoring)
HIGH (urgent):
- Implement circuit breaker to skip image processing when service failing (Save resilience)
- Review topics with high visual mention volumes (Save optimization)
MEDIUM (short-term):
- Add adaptive batch size reduction during service errors (Save optimization)
- Implement auto-scaling for ImageRecognition based on request rate (Save infrastructure)
LOW (preventive):
- Add alert: External service error rate >5% for >5 minutes (Save monitoring)
- Monitor for rate limit approaching (Save proactive alerting)
Main Agent Consolidation:
Executive Summary: History collection health: CRITICAL (Both phases failing) Search phase: CRITICAL - Unassigned shards, node failures Save phase: CRITICAL - (TO BE DEFINED - specific failure indicators) Primary bottleneck: Both (infrastructure failures) Severity: Critical
Search Phase Overview (from search subagent):
- Health: Critical
- Key issues: 12 unassigned primary shards, 6 nodes down, ShardsMissingException errors
- Severity: Critical
- Top recommendation: Recover missing nodes immediately
Save Phase Overview (from save subagent):
- Health: Critical
- Key issues: Multiple external services failures (aspect-classification, sentiment-analysis all >50% error rate), 75+ slow saves
- Severity: Critical
- Top recommendation: Investigate external services infrastructure, check service dependencies
Combined Root Cause Analysis: Infrastructure failure affecting both phases of history collection. This is infrastructure-wide, not a load or configuration issue. Both Search and Save phases require immediate attention.
Prioritized Combined Actions:
CRITICAL (immediate):
- HIGHEST PRIORITY: Recover 6 History cluster nodes (Search infrastructure)
- HIGHEST PRIORITY: Investigate external services failures - check common dependencies (databases, network, shared infrastructure) (Save infrastructure)
- Investigate infrastructure root cause (likely network partition or shared resource failure)
- Pause all new collection launches until both phases stabilize
- Assess data loss for unassigned shards and mentions that failed to save
HIGH (urgent - after recovery):
- Verify History cluster health returns to GREEN (Search verification)
- Verify external services recovery (aspect-classification, sentiment-analysis error rates back to <1%) (Save verification)
- Restart all failed collections (Search operations)
- Check save failure rates and identify mentions that may need reprocessing (Save operations)
MEDIUM (short-term):
- Review replica configuration - increase replicas for recent indices (Search resilience)
- Add circuit breakers for external services to prevent cascading failures (Save resilience)
- Create infrastructure failover runbook (Both phases)
LOW (preventive):
- Implement automated node health checks and alerts (Infrastructure monitoring)
- Add redundancy for critical infrastructure
- Test disaster recovery procedures quarterly (Operations)