yaroslav-tykhonchuk/SKILL.md

## SKILL.md

      
    Raw
  

              SKILL.md
            
          
  name
  description
  version
  
  
  history-troubleshooter
  Diagnose and troubleshoot YouScan history collection pipeline - Search phase (Elasticsearch cluster health, slow queries, unassigned shards) and Save phase (slow saves, external service errors, pipeline bottlenecks)
  2.2.0
  
  
History Troubleshooter

This skill helps diagnose and troubleshoot the YouScan history collection pipeline, analyzing both the Search phase (loading mentions from a >2.48 PB Elasticsearch History Cluster) and the Save phase (processing mentions through SaveMentionPipeline with external services).
When to Use This Skill

Use this skill when:

Need a systematic analysis of history collection pipeline (Search and Save phases)
History collection is slow or failing for topics
Search issues: Elasticsearch cluster shows timeouts, search context missing exceptions, unassigned shards, or degraded cluster health
Save issues: Slow batch saves (>150s), external service errors, or mention processing delays
Need to investigate performance issues with history searches or saves
Too many parallel history collections are suspected
External services (aspect-classification, sentiment-analysis, image-recognition) showing errors

Cluster Architecture Overview

Storage Tiers


Hot nodes (SSD): Last 1 year of data, 1 replica for last 45 days, 0 replicas after
Warm nodes (HDD): Older data, replicas up to 2206 (2022-06), gradual removal when space is limited

Index Structure


Daily indices with 12 shards (recent years)
Index types:

Regular: bs-prd-YYMM-DD or fh-prd-YYMM-DD


Prefix: bs (better search, since 2023-05-04), fh (full history, before)

Common Issues

Search Phase:

Complex boolean queries with wildcards and distance operators
Too many parallel history collections for the same period
Node failures causing unassigned shards
Search context missing exceptions (timeouts)
ShardsMissingException errors

Save Phase:

Slow batch saves (SaveDuration >150 seconds)
External service failures (aspect-classification, sentiment-analysis, image-recognition, entity-extraction)
External service rate limiting ("Too many requests")
Pipeline backpressure from slow enrichment


Main Agent Workflow

IMPORTANT: You are the main history-troubleshooter agent. Your role is to:

Launch 2 specialized analysis subagents in parallel (Search and Save)
Receive synthesized findings from each subagent
Consolidate both reports into a comprehensive diagnosis

History Collection Pipeline Overview

History collection has two main phases:


Search Phase: Loading mentions from Elasticsearch History Cluster

Querying Elasticsearch with complex boolean queries
Scanning through daily indices on hot/warm nodes
Managing search contexts and timeouts


Save Phase: Processing and saving mentions through SaveMentionPipeline

Enrichment and validation
Passing through Data Science services
Indexing to topic-specific indices in MentionStream elasticsearch cluster


Both phases can have independent issues, so they must be analyzed in parallel.

Step 1: Launch Analysis Subagents in Parallel

Call both subagents simultaneously to minimize time:
Task 1 - Search Analysis Subagent:
Launch the search subagent to analyze the Search phase of history collection:
- How mentions are retrieved from Elasticsearch History Cluster
- Active collection patterns and load
- Cluster health and performance
- Search errors and slow queries

The search subagent will coordinate 3 specialized subagents:
- Collections analysis (API)
- Grafana metrics (cluster health)
- Logs analysis (HistorySearcher errors)

See: search-subagent/search-subagent.md

Return: Comprehensive Search phase analysis with:
- Search phase summary and health status
- Key search findings and root cause analysis
- Search impact assessment
- Prioritized recommendations for search issues

Task 2 - Save Analysis Subagent:
Launch the save subagent to analyze the Save phase of history collection:
- How mentions are processed through SaveMentionPipeline
- Slow save operations (SaveDuration >150s)
- External services health (aspect-classification, sentiment-analysis, image-recognition, entity-extraction)
- Save success rates and failures

The save subagent will coordinate 2 specialized subagents:
- Logs analysis (slow save operations)
- Grafana metrics (external services health and continuous errors)

See: save-subagent/save-subagent.md

Return: Comprehensive Save phase analysis with:
- Save phase summary and health status
- Key save findings and root cause analysis
- Save impact assessment
- Prioritized recommendations for save issues


Step 2: Consolidate Search and Save Findings

After receiving synthesized reports from both Search and Save subagents, create a comprehensive diagnosis:
Structure your response:
1. Executive Summary (3-4 sentences)


Overall history collection health: Healthy / Degraded / Critical
Search phase status summary
Save phase status summary
Primary bottleneck: Search / Save / Both
Severity assessment

2. Search Phase Overview

Brief summary of search subagent findings:

Search health status
Key search issues (1-2 bullet points)
Search severity
Top search recommendation

3. Save Phase Overview

Brief summary of save subagent findings:

Save health status
Key save issues (1-2 bullet points)
Save severity
Top save recommendation

4. Combined Root Cause Analysis


What is the primary bottleneck in history collection?
Are Search and Save issues related or independent?
Is one phase blocking the other?
Examples:

Search overload slowing retrieval, Save pipeline idle
Search healthy, Save pipeline backpressured
Both phases overwhelmed, cascading failures
Independent issues in both phases


5. Overall Impact Assessment


Which topics/users are impacted overall?
What is the combined effect on history collection performance?
Is data at risk from either phase?
Overall severity: Low / Medium / High / Critical

6. Prioritized Combined Actions

Order actions by urgency and impact across both phases:
CRITICAL (immediate):

Actions that address critical issues in either phase
Must be done within minutes/hours
Example: Unassigned shards, service failures, data loss risk

HIGH (urgent):

Actions that address severe degradation
Must be done within hours
Example: Abort excessive collections, restart failed pipelines

MEDIUM (short-term):

Actions that improve degraded performance
Should be done within days
Example: Optimize queries, adjust pipeline configuration

LOW (preventive):

Long-term improvements and monitoring
Plan and implement over weeks
Example: Add alerts, implement throttling policies

Be specific: reference topic IDs, user emails, service names, commands
7. Additional Investigation (if needed)


Questions that require deeper analysis in either phase
Specific components to investigate further
Follow-up metrics or logs to


Best Practices for Main Agent

Focus on Consolidation:

You are NOT analyzing raw data - subagents do that
Your job is to synthesize TWO pre-analyzed reports
Identify connections between Search and Save findings
Determine the primary bottleneck

Avoid Duplication:

Don't repeat detailed subagent findings
Summarize key points from each phase
Focus on how phases interact or conflict

Prioritize Across Phases:

A critical Search issue takes priority over minor Save issues
Identify dependencies: Can't fix Save if Search is broken
Order recommendations by overall impact, not phase

Be Concise:

Subagents provide detailed analysis
Your report should be high-level executive summary
Reference subagent reports for details: "See Search phase report for details"

Quantify Overall Impact:

How many collections affected total?
What percentage of mentions are delayed?
How many users impacted?
Overall performance degradation: X%


Additional Resources

Analysis Subagents


search-subagent/search-subagent.md: Coordinates Search phase analysis (collections, Grafana cluster health, HistorySearcher logs)
save-subagent/save-subagent.md: Coordinates Save phase analysis (slow saves, external services health, save failures)

Search Phase Subagents


search-subagent/collections-subagent.md: Analyzes active history collections via API
search-subagent/grafana-subagent.md: Analyzes Elasticsearch cluster health and metrics
search-subagent/logs-subagent.md: Analyzes HistorySearcher error logs

Save Phase Subagents


save-subagent/logs-subagent.md: Analyzes slow save operations (SaveDuration >150s)
save-subagent/grafana-subagent.md: Analyzes external services health and continuous errors

Example Scenarios

Scenario 1: Search Overload, Save Healthy

Main Agent Consolidation:
Executive Summary:
History collection health: Degraded (Search bottleneck)
Search phase: CRITICAL - 65 active collections overwhelming Elasticsearch cluster
Save phase: HEALTHY - Pipeline operating normally with available capacity
Primary bottleneck: Search (retrieval from Elasticsearch)
Severity: High
Search Phase Overview (from search subagent):

Health: Critical
Key issues: 65 active collections (VERY HIGH), 3 power users, high CPU/latency
Severity: High
Top recommendation: Abort 10 collections, contact power users

Save Phase Overview (from save subagent):

Health: Healthy
Key issues: None (<5 slow saves in 24h, all external services healthy)
Severity: None
Top recommendation: Monitor for increased load when search recovers

Combined Root Cause Analysis:
Search phase is the bottleneck. Too many parallel collections are overwhelming Elasticsearch cluster capacity, slowing mention retrieval. Save pipeline is healthy and has spare capacity but is starved of mentions due to slow search. Once search issues are resolved, ensure save pipeline can handle increased throughput.
Prioritized Combined Actions:
CRITICAL (immediate):

Abort 10 lowest-priority collections on old data (Search issue)

Themes: 123456, 234567, 345678, 456789, 567890, 678901, 789012, 890123, 901234, 012345


Contact 3 power users to coordinate collection launches (Search issue)

user1@example.com (12 collections), user2@example.com (11), user3@example.com (10)


HIGH (urgent):

Prioritize hot node collections for faster completion (Search optimization)
Monitor save pipeline after search load reduces (ensure no downstream issues)

MEDIUM (short-term):

Stagger warm node collection launches (Search scheduling)
Review query complexity for top error-prone topics (Search optimization)

LOW (preventive):

Implement per-user collection limit: max 5 concurrent (Search policy)
Add monitoring alert: search latency p95 >2000ms (Search monitoring)
Add monitoring alert: SaveDuration >150s or external service errors >5 min (Save monitoring)


Scenario 2: Search Healthy, Save Phase Issues

Main Agent Consolidation:
Executive Summary:
History collection health: Degraded (Save bottleneck)
Search phase: HEALTHY - Normal collection load, cluster performing well
Save phase: DEGRADED - External service errors causing slow saves
Primary bottleneck: Save (external services)
Severity: Medium-High
Search Phase Overview (from search subagent):

Health: Healthy
Key issues: None (35 active collections, normal load, cluster metrics good)
Severity: None
Top recommendation: Continue normal operations

Save Phase Overview (from save subagent):

Health: Degraded
Key issues: 28 slow saves in 24h, ImageRecognition service 8% error rate for 15 minutes
Severity: Medium-High
Top recommendation: Review ImageRecognition rate limits, monitor service health

Combined Root Cause Analysis:
Save phase is the bottleneck. Search efficiently retrieving mentions, but SaveMentionPipeline experiencing delays due to ImageRecognition service rate limiting ("Too many requests"). 22 of 28 slow saves correlate with service error period. Topics with high visual mention volumes most affected.
Prioritized Combined Actions:
CRITICAL (immediate):

Review ImageRecognition service rate limits and increase if needed (Save issue)
Monitor ImageRecognition service for continued errors (Save monitoring)

HIGH (urgent):

Implement circuit breaker to skip image processing when service failing (Save resilience)
Review topics with high visual mention volumes (Save optimization)

MEDIUM (short-term):

Add adaptive batch size reduction during service errors (Save optimization)
Implement auto-scaling for ImageRecognition based on request rate (Save infrastructure)

LOW (preventive):

Add alert: External service error rate >5% for >5 minutes (Save monitoring)
Monitor for rate limit approaching (Save proactive alerting)


Scenario 3: Both Phases Degraded - Infrastructure Failure

Main Agent Consolidation:
Executive Summary:
History collection health: CRITICAL (Both phases failing)
Search phase: CRITICAL - Unassigned shards, node failures
Save phase: CRITICAL - (TO BE DEFINED - specific failure indicators)
Primary bottleneck: Both (infrastructure failures)
Severity: Critical
Search Phase Overview (from search subagent):

Health: Critical
Key issues: 12 unassigned primary shards, 6 nodes down, ShardsMissingException errors
Severity: Critical
Top recommendation: Recover missing nodes immediately

Save Phase Overview (from save subagent):

Health: Critical
Key issues: Multiple external services failures (aspect-classification, sentiment-analysis all >50% error rate), 75+ slow saves
Severity: Critical
Top recommendation: Investigate external services infrastructure, check service dependencies

Combined Root Cause Analysis:
Infrastructure failure affecting both phases of history collection. This is infrastructure-wide, not a load or configuration issue. Both Search and Save phases require immediate attention.
Prioritized Combined Actions:
CRITICAL (immediate):

HIGHEST PRIORITY: Recover 6 History cluster nodes (Search infrastructure)
HIGHEST PRIORITY: Investigate external services failures - check common dependencies (databases, network, shared infrastructure) (Save infrastructure)
Investigate infrastructure root cause (likely network partition or shared resource failure)
Pause all new collection launches until both phases stabilize
Assess data loss for unassigned shards and mentions that failed to save

HIGH (urgent - after recovery):

Verify History cluster health returns to GREEN (Search verification)
Verify external services recovery (aspect-classification, sentiment-analysis error rates back to <1%) (Save verification)
Restart all failed collections (Search operations)
Check save failure rates and identify mentions that may need reprocessing (Save operations)

MEDIUM (short-term):

Review replica configuration - increase replicas for recent indices (Search resilience)
Add circuit breakers for external services to prevent cascading failures (Save resilience)
Create infrastructure failover runbook (Both phases)

LOW (preventive):

Implement automated node health checks and alerts (Infrastructure monitoring)
Add redundancy for critical infrastructure
Test disaster recovery procedures quarterly (Operations)
No results found