Skip to content

Instantly share code, notes, and snippets.

@vishalsachdev
Created January 15, 2026 14:45
Show Gist options
  • Select an option

  • Save vishalsachdev/23ae43c8750e873209fe728b2a7ea50b to your computer and use it in GitHub Desktop.

Select an option

Save vishalsachdev/23ae43c8750e873209fe728b2a7ea50b to your computer and use it in GitHub Desktop.
AgentX Submission Feedback - AgentBeats Competition (Phase 1 Green Agent)

AgentX README Review & Improvement Suggestions

AgentBeats Competition - Phase 1 (Green Agent)


Executive Summary

Your AgentX submission is technically excellent with comprehensive documentation, but needs agent-optimized presentation to maximize evaluation scores. The README covers the 5 judging criteria well, but doesn't lead with clarity for agent evaluation or include agent-friendly interfaces prominently.

Key Finding: Agents (especially reasoning agents) will parse your README first. Structure must be scannable and interface-centric.


What You're Doing Well

  • Clear value proposition - You explain what AgentX does and why it matters
  • Rich technical documentation - Architecture diagrams, scoring methodology, API references are thorough
  • Production-ready focus - Docker, error handling, health checks, reproducibility demonstrated
  • Real examples - Shows actual code examples and task complexity comparisons
  • Unique features highlighted - Hallucination detection and error categorization

Critical Gaps for Agent Evaluation

Gap 1: Missing Task Overview at Top

Competition Requirement: "Brief description of the tasks your green agent evaluates"

Current Problem: README jumps into technical architecture before explaining what tasks are evaluated.

What Agents Need Immediately:

  • Which agent type you're benchmarking (Coding? Web? Research?)
  • What the benchmark actually measures
  • Real-world relevance
  • Task count and difficulty range

Suggested Addition (After title):

##  Benchmark at a Glance

**Evaluation Track:** Coding Agent (SQL Generation)

**What We Evaluate:**
- 27+ SQL tasks spanning 4 difficulty levels (Easy  Enterprise)
- SQL agents' ability to generate correct, efficient, and safe queries
- Real-world patterns: star schemas, sessionization, cohort retention analysis

**Key Metrics (7 Dimensions):**
- Correctness (35%) - Exact result matching
- Safety (20%) - No hallucinations, valid syntax
- Efficiency (15%) - Query execution time
- Semantic Accuracy (10%) - Value precision
- Completeness (10%) - All data returned
- Best Practices (5%) - Code quality
- Plan Quality (5%) - Execution efficiency

**Why This Matters:**
SQL generation is critical for data analytics agents. Most benchmarks only check if queries run (binary pass/fail). AgentX evaluates HOW WELL they're writtenproduction-readiness matters.

Gap 2: No Clear Agent Interface Section

Competition Requirement: "README describing how to run the green agent"

Current Problem: API Reference exists but isn't highlighted as the entry point for competing agents.

What Agents Need:

  • How to call the benchmark (endpoint + format)
  • What their agent needs to implement
  • Example request/response cycle
  • Success criteria

Suggested Addition - New Section " For Competing Agents" (before technical details):

##  For Competing Agents (Purple Agents)

### How to Compete Against AgentX

AgentX is fully A2A-compatible. To run your agent against this benchmark:

#### 1. Start the Green Agent (AgentX Evaluator)
docker run keshavdalmia10/agentx-green:latest --host 0.0.0.0 --port 8001

#### 2. Your Agent Must Implement
- GET /.well-known/agent.json - A2A descriptor
- POST /generate - SQL generation endpoint

#### 3. Trigger Assessment
curl -X POST http://localhost:8001/assess \
  -H "Content-Type: application/json" \
    -d '{"participants": {"my_agent": "http://my-agent:8080"}, "config": {"task_count": 27, "difficulty": ["easy", "medium", "hard", "enterprise"], "scorer_preset": "default"}}'
    
    #### 4. What AgentX Sends Your Agent
    {
      "question": "Find customers who placed orders > $100",
        "schema": {"tables": {"customers": {...}, "orders": {...}}},
          "difficulty": "medium",
            "task_id": "task_123"
            }
            
            #### 5. What Your Agent Should Return
            {
              "sql": "SELECT * FROM customers WHERE id IN (SELECT customer_id FROM orders WHERE total > 100)",
                "confidence": 0.92,
                  "reasoning": "Using subquery to find qualifying customers"
                  }
                  
                  #### 6. You'll Get Scored On
                  | Dimension | Weight | What It Measures |
                  |-----------|--------|-----------------|
                  | Correctness | 35% | Exact result matching |
                  | Safety | 20% | No hallucinations, valid syntax |
                  | Efficiency | 15% | Query execution time |
                  | Semantic Accuracy | 10% | Values match, not just row counts |
                  | Completeness | 10% | All expected data returned |
                  | Best Practices | 5% | Code quality |
                  | Plan Quality | 5% | Efficient execution plan |
                  
                  **Goal:** Achieve >85% overall score to be competitive.
                  ```
                  
                  ---
                  
                  ### Gap 3: "Why This Benchmark Matters" Missing
                  
                  **Current Problem:** Agents don't immediately understand why SQL benchmarking is important or how this differs from simpler benchmarks.
                  
                  **Suggested Addition - Strengthen Introduction:**
                  
                  ```markdown
                  ## Why SQL Generation is Hard for Agents
                  
                  AgentX targets real-world challenges that simple heuristics can't solve:
                  
                  1. **Schema Complexity** - 19-table enterprise schemas test multi-step reasoning
                  2. **Hallucination Detection** - Agents can imagine phantom tables/columns that don't exist; AgentX catches these BEFORE database errors
                  3. **Performance Matters** - Not just ANY valid SQL, but EFFICIENT SQL (production-readiness)
                  4. **Error Pattern Learning** - AgentX categorizes failures to help agents learn from mistakes
                  
                  ## Agents That Excel on AgentX
                  
                  - Coding Agents with strong SQL understanding
                  - Data Analytics Agents analyzing databases
                  - Research Agents querying knowledge bases
                  - Finance/BI Agents generating reports
                  - Agents trained on production SQL patterns
                  ```
                  
                  ---
                  
                  ### Gap 4: Judging Criteria Coverage
                  
                  **Competition judges on 5 criteria. Ensure they're visibly addressed:**
                  
                  | Judging Criterion | Your Coverage | Recommendation |
                  |---|---|---|
                  | **Technical Correctness & Documentation** |  Strong | Add: "Tested on 27+ SQL scenarios covering 4 difficulty levels" |
                  | **Reproducibility** |  Good | Emphasize: "Same 27 tasks, fixed seed data, guaranteed consistent results" |
                  | **Benchmark Design Quality** |  Excellent | Add: "Avoids simple heuristicsrequires real SQL understanding" |
                  | **Evaluation Methodology** |  Great | Add: "Fully automated evaluationno manual intervention" |
                  | **Innovation & Impact** |  Buried | **MOVE TO FRONT**this is your differentiator |
                  
                  ---
                  
                  ## Recommended README Restructure
                  
                  **Current Order:** Overview → Architecture  Quick Start  Design  Evaluation  Docker  API
                  
                  **Recommended Agent-Friendly Order:**
                  ```
                  1. Title + One-liner
                  2.  Benchmark at a Glance (NEW)
                  3.  Innovation & Impact (MOVED UP)
                  4. Why This Matters (NEW)
                  5.  For Competing Agents (NEW - API focus)
                  6. What Makes AgentX Unique (REORDERED)
                  7. Benchmark Design Quality
                  8. Evaluation Methodology
                  9. Error Categories
                  10. Deployment & Setup
                  11. Resource Requirements & Performance
                  12. API Reference
                  13. Contributing & Citation
                  ```
                  
                  ---
                  
                  ## Specific Presentational Improvements 
                  
                  ### 1. Lead with Impact, Not Implementation
                  -  Current: "7-Dimensional Scoring: Correctness, Efficiency, Safety..."
                  -  Better: "AgentX tests whether agents generate PRODUCTION-READY SQL, not just any SQL"
                  
                  ### 2. Make Hallucination Detection Your #1 Unique Feature
                  - This is what competitors care about most
                  - Move it to position #1 (currently #2)
                  - Add real example of how it prevents production failures
                  
                  ### 3. Add Benchmark Coverage Matrix
                  ```markdown
                  | Difficulty | Tasks | Example | Avg Time | Skills Tested |
                  |---|---|---|---|---|
                  | Easy | 10 | Basic SELECT, WHERE, LIMIT | 0.5s | Schema understanding |
                  | Medium | 10 | JOINs, GROUP BY, Subqueries | 1.2s | Multi-table reasoning |
                  | Hard | 4 | Window functions, CTEs | 0.8s | Advanced SQL |
                  | Enterprise | 30 | Star schema, SCD, Cohorts | 3.0s | Real-world patterns |
                  ```
                  
                  ### 4. Add "Success Criteria" Section
                  ```markdown
                  ## What Does a Winning Agent Look Like?
                  
                  - **Correctness > 85%** (exact results match)
                  - **Safety = 100%** (zero hallucinations)
                  - **Efficiency > 80%** (respects time budgets)
                  - **Best Practices > 70%** (clean SQL)
                  - **Overall > 82%** (competitive leaderboard position)
                  ```
                  
                  ### 5. Make Requirements Explicit
                  Add a "What Your Agent Needs" checklist:
                  
                  ```markdown
                  ## Baseline Purple Agent Requirements
                  
                  Your agent must implement:
                  - GET /.well-known/agent.json - A2A descriptor
                  - POST /generate - Takes question + schema, returns SQL + confidence
                  - Handles SQLite, DuckDB, PostgreSQL, BigQuery dialects
                  - Responds within 60 seconds per query
                  - Returns valid SQL (or reasonable error message)
                  - Can handle 19-table enterprise schemas
                  - Tested on 3+ agents minimum (for comparison)
                  
                  Optional but Competitive:
                  - Confidence scoring (shows uncertainty)
                  - Reasoning explanation (helps learning)
                  - Multi-turn interaction (asks clarifying questions)
                  ```
                  
                  ---
                  
                  ## What NOT to Change
                  
                  -  Keep all technical documentation (it's excellent)
                  -  Keep error categorization section (very detailed)
                  -  Keep reproducibility explanation (important for agents)
                  -  Keep Docker deployment (it works well)
                  -  Don't remove architecture diagrams (just add metadata)
                  
                  ---
                  
                  ## Summary: Priority Changes
                  
                  | Priority | Change | Impact |
                  |----------|--------|--------|
                  |  HIGH | Add "Benchmark at a Glance" at top | Agents understand scope immediately |
                  |  HIGH | Add "For Competing Agents" section with interface | Agents know how to submit |
                  |  HIGH | Move Innovation & Impact to front | Judges see differentiation immediately |
                  |  MEDIUM | Add Success Criteria section | Agents know what to optimize for |
                  |  MEDIUM | Restructure: move agent interface before architecture | Better information scannability |
                  |  MEDIUM | Add "Why This Matters" subsection | Agents understand value |
                  |  LOW | Add benchmark coverage matrix | Agents understand effort |
                  
                  ---
                  
                  ## Bottom Line
                  
                  You have **excellent technical content**. This is about **reorganizing and adding agent-friendly guidance** on top of what you already have. No new features to buildjust presentation optimization.
                  
                  **Estimated effort:** 2-3 hours to restructure + enhance README
                  
                  **Expected outcome:** Significant improvement in judging scores, especially for technical correctness, documentation, and innovation criteria.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment