Skip to content

Instantly share code, notes, and snippets.

@jmanhype
Last active January 20, 2026 05:38
Show Gist options
  • Select an option

  • Save jmanhype/939520e9af2f0214c00f0b2cedc09d58 to your computer and use it in GitHub Desktop.

Select an option

Save jmanhype/939520e9af2f0214c00f0b2cedc09d58 to your computer and use it in GitHub Desktop.
BLACKICE vs Code Intelligence Survey: Gap Analysis - comparing architecture against academic survey patterns

BLACKICE vs Code Intelligence Survey: Gap Analysis

Based on exploration of BLACKICE and the Code Intelligence survey's coverage, here's what BLACKICE could adopt.


βœ… What BLACKICE Already Has (Survey-Aligned)

Survey Pattern BLACKICE Implementation
SWE Agent Architecture RalphLoop, Supervisor, WorkflowDAG
Reflexion/Self-Improvement Full ReflexionLoop with 6 quality dimensions
Multi-Agent Coordination AgentRegistry, ConsensusEngine, MessageBroker
Tool Use Claude Code adapter with file edit, bash, search
Memory Systems Letta archival + semantic embeddings
Safety Guards Cost limits, loop detection, workspace isolation
Multi-Model Routing SmartRouter with capability-based selection

πŸ”΄ Gaps: What the Survey Covers That BLACKICE Lacks

1. Formal Benchmarking & Evaluation

Survey covers: HumanEval, MBPP, SWE-bench, CodeContests, LiveCodeBench

Gap: BLACKICE has no standardized benchmark integration

MISSING:
- SWE-bench task runner for measuring agent performance
- Pass@k metrics tracking
- Regression testing against known benchmarks
- Performance comparison across model versions

Recommendation: Add benchmarks/ module with SWE-bench, HumanEval runners to measure improvement over time.


2. Code Retrieval / RAG for Code

Survey covers: RepoMap, codebase indexing, AST-based retrieval

Gap: BLACKICE uses keyword/embedding search but lacks:

MISSING:
- AST-aware code chunking
- Repository-level context graphs
- Call graph / dependency indexing
- Intelligent context window packing

Recommendation: Integrate tree-sitter parsing + code graph indexing (like Aider's RepoMap or Cursor's approach).


3. Execution-Based Verification

Survey covers: Sandboxed execution, test generation, property-based testing

Gap: BLACKICE validates via pytest/file checks but lacks:

MISSING:
- Auto-generated unit tests for verification
- Property-based testing (Hypothesis-style)
- Fuzzing for edge case discovery
- Execution traces for debugging

Recommendation: Add test generation as validation step before declaring success.


4. Fine-Tuning / Alignment Pipeline

Survey covers: SFT, RLHF, DPO, RLVR (reinforcement learning from verifiable rewards)

Gap: BLACKICE uses Reflexion (verbal RL) but has no:

MISSING:
- Data collection for fine-tuning local models
- Preference pairs capture from user feedback
- RLVR integration (reward = tests pass)
- LoRA/QLoRA training pipeline for Ollama models

Recommendation: Capture (task, successful_output) pairs β†’ fine-tune local Qwen/CodeLlama for domain-specific improvement.


5. Code Safety Analysis

Survey covers: Vulnerability detection, insecure code patterns, CWE classification

Gap: BLACKICE's safety is about cost/loops, not code security:

MISSING:
- Static analysis integration (Semgrep, Bandit)
- Vulnerability scanning of generated code
- CWE pattern detection
- Security-focused code review step

Recommendation: Add security validator in CompositeValidator chain (run Semgrep before accepting code).


6. Long-Context / Repository-Scale Understanding

Survey covers: 100K+ context models, hierarchical summarization, context compression

Gap: BLACKICE packs recent history but lacks:

MISSING:
- Hierarchical repo summarization
- Context compression techniques
- Selective context based on task relevance
- Long-context model routing for large codebases

Recommendation: Build repo summary cache + relevance-based context selection.


7. Agent Trajectories & Debugging

Survey covers: Agent trajectory datasets, step-by-step debugging, execution traces

Gap: BLACKICE has Beads events but lacks:

MISSING:
- Visual trajectory replay
- Step-by-step debugging interface
- Trajectory comparison (success vs failure)
- Human-in-the-loop trajectory correction

Recommendation: Build trajectory viewer from Beads events for debugging failed runs.


8. Multi-Modal Code Understanding

Survey covers: Diagram-to-code, screenshot-to-code, UI understanding

Gap: BLACKICE is text-only:

MISSING:
- Screenshot/UI analysis for frontend tasks
- Architecture diagram parsing
- Visual bug reproduction
- Design mockup β†’ code generation

Recommendation: Route multimodal tasks to GPT-4o/Claude Vision via SmartRouter.


πŸ“Š Priority Matrix

Gap Impact Effort Priority
Benchmarking High Medium P1
Code RAG/Retrieval High High P1
Security Analysis High Low P1
Fine-tuning Pipeline Medium High P2
Execution Verification Medium Medium P2
Trajectory Debugging Medium Medium P2
Long-Context Low High P3
Multi-Modal Low Medium P3

πŸš€ Suggested First Actions

  1. Add Semgrep validator - Low effort, immediate security benefit
  2. Integrate SWE-bench runner - Measure actual agent performance
  3. Build RepoMap-style indexing - Better context = better code generation

Scaffolded Infrastructure Already Available

BLACKICE already has partial scaffolding for some of these:

Component Location Status
Tree-sitter deps ai-factory/pyproject.toml Declared, not implemented
Extraction package ai-factory/packages/extraction/ Empty directory scaffold
Semantic embeddings ralph/semantic_memory.py Working with nomic-embed-text
Prometheus metrics ralph/instrumentation/metrics.py Working
SQLite caching Pattern in memory.py Working
Beads event store ralph/beads.py 40+ event types, working

Additional Scaffolded But Incomplete Infrastructure

High-Impact Items

Item Location Status
Cloud Storage Backends ralph/storage/cloud.py Protocol + exceptions defined, only MemoryBackend works
CLI Commands ralph/cli/commands/ artifacts, dlq, dashboard are pure pass stubs
ai-factory packages ai-factory/packages/ 7 empty dirs: extraction, inference, mcp-servers, orchestration, skills, speckit, validation
SafetyGuard Alerts retry.py:378 TODO - never triggers
Task-Type Routing orchestrator.py:387 TODO - falls back to default
Grafana Dashboards checklist.md Prometheus metrics exist, no dashboards

From Quality Checklist (73% Complete)

Not Implemented:

  • Agent-to-agent authentication
  • At-rest encryption for local Beads DB
  • Grafana dashboards pre-built
  • Alert rules defined (PagerDuty/Slack)
  • Load testing results
  • Kubernetes deployment guide
  • End-to-end tests
  • Chaos testing
  • OpenAPI documentation

BLACKICE Architecture Summary

53,000+ lines of Python organized as 12 layers:

  • Layer 0: Infrastructure (Docker, Postgres, Ollama, Letta)
  • Layer 1: Dispatcher (task classification)
  • Layer 2: Adapters (Claude, Ollama, Letta, Codex)
  • Layer 3: Core Loop (RalphLoop - try β†’ fail β†’ reflect β†’ retry)
  • Layer 4: Service Colony (multi-agent coordination)
  • Layer 5: Instrumentation (safety, cost, tracing)
  • Layer 6: Persistence (Beads event store)
  • Layer 7: Recovery (crash recovery, DLQ, worktrees)
  • Layer 8: Reflexion (self-improvement with 6 quality dimensions)
  • Layer 9: EnterpriseFlywheel (unified orchestrator)
  • Layer 10: Orchestrator
  • Layer 11: CLI

References

  • Service Colony Paper: arXiv:2407.07267
  • Reflexion Paper: Shinn et al., 2023 - "Reflexion: Language Agents with Verbal Reinforcement Learning"
  • BLACKICE Codebase: /Users/speed/proxmox/blackice/ (53,000+ lines)
  • Code Intelligence Survey: "Comprehensive Survey and Practical Guide to Code Intelligence"

Generated from BLACKICE architecture analysis session - January 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment