Based on exploration of BLACKICE and the Code Intelligence survey's coverage, here's what BLACKICE could adopt.
| Survey Pattern | BLACKICE Implementation |
|---|---|
| SWE Agent Architecture | RalphLoop, Supervisor, WorkflowDAG |
| Reflexion/Self-Improvement | Full ReflexionLoop with 6 quality dimensions |
| Multi-Agent Coordination | AgentRegistry, ConsensusEngine, MessageBroker |
| Tool Use | Claude Code adapter with file edit, bash, search |
| Memory Systems | Letta archival + semantic embeddings |
| Safety Guards | Cost limits, loop detection, workspace isolation |
| Multi-Model Routing | SmartRouter with capability-based selection |
Survey covers: HumanEval, MBPP, SWE-bench, CodeContests, LiveCodeBench
Gap: BLACKICE has no standardized benchmark integration
MISSING:
- SWE-bench task runner for measuring agent performance
- Pass@k metrics tracking
- Regression testing against known benchmarks
- Performance comparison across model versions
Recommendation: Add benchmarks/ module with SWE-bench, HumanEval runners to measure improvement over time.
Survey covers: RepoMap, codebase indexing, AST-based retrieval
Gap: BLACKICE uses keyword/embedding search but lacks:
MISSING:
- AST-aware code chunking
- Repository-level context graphs
- Call graph / dependency indexing
- Intelligent context window packing
Recommendation: Integrate tree-sitter parsing + code graph indexing (like Aider's RepoMap or Cursor's approach).
Survey covers: Sandboxed execution, test generation, property-based testing
Gap: BLACKICE validates via pytest/file checks but lacks:
MISSING:
- Auto-generated unit tests for verification
- Property-based testing (Hypothesis-style)
- Fuzzing for edge case discovery
- Execution traces for debugging
Recommendation: Add test generation as validation step before declaring success.
Survey covers: SFT, RLHF, DPO, RLVR (reinforcement learning from verifiable rewards)
Gap: BLACKICE uses Reflexion (verbal RL) but has no:
MISSING:
- Data collection for fine-tuning local models
- Preference pairs capture from user feedback
- RLVR integration (reward = tests pass)
- LoRA/QLoRA training pipeline for Ollama models
Recommendation: Capture (task, successful_output) pairs β fine-tune local Qwen/CodeLlama for domain-specific improvement.
Survey covers: Vulnerability detection, insecure code patterns, CWE classification
Gap: BLACKICE's safety is about cost/loops, not code security:
MISSING:
- Static analysis integration (Semgrep, Bandit)
- Vulnerability scanning of generated code
- CWE pattern detection
- Security-focused code review step
Recommendation: Add security validator in CompositeValidator chain (run Semgrep before accepting code).
Survey covers: 100K+ context models, hierarchical summarization, context compression
Gap: BLACKICE packs recent history but lacks:
MISSING:
- Hierarchical repo summarization
- Context compression techniques
- Selective context based on task relevance
- Long-context model routing for large codebases
Recommendation: Build repo summary cache + relevance-based context selection.
Survey covers: Agent trajectory datasets, step-by-step debugging, execution traces
Gap: BLACKICE has Beads events but lacks:
MISSING:
- Visual trajectory replay
- Step-by-step debugging interface
- Trajectory comparison (success vs failure)
- Human-in-the-loop trajectory correction
Recommendation: Build trajectory viewer from Beads events for debugging failed runs.
Survey covers: Diagram-to-code, screenshot-to-code, UI understanding
Gap: BLACKICE is text-only:
MISSING:
- Screenshot/UI analysis for frontend tasks
- Architecture diagram parsing
- Visual bug reproduction
- Design mockup β code generation
Recommendation: Route multimodal tasks to GPT-4o/Claude Vision via SmartRouter.
| Gap | Impact | Effort | Priority |
|---|---|---|---|
| Benchmarking | High | Medium | P1 |
| Code RAG/Retrieval | High | High | P1 |
| Security Analysis | High | Low | P1 |
| Fine-tuning Pipeline | Medium | High | P2 |
| Execution Verification | Medium | Medium | P2 |
| Trajectory Debugging | Medium | Medium | P2 |
| Long-Context | Low | High | P3 |
| Multi-Modal | Low | Medium | P3 |
- Add Semgrep validator - Low effort, immediate security benefit
- Integrate SWE-bench runner - Measure actual agent performance
- Build RepoMap-style indexing - Better context = better code generation
BLACKICE already has partial scaffolding for some of these:
| Component | Location | Status |
|---|---|---|
| Tree-sitter deps | ai-factory/pyproject.toml |
Declared, not implemented |
| Extraction package | ai-factory/packages/extraction/ |
Empty directory scaffold |
| Semantic embeddings | ralph/semantic_memory.py |
Working with nomic-embed-text |
| Prometheus metrics | ralph/instrumentation/metrics.py |
Working |
| SQLite caching | Pattern in memory.py |
Working |
| Beads event store | ralph/beads.py |
40+ event types, working |
| Item | Location | Status |
|---|---|---|
| Cloud Storage Backends | ralph/storage/cloud.py |
Protocol + exceptions defined, only MemoryBackend works |
| CLI Commands | ralph/cli/commands/ |
artifacts, dlq, dashboard are pure pass stubs |
| ai-factory packages | ai-factory/packages/ |
7 empty dirs: extraction, inference, mcp-servers, orchestration, skills, speckit, validation |
| SafetyGuard Alerts | retry.py:378 |
TODO - never triggers |
| Task-Type Routing | orchestrator.py:387 |
TODO - falls back to default |
| Grafana Dashboards | checklist.md | Prometheus metrics exist, no dashboards |
Not Implemented:
- Agent-to-agent authentication
- At-rest encryption for local Beads DB
- Grafana dashboards pre-built
- Alert rules defined (PagerDuty/Slack)
- Load testing results
- Kubernetes deployment guide
- End-to-end tests
- Chaos testing
- OpenAPI documentation
53,000+ lines of Python organized as 12 layers:
- Layer 0: Infrastructure (Docker, Postgres, Ollama, Letta)
- Layer 1: Dispatcher (task classification)
- Layer 2: Adapters (Claude, Ollama, Letta, Codex)
- Layer 3: Core Loop (RalphLoop - try β fail β reflect β retry)
- Layer 4: Service Colony (multi-agent coordination)
- Layer 5: Instrumentation (safety, cost, tracing)
- Layer 6: Persistence (Beads event store)
- Layer 7: Recovery (crash recovery, DLQ, worktrees)
- Layer 8: Reflexion (self-improvement with 6 quality dimensions)
- Layer 9: EnterpriseFlywheel (unified orchestrator)
- Layer 10: Orchestrator
- Layer 11: CLI
- Service Colony Paper: arXiv:2407.07267
- Reflexion Paper: Shinn et al., 2023 - "Reflexion: Language Agents with Verbal Reinforcement Learning"
- BLACKICE Codebase:
/Users/speed/proxmox/blackice/(53,000+ lines) - Code Intelligence Survey: "Comprehensive Survey and Practical Guide to Code Intelligence"
Generated from BLACKICE architecture analysis session - January 2026