Laurian/PROXY.md

## PROXY.md

      
    Raw
  

              PROXY.md
            
          
    LLM Proxy Solutions with Queuing and Retry Capabilities

Research findings for open-source and commercial LLM proxy solutions that provide request queuing, rate limiting, and automatic retry mechanisms.

Executive Summary

This document covers the top solutions found through research on GitHub and other sources for LLM proxy/gateway servers that act as middleware between clients and LLM APIs.

Top Recommended Solutions

1. LiteLLM (berriai/litellm) ⭐ Primary Recommendation

GitHub: https://github.com/BerriAI/litellm | Stars: ~89k | Language: Python
Key Features:


Rate Limiting & Load Balancing: Built-in loadbalancing with retry/fallback logic for routing traffic
Retry Logic: Exception handling and automatic fallback mechanisms when providers fail
Monitoring/Logging: Comprehensive logging, observability callbacks (Lunary, MLflow, Langfuse), virtual keys for access control, spend tracking

Pros:


Unified interface supporting 100+ LLM providers (OpenAI format compatible)
High performance: 8ms P95 latency at 1k RPS
Enterprise adoption by Stripe, Netflix, Google ADK
Stable release with 12-hour load testing before publication

Cons:


Advanced queuing/backoff algorithms not explicitly documented as native feature
Full proxy features require setup for virtual keys and dashboard UI

Production Readiness: ✅ Yes - Actively maintained, major enterprise users, stable releases
2. Plano (katanemo/plano) ⭐ Best for Observability

GitHub: https://github.com/katanemo/plano | Stars: ~6.1k | Language: Rust
Key Features:


Rate Limiting: Guardrails and filtering capabilities
Smart Model Routing: Semantic routing via 4B-parameter orchestrator
Monitoring/Observability: Zero-code OpenTelemetry tracing, OTEL traces/metrics across every agent, rich agentic signals

Pros:


Built on Envoy (production-grade proxy infrastructure)
Low-latency orchestration between agents
Unified LLM APIs with automatic state management
Production-ready agentic application deployment

Cons:


Out-of-process dataplane adds infrastructure overhead
Production scaling requires local deployment or API keys for hosted models

Production Readiness: ✅ Yes - Actively maintained (614 commits, v0.4.8 Feb 2026), backed by research
3. Olla (thushan/olla) ⭐ Best Performance/Ollama Focus

GitHub: https://github.com/thushan/olla | Stars: ~160 | Language: Go
Key Features:


Rate Limiting: Production-ready rate limiting and request size limits
Request Queuing: Smart routing with priority-based queues
Retry/Failover: Automatic retry on connection failures with transparent endpoint switching
Circuit Breakers: Continuous health checks with circuit breakers and automatic recovery

Pros:


Dual engine architecture (Sherpa for simplicity, Olla for maximum performance)
Extremely lightweight: Runs on <50MB RAM
Native support for Ollama, vLLM, LiteLLM, LM Studio
Active failover and model discovery

Cons:


Still in active development; some features being finalized
TLS termination and dynamic configuration API pending completion

Production Readiness: ⚠️ Mostly Yes - Marked production-ready but actively stabilizing
4. ReliAPI (KikuAI-Lab/reliapi) ⭐ Best for Reliability Patterns

GitHub: https://github.com/KikuAI-Lab/reliapi | Stars: ~7 | Language: Python/Redis
Key Features:


Retry with Exponential Backoff: Automatic retries built-in for transient failures
Circuit Breaker: Explicit circuit breaker to prevent cascading failures
Rate Limiting: Built-in token bucket rate limiting per tier
Request Coalescing: Idempotency keys (similar to queuing)

Pros:


Combines multiple reliability patterns in single gateway
Redis-based TTL caching for GET requests and LLM responses
Budget caps and cost estimation prevents runaway expenses
Dedicated SDKs for Python/JavaScript

Cons:


Requires Redis as hard dependency
Early adoption stage (7 stars, minimal issues visible)
Payment processing relies on third-party Paddle

Production Readiness: ⚠️ Early Stage - Good architecture but not enterprise-grade stable yet
5. LLM Gateway (theopenco/llmgateway)

GitHub: https://github.com/theopenco/llmgateway | Stars: ~900 | Language: TypeScript
Key Features:


Rate Limiting: Topic tag indicates support
Monitoring/Logging: Usage analytics (requests, tokens, response times, costs)

Pros:


Provider-agnostic via OpenAI-compatible interface
Built-in cost and token tracking without external tools
Modern stack using Next.js and Hono
Dual licensing (AGPLv3 core + Enterprise tier)

Cons:


Advanced resilience patterns not documented
AGPLv3 license requires open-sourcing modifications for public use

Production Readiness: ⚠️ Partially - Functional but limited advanced features in open-source version
6. LM Proxy (Nayjest/lm-proxy)

GitHub: https://github.com/Nayjest/lm-proxy | Stars: ~80 | Language: Python/FastAPI
Key Features:


Rate Limiting: API key validation via user groups, granular rate limiting not detailed
Logging: Configurable logging with database connector, custom log writers

Pros:


Provider agnostic (OpenAI, Anthropic, Google, local PyTorch)
Extensible library/standalone service design
Secure key management separating proxy keys from provider keys
Lightweight FastAPI-based

Cons:


Manual routing configuration required (no automatic discovery)
Limited resilience patterns out-of-the-box
May require infrastructure layer implementation for high-scale use

Production Readiness: ⚠️ Basic Production Ready - Good for basic proxying, needs additional hardening for enterprise
7. LLMApiGateway (fabiojbg/LLMApiGateway)

GitHub: https://github.com/fabiojbg/LLMApiGateway | Language: Python
Key Features:


Retry Logic: Configurable retry attempts and delays per model
Fault Tolerance/Fallbacks: Automatic provider switching on failure, custom fallback sequences

Pros:


Flexible multi-provider failover configuration
Easy Docker/pip deployment
Usage statistics and cost tracking via local UI

Cons:


No enterprise-grade rate limiting or distributed tracing
Personal project origin (may need hardening for production)

Production Readiness: ⚠️ Limited Production Ready - Good for personal use, needs external tools for high availability

Feature Comparison Matrix


Project
Rate Limiting
Request Queuing
Retry w/ Backoff
Circuit Breaker
Monitoring
Stars
Language


LiteLLM
Partial
❌
Basic
Limited
Excellent
89k
Python


Plano
Guardrails
❌
❌
❌
OTEL
6.1k
Rust


Olla
Full
✅ Priority Queues
Auto Failover
Health Checks
Basic
160
Go


ReliAPI
Token Bucket
Idempotency Keys
Exponential Backoff
Explicit
Prometheus
7
Python/Redis


LLM Gateway
Topic Tagged
❌
❌
❌
Analytics
900
TypeScript


LM Proxy
Basic Groups
❌
Client-level
❌
Configurable Logging
80
Python/FastAPI


Recommendations Based on Requirements

Best Overall: LiteLLM


✅ Best balance of features and production readiness
✅ Largest community (89k stars)
✅ Enterprise adoption verified
⚠️ May need external tooling for advanced queuing/backoff

Best for Reliability Patterns: ReliAPI


✅ Explicit exponential backoff implementation
✅ Circuit breaker pattern implemented
✅ Token bucket rate limiting
⚠️ Early stage (7 stars) - validate in your environment first

Best Performance/Ollama Focus: Olla


✅ Built specifically for LLM proxy workloads
✅ Lightweight (<50MB RAM)
✅ Native support for Ollama, vLLM, LiteLLM
⚠️ Active development - some features pending completion

Best for Observability: Plano


✅ Zero-code OpenTelemetry integration
✅ Enterprise-grade tracing and metrics
⚠️ Limited documentation on queuing/backoff specifics


Commercial Alternatives Worth Considering


Solution
Type
Notable Features


Katanemo Plano (Hosted)
Commercial SaaS
Production scaling, US-central region free tier


OpenRouter
Commercial API
Multi-provider access, cost tracking


Portkey AI
Commercial Gateway
Rate limiting, analytics, routing rules


Implementation Considerations

If none of the solutions fully meet your requirements, consider:

Combining Solutions: Use LiteLLM for routing + external rate limiting middleware
Custom Middleware Layer: Build retry/backoff circuit breaker logic above any proxy
Infrastructure-Level Solution: Place NGINX/Traefik load balancer in front of LLM gateway


Final Recommendation

For production use requiring all three features (queueing, rate limiting, retry with exponential backoff), I recommend:

Primary Choice: LiteLLM for broadest feature support and enterprise readiness
Backup/Complementary: Consider using ReliAPI patterns or implementing circuit breakers separately if ReliAPI is too early-stage for your use case


GitHub Search Links for Discovery

For discovering more solutions, search these GitHub topics:

llm-proxy: https://github.com/topics/llm-proxy
ai-gateway: https://github.com/topics/ai-gateway
rate-limiting: https://github.com/topics/rate-limiting


References

Research conducted on 2026-02-27 using GitHub search for LLM proxy, gateway, rate limiting, and reliability patterns.
Project	Rate Limiting	Request Queuing	Retry w/ Backoff	Circuit Breaker	Monitoring	Stars	Language
LiteLLM	Partial	❌	Basic	Limited	Excellent	89k	Python
Plano	Guardrails	❌	❌	❌	OTEL	6.1k	Rust
Olla	Full	✅ Priority Queues	Auto Failover	Health Checks	Basic	160	Go
ReliAPI	Token Bucket	Idempotency Keys	Exponential Backoff	Explicit	Prometheus	7	Python/Redis
LLM Gateway	Topic Tagged	❌	❌	❌	Analytics	900	TypeScript
LM Proxy	Basic Groups	❌	Client-level	❌	Configurable Logging	80	Python/FastAPI
Solution	Type	Notable Features
Katanemo Plano (Hosted)	Commercial SaaS	Production scaling, US-central region free tier
OpenRouter	Commercial API	Multi-provider access, cost tracking
Portkey AI	Commercial Gateway	Rate limiting, analytics, routing rules