Research findings for open-source and commercial LLM proxy solutions that provide request queuing, rate limiting, and automatic retry mechanisms.
This document covers the top solutions found through research on GitHub and other sources for LLM proxy/gateway servers that act as middleware between clients and LLM APIs.
GitHub: https://github.com/BerriAI/litellm | Stars: ~89k | Language: Python
- Rate Limiting & Load Balancing: Built-in loadbalancing with retry/fallback logic for routing traffic
- Retry Logic: Exception handling and automatic fallback mechanisms when providers fail
- Monitoring/Logging: Comprehensive logging, observability callbacks (Lunary, MLflow, Langfuse), virtual keys for access control, spend tracking
- Unified interface supporting 100+ LLM providers (OpenAI format compatible)
- High performance: 8ms P95 latency at 1k RPS
- Enterprise adoption by Stripe, Netflix, Google ADK
- Stable release with 12-hour load testing before publication
- Advanced queuing/backoff algorithms not explicitly documented as native feature
- Full proxy features require setup for virtual keys and dashboard UI
Production Readiness: ✅ Yes - Actively maintained, major enterprise users, stable releases
GitHub: https://github.com/katanemo/plano | Stars: ~6.1k | Language: Rust
- Rate Limiting: Guardrails and filtering capabilities
- Smart Model Routing: Semantic routing via 4B-parameter orchestrator
- Monitoring/Observability: Zero-code OpenTelemetry tracing, OTEL traces/metrics across every agent, rich agentic signals
- Built on Envoy (production-grade proxy infrastructure)
- Low-latency orchestration between agents
- Unified LLM APIs with automatic state management
- Production-ready agentic application deployment
- Out-of-process dataplane adds infrastructure overhead
- Production scaling requires local deployment or API keys for hosted models
Production Readiness: ✅ Yes - Actively maintained (614 commits, v0.4.8 Feb 2026), backed by research
GitHub: https://github.com/thushan/olla | Stars: ~160 | Language: Go
- Rate Limiting: Production-ready rate limiting and request size limits
- Request Queuing: Smart routing with priority-based queues
- Retry/Failover: Automatic retry on connection failures with transparent endpoint switching
- Circuit Breakers: Continuous health checks with circuit breakers and automatic recovery
- Dual engine architecture (Sherpa for simplicity, Olla for maximum performance)
- Extremely lightweight: Runs on <50MB RAM
- Native support for Ollama, vLLM, LiteLLM, LM Studio
- Active failover and model discovery
- Still in active development; some features being finalized
- TLS termination and dynamic configuration API pending completion
Production Readiness:
GitHub: https://github.com/KikuAI-Lab/reliapi | Stars: ~7 | Language: Python/Redis
- Retry with Exponential Backoff: Automatic retries built-in for transient failures
- Circuit Breaker: Explicit circuit breaker to prevent cascading failures
- Rate Limiting: Built-in token bucket rate limiting per tier
- Request Coalescing: Idempotency keys (similar to queuing)
- Combines multiple reliability patterns in single gateway
- Redis-based TTL caching for GET requests and LLM responses
- Budget caps and cost estimation prevents runaway expenses
- Dedicated SDKs for Python/JavaScript
- Requires Redis as hard dependency
- Early adoption stage (7 stars, minimal issues visible)
- Payment processing relies on third-party Paddle
Production Readiness:
GitHub: https://github.com/theopenco/llmgateway | Stars: ~900 | Language: TypeScript
- Rate Limiting: Topic tag indicates support
- Monitoring/Logging: Usage analytics (requests, tokens, response times, costs)
- Provider-agnostic via OpenAI-compatible interface
- Built-in cost and token tracking without external tools
- Modern stack using Next.js and Hono
- Dual licensing (AGPLv3 core + Enterprise tier)
- Advanced resilience patterns not documented
- AGPLv3 license requires open-sourcing modifications for public use
Production Readiness:
GitHub: https://github.com/Nayjest/lm-proxy | Stars: ~80 | Language: Python/FastAPI
- Rate Limiting: API key validation via user groups, granular rate limiting not detailed
- Logging: Configurable logging with database connector, custom log writers
- Provider agnostic (OpenAI, Anthropic, Google, local PyTorch)
- Extensible library/standalone service design
- Secure key management separating proxy keys from provider keys
- Lightweight FastAPI-based
- Manual routing configuration required (no automatic discovery)
- Limited resilience patterns out-of-the-box
- May require infrastructure layer implementation for high-scale use
Production Readiness:
GitHub: https://github.com/fabiojbg/LLMApiGateway | Language: Python
- Retry Logic: Configurable retry attempts and delays per model
- Fault Tolerance/Fallbacks: Automatic provider switching on failure, custom fallback sequences
- Flexible multi-provider failover configuration
- Easy Docker/pip deployment
- Usage statistics and cost tracking via local UI
- No enterprise-grade rate limiting or distributed tracing
- Personal project origin (may need hardening for production)
Production Readiness:
| Project | Rate Limiting | Request Queuing | Retry w/ Backoff | Circuit Breaker | Monitoring | Stars | Language |
|---|---|---|---|---|---|---|---|
| LiteLLM | Partial | ❌ | Basic | Limited | Excellent | 89k | Python |
| Plano | Guardrails | ❌ | ❌ | ❌ | OTEL | 6.1k | Rust |
| Olla | Full | ✅ Priority Queues | Auto Failover | Health Checks | Basic | 160 | Go |
| ReliAPI | Token Bucket | Idempotency Keys | Exponential Backoff | Explicit | Prometheus | 7 | Python/Redis |
| LLM Gateway | Topic Tagged | ❌ | ❌ | ❌ | Analytics | 900 | TypeScript |
| LM Proxy | Basic Groups | ❌ | Client-level | ❌ | Configurable Logging | 80 | Python/FastAPI |
- ✅ Best balance of features and production readiness
- ✅ Largest community (89k stars)
- ✅ Enterprise adoption verified
⚠️ May need external tooling for advanced queuing/backoff
- ✅ Explicit exponential backoff implementation
- ✅ Circuit breaker pattern implemented
- ✅ Token bucket rate limiting
⚠️ Early stage (7 stars) - validate in your environment first
- ✅ Built specifically for LLM proxy workloads
- ✅ Lightweight (<50MB RAM)
- ✅ Native support for Ollama, vLLM, LiteLLM
⚠️ Active development - some features pending completion
- ✅ Zero-code OpenTelemetry integration
- ✅ Enterprise-grade tracing and metrics
⚠️ Limited documentation on queuing/backoff specifics
| Solution | Type | Notable Features |
|---|---|---|
| Katanemo Plano (Hosted) | Commercial SaaS | Production scaling, US-central region free tier |
| OpenRouter | Commercial API | Multi-provider access, cost tracking |
| Portkey AI | Commercial Gateway | Rate limiting, analytics, routing rules |
If none of the solutions fully meet your requirements, consider:
- Combining Solutions: Use LiteLLM for routing + external rate limiting middleware
- Custom Middleware Layer: Build retry/backoff circuit breaker logic above any proxy
- Infrastructure-Level Solution: Place NGINX/Traefik load balancer in front of LLM gateway
For production use requiring all three features (queueing, rate limiting, retry with exponential backoff), I recommend:
- Primary Choice: LiteLLM for broadest feature support and enterprise readiness
- Backup/Complementary: Consider using ReliAPI patterns or implementing circuit breakers separately if ReliAPI is too early-stage for your use case
For discovering more solutions, search these GitHub topics:
llm-proxy: https://github.com/topics/llm-proxyai-gateway: https://github.com/topics/ai-gatewayrate-limiting: https://github.com/topics/rate-limiting
Research conducted on 2026-02-27 using GitHub search for LLM proxy, gateway, rate limiting, and reliability patterns.