Skip to content

Instantly share code, notes, and snippets.

@Laurian
Created February 26, 2026 23:01
Show Gist options
  • Select an option

  • Save Laurian/43296129324d3497d74c198cafa74068 to your computer and use it in GitHub Desktop.

Select an option

Save Laurian/43296129324d3497d74c198cafa74068 to your computer and use it in GitHub Desktop.
LLM proxy report

LLM Proxy Solutions with Queuing and Retry Capabilities

Research findings for open-source and commercial LLM proxy solutions that provide request queuing, rate limiting, and automatic retry mechanisms.


Executive Summary

This document covers the top solutions found through research on GitHub and other sources for LLM proxy/gateway servers that act as middleware between clients and LLM APIs.


Top Recommended Solutions

1. LiteLLM (berriai/litellm) ⭐ Primary Recommendation

GitHub: https://github.com/BerriAI/litellm | Stars: ~89k | Language: Python

Key Features:

  • Rate Limiting & Load Balancing: Built-in loadbalancing with retry/fallback logic for routing traffic
  • Retry Logic: Exception handling and automatic fallback mechanisms when providers fail
  • Monitoring/Logging: Comprehensive logging, observability callbacks (Lunary, MLflow, Langfuse), virtual keys for access control, spend tracking

Pros:

  • Unified interface supporting 100+ LLM providers (OpenAI format compatible)
  • High performance: 8ms P95 latency at 1k RPS
  • Enterprise adoption by Stripe, Netflix, Google ADK
  • Stable release with 12-hour load testing before publication

Cons:

  • Advanced queuing/backoff algorithms not explicitly documented as native feature
  • Full proxy features require setup for virtual keys and dashboard UI

Production Readiness:Yes - Actively maintained, major enterprise users, stable releases

2. Plano (katanemo/plano) ⭐ Best for Observability

GitHub: https://github.com/katanemo/plano | Stars: ~6.1k | Language: Rust

Key Features:

  • Rate Limiting: Guardrails and filtering capabilities
  • Smart Model Routing: Semantic routing via 4B-parameter orchestrator
  • Monitoring/Observability: Zero-code OpenTelemetry tracing, OTEL traces/metrics across every agent, rich agentic signals

Pros:

  • Built on Envoy (production-grade proxy infrastructure)
  • Low-latency orchestration between agents
  • Unified LLM APIs with automatic state management
  • Production-ready agentic application deployment

Cons:

  • Out-of-process dataplane adds infrastructure overhead
  • Production scaling requires local deployment or API keys for hosted models

Production Readiness:Yes - Actively maintained (614 commits, v0.4.8 Feb 2026), backed by research

3. Olla (thushan/olla) ⭐ Best Performance/Ollama Focus

GitHub: https://github.com/thushan/olla | Stars: ~160 | Language: Go

Key Features:

  • Rate Limiting: Production-ready rate limiting and request size limits
  • Request Queuing: Smart routing with priority-based queues
  • Retry/Failover: Automatic retry on connection failures with transparent endpoint switching
  • Circuit Breakers: Continuous health checks with circuit breakers and automatic recovery

Pros:

  • Dual engine architecture (Sherpa for simplicity, Olla for maximum performance)
  • Extremely lightweight: Runs on <50MB RAM
  • Native support for Ollama, vLLM, LiteLLM, LM Studio
  • Active failover and model discovery

Cons:

  • Still in active development; some features being finalized
  • TLS termination and dynamic configuration API pending completion

Production Readiness: ⚠️ Mostly Yes - Marked production-ready but actively stabilizing

4. ReliAPI (KikuAI-Lab/reliapi) ⭐ Best for Reliability Patterns

GitHub: https://github.com/KikuAI-Lab/reliapi | Stars: ~7 | Language: Python/Redis

Key Features:

  • Retry with Exponential Backoff: Automatic retries built-in for transient failures
  • Circuit Breaker: Explicit circuit breaker to prevent cascading failures
  • Rate Limiting: Built-in token bucket rate limiting per tier
  • Request Coalescing: Idempotency keys (similar to queuing)

Pros:

  • Combines multiple reliability patterns in single gateway
  • Redis-based TTL caching for GET requests and LLM responses
  • Budget caps and cost estimation prevents runaway expenses
  • Dedicated SDKs for Python/JavaScript

Cons:

  • Requires Redis as hard dependency
  • Early adoption stage (7 stars, minimal issues visible)
  • Payment processing relies on third-party Paddle

Production Readiness: ⚠️ Early Stage - Good architecture but not enterprise-grade stable yet

5. LLM Gateway (theopenco/llmgateway)

GitHub: https://github.com/theopenco/llmgateway | Stars: ~900 | Language: TypeScript

Key Features:

  • Rate Limiting: Topic tag indicates support
  • Monitoring/Logging: Usage analytics (requests, tokens, response times, costs)

Pros:

  • Provider-agnostic via OpenAI-compatible interface
  • Built-in cost and token tracking without external tools
  • Modern stack using Next.js and Hono
  • Dual licensing (AGPLv3 core + Enterprise tier)

Cons:

  • Advanced resilience patterns not documented
  • AGPLv3 license requires open-sourcing modifications for public use

Production Readiness: ⚠️ Partially - Functional but limited advanced features in open-source version

6. LM Proxy (Nayjest/lm-proxy)

GitHub: https://github.com/Nayjest/lm-proxy | Stars: ~80 | Language: Python/FastAPI

Key Features:

  • Rate Limiting: API key validation via user groups, granular rate limiting not detailed
  • Logging: Configurable logging with database connector, custom log writers

Pros:

  • Provider agnostic (OpenAI, Anthropic, Google, local PyTorch)
  • Extensible library/standalone service design
  • Secure key management separating proxy keys from provider keys
  • Lightweight FastAPI-based

Cons:

  • Manual routing configuration required (no automatic discovery)
  • Limited resilience patterns out-of-the-box
  • May require infrastructure layer implementation for high-scale use

Production Readiness: ⚠️ Basic Production Ready - Good for basic proxying, needs additional hardening for enterprise

7. LLMApiGateway (fabiojbg/LLMApiGateway)

GitHub: https://github.com/fabiojbg/LLMApiGateway | Language: Python

Key Features:

  • Retry Logic: Configurable retry attempts and delays per model
  • Fault Tolerance/Fallbacks: Automatic provider switching on failure, custom fallback sequences

Pros:

  • Flexible multi-provider failover configuration
  • Easy Docker/pip deployment
  • Usage statistics and cost tracking via local UI

Cons:

  • No enterprise-grade rate limiting or distributed tracing
  • Personal project origin (may need hardening for production)

Production Readiness: ⚠️ Limited Production Ready - Good for personal use, needs external tools for high availability


Feature Comparison Matrix

Project Rate Limiting Request Queuing Retry w/ Backoff Circuit Breaker Monitoring Stars Language
LiteLLM Partial Basic Limited Excellent 89k Python
Plano Guardrails OTEL 6.1k Rust
Olla Full ✅ Priority Queues Auto Failover Health Checks Basic 160 Go
ReliAPI Token Bucket Idempotency Keys Exponential Backoff Explicit Prometheus 7 Python/Redis
LLM Gateway Topic Tagged Analytics 900 TypeScript
LM Proxy Basic Groups Client-level Configurable Logging 80 Python/FastAPI

Recommendations Based on Requirements

Best Overall: LiteLLM

  • ✅ Best balance of features and production readiness
  • ✅ Largest community (89k stars)
  • ✅ Enterprise adoption verified
  • ⚠️ May need external tooling for advanced queuing/backoff

Best for Reliability Patterns: ReliAPI

  • ✅ Explicit exponential backoff implementation
  • ✅ Circuit breaker pattern implemented
  • ✅ Token bucket rate limiting
  • ⚠️ Early stage (7 stars) - validate in your environment first

Best Performance/Ollama Focus: Olla

  • ✅ Built specifically for LLM proxy workloads
  • ✅ Lightweight (<50MB RAM)
  • ✅ Native support for Ollama, vLLM, LiteLLM
  • ⚠️ Active development - some features pending completion

Best for Observability: Plano

  • ✅ Zero-code OpenTelemetry integration
  • ✅ Enterprise-grade tracing and metrics
  • ⚠️ Limited documentation on queuing/backoff specifics

Commercial Alternatives Worth Considering

Solution Type Notable Features
Katanemo Plano (Hosted) Commercial SaaS Production scaling, US-central region free tier
OpenRouter Commercial API Multi-provider access, cost tracking
Portkey AI Commercial Gateway Rate limiting, analytics, routing rules

Implementation Considerations

If none of the solutions fully meet your requirements, consider:

  1. Combining Solutions: Use LiteLLM for routing + external rate limiting middleware
  2. Custom Middleware Layer: Build retry/backoff circuit breaker logic above any proxy
  3. Infrastructure-Level Solution: Place NGINX/Traefik load balancer in front of LLM gateway

Final Recommendation

For production use requiring all three features (queueing, rate limiting, retry with exponential backoff), I recommend:

  1. Primary Choice: LiteLLM for broadest feature support and enterprise readiness
  2. Backup/Complementary: Consider using ReliAPI patterns or implementing circuit breakers separately if ReliAPI is too early-stage for your use case

GitHub Search Links for Discovery

For discovering more solutions, search these GitHub topics:


References

Research conducted on 2026-02-27 using GitHub search for LLM proxy, gateway, rate limiting, and reliability patterns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment