Issue #4561 by @yuchen-ecnu proposes adding federation capability to KubeRay so that a single logical RayCluster can span multiple Kubernetes clusters. The core motivation:
- Fragmented GPUs: Organizations procure GPUs across multiple cloud vendors/AZs. Today these are isolated into separate K8s clusters, preventing a unified Ray cluster.
- Operational pain: Users must split datasets, deploy multiple small RayClusters, and manually manage them — causing long-tail performance issues and complexity.
- Virtual Kubelet limitations: The common workaround (aggregating via Virtual Kubelet) creates control-plane scalability bottlenecks, especially at scale (e.g., 10K→400K+ cores in an hour).
Desired end state: Submit one RayJob to one federated RayCluster, and Ray Data/Serve automatically schedules tasks across workers in any AZ/cloud, with preemption resilience and cross-cluster load balancing.
| Participant | Position |
|---|---|
| @andrewsykim (KubeRay maintainer) | Notes MultiKueue as an alternative but acknowledges it works at CRD-level, not Ray task/actor-level. Sees Ray Data batch inference as the ideal use case since it's GPU-local and horizontally scalable. |
| @siyuanfoundation (contributor) | Raises the cross-AZ communication overhead concern — Ray head must be topology-aware to avoid performance degradation. Suggests scoping to non-reshuffling Ray Data and Ray Serve only. Proposes an alternative: a Ray job delegator/proxy that dispatches jobs to separate RayClusters and aggregates status. Also notes MultiKueue supports pod-level scheduling, so worker pods could be delegated while head stays in one cluster. |
| @Future-Outlier (KubeRay member) | Asks about SkyPilot overlap. |
| @yuchen-ecnu (author) | Clarifies this is about federating resources into a single Ray cluster, not cross-K8s management (which SkyPilot does). Cites Tencent's prior art from Ray Forward 2025. |
| Solution | Abstraction Level | Single Logical Cluster? | Ray-Aware? | Maturity | Best For |
|---|---|---|---|---|---|
| KubeRay Federation (proposed) | Ray task/actor | ✅ Yes — unified RayCluster | ✅ Native | Proposal stage | Ray Data batch inference, Ray Serve |
| NVIDIA Dynamo | Inference framework | ✅ Yes — disaggregated prefill/decode | ❌ No (own framework) | Production (2025+) | LLM inference optimization |
| SkyPilot | Cluster provisioning | ❌ Separate clusters per cloud | Production | Multi-cloud Ray cluster provisioning | |
| MultiKueue (Kueue) | K8s job/pod dispatch | ❌ Dispatches whole jobs to clusters | ❌ No | Beta (v0.9+) | Job-level multi-cluster GPU scheduling |
| Karmada | K8s resource federation | ❌ Propagates CRDs/workloads | ❌ No | CNCF Incubating | General K8s multi-cluster federation |
| Volcano Global | Batch scheduling | ❌ Cross-cluster queue & dispatch | ❌ No | Early production | Gang-scheduled training, batch AI |
| Admiralty | Pod scheduling | ❌ Proxy pods in target clusters | ❌ No | Niche/Stable | Multi-cluster pod placement |
| Liqo | Network/resource mesh | ❌ No | CNCF Sandbox | Hybrid/edge cloud bursting |
Dynamo solves a related but different problem: optimizing multi-node LLM inference (disaggregated prefill/decode, KV cache routing, MoE rebalancing). It operates within a GPU cluster, not across K8s clusters. However, its Grove API for Kubernetes-native orchestration shows the industry trajectory toward framework-aware, topology-aware scheduling — which is exactly what KubeRay Federation would need to handle cross-AZ latency.
Key distinction: Dynamo is inference-engine-level optimization; KubeRay Federation is cluster-topology-level resource pooling. They're complementary, not competing.
SkyPilot provisions and manages separate Ray clusters across clouds (Shopify's multi-cloud GPU fleet is a notable example). It doesn't create a single unified Ray cluster — each provisioned cluster is independent. The proposal explicitly distinguishes itself from SkyPilot: KubeRay Federation wants one RayCluster with workers distributed across K8s clusters, enabling intra-cluster load balancing via Ray's scheduler.
Key distinction: SkyPilot = multi-cloud cluster provisioning; KubeRay Federation = single-cluster resource unification.
The most architecturally adjacent K8s-native solution. MultiKueue dispatches entire workloads to whichever worker cluster has capacity. As @siyuanfoundation noted, it could be adapted: deploy the RayCluster head in one cluster, and use MultiKueue to dispatch worker pods to remote clusters (since MultiKueue supports pod-level scheduling). This is a pragmatic middle-ground that avoids deep Ray-level changes.
Key distinction: MultiKueue dispatches at the K8s object boundary; KubeRay Federation wants Ray's scheduler to balance tasks across all workers regardless of their physical cluster.
Karmada federates K8s resources across clusters via PropagationPolicies. Volcano Global adds AI-specific scheduling (gang scheduling, queue fairness) atop Karmada. Together they could propagate RayCluster worker groups to different clusters. However, like MultiKueue, this operates at the K8s level — Ray wouldn't "know" about the topology, so cross-AZ data shuffling could silently degrade performance.
Key distinction: General-purpose K8s federation vs. Ray-topology-aware federation.
-
Cross-AZ network latency: As
@siyuanfoundationflagged, Ray's head node must be topology-aware. Without this, object transfers and task scheduling will blindly route across AZs, potentially destroying performance for anything involving data movement. The proposal should explicitly scope to workloads with minimal cross-node communication (batch inference with local GPU execution, stateless serving). -
Ray GCS and head node as single point of failure: The head node lives in one cluster. Cross-cluster network partitions could orphan all remote workers simultaneously.
-
Networking: Worker pods in remote clusters must reach the head node's GCS and Ray object store. This requires cross-cluster networking (service mesh, VPN, or public endpoints), which adds latency and security surface area.
-
Autoscaler integration: KubeRay's autoscaler currently talks to a single K8s API server. Federation requires it to create/delete worker pods across multiple clusters, meaning multi-cluster API credentials and reconciliation.
The proposal addresses a real and growing pain point — GPU resource fragmentation across clusters is one of the top operational challenges for large-scale AI teams. However:
- Start narrow: The community feedback correctly suggests scoping to Ray Data (no shuffle) and Ray Serve first. These workloads are embarrassingly parallel / stateless and tolerate cross-AZ latency.
- Consider the proxy/delegator alternative:
@siyuanfoundation's suggestion of a job delegator that dispatches to separate RayClusters and aggregates results may deliver 80% of the value with 20% of the complexity — no Ray-level changes needed, just a KubeRay-level orchestration layer. - Leverage MultiKueue for worker pod placement: Rather than building full federation from scratch, using MultiKueue to place worker pods in remote clusters (while the head stays in one cluster) could be a pragmatic first step that's composable with the existing ecosystem.
- NVIDIA Dynamo is not a competitor here but its topology-aware routing patterns (especially the Grove API) are worth studying as a design reference for how to make the Ray head aware of worker locality.
The proposal fills a unique gap — none of the existing solutions provide a single logical Ray cluster spanning K8s boundaries. Whether that's built as deep Ray+KubeRay integration or as a lighter-weight orchestration layer above multiple RayClusters is the key architectural decision the community needs to make.