jackfrancis/kuberay-multicluster.md

## kuberay-multicluster.md

      
    Raw
  

              kuberay-multicluster.md
            
          
    KubeRay Federation Proposal — Analysis & Comparative Landscape

Proposal Summary

Issue #4561 by @yuchen-ecnu proposes adding federation capability to KubeRay so that a single logical RayCluster can span multiple Kubernetes clusters. The core motivation:

Fragmented GPUs: Organizations procure GPUs across multiple cloud vendors/AZs. Today these are isolated into separate K8s clusters, preventing a unified Ray cluster.
Operational pain: Users must split datasets, deploy multiple small RayClusters, and manually manage them — causing long-tail performance issues and complexity.
Virtual Kubelet limitations: The common workaround (aggregating via Virtual Kubelet) creates control-plane scalability bottlenecks, especially at scale (e.g., 10K→400K+ cores in an hour).

Desired end state: Submit one RayJob to one federated RayCluster, and Ray Data/Serve automatically schedules tasks across workers in any AZ/cloud, with preemption resilience and cross-cluster load balancing.
Key Discussion Thread Insights


Participant
Position


@andrewsykim (KubeRay maintainer)
Notes MultiKueue as an alternative but acknowledges it works at CRD-level, not Ray task/actor-level. Sees Ray Data batch inference as the ideal use case since it's GPU-local and horizontally scalable.


@siyuanfoundation (contributor)
Raises the cross-AZ communication overhead concern — Ray head must be topology-aware to avoid performance degradation. Suggests scoping to non-reshuffling Ray Data and Ray Serve only. Proposes an alternative: a Ray job delegator/proxy that dispatches jobs to separate RayClusters and aggregates status. Also notes MultiKueue supports pod-level scheduling, so worker pods could be delegated while head stays in one cluster.


@Future-Outlier (KubeRay member)
Asks about SkyPilot overlap.


@yuchen-ecnu (author)
Clarifies this is about federating resources into a single Ray cluster, not cross-K8s management (which SkyPilot does). Cites Tencent's prior art from Ray Forward 2025.


Comparative Analysis: Multi-Cluster AI/ML Solutions


Solution
Abstraction Level
Single Logical Cluster?
Ray-Aware?
Maturity
Best For


KubeRay Federation (proposed)
Ray task/actor
✅ Yes — unified RayCluster
✅ Native
Proposal stage
Ray Data batch inference, Ray Serve


NVIDIA Dynamo
Inference framework
✅ Yes — disaggregated prefill/decode
❌ No (own framework)
Production (2025+)
LLM inference optimization


SkyPilot
Cluster provisioning
❌ Separate clusters per cloud
⚠️ Partial (launches Ray clusters)
Production
Multi-cloud Ray cluster provisioning


MultiKueue (Kueue)
K8s job/pod dispatch
❌ Dispatches whole jobs to clusters
❌ No
Beta (v0.9+)
Job-level multi-cluster GPU scheduling


Karmada
K8s resource federation
❌ Propagates CRDs/workloads
❌ No
CNCF Incubating
General K8s multi-cluster federation


Volcano Global
Batch scheduling
❌ Cross-cluster queue & dispatch
❌ No
Early production
Gang-scheduled training, batch AI


Admiralty
Pod scheduling
❌ Proxy pods in target clusters
❌ No
Niche/Stable
Multi-cluster pod placement


Liqo
Network/resource mesh
⚠️ Transparent pod offloading
❌ No
CNCF Sandbox
Hybrid/edge cloud bursting


Deep-Dive Comparison

1. NVIDIA Dynamo

Dynamo solves a related but different problem: optimizing multi-node LLM inference (disaggregated prefill/decode, KV cache routing, MoE rebalancing). It operates within a GPU cluster, not across K8s clusters. However, its Grove API for Kubernetes-native orchestration shows the industry trajectory toward framework-aware, topology-aware scheduling — which is exactly what KubeRay Federation would need to handle cross-AZ latency.
Key distinction: Dynamo is inference-engine-level optimization; KubeRay Federation is cluster-topology-level resource pooling. They're complementary, not competing.
2. SkyPilot

SkyPilot provisions and manages separate Ray clusters across clouds (Shopify's multi-cloud GPU fleet is a notable example). It doesn't create a single unified Ray cluster — each provisioned cluster is independent. The proposal explicitly distinguishes itself from SkyPilot: KubeRay Federation wants one RayCluster with workers distributed across K8s clusters, enabling intra-cluster load balancing via Ray's scheduler.
Key distinction: SkyPilot = multi-cloud cluster provisioning; KubeRay Federation = single-cluster resource unification.
3. MultiKueue

The most architecturally adjacent K8s-native solution. MultiKueue dispatches entire workloads to whichever worker cluster has capacity. As @siyuanfoundation noted, it could be adapted: deploy the RayCluster head in one cluster, and use MultiKueue to dispatch worker pods to remote clusters (since MultiKueue supports pod-level scheduling). This is a pragmatic middle-ground that avoids deep Ray-level changes.
Key distinction: MultiKueue dispatches at the K8s object boundary; KubeRay Federation wants Ray's scheduler to balance tasks across all workers regardless of their physical cluster.
4. Karmada + Volcano Global

Karmada federates K8s resources across clusters via PropagationPolicies. Volcano Global adds AI-specific scheduling (gang scheduling, queue fairness) atop Karmada. Together they could propagate RayCluster worker groups to different clusters. However, like MultiKueue, this operates at the K8s level — Ray wouldn't "know" about the topology, so cross-AZ data shuffling could silently degrade performance.
Key distinction: General-purpose K8s federation vs. Ray-topology-aware federation.
Critical Technical Challenges for the Proposal


Cross-AZ network latency: As @siyuanfoundation flagged, Ray's head node must be topology-aware. Without this, object transfers and task scheduling will blindly route across AZs, potentially destroying performance for anything involving data movement. The proposal should explicitly scope to workloads with minimal cross-node communication (batch inference with local GPU execution, stateless serving).


Ray GCS and head node as single point of failure: The head node lives in one cluster. Cross-cluster network partitions could orphan all remote workers simultaneously.


Networking: Worker pods in remote clusters must reach the head node's GCS and Ray object store. This requires cross-cluster networking (service mesh, VPN, or public endpoints), which adds latency and security surface area.


Autoscaler integration: KubeRay's autoscaler currently talks to a single K8s API server. Federation requires it to create/delete worker pods across multiple clusters, meaning multi-cluster API credentials and reconciliation.


Recommendations / Assessment

The proposal addresses a real and growing pain point — GPU resource fragmentation across clusters is one of the top operational challenges for large-scale AI teams. However:

Start narrow: The community feedback correctly suggests scoping to Ray Data (no shuffle) and Ray Serve first. These workloads are embarrassingly parallel / stateless and tolerate cross-AZ latency.
Consider the proxy/delegator alternative: @siyuanfoundation's suggestion of a job delegator that dispatches to separate RayClusters and aggregates results may deliver 80% of the value with 20% of the complexity — no Ray-level changes needed, just a KubeRay-level orchestration layer.
Leverage MultiKueue for worker pod placement: Rather than building full federation from scratch, using MultiKueue to place worker pods in remote clusters (while the head stays in one cluster) could be a pragmatic first step that's composable with the existing ecosystem.
NVIDIA Dynamo is not a competitor here but its topology-aware routing patterns (especially the Grove API) are worth studying as a design reference for how to make the Ray head aware of worker locality.

The proposal fills a unique gap — none of the existing solutions provide a single logical Ray cluster spanning K8s boundaries. Whether that's built as deep Ray+KubeRay integration or as a lighter-weight orchestration layer above multiple RayClusters is the key architectural decision the community needs to make.
Participant	Position
@andrewsykim (KubeRay maintainer)	Notes MultiKueue as an alternative but acknowledges it works at CRD-level, not Ray task/actor-level. Sees Ray Data batch inference as the ideal use case since it's GPU-local and horizontally scalable.
@siyuanfoundation (contributor)	Raises the cross-AZ communication overhead concern — Ray head must be topology-aware to avoid performance degradation. Suggests scoping to non-reshuffling Ray Data and Ray Serve only. Proposes an alternative: a Ray job delegator/proxy that dispatches jobs to separate RayClusters and aggregates status. Also notes MultiKueue supports pod-level scheduling, so worker pods could be delegated while head stays in one cluster.
@Future-Outlier (KubeRay member)	Asks about SkyPilot overlap.
@yuchen-ecnu (author)	Clarifies this is about federating resources into a single Ray cluster, not cross-K8s management (which SkyPilot does). Cites Tencent's prior art from Ray Forward 2025.
Solution	Abstraction Level	Single Logical Cluster?	Ray-Aware?	Maturity	Best For
KubeRay Federation (proposed)	Ray task/actor	✅ Yes — unified RayCluster	✅ Native	Proposal stage	Ray Data batch inference, Ray Serve
NVIDIA Dynamo	Inference framework	✅ Yes — disaggregated prefill/decode	❌ No (own framework)	Production (2025+)	LLM inference optimization
SkyPilot	Cluster provisioning	❌ Separate clusters per cloud	⚠️ Partial (launches Ray clusters)	Production	Multi-cloud Ray cluster provisioning
MultiKueue (Kueue)	K8s job/pod dispatch	❌ Dispatches whole jobs to clusters	❌ No	Beta (v0.9+)	Job-level multi-cluster GPU scheduling
Karmada	K8s resource federation	❌ Propagates CRDs/workloads	❌ No	CNCF Incubating	General K8s multi-cluster federation
Volcano Global	Batch scheduling	❌ Cross-cluster queue & dispatch	❌ No	Early production	Gang-scheduled training, batch AI
Admiralty	Pod scheduling	❌ Proxy pods in target clusters	❌ No	Niche/Stable	Multi-cluster pod placement
Liqo	Network/resource mesh	⚠️ Transparent pod offloading	❌ No	CNCF Sandbox	Hybrid/edge cloud bursting