Skip to content

Instantly share code, notes, and snippets.

@jackfrancis
Created March 5, 2026 19:48
Show Gist options
  • Select an option

  • Save jackfrancis/3a4b1f8343b7d5286395213e7ce94f0a to your computer and use it in GitHub Desktop.

Select an option

Save jackfrancis/3a4b1f8343b7d5286395213e7ce94f0a to your computer and use it in GitHub Desktop.
Ray Azure Provider plan

Agent Workflow: Reinforce azure support

Auto-generated by providerize on 2026-02-24 22:56 UTC Repository: https://github.com/ray-project/ray

Context

This repository has cloud provider integrations for: aws, gcp, azure.

Current ranking (best → worst):

  1. aws — 2890 detections across 9 categories, 630 files
  2. gcp — 1719 detections across 9 categories, 307 files
  3. azure — 458 detections across 9 categories, 60 files ← target

7 gap(s) identified for azure.

Identified Gaps

Under-Implemented Categories

azure has implementations in these categories but they are significantly sparser than aws:

Task 1: Expand container coverage

Provider 'azure' has 35 container detection(s) vs 'aws' with 80.

Reference examples (aws) — areas not yet covered by azure:

  • python/ray/autoscaler/launch_and_verify_cluster.py line 200: bucket_name = "aws-cluster-launcher-test" (pattern: EKS integration)
  • python/ray/autoscaler/azure/tests/azure_compute.yaml line 2: # The test script runs on AWS while the actual cluster is created in Azure. (pattern: EKS integration)
  • python/ray/autoscaler/aws/example-cloudwatch.yaml line 14: # We depend on AWS Systems Manager (SSM) to deploy CloudWatch configuration updates to your cluster, (pattern: EKS integration)
  • ci/ray_ci/test_windows_container.py line 28: image = "029272617770.dkr.ecr.us-west-2.amazonaws.com/rayproject/citemp:hi" (pattern: ECR registry)
  • ci/ray_ci/test_windows_container.py line 67: "029272617770.dkr.ecr.us-west-2.amazonaws.com/rayproject/citemp:test", (pattern: ECR registry)

Action items:

  1. Review the full set of aws container integrations (80 total).
  2. Identify which specific features/paths are missing for azure (current count: 35).
  3. Implement the missing pieces, matching existing code style and conventions.
  4. Add tests for each new integration point.

Task 2: Expand identity coverage

Provider 'azure' has 14 identity detection(s) vs 'aws' with 50.

Reference examples (aws) — areas not yet covered by azure:

  • docker/fix-docker-latest.sh line 28: AWS_ACCESS_KEY_ID=$(echo "$ASSUME_ROLE_CREDENTIALS" | jq -r .Credentials.AccessKeyId) (pattern: AWS credentials)
  • docker/fix-docker-latest.sh line 29: AWS_SECRET_ACCESS_KEY=$(echo "$ASSUME_ROLE_CREDENTIALS" | jq -r .Credentials.SecretAccessKey) (pattern: AWS credentials)
  • docker/fix-docker-latest.sh line 30: AWS_SESSION_TOKEN=$(echo "$ASSUME_ROLE_CREDENTIALS" | jq -r .Credentials.SessionToken) (pattern: AWS credentials)
  • docker/fix-docker-latest.sh line 33: AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY AWS_SESSION_TOK... (pattern: AWS credentials)
  • python/ray/_common/test_utils.py line 191: os.environ["AWS_ACCESS_KEY_ID"] = "testing" (pattern: AWS credentials)

Action items:

  1. Review the full set of aws identity integrations (50 total).
  2. Identify which specific features/paths are missing for azure (current count: 14).
  3. Implement the missing pieces, matching existing code style and conventions.
  4. Add tests for each new integration point.

Task 3: Expand storage coverage

Provider 'azure' has 80 storage detection(s) vs 'aws' with 583.

Reference examples (aws) — areas not yet covered by azure:

  • python/ray/llm/_internal/common/callbacks/cloud_downloader.py line 50: ("s3://bucket/path/to/file.txt", "/local/path/to/file.txt"), (pattern: S3 storage)
  • python/ray/llm/_internal/common/utils/cloud_filesystem/s3_filesystem.py line 35: uri: S3 URI (e.g., s3://bucket/path/to/object or s3://anonymous@bucket/path/to/object) (pattern: S3 storage)
  • python/ray/llm/_internal/common/utils/cloud_filesystem/s3_filesystem.py line 45: if uri.startswith("s3://anonymous@"): (pattern: S3 storage)
  • python/ray/llm/_internal/common/utils/cloud_filesystem/s3_filesystem.py line 47: uri = uri.replace("s3://anonymous@", "s3://", 1) (pattern: S3 storage)
  • python/ray/llm/_internal/common/utils/cloud_filesystem/pyarrow_filesystem.py line 35: Example: s3://anonymous@bucket/path (pattern: S3 storage)

Action items:

  1. Review the full set of aws storage integrations (583 total).
  2. Identify which specific features/paths are missing for azure (current count: 80).
  3. Implement the missing pieces, matching existing code style and conventions.
  4. Add tests for each new integration point.

Task 4: Expand networking coverage

Provider 'azure' has 26 networking detection(s) vs 'aws' with 61.

Reference examples (aws) — areas not yet covered by azure:

  • python/ray/tests/aws/test_autoscaler_aws.py line 281: head_node_config["SecurityGroupIds"] = ["sg-1234abcd"] (pattern: AWS VPC)
  • python/ray/tests/aws/test_autoscaler_aws.py line 282: worker_node_config["SecurityGroupIds"] = ["sg-1234abcd"] (pattern: AWS VPC)
  • python/ray/tests/aws/test_autoscaler_aws.py line 325: head_node_config["SecurityGroupIds"] = ["sg-1234abcd"] (pattern: AWS VPC)
  • python/ray/tests/aws/test_autoscaler_aws.py line 326: worker_node_config["SecurityGroupIds"] = ["sg-1234abcd"] (pattern: AWS VPC)
  • python/ray/tests/aws/test_autoscaler_aws.py line 348: field is of form SubnetIds: [subnet-xxxxx]. (pattern: AWS VPC)

Action items:

  1. Review the full set of aws networking integrations (61 total).
  2. Identify which specific features/paths are missing for azure (current count: 26).
  3. Implement the missing pieces, matching existing code style and conventions.
  4. Add tests for each new integration point.

Task 5: Expand compute coverage

Provider 'azure' has 48 compute detection(s) vs 'aws' with 1073.

Reference examples (aws) — areas not yet covered by azure:

  • python/ray/_common/tests/test_usage_stats.py line 965: InstanceType: m5.large (pattern: EC2 instance type)
  • python/ray/_common/tests/test_usage_stats.py line 970: InstanceType: m3.large (pattern: EC2 instance type)
  • python/ray/_common/tests/test_usage_stats.py line 988: assert cluster_config_to_report.head_node_instance_type == "m5.large" (pattern: EC2 instance type)
  • python/ray/_common/tests/test_usage_stats.py line 990: "m3.large", (pattern: EC2 instance type)
  • python/ray/_common/tests/test_usage_stats.py line 1007: InstanceType: m5.large (pattern: EC2 instance type)

Action items:

  1. Review the full set of aws compute integrations (1073 total).
  2. Identify which specific features/paths are missing for azure (current count: 48).
  3. Implement the missing pieces, matching existing code style and conventions.
  4. Add tests for each new integration point.

Task 6: Expand driver coverage

Provider 'azure' has 9 driver detection(s) vs 'aws' with 53.

Reference examples (aws) — areas not yet covered by azure:

  • python/ray/util/accelerators/accelerators.py line 30: AWS_NEURON_CORE = "aws-neuron-core" (pattern: AWS Inferentia)
  • python/ray/tests/test_autoscaler_yaml.py line 153: "InstanceType": "inf2.xlarge", (pattern: AWS Inferentia)
  • python/ray/tests/test_autoscaler_yaml.py line 178: "accelerator_type:aws-neuron-core": 1, (pattern: AWS Inferentia)
  • python/ray/tests/test_autoscaler_yaml.py line 203: "InstanceType": "inf2.xlarge", (pattern: AWS Inferentia)
  • python/ray/tests/test_autoscaler_yaml.py line 207: "Accelerators": [{"Name": "Inferentia", "Count": 1}] (pattern: AWS Inferentia)

Action items:

  1. Review the full set of aws driver integrations (53 total).
  2. Identify which specific features/paths are missing for azure (current count: 9).
  3. Implement the missing pieces, matching existing code style and conventions.
  4. Add tests for each new integration point.

Task 7: Expand api coverage

Provider 'azure' has 5 api detection(s) vs 'aws' with 806.

Reference examples (aws) — areas not yet covered by azure:

  • python/ray/llm/_internal/common/callbacks/cloud_downloader.py line 50: ("s3://bucket/path/to/file.txt", "/local/path/to/file.txt"), (pattern: S3 API)
  • python/ray/llm/_internal/common/utils/cloud_filesystem/s3_filesystem.py line 35: uri: S3 URI (e.g., s3://bucket/path/to/object or s3://anonymous@bucket/path/to/object) (pattern: S3 API)
  • python/ray/llm/_internal/common/utils/cloud_filesystem/s3_filesystem.py line 45: if uri.startswith("s3://anonymous@"): (pattern: S3 API)
  • python/ray/llm/_internal/common/utils/cloud_filesystem/s3_filesystem.py line 47: uri = uri.replace("s3://anonymous@", "s3://", 1) (pattern: S3 API)
  • python/ray/llm/_internal/common/utils/cloud_filesystem/pyarrow_filesystem.py line 35: Example: s3://anonymous@bucket/path (pattern: S3 API)

Action items:

  1. Review the full set of aws api integrations (806 total).
  2. Identify which specific features/paths are missing for azure (current count: 5).
  3. Implement the missing pieces, matching existing code style and conventions.
  4. Add tests for each new integration point.

LLM Code Analysis — Identified Gaps

The following gaps were identified by LLM analysis of the actual source code, providing deeper functional insight than pattern matching alone:

Task 8: 🔴 container (high)

get_docker_image() in ci/ray_ci/container.py only constructs AWS ECR image URIs via _DOCKER_ECR_REPO. Despite _DOCKER_AZURE_REGISTRY being defined, there is no equivalent function to produce Azure Container Registry image references for CI workflows. All CI containers resolve to '029272617770.dkr.ecr.us-west-2.amazonaws.com/rayproject/citemp'.

Reference files (aws):

  • ci/ray_ci/container.py

Existing target files (azure):

  • ci/ray_ci/container.py

Recommendation: Add a get_docker_image_azure() function or parameterize get_docker_image() to accept a registry argument, using _DOCKER_AZURE_REGISTRY ('rayreleasetest.azurecr.io') to construct Azure CR image URIs alongside ECR URIs.

Task 9: 🔴 container (high)

The CI init script .buildkite/release/custom-image-build-and-test-init.sh authenticates to AWS ECR ('aws ecr get-login-password') and GCP ('gcloud_docker_login.sh') but has no Azure Container Registry authentication step (e.g., 'az acr login'). Azure images cannot be pushed or pulled in CI without this.

Reference files (aws):

  • .buildkite/release/custom-image-build-and-test-init.sh

Existing target files (azure):

  • .buildkite/release/custom-image-build-and-test-init.sh

Recommendation: Add an 'az acr login --name rayreleasetest' step (or token-based auth) after the AWS ECR and GCP auth steps in custom-image-build-and-test-init.sh.

Task 10: 🔴 container (high)

All Buildkite pipeline configs (.buildkite/macos/config.yml, .buildkite/bisect/config.yml, .buildkite/release/config.yml, .buildkite/release-automation/config.yml, .buildkite/cicd-cron/config.yaml) hardcode ci_work_repo to AWS ECR ('029272617770.dkr.ecr.us-west-2.amazonaws.com/rayproject/citemp'). No Azure Container Registry work repo is configured for any CI pipeline.

Reference files (aws):

  • .buildkite/macos/config.yml
  • .buildkite/bisect/config.yml
  • .buildkite/release/config.yml
  • .buildkite/release-automation/config.yml
  • .buildkite/cicd-cron/config.yaml

Recommendation: Add an azure_work_repo field (e.g., 'rayreleasetest.azurecr.io/rayproject/citemp') to each Buildkite config, or make ci_work_repo configurable per-cloud so CI can run against Azure CR.

Task 11: 🟡 container (medium)

CI artifact storage (artifacts_bucket and ci_temp) in all Buildkite configs uses AWS S3 exclusively (e.g., 's3://ray-ci-artifact-branch-public/ci-temp/'). No Azure Blob Storage equivalent exists for storing CI artifacts from container builds.

Reference files (aws):

  • .buildkite/macos/config.yml
  • .buildkite/release/config.yml
  • .buildkite/release-automation/config.yml
  • .buildkite/bisect/config.yml
  • .buildkite/cicd-cron/config.yaml

Recommendation: Define Azure Blob Storage equivalents for artifacts_bucket and ci_temp (e.g., 'https://.blob.core.windows.net/ray-ci-artifacts/') in Buildkite configs or a shared config file.

Task 12: 🟡 container (medium)

The state_machine config in both ci/ray_ci/oss_config.yaml and release/ray_release/configs/oss_config.yaml only defines aws_bucket for PR and branch CI results ('ray-ci-pr-results', 'ray-ci-results'). There is no Azure storage bucket/container equivalent for storing CI state machine results.

Reference files (aws):

  • ci/ray_ci/oss_config.yaml
  • release/ray_release/configs/oss_config.yaml

Existing target files (azure):

  • ci/ray_ci/oss_config.yaml
  • release/ray_release/configs/oss_config.yaml

Recommendation: Add azure_container (or azure_bucket) fields under state_machine.pr and state_machine.branch in both oss_config.yaml files, pointing to Azure Blob Storage containers for CI results.

Task 13: 🟡 container (medium)

python/ray/autoscaler/launch_and_verify_cluster.py defines download_ssh_key_aws() (from S3) and download_ssh_key_gcp() (from GCS) but has no download_ssh_key_azure() function. Azure cluster launch tests cannot retrieve SSH keys from Azure storage.

Reference files (aws):

  • python/ray/autoscaler/launch_and_verify_cluster.py

Existing target files (azure):

  • python/ray/autoscaler/launch_and_verify_cluster.py

Recommendation: Implement download_ssh_key_azure() using azure-storage-blob SDK to download SSH keys from an Azure Blob Storage container, mirroring the S3 and GCS download patterns.

Task 14: 🟢 container (low)

release/ray_release/tests/test_global_config.py validates byod_ecr (AWS ECR '029272617770.dkr.ecr.us-west-2.amazonaws.com') and gcp_cr in the test config, but the _TEST_CONFIG does not include azure_cr and no assertion validates Azure CR configuration, despite azure_cr being present in the production oss_config.yaml.

Reference files (aws):

  • release/ray_release/tests/test_global_config.py

Existing target files (azure):

  • release/ray_release/tests/test_global_config.py

Recommendation: Add 'azure_cr: rayreleasetest.azurecr.io' to _TEST_CONFIG in test_global_config.py and add an assertion: assert config['release_byod_azure_cr'] == 'rayreleasetest.azurecr.io'.

Task 15: 🟢 container (low)

Container tests (ci/ray_ci/test_windows_container.py, test_windows_tester_container.py, test_linux_container.py) all hardcode AWS ECR image URIs ('029272617770.dkr.ecr.us-west-2.amazonaws.com/rayproject/citemp'). No test validates container operations against Azure Container Registry.

Reference files (aws):

  • ci/ray_ci/test_windows_container.py
  • ci/ray_ci/test_windows_tester_container.py
  • ci/ray_ci/test_linux_container.py

Recommendation: Add parameterized test cases or separate test functions that verify container image construction and run commands work with Azure CR URIs ('rayreleasetest.azurecr.io/rayproject/citemp').

Task 16: 🔴 identity (high)

Azure credentials are not propagated to Ray actor/task runtime environments. In class_cache.py, ENV_VARS_TO_PROPAGATE explicitly lists AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SECURITY_TOKEN, and AWS_SESSION_TOKEN for propagation into remote actor env_vars. No Azure equivalents (AZURE_CLIENT_ID, AZURE_CLIENT_SECRET, AZURE_TENANT_ID, AZURE_SUBSCRIPTION_ID, AZURE_STORAGE_SAS_TOKEN) are included, so Azure-authenticated workloads lose credentials when dispatched to actors.

Reference files (aws):

  • python/ray/tune/execution/class_cache.py

Existing target files (azure):

  • python/ray/tune/execution/class_cache.py

Recommendation: Add AZURE_CLIENT_ID, AZURE_CLIENT_SECRET, AZURE_TENANT_ID, AZURE_SUBSCRIPTION_ID, and AZURE_STORAGE_SAS_TOKEN to the ENV_VARS_TO_PROPAGATE set in class_cache.py so Azure credentials propagate to trainable actors the same way AWS credentials do.

Task 17: 🔴 identity (high)

Azure node provider has no credential expiration error handling. In aws/utils.py, handle_boto_error() catches ExpiredTokenException, ExpiredToken, and RequestExpired error codes, provides user-friendly recovery instructions (aws sts get-session-token command, export instructions for AWS_SECRET_ACCESS_KEY/AWS_SESSION_TOKEN/AWS_ACCESS_KEY_ID). The Azure node provider (node_provider.py) uses DefaultAzureCredential but has no equivalent error interception or recovery guidance for expired Azure tokens.

Reference files (aws):

  • python/ray/autoscaler/_private/aws/utils.py

Existing target files (azure):

  • python/ray/autoscaler/_private/_azure/node_provider.py

Recommendation: Create an Azure-equivalent error handler in the _azure package that catches azure.core.exceptions.ClientAuthenticationError and azure.identity.CredentialUnavailableError, and provides recovery guidance (e.g., 'az login', 'az account get-access-token') similar to how handle_boto_error() guides users through AWS STS token refresh.

Task 18: 🟡 identity (medium)

No Azure mock credential test infrastructure exists. AWS has extensive test fixtures: configure_aws fixture in test_cli.py sets AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY/AWS_SECURITY_TOKEN/AWS_SESSION_TOKEN with 'testing' values and uses moto's mock_aws context manager; aws_credentials fixture in data/tests/conftest.py does the same; simulate_s3_bucket in test_utils.py creates a mocked S3 environment. Azure has no equivalent mock fixtures for testing Azure-authenticated code paths.

Reference files (aws):

  • python/ray/tests/test_cli.py
  • python/ray/data/tests/conftest.py
  • python/ray/_common/test_utils.py

Existing target files (azure):

  • python/ray/data/tests/conftest.py

Recommendation: Create Azure mock credential fixtures analogous to configure_aws and aws_credentials that set AZURE_CLIENT_ID, AZURE_TENANT_ID, AZURE_CLIENT_SECRET to test values, and integrate a mock Azure storage library (e.g., Azurite or unittest.mock patches on azure.storage.blob) to enable Azure code path testing similar to moto-based S3 testing.

Task 19: 🟡 identity (medium)

No Azure equivalent of STS assume-role for cross-account/elevated credential access. AWS uses boto3 sts.assume_role in firehose_utils.py (assumes arn:aws:iam::830883877497:role/... to get temporary AccessKeyId/SecretAccessKey/SessionToken) and in docker/fix-docker-latest.sh (aws sts assume-role --role-arn to invoke Lambda). Azure node_provider.py only uses DefaultAzureCredential which does not support cross-tenant or cross-subscription role assumption.

Reference files (aws):

  • release/llm_tests/serve/benchmark/firehose_utils.py
  • docker/fix-docker-latest.sh

Existing target files (azure):

  • python/ray/autoscaler/_private/_azure/node_provider.py

Recommendation: Implement an Azure credential elevation utility that supports azure.identity.ClientSecretCredential or on-behalf-of flows for cross-subscription scenarios, analogous to how AWS STS assume_role provides scoped temporary credentials. Add optional provider config fields (e.g., azure_tenant_id, azure_client_id) to support federated identity scenarios.

Task 20: 🟡 identity (medium)

No programmatic Azure credential retrieval and export pattern. In test_durable_trainable.py, AWS credentials are retrieved programmatically via boto3.Session().get_credentials().get_frozen_credentials() and exported to AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY/AWS_SESSION_TOKEN env vars. There is no Azure equivalent path — the code only checks for AWS credentials and prints 'Cannot setup AWS credentials (is this running on GCE?)' with no Azure fallback.

Reference files (aws):

  • release/tune_tests/scalability_tests/workloads/test_durable_trainable.py

Existing target files (azure):

  • release/tune_tests/scalability_tests/workloads/test_durable_trainable.py

Recommendation: Add an Azure credential retrieval fallback using azure.identity.DefaultAzureCredential().get_token() to extract and export Azure credentials (AZURE_CLIENT_ID, AZURE_TENANT_ID, etc.) to environment variables when AWS credentials are unavailable, enabling durable trainable tests to work with Azure storage backends.

Task 21: 🟢 identity (low)

Azure EKS-equivalent documentation lacks identity/credential guidance. doc/source/cluster/kubernetes/user-guides/aws-eks-gpu-cluster.md references AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_SESSION_TOKEN for eksctl authentication in Step 3. There is no equivalent AKS GPU cluster guide documenting Azure identity setup (az login, service principal, managed identity for AKS).

Reference files (aws):

  • doc/source/cluster/kubernetes/user-guides/aws-eks-gpu-cluster.md

Recommendation: Create a parallel doc/source/cluster/kubernetes/user-guides/azure-aks-gpu-cluster.md guide that documents Azure identity setup for AKS: az login, service principal creation, managed identity configuration for the AKS cluster, and kubeconfig credential setup via az aks get-credentials.

Task 22: 🔴 storage (high)

No dedicated Azure filesystem implementation equivalent to S3FileSystem (s3_filesystem.py). S3 has a native boto3-based implementation with optimized connection pooling (max_pool_connections=50), adaptive retry configuration (max_attempts=3, mode='adaptive'), and TCP keepalive via _get_s3_client(). Azure relies entirely on the PyArrow abstraction layer in pyarrow_filesystem.py via _create_azure_filesystem(), which wraps adlfs.AzureBlobFileSystem with no equivalent connection pooling, retry, or keepalive configuration.

Reference files (aws):

  • python/ray/llm/_internal/common/utils/cloud_filesystem/s3_filesystem.py

Existing target files (azure):

  • python/ray/llm/_internal/common/utils/cloud_filesystem/pyarrow_filesystem.py

Recommendation: Create a dedicated azure_filesystem.py implementing BaseCloudFileSystem using the native Azure SDK (azure-storage-blob). Include connection pooling via BlobServiceClient, retry policies via azure.core.pipeline.policies.RetryPolicy, and equivalent get_file(), list_subfolders(), and download_files() methods matching S3FileSystem's interface.

Task 23: 🔴 storage (high)

S3FileSystem.get_file() provides direct-to-memory file download using boto3's get_object() with proper error handling (catches ClientError with 'NoSuchKey'/'404' codes, returns None). Azure has no equivalent native SDK get_file() implementation — it must go through PyArrowFileSystem.get_file() which uses open_input_file/get_file_info from the PyArrow abstraction, adding overhead and losing Azure-specific error handling capabilities.

Reference files (aws):

  • python/ray/llm/_internal/common/utils/cloud_filesystem/s3_filesystem.py

Existing target files (azure):

  • python/ray/llm/_internal/common/utils/cloud_filesystem/pyarrow_filesystem.py

Recommendation: Implement get_file() in the new azure_filesystem.py using BlobServiceClient.get_blob_client().download_blob().readall() with proper Azure-specific exception handling (ResourceNotFoundError) to return None for missing blobs, mirroring S3FileSystem.get_file() behavior.

Task 24: 🟡 storage (medium)

S3FileSystem supports anonymous/unsigned access via the 's3://anonymous@bucket/path' URI pattern, handled in both _parse_s3_uri() (returns is_anonymous flag) and _get_s3_client() (sets signature_version=UNSIGNED). PyArrowFileSystem.get_fs_and_path() explicitly excludes Azure URIs from anonymous access parsing with the condition: 'not (object_uri.startswith("abfss://") or object_uri.startswith("azure://"))'. There is no mechanism for anonymous Azure Blob Storage access.

Reference files (aws):

  • python/ray/llm/_internal/common/utils/cloud_filesystem/s3_filesystem.py
  • python/ray/llm/_internal/common/utils/cloud_filesystem/pyarrow_filesystem.py

Existing target files (azure):

  • python/ray/llm/_internal/common/utils/cloud_filesystem/pyarrow_filesystem.py

Recommendation: Add anonymous/public access support for Azure Blob Storage by allowing 'azure://anonymous@account.blob.core.windows.net/container/path' URIs. In PyArrowFileSystem, update the anonymous access check to not exclude Azure URIs. In the native Azure implementation, support anonymous access by omitting credentials from BlobServiceClient.

Task 25: 🟡 storage (medium)

S3FileSystem uses ThreadPoolExecutor with configurable max_workers for concurrent file downloads, with _get_s3_client() creating clients that have max_pool_connections matching max_workers for optimal concurrent performance. The S3 download_files tests (test_s3_filesystem.py) validate this concurrent download path. Azure downloads go through PyArrowFileSystem.download_files() which calls pyarrow.fs.copy_files() — a single bulk operation with no equivalent fine-grained concurrency control or connection pool tuning.

Reference files (aws):

  • python/ray/llm/_internal/common/utils/cloud_filesystem/s3_filesystem.py
  • python/ray/llm/tests/common/cloud/test_s3_filesystem.py

Existing target files (azure):

  • python/ray/llm/_internal/common/utils/cloud_filesystem/pyarrow_filesystem.py
  • python/ray/llm/tests/common/cloud/test_pyarrow_filesystem.py

Recommendation: Implement concurrent download_files() in azure_filesystem.py using ThreadPoolExecutor with azure.storage.blob's BlobClient.download_blob() per file, matching S3FileSystem's concurrency model. Configure max_workers and use connection pooling via BlobServiceClient's built-in connection pool settings.

Task 26: 🟡 storage (medium)

S3FileSystem.list_subfolders() uses boto3's list_objects_v2() with Delimiter='/' and parses CommonPrefixes for efficient server-side directory listing (confirmed in test_s3_filesystem.py). Azure has no native equivalent — list_subfolders() goes through PyArrowFileSystem which uses get_file_info() from the PyArrow abstraction with adlfs, requiring client-side filtering of FileType.Directory entries rather than server-side prefix enumeration.

Reference files (aws):

  • python/ray/llm/_internal/common/utils/cloud_filesystem/s3_filesystem.py
  • python/ray/llm/tests/common/cloud/test_s3_filesystem.py

Existing target files (azure):

  • python/ray/llm/_internal/common/utils/cloud_filesystem/pyarrow_filesystem.py
  • python/ray/llm/tests/common/cloud/test_pyarrow_filesystem.py

Recommendation: Implement list_subfolders() in azure_filesystem.py using BlobServiceClient.get_container_client().walk_blobs(name_starts_with=prefix, delimiter='/') which provides efficient server-side prefix enumeration equivalent to S3's CommonPrefixes.

Task 27: 🟡 storage (medium)

The runtime_env Protocol class in protocol.py handles S3 with smart credential fallback: _handle_s3_protocol() checks session.get_credentials() and falls back to UNSIGNED config for public buckets. The Azure equivalent _handle_azure_protocol() requires AZURE_STORAGE_ACCOUNT env var with no fallback, raising ValueError if unset. There is no graceful degradation or auto-detection of authentication method for Azure.

Reference files (aws):

  • python/ray/_private/runtime_env/protocol.py

Existing target files (azure):

  • python/ray/_private/runtime_env/protocol.py

Recommendation: Update _handle_azure_protocol() to implement credential fallback: try DefaultAzureCredential first (which chains multiple auth methods), then fall back to anonymous BlobServiceClient for public containers. Extract account name from the URI itself instead of requiring AZURE_STORAGE_ACCOUNT env var, similar to how _handle_abfss_protocol() parses the URI.

Task 28: 🟢 storage (low)

test_s3_filesystem.py provides comprehensive unit tests for the native S3 implementation including: get_file (string and bytes), get_file_not_found, get_file_anonymous, list_subfolders with parameterized URI variants. There is no equivalent test_azure_filesystem.py for a native Azure implementation. Azure tests exist only for the PyArrow abstraction layer in test_pyarrow_filesystem.py (TestPyArrowFileSystemAzureSupport).

Reference files (aws):

  • python/ray/llm/tests/common/cloud/test_s3_filesystem.py

Existing target files (azure):

  • python/ray/llm/tests/common/cloud/test_pyarrow_filesystem.py

Recommendation: Create test_azure_filesystem.py with unit tests mirroring test_s3_filesystem.py structure: test_get_file (string/bytes), test_get_file_not_found (ResourceNotFoundError handling), test_list_subfolders, test_download_files, and test_parse_azure_uri with parameterized URI variants.

Task 29: 🟢 storage (low)

LoRA model loading in lora_serve_utils.py has S3-centric documentation: load_model_from_config docstring says 'fetching its mirror config from S3', and the example comments show only 's3://ray-llama-weights' as base_path. While the code is cloud-agnostic at runtime, documentation does not mention Azure usage patterns or ABFSS/Azure URI examples.

Reference files (aws):

  • python/ray/llm/_internal/serve/utils/lora_serve_utils.py

Existing target files (azure):

  • python/ray/llm/_internal/serve/utils/lora_serve_utils.py

Recommendation: Update docstrings and comments in lora_serve_utils.py to reference all supported cloud providers. Change 'from S3' to 'from cloud storage' in load_model_from_config, and add Azure URI examples alongside S3 examples (e.g., 'abfss://container@account.dfs.core.windows.net/lora-weights').

Task 30: 🔴 networking (high)

Azure lacks per-node-type security group support. AWS's _get_or_create_vpc_security_groups() in config.py creates separate security groups per VPC/node-type, and example-head-and-worker-security-group.yaml allows distinct SecurityGroupIds for head vs worker nodes. Azure's azure-config-template.json creates only a single shared NSG (variables.nsgName) applied to the entire subnet, with no mechanism for node-type-specific network isolation.

Reference files (aws):

  • python/ray/autoscaler/_private/aws/config.py
  • python/ray/autoscaler/aws/example-head-and-worker-security-group.yaml

Existing target files (azure):

  • python/ray/autoscaler/_private/_azure/azure-config-template.json
  • python/ray/autoscaler/_private/_azure/config.py

Recommendation: Add support for per-node-type NSGs in the Azure config. Extend azure-config-template.json to accept optional separate NSG parameters for head and worker nodes. In config.py's _configure_resource_group(), allow provider config fields like 'security_group' with per-node-type overrides similar to AWS's SecurityGroupIds per available_node_type.

Task 31: 🔴 networking (high)

Azure lacks configurable security group rules. AWS's _upsert_security_group_rules() dynamically manages inbound rules including custom IpPermissions from provider.security_group config, and the test suite (test_autoscaler_aws.py test_create_sg_with_custom_inbound_rules) verifies custom inbound rule propagation. Azure's azure-config-template.json hardcodes only a single SSH rule (port 22), missing Ray Dashboard (8265), Ray Client (10001), GCS (6379), and inter-node communication ports that the doc-level azure-ray-template.json includes (SSH, JupyterLab 8000, RayWebUI 8265, TensorBoard).

Reference files (aws):

  • python/ray/autoscaler/_private/aws/config.py
  • python/ray/tests/aws/test_autoscaler_aws.py

Existing target files (azure):

  • python/ray/autoscaler/_private/_azure/azure-config-template.json

Recommendation: Add configurable NSG rules to azure-config-template.json via an ARM template parameter for additional security rules. At minimum, add default rules for Ray Dashboard (8265) and inter-node Ray communication ports. Expose a provider config option (e.g., provider.security_group.rules) in config.py that injects custom rules into the ARM deployment parameters.

Task 32: 🟡 networking (medium)

Azure lacks multiple subnet support for head/worker node separation. AWS's example-subnets.yaml allows per-node-type SubnetIds, and config.py's _get_subnets_or_die() validates user-specified subnets with VPC peering support across different VPCs. Azure's config.py generates a single random subnet_mask ('10.{random}.0.0/16') and azure-config-template.json creates one VNet with one subnet. The doc-level azure-ray-template.json defines separate subnetWorkers and subnetHead variables, but this pattern is absent from the autoscaler template.

Reference files (aws):

  • python/ray/autoscaler/aws/example-subnets.yaml
  • python/ray/autoscaler/_private/aws/config.py

Existing target files (azure):

  • python/ray/autoscaler/_private/_azure/azure-config-template.json
  • python/ray/autoscaler/_private/_azure/config.py

Recommendation: Extend azure-config-template.json to support separate subnets for head and worker nodes (mirroring the pattern already in doc/azure/azure-ray-template.json with subnetWorkers/subnetHead). Add subnet_mask_head and subnet_mask_worker provider config options in config.py, and allow users to specify pre-existing subnet resource IDs per node type.

Task 33: 🟡 networking (medium)

Azure has no network interface configuration support. AWS's example-network-interfaces.yaml supports multiple NetworkInterfaces per node with features including: fixed PrivateIpAddress for head nodes, multiple DeviceIndex/NetworkCardIndex entries, per-interface SubnetId and security Groups, EFA (Elastic Fabric Adapter) support via InterfaceType field, and AssociatePublicIpAddress control. Azure's node_provider.py handles NIC cleanup (_cleanup_subnet, _cleanup_nsg) but provides no mechanism to configure custom network interfaces at node creation time.

Reference files (aws):

  • python/ray/autoscaler/aws/example-network-interfaces.yaml

Existing target files (azure):

  • python/ray/autoscaler/_private/_azure/node_provider.py
  • python/ray/autoscaler/_private/_azure/config.py

Recommendation: Add support for Azure network interface configuration in node_config, allowing users to specify multiple NICs with properties like subnet, NSG, private IP, and Accelerated Networking (Azure's equivalent of enhanced networking). Provide an example config similar to example-network-interfaces.yaml that demonstrates multi-NIC setup with Azure InfiniBand/RDMA for HPC workloads (Azure's analog to EFA).

Task 34: 🟡 networking (medium)

Azure lacks subnet validation and discovery logic. AWS's _get_subnets_or_die() (with @lru_cache) validates that user-specified subnet IDs actually exist via ec2.subnets.filter(), and _get_vpc_id_or_die() resolves VPC membership from subnets. The AWS test suite includes stubs for describe_subnets (configure_subnet_default, describe_a_thousand_subnets_in_different_vpcs, describe_twenty_subnets_in_different_azs) and tests like test_subnet_given_head_and_worker_sg that verify correct subnet resolution. Azure's config.py simply generates a random subnet CIDR and deploys it via ARM template without validating existing network topology.

Reference files (aws):

  • python/ray/autoscaler/_private/aws/config.py
  • python/ray/tests/aws/utils/stubs.py
  • python/ray/tests/aws/utils/constants.py

Existing target files (azure):

  • python/ray/autoscaler/_private/_azure/config.py

Recommendation: Add a _validate_subnet() function in Azure config.py that uses the NetworkManagementClient to verify user-specified subnets/VNets exist and have sufficient address space. When users provide existing VNet/subnet resource IDs, validate them before deployment. Add corresponding unit tests with mocked Azure SDK responses similar to AWS's stubs.py pattern.

Task 35: 🟢 networking (low)

Azure lacks use_internal_ips configuration parity. AWS's example-network-interfaces.yaml demonstrates provider.use_internal_ips: True for clusters where nodes don't have public IPs, requiring communication via private addresses within the same VPC. While Azure's config.py handles internal networking implicitly via the generated VNet, there is no explicit use_internal_ips provider option or documentation for private-only networking setups comparable to AWS's pattern.

Reference files (aws):

  • python/ray/autoscaler/aws/example-network-interfaces.yaml

Existing target files (azure):

  • python/ray/autoscaler/_private/_azure/config.py

Recommendation: Document and expose a provider.use_internal_ips option for Azure that controls whether nodes are assigned public IPs and whether Ray communicates over private addresses. This is relevant for Azure Private Link and VNet-only deployments where public IP association should be disabled on the NIC level.

Task 36: 🔴 compute (high)

Elastic training instance self-termination (terminate_current_instance() in elastic_util.py) is AWS-only. It uses EC2 IMDSv2 to fetch instance-id and region via http://169.254.169.254/latest/api/token and http://169.254.169.254/latest/meta-data/, then calls aws ec2 terminate-instances. No Azure equivalent exists using Azure IMDS (http://169.254.169.254/metadata/instance) and az vm delete or the Azure SDK.

Reference files (aws):

  • release/train_tests/elastic_training/elastic_util.py

Recommendation: Add terminate_current_azure_instance() that queries Azure IMDS at http://169.254.169.254/metadata/instance?api-version=2021-12-13 with Metadata: true header to get subscription, resource group, and VM name, then terminates via az vm delete or azure.mgmt.compute.ComputeManagementClient. Add a dispatcher function that calls the correct provider termination based on get_cloud_from_metadata_requests() result.

Task 37: 🟡 compute (medium)

Test utility _terminate_ec2_instance() and _execute_command_on_node() in test_utils.py are AWS-only. They use EC2 IMDSv2 (token-based metadata at 169.254.169.254/latest/api/token) and aws ec2 terminate-instances to terminate nodes during integration tests. No Azure VM termination equivalent exists for test infrastructure running on Azure.

Reference files (aws):

  • python/ray/_private/test_utils.py

Recommendation: Add _terminate_azure_vm() that uses Azure IMDS (http://169.254.169.254/metadata/instance?api-version=2021-12-13 with Metadata: true header) to get VM identity, then calls az vm delete or uses the Azure SDK. Refactor _execute_command_on_node to be provider-agnostic or add an Azure-specific variant.

Task 38: 🟡 compute (medium)

CI reproduction environment (ci/repro-ci.py ReproSession class) is entirely AWS-specific. It uses boto3.client('ec2') to start/manage EC2 instances, AWS-specific SSH user ec2-user, AWS key pair buildkite-repro-env, security group sg-0ccfca2ef191c04ae, and reads BUILDKITE_AGENT_META_DATA_AWS_INSTANCE_TYPE and BUILDKITE_AGENT_META_DATA_AWS_AMI_ID environment variables. No Azure VM-based CI repro path exists.

Reference files (aws):

  • ci/repro-ci.py

Recommendation: Add an AzureReproSession class that uses azure.mgmt.compute.ComputeManagementClient to create/manage Azure VMs for CI reproduction. Map AWS concepts to Azure: AMI → Azure image reference, instance type → vmSize, security group → NSG, and use Azure-specific environment variables like BUILDKITE_AGENT_META_DATA_AZURE_VM_SIZE and BUILDKITE_AGENT_META_DATA_AZURE_IMAGE_ID.

Task 39: 🟢 compute (low)

CI secrets retrieval in ci/repro-ci.py uses AWS Secrets Manager (boto3.client('secretsmanager', region_name='us-west-2').get_secret_value(SecretId='arn:aws:secretsmanager:...')) to fetch the Buildkite token. No Azure Key Vault equivalent is provided for environments running on Azure.

Reference files (aws):

  • ci/repro-ci.py

Recommendation: Add an Azure Key Vault path using azure.keyvault.secrets.SecretClient alongside the existing AWS Secrets Manager path. Use get_cloud_from_metadata_requests() or an environment variable to select the correct secrets backend.

Task 40: 🟢 compute (low)

Cross-cloud IAM federation config (release/aws2gce_iam.json) only exists for AWS-to-GCP workload identity federation using AWS STS token type urn:ietf:params:aws:token-type:aws4_request and AWS credential source at 169.254.169.254/latest/meta-data/iam/security-credentials. No Azure-to-GCP federation config exists for CI/release jobs running on Azure compute.

Reference files (aws):

  • release/aws2gce_iam.json

Recommendation: Create release/azure2gce_iam.json using GCP Workload Identity Federation with Azure as the identity provider. Use subject_token_type: urn:ietf:params:oauth:token-type:jwt with Azure managed identity token from http://169.254.169.254/metadata/identity/oauth2/token as the credential source.

Task 41: 🔴 driver (high)

Azure lacks automatic node type resource detection. AWS autoscaler calls fillout_available_node_types_resources using EC2 describe_instance_types to auto-populate CPU, memory, GPU count, GPU name, and accelerator info (seen in test_autoscaler_yaml.py boto3_dict mock with VCpuInfo, MemoryInfo, GpuInfo, AcceleratorInfo fields). Azure's example-gpu-docker.yaml requires manually specifying resources: {"CPU": 6, "GPU": 1} for every node type, with no equivalent Azure Compute API-based auto-detection.

Reference files (aws):

  • python/ray/tests/test_autoscaler_yaml.py

Existing target files (azure):

  • python/ray/autoscaler/azure/example-gpu-docker.yaml

Recommendation: Implement fillout_available_node_types_resources in the Azure node provider that calls the Azure Compute REST API (VirtualMachineSizes or ResourceSKUs) to auto-detect vCPUs, memory, and GPU counts/types for each vmSize, mirroring the AWS pattern using client_cache and describe_instance_types.

Task 42: 🔴 driver (high)

Azure has no custom accelerator manager equivalent to AWS's NeuronAcceleratorManager in neuron.py. That class provides get_resource_name() ('neuron_cores'), get_current_node_num_accelerators() (via neuron-ls --json-output), get_current_node_accelerator_type() (returning AWS_NEURON_CORE), get_visible_accelerator_ids_env_var() ('NEURON_RT_VISIBLE_CORES'), validate_resource_request_quantity(), and the AWS_NEURON_INSTANCE_MAP mapping instance types to core counts. Azure has no analogous accelerator manager for Azure-specific accelerators (e.g., Azure Maia, FPGAs).

Reference files (aws):

  • python/ray/_private/accelerators/neuron.py

Recommendation: Create python/ray/_private/accelerators/azure_accelerator.py implementing an AcceleratorManager subclass for any Azure-specific accelerators, following the same pattern as NeuronAcceleratorManager: implement get_resource_name, get_current_node_num_accelerators, get_current_node_accelerator_type, get_visible_accelerator_ids_env_var, and validate_resource_request_quantity.

Task 43: 🟡 driver (medium)

Azure has no registered accelerator type constant in accelerators.py. AWS defines AWS_NEURON_CORE = "aws-neuron-core" (line 30) which is referenced by NeuronAcceleratorManager.get_current_node_accelerator_type() and used as accelerator_type:aws-neuron-core in resource auto-detection (test_autoscaler_yaml.py expected_available_node_types). Azure has no equivalent constant for its accelerator hardware.

Reference files (aws):

  • python/ray/util/accelerators/accelerators.py
  • python/ray/_private/accelerators/neuron.py

Recommendation: Add Azure-specific accelerator constants (e.g., AZURE_MAIA_100 = "azure-maia-100") to python/ray/util/accelerators/accelerators.py and reference them from a corresponding Azure accelerator manager.

Task 44: 🟡 driver (medium)

Azure has no specialized training backend equivalent to _TorchAwsNeuronXLABackend in python/ray/train/torch/xla/config.py. That class provides TorchXLAConfig with neuron_parallel_compile support, XLA environment variable setup (_set_xla_env_vars including EFA/XLA vars like FI_PROVIDER, FI_EFA_USE_DEVICE_RDMA), XLA process group initialization (_setup_xla_torch_process_group), Neuron graph extraction/compilation (_neuron_compile_extracted_graphs using libneuronxla), and xrt server lifecycle management (_kill_xrt_server). Azure has no equivalent accelerator-specific training backend.

Reference files (aws):

  • python/ray/train/torch/xla/config.py

Recommendation: If Azure introduces custom training accelerators, implement an Azure-specific backend class extending Backend (similar to _TorchAwsNeuronXLABackend) with appropriate environment setup, distributed process group initialization, and compilation workflows in a new python/ray/train/torch/azure/config.py.

Task 45: 🟡 driver (medium)

Azure has no Ray Serve accelerator inference examples. AWS provides two complete NeuronCore inference tutorials: aws_neuron_core_inference_serve.py (text classification on NeuronCores with ray_actor_options={"resources": {"neuron_cores": 1}}) and aws_neuron_core_inference_serve_stable_diffusion.py (Stable Diffusion with NeuronStableDiffusionXLPipeline and neuron_cores: 2), both listed in doc/source/serve/examples.yml under 'ai accelerators'. Azure has no equivalent accelerator-specific serving examples.

Reference files (aws):

  • doc/source/serve/examples.yml
  • doc/source/serve/doc_code/aws_neuron_core_inference_serve.py
  • doc/source/serve/doc_code/aws_neuron_core_inference_serve_stable_diffusion.py

Recommendation: Add Azure GPU-optimized Ray Serve inference examples (e.g., using Azure NC/ND-series VMs with ONNX Runtime or TensorRT) to doc/source/serve/doc_code/ and register them in doc/source/serve/examples.yml under the 'ai accelerators' category.

Task 46: 🟢 driver (low)

Azure has no accelerator-specific distributed training tutorial. AWS provides doc/source/train/examples/aws-trainium/llama3.rst demonstrating Llama 3.1 fine-tuning on Trainium with EKS/KubeRay, and lists it in doc/source/train/examples.yml with frameworks: [pytorch, aws neuron]. Azure's azure-aks-gpu-cluster.md only covers generic AKS GPU cluster setup but has no equivalent accelerator-specific training walkthrough.

Reference files (aws):

  • doc/source/train/examples.yml
  • doc/source/train/examples/aws-trainium/llama3.rst

Existing target files (azure):

  • doc/source/cluster/kubernetes/user-guides/azure-aks-gpu-cluster.md

Recommendation: Create an Azure-specific training tutorial (e.g., doc/source/train/examples/azure-gpu/llama3.rst) demonstrating LLM fine-tuning on Azure ND-series GPU VMs with AKS/KubeRay, and add it to doc/source/train/examples.yml with appropriate framework tags.

Task 47: 🟢 driver (low)

Azure autoscaler test coverage lacks GPU and accelerator auto-detection validation. test_autoscaler_yaml.py has testValidateDefaultConfigAWSMultiNodeTypes which validates AWS-specific resource fillout including GPU types (GpuInfo.Gpus[].Name = 'V100'), accelerator types (AcceleratorInfo.Accelerators), and neuron core counts via mocked describe_instance_types. There are no equivalent tests verifying Azure node type resource auto-detection with GPU/accelerator metadata.

Reference files (aws):

  • python/ray/tests/test_autoscaler_yaml.py

Recommendation: Add testValidateDefaultConfigAzureMultiNodeTypes to test_autoscaler_yaml.py that mocks Azure Compute API responses (ResourceSKUs with GPU capability data) and validates that the Azure provider correctly auto-fills resources including GPU counts and accelerator types.

Task 48: 🔴 api (high)

No dedicated Azure filesystem implementation equivalent to S3FileSystem. AWS has a purpose-built S3FileSystem class in s3_filesystem.py using boto3 with connection pooling (max_pool_connections=50), adaptive retry logic (max_attempts=3, mode='adaptive'), TCP keepalive, and concurrent ThreadPoolExecutor-based downloads. Azure relies solely on the generic PyArrowFileSystem abstraction in pyarrow_filesystem.py, which lacks these optimizations.

Reference files (aws):

  • python/ray/llm/_internal/common/utils/cloud_filesystem/s3_filesystem.py

Existing target files (azure):

  • python/ray/llm/_internal/common/utils/cloud_filesystem/pyarrow_filesystem.py

Recommendation: Create a dedicated AzureFileSystem class (e.g., azure_filesystem.py) extending BaseCloudFileSystem that uses the Azure SDK (azure-storage-blob) directly. Implement _parse_azure_uri(), _get_blob_client() with connection pooling and retry policies (via azure.core.pipeline.policies), and get_file(), list_subfolders(), download_files() methods with optimized concurrent operations using ThreadPoolExecutor, mirroring the S3FileSystem pattern.

Task 49: 🟡 api (medium)

No dedicated Azure filesystem tests equivalent to test_s3_filesystem.py. AWS has comprehensive unit tests covering get_file (string/bytes), get_file_not_found, get_file_anonymous, and list_subfolders with parameterized URI variants, all mocking boto3.client. Azure has no analogous test_azure_filesystem.py test file.

Reference files (aws):

  • python/ray/llm/tests/common/cloud/test_s3_filesystem.py

Existing target files (azure):

  • python/ray/llm/tests/common/cloud/test_pyarrow_filesystem.py

Recommendation: Create test_azure_filesystem.py with tests for the new AzureFileSystem class covering get_file (UTF-8 and bytes), file-not-found handling, list_subfolders, and download_files, mocking azure.storage.blob.BlobServiceClient. Also add Azure-specific URI parsing tests for both abfss:// and azure:// schemes.

Task 50: 🟡 api (medium)

Upload utility tests in test_upload_utils.py only cover gs:// and pyarrow-s3:// URI schemes (lines 33 and 62). No test exercises Azure upload paths (abfss:// or azure://), despite CloudDownloaderConfig.validate_paths in cloud_downloader.py accepting these schemes as valid.

Reference files (aws):

  • python/ray/llm/tests/common/utils/test_upload_utils.py

Existing target files (azure):

  • python/ray/llm/tests/common/utils/test_upload_utils.py

Recommendation: Add test cases test_upload_custom_model_azure and test_upload_downloaded_hf_model_azure in test_upload_utils.py that exercise upload_model_files with abfss://container@account.dfs.core.windows.net/model-id and azure://container@account.blob.core.windows.net/model-id URIs, verifying pyarrow.fs.copy_files is called with the correct Azure filesystem and path.

Task 51: 🟡 api (medium)

Test fixtures in conftest.py download models exclusively from S3 (S3_ARTIFACT_URL = 'https://air-example-data.s3.amazonaws.com/'). The download_model_from_s3 helper and all model fixtures (model_opt_125m, model_llava_354m, model_smolvlm_256m) are hardcoded to S3. No Azure Blob Storage equivalent exists for integration testing.

Reference files (aws):

  • python/ray/llm/tests/conftest.py

Existing target files (azure):

  • python/ray/llm/tests/conftest.py

Recommendation: Add a download_model_from_azure helper function that downloads from Azure Blob Storage using the Azure SDK, and create corresponding Azure-backed model fixtures (or make the existing fixtures configurable via environment variables like TEST_ARTIFACT_STORAGE=azure) so integration tests can validate Azure storage paths end-to-end.

Task 52: 🟢 api (low)

The LoraModelLoader.load_model_from_config method in lora_serve_utils.py (line 163) is documented as 'fetching its mirror config from S3' and examples in the code reference S3 paths exclusively (e.g., s3://ray-llama-weights). The LoRA dynamic loading path pattern is AWS-centric with no Azure documentation or examples.

Reference files (aws):

  • python/ray/llm/_internal/serve/utils/lora_serve_utils.py

Existing target files (azure):

  • python/ray/llm/_internal/serve/utils/lora_serve_utils.py

Recommendation: Update the load_model_from_config docstring to be storage-agnostic ('fetching its mirror config from cloud storage'). Add Azure examples in lora_serve_utils.py comments showing dynamic_lora_loading_path set to abfss://container@account.dfs.core.windows.net/lora-weights and document required Azure credentials setup (e.g., AZURE_STORAGE_ACCOUNT_NAME, DefaultAzureCredential).

Task 53: 🟢 api (low)

The CloudDownloader callback class docstring in cloud_downloader.py (lines 43-58) only shows S3 and GCS examples in its usage documentation. No Azure (abfss:// or azure://) example is provided, despite these being listed as valid schemes in CloudDownloaderConfig.validate_paths.

Reference files (aws):

  • python/ray/llm/_internal/common/callbacks/cloud_downloader.py

Existing target files (azure):

  • python/ray/llm/_internal/common/callbacks/cloud_downloader.py

Recommendation: Add Azure examples to the CloudDownloader docstring showing tuples like ('abfss://container@account.dfs.core.windows.net/path/to/file.txt', '/local/path/to/file.txt') and ('azure://container@account.blob.core.windows.net/path/to/file.txt', '/local/path/to/file.txt') alongside the existing S3 and GCS examples.

LLM Holistic Analysis

Now I have full context. Here's the analysis:


Azure vs. AWS Provider Gap Analysis

Architectural Patterns. AWS's provider implementation follows a deeply layered architecture that Azure only partially mirrors. The AWS autoscaler (config.py) performs extensive bootstrap work—automated AMI selection per-region (DEFAULT_AMI dict), IAM role creation (DEFAULT_RAY_IAM_ROLE), VPC/subnet configuration (_configure_subnet, _get_subnets_or_die), and security group templating (SECURITY_GROUP_TEMPLATE). Azure's config.py is structurally simpler: bootstrap_azure only calls _configure_key_pair and _configure_resource_group, with no equivalent for subnet selection, security group rule upsert, or automatic image resolution. AWS's node provider also integrates fillout_available_node_types_resources to auto-detect GPU/accelerator metadata via boto3.describe_instance_types, whereas Azure's node provider has no equivalent—users must manually specify all resources in YAML. The test infrastructure reflects this: test_autoscaler_aws.py uses elaborate stub/mock fixtures (stubs.configure_iam_role_default, stubs.configure_key_pair_default) that test full bootstrap flows, while test_autoscaler_azure.py only tests zone-selection logic with shallow mocks that skip Azure API calls entirely.

Weakest Functional Areas. Azure's weakest areas are driver/accelerator support, storage, and identity propagation. On the driver side, AWS defines NeuronAcceleratorManager with AWS_NEURON_INSTANCE_MAP and registers it in the accelerator registry; Azure has zero accelerator constants, no instance-to-GPU map, and no AcceleratorManager subclass—meaning the autoscaler cannot auto-populate GPU resources or accelerator_type labels for NC/ND-series VMs. On storage, AWS has a dedicated S3FileSystem class (tested in test_s3_filesystem.py) with get_file(), anonymous access, and ClientError handling via botocore.exceptions; Azure relies entirely on the generic PyArrowFileSystem with adlfs integration (as shown in test_pyarrow_filesystem.py), which works but lacks a first-class AzureBlobFileSystem wrapper with equivalent error handling, retry logic, and direct-to-memory download. On identity, the Azure node provider imports DefaultAzureCredential and get_cli_profile but has no credential refresh, expiration handling, or propagation to Ray actors—contrast AWS's handle_boto_error utility and the extensive credential mock infrastructure in test_autoscaler_aws.py.

Specific Recommendations. The highest-impact change is implementing the AZURE_GPU_INSTANCE_MAP and AzureGPUAcceleratorManager (as the existing plan.md details), which would close the accelerator gap for NC/ND/NV-series VMs. Next, Azure should add a dedicated AzureBlobFileSystem class in ray/llm/_internal/common/utils/cloud_filesystem/ mirroring S3FileSystem's interface—wrapping azure.storage.blob.BlobServiceClient with get_file(), put_file(), and proper ResourceNotFoundError handling. The Azure bootstrap (bootstrap_azure) should be extended with _configure_network_security_group and _configure_subnet functions that auto-create NSG rules and select subnets within a VNet, following the AWS pattern of _get_or_create_vpc_security_groups. Finally, Azure needs test fixtures comparable to AWS's stubs module—mock ComputeManagementClient, NetworkManagementClient, and ResourceManagementClient factories that enable full bootstrap flow testing without live Azure API calls.

Cross-Cutting Concerns. Error handling diverges significantly: AWS wraps SDK errors through handle_boto_error (in aws/utils.py) and tests specific error codes like NoSuchKey and InstanceLimitExceeded (visible in test_info_string_with_launch_failures with UnavailableNodeInformation categories); Azure's node provider has no equivalent error-classification layer, making it harder to surface actionable failure messages in the autoscaler status string. Configuration validation is another gap—AWS's test suite validates subnet/VPC consistency (test_use_subnets_in_only_one_vpc) and security group rule correctness with named constants (DEFAULT_SG_WITH_RULES, CUSTOM_IN_BOUND_RULES); Azure tests only cover availability zone logic. The CI/container pipeline is entirely AWS-centric (ECR URIs, Buildkite configs), but this is lower priority since KubeRay on AKS (documented in azure-aks-gpu-cluster.md) handles container deployment separately via Kubernetes-native tooling.

Analysis based on reading actual source files. LLM tokens used: 0 input, 0 output.

Summary Checklist

  • Task 1
  • Task 2
  • Task 3
  • Task 4
  • Task 5
  • Task 6
  • Task 7
  • Task 8
  • Task 9
  • Task 10
  • Task 11
  • Task 12
  • Task 13
  • Task 14
  • Task 15
  • Task 16
  • Task 17
  • Task 18
  • Task 19
  • Task 20
  • Task 21
  • Task 22
  • Task 23
  • Task 24
  • Task 25
  • Task 26
  • Task 27
  • Task 28
  • Task 29
  • Task 30
  • Task 31
  • Task 32
  • Task 33
  • Task 34
  • Task 35
  • Task 36
  • Task 37
  • Task 38
  • Task 39
  • Task 40
  • Task 41
  • Task 42
  • Task 43
  • Task 44
  • Task 45
  • Task 46
  • Task 47
  • Task 48
  • Task 49
  • Task 50
  • Task 51
  • Task 52
  • Task 53

Guidelines

  • Follow existing code conventions and directory structure.
  • Each task should be a separate commit or PR.
  • Include tests for all new provider-specific code.
  • Update documentation (README, CHANGELOG) as appropriate.
  • Use the aws implementation as the primary reference, but adapt to azure-specific APIs and conventions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment