ksuderman/GCP_BATCH_COMPARISON.md

## GCP_BATCH_COMPARISON.md

      
    Raw
  

              GCP_BATCH_COMPARISON.md
            
          
    GCP Batch Job Runners Comparison

Galaxy supports two approaches for dispatching jobs to Google Cloud Batch: the Direct GCP Batch Runner and the Pulsar GCP Batch Runner. Each has distinct architectures and trade-offs.
Architecture Overview

Direct GCP Batch Runner (gcp_batch)


Runner class: galaxy.jobs.runners.gcp_batch:GoogleCloudBatchJobRunner
File access: NFS mount from Kubernetes cluster
Communication: Direct polling of GCP Batch API
Container model: Single container running the tool

Pulsar GCP Batch Runner (pulsar_gcp)


Runner class: galaxy.jobs.runners.pulsar:PulsarGcpBatchJobRunner
File access: HTTP transfer to local SSD via Pulsar sidecar
Communication: RabbitMQ message queue + Galaxy API
Container model: Two containers (Pulsar sidecar + tool container)

Comparison


Aspect
Direct GCP Batch
Pulsar GCP Batch


Startup overhead
Lower (just mount NFS)
Higher (file staging required)


I/O performance
Network-bound (NFS)
Local SSD (375GB+)


Large input files
Better (no transfer)
Slower (must download)


I/O-intensive tools
Slower (network latency)
Faster (local disk)


Network configuration
Supported (network/subnet params)
Not yet supported


Galaxy accessibility
Internal IP (same VPC)
Requires public IP or VPC peering


Complexity
Simpler
More complex


Firewall requirements
NFS ports (2049, 111)
RabbitMQ (5672) + HTTP (80/443)


When to Use Each Approach

Use Direct GCP Batch (gcp_batch) when:


Input files are large (reduces transfer time)
Tools have moderate I/O requirements
You want simpler infrastructure
Galaxy and Batch VMs are in the same VPC
You need fine-grained network control

Use Pulsar GCP Batch (pulsar_gcp) when:


Tools are I/O-intensive (benefit from local SSD)
Input files are small to medium sized
You need to run jobs in a different network/project
Tool execution time dominates over file transfer time

Current Limitations

Direct GCP Batch


Requires NFS server accessible from Batch VMs
VM must be in same VPC as Galaxy cluster (if not, requires VPC peering or Cloud Filestore for NFS access)

Pulsar GCP Batch


CRITICAL: Runnable deadlock - the Pulsar sidecar container is missing background: true flag, causing the tool container to never start (requires fix in pulsar-galaxy-lib gcp_job_template())
Missing network and subnet parameters (requires code fix in pulsar-galaxy-lib)
Missing automatic machine_type computation from cores/mem (uses hardcoded default)
Requires Galaxy to be accessible via public IP (until network params are added)
kill() method not implemented (jobs cannot be cancelled cleanly)

Configuration Examples

Direct GCP Batch

runners:
  gcp_batch:
    load: galaxy.jobs.runners.gcp_batch:GoogleCloudBatchJobRunner
    project_id: my-project
    region: us-east4
    network: default
    subnet: default
    nfs_server: 10.0.0.5
    nfs_path: /export/galaxy
Pulsar GCP Batch

runners:
  pulsar_gcp:
    load: galaxy.jobs.runners.pulsar:PulsarGcpBatchJobRunner
    amqp_url: pyamqp://user:pass@rabbitmq-ip:5672//
    galaxy_url: http://galaxy-public-ip

execution:
  environments:
    pulsar_gcp:
      runner: pulsar_gcp
      project_id: my-project
      region: us-east4
      machine_type: n2-standard-8
      disk_size: 375
Aspect	Direct GCP Batch	Pulsar GCP Batch
Startup overhead	Lower (just mount NFS)	Higher (file staging required)
I/O performance	Network-bound (NFS)	Local SSD (375GB+)
Large input files	Better (no transfer)	Slower (must download)
I/O-intensive tools	Slower (network latency)	Faster (local disk)
Network configuration	Supported (network/subnet params)	Not yet supported
Galaxy accessibility	Internal IP (same VPC)	Requires public IP or VPC peering
Complexity	Simpler	More complex
Firewall requirements	NFS ports (2049, 111)	RabbitMQ (5672) + HTTP (80/443)
No results found