Skip to content

Instantly share code, notes, and snippets.

@mvdbeek
Created February 26, 2026 12:58
Show Gist options
  • Select an option

  • Save mvdbeek/829e8f5324a8107b32e90394eda7dff2 to your computer and use it in GitHub Desktop.

Select an option

Save mvdbeek/829e8f5324a8107b32e90394eda7dff2 to your computer and use it in GitHub Desktop.
Triage artifacts for Galaxy issue #21642 - Remote data fetch not respecting quota

Issue #21642: Fetching data from repositories does not seem to respect storage quota

State: OPEN Author: martenson Labels: area/backend, area/jobs, kind/bug Assignees: mvdbeek Comments: 1

Description

I have a 25.1 test Galaxy instance set up and as an anon user I have a quota of 100 MB.

However there is seemingly no limit on how many things I can fetch from remote repositories. At the moment my history has 3 GB and I can request more datasets to be fetched without any issues.

Looking at the logs the fetching jobs do not seem to be ever paused, so this is likely different code path than galaxyproject/galaxy#20637 (also does not go through pulsar)

Related Issue

Issue #21642: Code Research - Remote Repository Data Fetching Not Respecting Storage Quota

Problem Statement

Galaxy 25.1 test instance with an anonymous user having 100 MB quota allows fetching unlimited data (3 GB observed) from remote repositories. Fetching jobs are never paused. The code path differs from issue #20637 and doesn't involve Pulsar.

Key Code Locations

File Structure:

  • /lib/galaxy/tools/data_fetch.py (Lines 51-73) - Main do_fetch() function
  • /lib/galaxy/tools/execute.py (Lines 206-235) - Tool execution with Celery decision
  • /lib/galaxy/celery/tasks.py (Lines 227-335) - Celery task definitions
  • /lib/galaxy/jobs/__init__.py (Lines 1789-1809) - JobWrapper quota checking
  • /lib/galaxy/quota/__init__.py (Lines 372-386) - Quota agent implementation
  • /lib/galaxy/config/__init__.py (Lines 1428-1437) - Celery config check

Root Cause Analysis

The Problem: When is_fetch_with_celery_enabled() returns True (Galaxy 25.1 default), the __DATA_FETCH__ tool execution bypasses the traditional JobWrapper.enqueue() method that performs quota checks. Instead, it executes as a Celery task chain that completely skips quota validation.

Execution Flow Comparison

Traditional Path (Non-Celery):

JobWrapper.enqueue() → _set_object_store_ids() → _pause_job_if_over_quota() ✓

Celery Path (Default in 25.1):

setup_fetch_data() → _set_object_store_ids() [NO QUOTA CHECK] ✗
fetch_data() → change_state(RUNNING) [NO QUOTA CHECK] ✗
_fetch_data() → actual fetching

Three Theories on Root Cause

Theory 1: Missing Quota Check in Celery Tasks (MOST LIKELY - 95% probability)

In /lib/galaxy/celery/tasks.py, the setup_fetch_data() function (line 245) sets object store IDs but never calls _pause_job_if_over_quota(). Similarly, the fetch_data() function (line 334) transitions the job directly to RUNNING state without any quota verification.

The quota check method exists in /lib/galaxy/jobs/__init__.py lines 1804-1809 but is only called from JobWrapper.enqueue() which is bypassed entirely for Celery-based fetch jobs.

Theory 2: Session Management Issue in Setup Task (60% probability)

The setup_fetch_data() is a setup callback that returns values to the main task. Even if quota check were added here, changes to job state might not persist properly due to database session lifecycle in async contexts. The actual quota check needs to occur in the main fetch_data() task with proper session management.

Theory 3: Quota Tracking Before Job Execution (30% probability)

The quota agent's is_over_quota() method only checks current disk usage. If output datasets created during job setup don't have their sizes reflected in the quota calculation, quota checks might incorrectly show available space. Multiple concurrent fetch jobs could each pass checks individually.

Critical Code Sections

Normal Quota Check (JobWrapper.enqueue, lines 1800-1809):

def _pause_job_if_over_quota(self, job):
    quota_source_map = self.app.object_store.get_quota_source_map()
    if self.app.quota_agent.is_over_quota(quota_source_map, job):
        log.info("(%d) User (%s) is over quota: job paused", job.id, job.user_id)
        self.pause(job, message)

Celery Setup Task (Missing Check, line 245):

def setup_fetch_data(...):
    mini_job_wrapper._set_object_store_ids(job)  # Object store assigned
    # *** MISSING: _pause_job_if_over_quota() call here ***
    return mini_job_wrapper.working_directory, ...

Celery Main Task (No Check Before Execution, lines 334-335):

def fetch_data(...):
    mini_job_wrapper.change_state(model.Job.states.RUNNING, flush=True, job=job)
    return abort_when_job_stops(_fetch_data, ...)  # Starts fetching without quota check

Summary

The root cause is that Celery-based data fetch jobs bypass the traditional job enqueue path where quota checks are performed. The fix should add quota checking to the Celery task chain, preferably in setup_fetch_data() or at the start of fetch_data() before changing the job state to RUNNING.

Issue #21642: Git History Research

Timeline of Celery Fetch Implementation

April 1, 2022 - Initial Celery Fetch Implementation

Commit: aa67d9dd387 Author: mvdbeek

  • Created setup_fetch_data and fetch_data Celery tasks
  • No quota checks were included in the new path
  • This was the initial implementation of Celery-based data fetching

March 10, 2023 - Celery Fetch Made Configurable

Commit: f35e8f8288e Author: John Davis PR: #15767 Target: release_23.0

  • Added is_fetch_with_celery_enabled() function
  • Allowed disabling Celery fetch as a workaround, but not a fix
  • Made Celery fetch the default behavior

February 10, 2025 - Quota Check Added to Job Enqueue

Commit: ecaa747104a Author: davelopez

  • Added _pause_job_if_over_quota() to MinimalJobWrapper.enqueue()
  • CRITICAL ISSUE: This only fixes the traditional path, NOT the Celery path
  • Celery tasks still bypass enqueue() entirely

The Core Problem

The Celery path in execute() at line 206 of /lib/galaxy/tools/execute.py completely bypasses the job enqueue mechanism:

# Celery Path (NO quota check)
setup_fetch_data.s() | fetch_data.s() | set_job_metadata.s() | finish_job.si()

# Traditional Path (WITH quota check as of Feb 10, 2025)
tool.app.job_manager.enqueue(job2, tool=tool)

Key Authors

Author Contribution
mvdbeek Original Celery fetch implementation, currently assigned to issue
John Davis Made Celery fetch configurable (PR #15767)
davelopez Added quota check to MinimalJobWrapper.enqueue() (Feb 2025)

Regression Assessment

  • Type: REGRESSION
  • Introduced: Galaxy 23.0 (March 2023) when Celery fetch became default
  • Duration: ~2 years undetected (Mar 2023 → Feb 2025)
  • Severity: High (quota is a security/fairness control)

Root Cause Summary

The Celery implementation didn't replicate all job enqueue-time checks. When the Celery path was introduced, it was designed for performance but inadvertently bypassed the quota enforcement that exists in the traditional job handling path.

The recent fix by davelopez (Feb 2025) added quota checking to MinimalJobWrapper.enqueue(), but this doesn't help the Celery path because Celery tasks never call enqueue() - they directly execute the fetch operations.

Key Files

File Line Description
/lib/galaxy/tools/execute.py 206 Decides which path (Celery vs traditional) to use
/lib/galaxy/celery/tasks.py 227-335 Celery tasks with no quota checks
/lib/galaxy/jobs/__init__.py 1612 New quota check (doesn't help Celery path)
/lib/galaxy/config/__init__.py 1428 is_fetch_with_celery_enabled() function

Related PRs and Issues

  • PR #15767 - Made Celery fetch configurable (March 2023)
  • Issue #20637 - Similar quota issue with different code path (mentioned in original issue)

Issue #21642: Importance Assessment

1. SEVERITY: CRITICAL

Rationale:

  • Enables complete bypass of storage quota enforcement for anonymous users
  • Allows unlimited data consumption on quota-limited systems
  • Can cause disk space exhaustion on public Galaxy instances
  • Represents a fundamental failure of a core resource management feature
  • Anonymous users (the security weakest point) are most vulnerable to this

2. BLAST RADIUS: HIGH - Affects all public-facing Galaxy instances

Who is affected:

  • All Galaxy instances with:
    • Quotas enabled (especially those serving anonymous users)
    • Data fetch/import functionality (core feature)
    • Remote data sources configured (common in shared servers)

Specific vulnerability:

  • Anonymous users with quotas can consume unlimited storage
  • Logged-in users may also be affected (quota only enforced after fetch completes)
  • This is not an edge case but a fundamental workflow pattern

3. WORKAROUND EXISTENCE: PAINFUL

Option Description Impact
Server-side Disable data_fetch tool entirely Breaks core functionality
Partial Remove remote repository access Limits use cases
Monitoring Manual disk usage monitoring and user suspension Reactive, not preventive

Assessment: No practical workaround for maintaining functionality while enforcing quotas.

4. REGRESSION STATUS: LIKELY INTRODUCED WITH CELERY FETCH PATH

The issue stems from architectural design where:

  • Celery-based data fetch jobs bypass traditional job enqueue path
  • Quota checks happen in JobWrapper.enqueue() which is not called for Celery fetch jobs
  • The _pause_job_if_over_quota() at line 1800 in lib/galaxy/jobs/__init__.py is never invoked

This likely became an issue when is_fetch_with_celery_enabled() became the default path (Galaxy 25.1).

5. USER IMPACT SIGNALS

  • Issue explicitly filed by Galaxy maintainer (martenson)
  • Assigned to core developer (mvdbeek)
  • Marked as kind/bug and area/backend/area/jobs
  • Affects resource management (critical for multi-tenant deployments)
  • Public Galaxy instances serving anonymous users are severely impacted

6. RECOMMENDATION: HOTFIX (Priority: CRITICAL)

Rationale:

  1. Resource exhaustion: Anonymous users can exhaust shared infrastructure
  2. Security/DoS vector: Could be exploited for denial of service against legitimate users
  3. Multi-tenant impact: Affects all users on shared Galaxy instances
  4. No workaround: Server operators cannot mitigate without disabling functionality

Implementation Strategy:

  1. Immediate (Hotfix): Enforce quota BEFORE Celery fetch job execution

    • Add quota check in setup_fetch_data() or start of fetch_data() in lib/galaxy/celery/tasks.py
    • Similar to _pause_job_if_over_quota() but in the Celery code path
    • Pause job and prevent fetching if user is already over quota
  2. Medium-term: Implement quota-aware streaming during fetch

    • Monitor accumulated data size during fetch process
    • Cancel/pause fetch if quota exceeded mid-stream
    • Prevent partial/orphaned datasets
  3. Testing: Add integration tests for:

    • Anonymous user with quota cannot fetch > quota
    • Fetch jobs are properly paused when quota exceeded
    • Different quota source labels are respected

Backporting:

  • Should be backported to all supported stable releases (25.x, 24.x)
  • Consider security implications for older versions in active use

Key Code Locations for Fix

  • /lib/galaxy/celery/tasks.py - setup_fetch_data() and fetch_data() - add quota check
  • /lib/galaxy/jobs/__init__.py - MinimalJobWrapper - may need quota checking capability
  • /lib/galaxy/quota/__init__.py - DatabaseQuotaAgent.is_over_quota() - already exists
  • Tests: /test/integration/test_quota.py - add Celery data_fetch specific tests

Summary

Issue #21642 represents a critical vulnerability in Galaxy's quota enforcement system that allows storage exhaustion on public instances. The problem is architectural—Celery-based data fetch jobs bypass the traditional job enqueue path where quota checks are performed. This requires a hotfix that enforces quotas in the Celery task chain. All public Galaxy instances with quotas enabled are at risk of disk space exhaustion.

Issue #21642: Implementation Plan

Issue Summary

GitHub Issue #21642: Remote repository data fetching not respecting storage quota in Galaxy 25.1 when Celery-based fetch is enabled (default).

Root Cause Analysis

Confirmed Root Cause (95% confidence): When is_fetch_with_celery_enabled() returns True (Galaxy 25.1 default), the __DATA_FETCH__ tool execution completely bypasses the traditional JobWrapper.enqueue() method where quota checks are performed.

Code Path Comparison

Path Flow Quota Check
Traditional (Non-Celery) JobWrapper.enqueue()_set_object_store_ids()_pause_job_if_over_quota() YES
Celery (Default in 25.1) setup_fetch_data()_set_object_store_ids()fetch_data()change_state(RUNNING) NO

Key Code Locations

  1. /lib/galaxy/tools/execute.py (lines 206-231) - Celery task chain creation
  2. /lib/galaxy/celery/tasks.py (lines 227-257) - setup_fetch_data() function
  3. /lib/galaxy/celery/tasks.py (lines 323-335) - fetch_data() function
  4. /lib/galaxy/jobs/__init__.py (lines 1789-1809) - JobWrapper.enqueue() and _pause_job_if_over_quota()
  5. /lib/galaxy/quota/__init__.py (lines 372-386) - DatabaseQuotaAgent.is_over_quota()

Step-by-Step Implementation Plan

Step 1: Add Quota Check Method to MinimalJobWrapper

File: /lib/galaxy/jobs/__init__.py

Add a new method check_and_pause_if_over_quota() to MinimalJobWrapper class (around line 1560, after the pause() method):

def check_and_pause_if_over_quota(self, job=None) -> bool:
    """Check if user is over quota and pause job if so.

    Returns True if job was paused due to quota, False otherwise.
    """
    if job is None:
        job = self.get_job()

    # Get quota source map from object store
    quota_source_map = self.app.object_store.get_quota_source_map()

    # Check quota using the quota agent
    if self.app.quota_agent.is_over_quota(quota_source_map, job):
        log.info("(%d) User (%s) is over quota: job paused", job.id, job.user_id)
        message = "Execution of this dataset's job is paused because you were over your disk quota at the time it was ready to run"
        self.pause(job, message)
        return True
    return False

Step 2: Modify setup_fetch_data() to Check Quota

File: /lib/galaxy/celery/tasks.py

Modify setup_fetch_data() function (starting at line 227) to check quota after setting object store IDs:

@galaxy_task(bind=True)
def setup_fetch_data(
    self,
    job_id: int,
    raw_tool_source: str,
    app: MinimalManagerApp,
    sa_session: galaxy_scoped_session,
    task_user_id: Optional[int] = None,
):
    tool = cached_create_tool_from_representation(app=app, raw_tool_source=raw_tool_source)
    job = sa_session.get(Job, job_id)
    assert job
    job.handler = self.request.hostname
    job.job_runner_name = "celery"
    mini_job_wrapper = MinimalJobWrapper(job=job, app=app, tool=tool)
    mini_job_wrapper.change_state(model.Job.states.QUEUED, flush=False, job=job)
    mini_job_wrapper._set_object_store_ids(job)

    # NEW: Check quota after object store is assigned
    if mini_job_wrapper.check_and_pause_if_over_quota(job):
        sa_session.commit()
        # Return None to signal the task chain should not continue
        return None

    # ... rest of the function unchanged

Step 3: Modify fetch_data() to Handle Paused Jobs

File: /lib/galaxy/celery/tasks.py

Modify fetch_data() function (starting at line 323) to handle None return from setup_fetch_data():

@galaxy_task(action="Run fetch_data")
def fetch_data(
    setup_return,
    job_id: int,
    app: MinimalManagerApp,
    sa_session: galaxy_scoped_session,
    task_user_id: Optional[int] = None,
) -> str:
    # NEW: If setup_return is None, job was paused due to quota
    if setup_return is None:
        log.info("(%d) Fetch job was paused (likely due to quota), skipping execution", job_id)
        return None

    job = sa_session.get(Job, job_id)
    assert job

    # NEW: Double-check job state - don't proceed if paused
    if job.state == model.Job.states.PAUSED:
        log.info("(%d) Job is paused, skipping fetch execution", job_id)
        return None

    # ... rest of the function unchanged

Step 4: Update finish_job() to Handle Paused Jobs

File: /lib/galaxy/celery/tasks.py

Modify finish_job() function to handle paused jobs:

@galaxy_task
def finish_job(
    job_id: int,
    raw_tool_source: str,
    app: MinimalManagerApp,
    sa_session: galaxy_scoped_session,
    task_user_id: Optional[int] = None,
):
    tool = cached_create_tool_from_representation(app=app, raw_tool_source=raw_tool_source)
    job = sa_session.get(Job, job_id)
    assert job

    # NEW: Don't finish if job is paused (quota exceeded)
    if job.state == model.Job.states.PAUSED:
        log.info("(%d) Job is paused, skipping finish", job_id)
        return

    # ... rest of the function unchanged

Step 5: Update set_job_metadata() to Handle Paused Jobs

File: /lib/galaxy/celery/tasks.py

Modify set_job_metadata() function to handle None input:

@galaxy_task(action="set metadata for job")
def set_job_metadata(
    tool_job_working_directory,
    extended_metadata_collection: bool,
    job_id: int,
    sa_session: galaxy_scoped_session,
    task_user_id: Optional[int] = None,
) -> None:
    # NEW: If working directory is None, job was paused
    if tool_job_working_directory is None:
        log.info("(%d) Job metadata skipped - job was paused", job_id)
        return None

    # ... rest of the function unchanged

Test Strategy

Unit Tests

File: /test/unit/celery/test_fetch_data_quota.py (new file)

Test cases:

  • test_setup_fetch_data_pauses_job_when_over_quota
  • test_fetch_data_skips_execution_when_setup_returns_none
  • test_fetch_data_skips_paused_job

Integration Tests

File: /test/integration/test_fetch_quota.py (new file)

Test cases:

  • test_fetch_paused_when_over_quota - Verify fetching data pauses job when user is over quota
  • test_fetch_from_url_respects_quota - Verify fetching from remote URLs respects quota

Potential Edge Cases and Risks

  1. Race Conditions: Multiple concurrent fetch jobs could each pass the quota check before any complete
  2. Quota Calculation Timing: Quota is checked before actual data download; final size unknown until complete
  3. Session Management: Celery tasks run in separate workers; ensure proper session commits
  4. Error Handling in Task Chain: Ensure all downstream tasks handle None gracefully
  5. User Object Stores: Verify is_over_quota() works correctly in Celery context
  6. Anonymous Users: Verify quota check works for job.user = None cases

Backporting Considerations

Target Branches:

  • release_25.1 (immediate fix needed)
  • release_25.0 (if applicable)

Backporting Steps:

  1. Check for API differences in MinimalJobWrapper between versions
  2. Verify Celery task signatures match
  3. Test both enable_celery_tasks=True and False paths
  4. Consider adding config option to disable Celery fetch as temporary workaround

Implementation Order

  1. Add check_and_pause_if_over_quota() to MinimalJobWrapper
  2. Modify setup_fetch_data() with quota check
  3. Update fetch_data() to handle None/paused
  4. Update finish_job() to skip paused jobs
  5. Update set_job_metadata() to handle None
  6. Add unit tests
  7. Add integration tests
  8. Manual testing on development instance
  9. Create PR and run CI

Critical Files for Implementation

File Changes Needed
/lib/galaxy/jobs/__init__.py Add check_and_pause_if_over_quota() to MinimalJobWrapper
/lib/galaxy/celery/tasks.py Modify setup_fetch_data(), fetch_data(), finish_job(), set_job_metadata()
/lib/galaxy/quota/__init__.py Reference only (no changes needed)
/test/integration/objectstore/test_quota_limit.py Pattern to follow for integration tests
/lib/galaxy/tools/execute.py Reference only (no changes needed)

Issue #21642: Triage Summary

Top-Line Summary

Issue: Remote repository data fetching does not respect storage quota in Galaxy 25.1+

Root Cause: When Celery-based data fetch is enabled (default since Galaxy 23.0), the __DATA_FETCH__ tool execution bypasses the traditional JobWrapper.enqueue() method where quota checks are performed. The Celery task chain (setup_fetch_datafetch_dataset_job_metadatafinish_job) directly executes without calling _pause_job_if_over_quota(). This is a regression introduced in Galaxy 23.0 (March 2023, PR #15767) when Celery fetch became the default, and has been undetected for approximately 2 years.

Most Probable Fix: Add quota checking to the setup_fetch_data() Celery task in /lib/galaxy/celery/tasks.py by calling a new check_and_pause_if_over_quota() method on MinimalJobWrapper after object store IDs are set. All downstream tasks in the chain must be updated to handle paused jobs gracefully.


Importance Assessment Summary

Criterion Assessment
Severity CRITICAL - Enables complete bypass of quota enforcement
Blast Radius HIGH - Affects all public-facing Galaxy instances with quotas enabled
Workaround PAINFUL - No practical workaround without disabling core functionality
Regression Status REGRESSION - Introduced in Galaxy 23.0 (March 2023)
Priority Recommendation HOTFIX - Should be backported to all supported releases

Discussion Questions

  1. Concurrent fetch jobs: Multiple simultaneous fetch requests could each pass quota checks before any complete. Should we implement a quota reservation mechanism to prevent race conditions?

  2. Unknown file sizes: For remote URL fetches, the final file size isn't known until download completes. Should we implement:

    • A Content-Length based pre-check?
    • Mid-stream cancellation if quota exceeded during download?
  3. Backporting scope: The fix should be backported to 25.1, but should it also go to 25.0 and earlier supported releases?

  4. Testing coverage: The existing quota tests don't cover the Celery fetch path. What integration test scenarios should be prioritized?

  5. Anonymous user impact: Anonymous users with quotas appear to be the most affected. Are there specific configurations or use cases we should test?


Effort Estimate

Aspect Assessment
Implementation Effort Medium - 5 files to modify, well-scoped changes
Testing Complexity Medium - Requires Celery worker setup for integration tests
Reproduction Difficulty Easy - Set up quota, enable Celery fetch (default), fetch data
Risk Level Low - Changes are additive, existing code paths unchanged

Key Files

  • /lib/galaxy/celery/tasks.py - Primary fix location
  • /lib/galaxy/jobs/__init__.py - Add quota check method to MinimalJobWrapper
  • /lib/galaxy/tools/execute.py - Reference for task chain (no changes)
  • /lib/galaxy/quota/__init__.py - Existing quota check implementation (no changes)

Related Issues

  • Issue #20637 - Similar quota issue but different code path (doesn't go through Pulsar)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment