mvdbeek/ISSUE_21642.md

## ISSUE_21642.md

      
    Raw
  

              ISSUE_21642.md
            
          
    Issue #21642: Fetching data from repositories does not seem to respect storage quota

State: OPEN
Author: martenson
Labels: area/backend, area/jobs, kind/bug
Assignees: mvdbeek
Comments: 1
Description

I have a 25.1 test Galaxy instance set up and as an anon user I have a quota of 100 MB.
However there is seemingly no limit on how many things I can fetch from remote repositories. At the moment my history has 3 GB and I can request more datasets to be fetched without any issues.
Looking at the logs the fetching jobs do not seem to be ever paused, so this is likely different code path than galaxyproject/galaxy#20637 (also does not go through pulsar)
Related Issue


galaxyproject/galaxy#20637 - Similar quota-related issue (different code path)


## ISSUE_21642_CODE_RESEARCH.md

      
    Raw
  

              ISSUE_21642_CODE_RESEARCH.md
            
          
    Issue #21642: Code Research - Remote Repository Data Fetching Not Respecting Storage Quota

Problem Statement

Galaxy 25.1 test instance with an anonymous user having 100 MB quota allows fetching unlimited data (3 GB observed) from remote repositories. Fetching jobs are never paused. The code path differs from issue #20637 and doesn't involve Pulsar.
Key Code Locations

File Structure:


/lib/galaxy/tools/data_fetch.py (Lines 51-73) - Main do_fetch() function
/lib/galaxy/tools/execute.py (Lines 206-235) - Tool execution with Celery decision
/lib/galaxy/celery/tasks.py (Lines 227-335) - Celery task definitions
/lib/galaxy/jobs/__init__.py (Lines 1789-1809) - JobWrapper quota checking
/lib/galaxy/quota/__init__.py (Lines 372-386) - Quota agent implementation
/lib/galaxy/config/__init__.py (Lines 1428-1437) - Celery config check

Root Cause Analysis

The Problem: When is_fetch_with_celery_enabled() returns True (Galaxy 25.1 default), the __DATA_FETCH__ tool execution bypasses the traditional JobWrapper.enqueue() method that performs quota checks. Instead, it executes as a Celery task chain that completely skips quota validation.
Execution Flow Comparison

Traditional Path (Non-Celery):
JobWrapper.enqueue() → _set_object_store_ids() → _pause_job_if_over_quota() ✓

Celery Path (Default in 25.1):
setup_fetch_data() → _set_object_store_ids() [NO QUOTA CHECK] ✗
fetch_data() → change_state(RUNNING) [NO QUOTA CHECK] ✗
_fetch_data() → actual fetching

Three Theories on Root Cause

Theory 1: Missing Quota Check in Celery Tasks (MOST LIKELY - 95% probability)

In /lib/galaxy/celery/tasks.py, the setup_fetch_data() function (line 245) sets object store IDs but never calls _pause_job_if_over_quota(). Similarly, the fetch_data() function (line 334) transitions the job directly to RUNNING state without any quota verification.
The quota check method exists in /lib/galaxy/jobs/__init__.py lines 1804-1809 but is only called from JobWrapper.enqueue() which is bypassed entirely for Celery-based fetch jobs.
Theory 2: Session Management Issue in Setup Task (60% probability)

The setup_fetch_data() is a setup callback that returns values to the main task. Even if quota check were added here, changes to job state might not persist properly due to database session lifecycle in async contexts. The actual quota check needs to occur in the main fetch_data() task with proper session management.
Theory 3: Quota Tracking Before Job Execution (30% probability)

The quota agent's is_over_quota() method only checks current disk usage. If output datasets created during job setup don't have their sizes reflected in the quota calculation, quota checks might incorrectly show available space. Multiple concurrent fetch jobs could each pass checks individually.
Critical Code Sections

Normal Quota Check (JobWrapper.enqueue, lines 1800-1809):

def _pause_job_if_over_quota(self, job):
    quota_source_map = self.app.object_store.get_quota_source_map()
    if self.app.quota_agent.is_over_quota(quota_source_map, job):
        log.info("(%d) User (%s) is over quota: job paused", job.id, job.user_id)
        self.pause(job, message)
Celery Setup Task (Missing Check, line 245):

def setup_fetch_data(...):
    mini_job_wrapper._set_object_store_ids(job)  # Object store assigned
    # *** MISSING: _pause_job_if_over_quota() call here ***
    return mini_job_wrapper.working_directory, ...
Celery Main Task (No Check Before Execution, lines 334-335):

def fetch_data(...):
    mini_job_wrapper.change_state(model.Job.states.RUNNING, flush=True, job=job)
    return abort_when_job_stops(_fetch_data, ...)  # Starts fetching without quota check
Summary

The root cause is that Celery-based data fetch jobs bypass the traditional job enqueue path where quota checks are performed. The fix should add quota checking to the Celery task chain, preferably in setup_fetch_data() or at the start of fetch_data() before changing the job state to RUNNING.

  
## ISSUE_21642_HISTORY.md

      
    Raw
  

              ISSUE_21642_HISTORY.md
            
          
    Issue #21642: Git History Research

Timeline of Celery Fetch Implementation

April 1, 2022 - Initial Celery Fetch Implementation

Commit: aa67d9dd387
Author: mvdbeek

Created setup_fetch_data and fetch_data Celery tasks
No quota checks were included in the new path
This was the initial implementation of Celery-based data fetching

March 10, 2023 - Celery Fetch Made Configurable

Commit: f35e8f8288e
Author: John Davis
PR: #15767
Target: release_23.0

Added is_fetch_with_celery_enabled() function
Allowed disabling Celery fetch as a workaround, but not a fix
Made Celery fetch the default behavior

February 10, 2025 - Quota Check Added to Job Enqueue

Commit: ecaa747104a
Author: davelopez

Added _pause_job_if_over_quota() to MinimalJobWrapper.enqueue()
CRITICAL ISSUE: This only fixes the traditional path, NOT the Celery path
Celery tasks still bypass enqueue() entirely

The Core Problem

The Celery path in execute() at line 206 of /lib/galaxy/tools/execute.py completely bypasses the job enqueue mechanism:
# Celery Path (NO quota check)
setup_fetch_data.s() | fetch_data.s() | set_job_metadata.s() | finish_job.si()

# Traditional Path (WITH quota check as of Feb 10, 2025)
tool.app.job_manager.enqueue(job2, tool=tool)
Key Authors


Author
Contribution


mvdbeek
Original Celery fetch implementation, currently assigned to issue


John Davis
Made Celery fetch configurable (PR #15767)


davelopez
Added quota check to MinimalJobWrapper.enqueue() (Feb 2025)


Regression Assessment


Type: REGRESSION
Introduced: Galaxy 23.0 (March 2023) when Celery fetch became default
Duration: ~2 years undetected (Mar 2023 → Feb 2025)
Severity: High (quota is a security/fairness control)

Root Cause Summary

The Celery implementation didn't replicate all job enqueue-time checks. When the Celery path was introduced, it was designed for performance but inadvertently bypassed the quota enforcement that exists in the traditional job handling path.
The recent fix by davelopez (Feb 2025) added quota checking to MinimalJobWrapper.enqueue(), but this doesn't help the Celery path because Celery tasks never call enqueue() - they directly execute the fetch operations.
Key Files


File
Line
Description


/lib/galaxy/tools/execute.py
206
Decides which path (Celery vs traditional) to use


/lib/galaxy/celery/tasks.py
227-335
Celery tasks with no quota checks


/lib/galaxy/jobs/__init__.py
1612
New quota check (doesn't help Celery path)


/lib/galaxy/config/__init__.py
1428
is_fetch_with_celery_enabled() function


Related PRs and Issues


PR #15767 - Made Celery fetch configurable (March 2023)
Issue #20637 - Similar quota issue with different code path (mentioned in original issue)


## ISSUE_21642_IMPORTANCE.md

      
    Raw
  

              ISSUE_21642_IMPORTANCE.md
            
          
    Issue #21642: Importance Assessment

1. SEVERITY: CRITICAL

Rationale:

Enables complete bypass of storage quota enforcement for anonymous users
Allows unlimited data consumption on quota-limited systems
Can cause disk space exhaustion on public Galaxy instances
Represents a fundamental failure of a core resource management feature
Anonymous users (the security weakest point) are most vulnerable to this

2. BLAST RADIUS: HIGH - Affects all public-facing Galaxy instances

Who is affected:

All Galaxy instances with:

Quotas enabled (especially those serving anonymous users)
Data fetch/import functionality (core feature)
Remote data sources configured (common in shared servers)


Specific vulnerability:

Anonymous users with quotas can consume unlimited storage
Logged-in users may also be affected (quota only enforced after fetch completes)
This is not an edge case but a fundamental workflow pattern

3. WORKAROUND EXISTENCE: PAINFUL


Option
Description
Impact


Server-side
Disable data_fetch tool entirely
Breaks core functionality


Partial
Remove remote repository access
Limits use cases


Monitoring
Manual disk usage monitoring and user suspension
Reactive, not preventive


Assessment: No practical workaround for maintaining functionality while enforcing quotas.
4. REGRESSION STATUS: LIKELY INTRODUCED WITH CELERY FETCH PATH

The issue stems from architectural design where:

Celery-based data fetch jobs bypass traditional job enqueue path
Quota checks happen in JobWrapper.enqueue() which is not called for Celery fetch jobs
The _pause_job_if_over_quota() at line 1800 in lib/galaxy/jobs/__init__.py is never invoked

This likely became an issue when is_fetch_with_celery_enabled() became the default path (Galaxy 25.1).
5. USER IMPACT SIGNALS


Issue explicitly filed by Galaxy maintainer (martenson)
Assigned to core developer (mvdbeek)
Marked as kind/bug and area/backend/area/jobs
Affects resource management (critical for multi-tenant deployments)
Public Galaxy instances serving anonymous users are severely impacted

6. RECOMMENDATION: HOTFIX (Priority: CRITICAL)

Rationale:

Resource exhaustion: Anonymous users can exhaust shared infrastructure
Security/DoS vector: Could be exploited for denial of service against legitimate users
Multi-tenant impact: Affects all users on shared Galaxy instances
No workaround: Server operators cannot mitigate without disabling functionality

Implementation Strategy:


Immediate (Hotfix): Enforce quota BEFORE Celery fetch job execution

Add quota check in setup_fetch_data() or start of fetch_data() in lib/galaxy/celery/tasks.py
Similar to _pause_job_if_over_quota() but in the Celery code path
Pause job and prevent fetching if user is already over quota


Medium-term: Implement quota-aware streaming during fetch

Monitor accumulated data size during fetch process
Cancel/pause fetch if quota exceeded mid-stream
Prevent partial/orphaned datasets


Testing: Add integration tests for:

Anonymous user with quota cannot fetch > quota
Fetch jobs are properly paused when quota exceeded
Different quota source labels are respected


Backporting:

Should be backported to all supported stable releases (25.x, 24.x)
Consider security implications for older versions in active use

Key Code Locations for Fix


/lib/galaxy/celery/tasks.py - setup_fetch_data() and fetch_data() - add quota check
/lib/galaxy/jobs/__init__.py - MinimalJobWrapper - may need quota checking capability
/lib/galaxy/quota/__init__.py - DatabaseQuotaAgent.is_over_quota() - already exists
Tests: /test/integration/test_quota.py - add Celery data_fetch specific tests

Summary

Issue #21642 represents a critical vulnerability in Galaxy's quota enforcement system that allows storage exhaustion on public instances. The problem is architectural—Celery-based data fetch jobs bypass the traditional job enqueue path where quota checks are performed. This requires a hotfix that enforces quotas in the Celery task chain. All public Galaxy instances with quotas enabled are at risk of disk space exhaustion.

  
## ISSUE_21642_PLAN.md

      
    Raw
  

              ISSUE_21642_PLAN.md
            
          
    Issue #21642: Implementation Plan

Issue Summary

GitHub Issue #21642: Remote repository data fetching not respecting storage quota in Galaxy 25.1 when Celery-based fetch is enabled (default).
Root Cause Analysis

Confirmed Root Cause (95% confidence):
When is_fetch_with_celery_enabled() returns True (Galaxy 25.1 default), the __DATA_FETCH__ tool execution completely bypasses the traditional JobWrapper.enqueue() method where quota checks are performed.
Code Path Comparison


Path
Flow
Quota Check


Traditional (Non-Celery)
JobWrapper.enqueue() → _set_object_store_ids() → _pause_job_if_over_quota()
YES


Celery (Default in 25.1)
setup_fetch_data() → _set_object_store_ids() → fetch_data() → change_state(RUNNING)
NO


Key Code Locations


/lib/galaxy/tools/execute.py (lines 206-231) - Celery task chain creation
/lib/galaxy/celery/tasks.py (lines 227-257) - setup_fetch_data() function
/lib/galaxy/celery/tasks.py (lines 323-335) - fetch_data() function
/lib/galaxy/jobs/__init__.py (lines 1789-1809) - JobWrapper.enqueue() and _pause_job_if_over_quota()
/lib/galaxy/quota/__init__.py (lines 372-386) - DatabaseQuotaAgent.is_over_quota()


Step-by-Step Implementation Plan

Step 1: Add Quota Check Method to MinimalJobWrapper

File: /lib/galaxy/jobs/__init__.py
Add a new method check_and_pause_if_over_quota() to MinimalJobWrapper class (around line 1560, after the pause() method):
def check_and_pause_if_over_quota(self, job=None) -> bool:
    """Check if user is over quota and pause job if so.

    Returns True if job was paused due to quota, False otherwise.
    """
    if job is None:
        job = self.get_job()

    # Get quota source map from object store
    quota_source_map = self.app.object_store.get_quota_source_map()

    # Check quota using the quota agent
    if self.app.quota_agent.is_over_quota(quota_source_map, job):
        log.info("(%d) User (%s) is over quota: job paused", job.id, job.user_id)
        message = "Execution of this dataset's job is paused because you were over your disk quota at the time it was ready to run"
        self.pause(job, message)
        return True
    return False
Step 2: Modify setup_fetch_data() to Check Quota

File: /lib/galaxy/celery/tasks.py
Modify setup_fetch_data() function (starting at line 227) to check quota after setting object store IDs:
@galaxy_task(bind=True)
def setup_fetch_data(
    self,
    job_id: int,
    raw_tool_source: str,
    app: MinimalManagerApp,
    sa_session: galaxy_scoped_session,
    task_user_id: Optional[int] = None,
):
    tool = cached_create_tool_from_representation(app=app, raw_tool_source=raw_tool_source)
    job = sa_session.get(Job, job_id)
    assert job
    job.handler = self.request.hostname
    job.job_runner_name = "celery"
    mini_job_wrapper = MinimalJobWrapper(job=job, app=app, tool=tool)
    mini_job_wrapper.change_state(model.Job.states.QUEUED, flush=False, job=job)
    mini_job_wrapper._set_object_store_ids(job)

    # NEW: Check quota after object store is assigned
    if mini_job_wrapper.check_and_pause_if_over_quota(job):
        sa_session.commit()
        # Return None to signal the task chain should not continue
        return None

    # ... rest of the function unchanged
Step 3: Modify fetch_data() to Handle Paused Jobs

File: /lib/galaxy/celery/tasks.py
Modify fetch_data() function (starting at line 323) to handle None return from setup_fetch_data():
@galaxy_task(action="Run fetch_data")
def fetch_data(
    setup_return,
    job_id: int,
    app: MinimalManagerApp,
    sa_session: galaxy_scoped_session,
    task_user_id: Optional[int] = None,
) -> str:
    # NEW: If setup_return is None, job was paused due to quota
    if setup_return is None:
        log.info("(%d) Fetch job was paused (likely due to quota), skipping execution", job_id)
        return None

    job = sa_session.get(Job, job_id)
    assert job

    # NEW: Double-check job state - don't proceed if paused
    if job.state == model.Job.states.PAUSED:
        log.info("(%d) Job is paused, skipping fetch execution", job_id)
        return None

    # ... rest of the function unchanged
Step 4: Update finish_job() to Handle Paused Jobs

File: /lib/galaxy/celery/tasks.py
Modify finish_job() function to handle paused jobs:
@galaxy_task
def finish_job(
    job_id: int,
    raw_tool_source: str,
    app: MinimalManagerApp,
    sa_session: galaxy_scoped_session,
    task_user_id: Optional[int] = None,
):
    tool = cached_create_tool_from_representation(app=app, raw_tool_source=raw_tool_source)
    job = sa_session.get(Job, job_id)
    assert job

    # NEW: Don't finish if job is paused (quota exceeded)
    if job.state == model.Job.states.PAUSED:
        log.info("(%d) Job is paused, skipping finish", job_id)
        return

    # ... rest of the function unchanged
Step 5: Update set_job_metadata() to Handle Paused Jobs

File: /lib/galaxy/celery/tasks.py
Modify set_job_metadata() function to handle None input:
@galaxy_task(action="set metadata for job")
def set_job_metadata(
    tool_job_working_directory,
    extended_metadata_collection: bool,
    job_id: int,
    sa_session: galaxy_scoped_session,
    task_user_id: Optional[int] = None,
) -> None:
    # NEW: If working directory is None, job was paused
    if tool_job_working_directory is None:
        log.info("(%d) Job metadata skipped - job was paused", job_id)
        return None

    # ... rest of the function unchanged

Test Strategy

Unit Tests

File: /test/unit/celery/test_fetch_data_quota.py (new file)
Test cases:

test_setup_fetch_data_pauses_job_when_over_quota
test_fetch_data_skips_execution_when_setup_returns_none
test_fetch_data_skips_paused_job

Integration Tests

File: /test/integration/test_fetch_quota.py (new file)
Test cases:

test_fetch_paused_when_over_quota - Verify fetching data pauses job when user is over quota
test_fetch_from_url_respects_quota - Verify fetching from remote URLs respects quota


Potential Edge Cases and Risks


Race Conditions: Multiple concurrent fetch jobs could each pass the quota check before any complete
Quota Calculation Timing: Quota is checked before actual data download; final size unknown until complete
Session Management: Celery tasks run in separate workers; ensure proper session commits
Error Handling in Task Chain: Ensure all downstream tasks handle None gracefully
User Object Stores: Verify is_over_quota() works correctly in Celery context
Anonymous Users: Verify quota check works for job.user = None cases


Backporting Considerations

Target Branches:

release_25.1 (immediate fix needed)
release_25.0 (if applicable)

Backporting Steps:

Check for API differences in MinimalJobWrapper between versions
Verify Celery task signatures match
Test both enable_celery_tasks=True and False paths
Consider adding config option to disable Celery fetch as temporary workaround


Implementation Order


Add check_and_pause_if_over_quota() to MinimalJobWrapper
Modify setup_fetch_data() with quota check
Update fetch_data() to handle None/paused
Update finish_job() to skip paused jobs
Update set_job_metadata() to handle None
Add unit tests
Add integration tests
Manual testing on development instance
Create PR and run CI


Critical Files for Implementation


File
Changes Needed


/lib/galaxy/jobs/__init__.py
Add check_and_pause_if_over_quota() to MinimalJobWrapper


/lib/galaxy/celery/tasks.py
Modify setup_fetch_data(), fetch_data(), finish_job(), set_job_metadata()


/lib/galaxy/quota/__init__.py
Reference only (no changes needed)


/test/integration/objectstore/test_quota_limit.py
Pattern to follow for integration tests


/lib/galaxy/tools/execute.py
Reference only (no changes needed)


## ISSUE_21642_SUMMARY.md

      
    Raw
  

              ISSUE_21642_SUMMARY.md
            
          
    Issue #21642: Triage Summary

Top-Line Summary

Issue: Remote repository data fetching does not respect storage quota in Galaxy 25.1+
Root Cause: When Celery-based data fetch is enabled (default since Galaxy 23.0), the __DATA_FETCH__ tool execution bypasses the traditional JobWrapper.enqueue() method where quota checks are performed. The Celery task chain (setup_fetch_data → fetch_data → set_job_metadata → finish_job) directly executes without calling _pause_job_if_over_quota(). This is a regression introduced in Galaxy 23.0 (March 2023, PR #15767) when Celery fetch became the default, and has been undetected for approximately 2 years.
Most Probable Fix: Add quota checking to the setup_fetch_data() Celery task in /lib/galaxy/celery/tasks.py by calling a new check_and_pause_if_over_quota() method on MinimalJobWrapper after object store IDs are set. All downstream tasks in the chain must be updated to handle paused jobs gracefully.

Importance Assessment Summary


Criterion
Assessment


Severity
CRITICAL - Enables complete bypass of quota enforcement


Blast Radius
HIGH - Affects all public-facing Galaxy instances with quotas enabled


Workaround
PAINFUL - No practical workaround without disabling core functionality


Regression Status
REGRESSION - Introduced in Galaxy 23.0 (March 2023)


Priority Recommendation
HOTFIX - Should be backported to all supported releases


Discussion Questions


Concurrent fetch jobs: Multiple simultaneous fetch requests could each pass quota checks before any complete. Should we implement a quota reservation mechanism to prevent race conditions?


Unknown file sizes: For remote URL fetches, the final file size isn't known until download completes. Should we implement:

A Content-Length based pre-check?
Mid-stream cancellation if quota exceeded during download?


Backporting scope: The fix should be backported to 25.1, but should it also go to 25.0 and earlier supported releases?


Testing coverage: The existing quota tests don't cover the Celery fetch path. What integration test scenarios should be prioritized?


Anonymous user impact: Anonymous users with quotas appear to be the most affected. Are there specific configurations or use cases we should test?


Effort Estimate


Aspect
Assessment


Implementation Effort
Medium - 5 files to modify, well-scoped changes


Testing Complexity
Medium - Requires Celery worker setup for integration tests


Reproduction Difficulty
Easy - Set up quota, enable Celery fetch (default), fetch data


Risk Level
Low - Changes are additive, existing code paths unchanged


Key Files


/lib/galaxy/celery/tasks.py - Primary fix location
/lib/galaxy/jobs/__init__.py - Add quota check method to MinimalJobWrapper
/lib/galaxy/tools/execute.py - Reference for task chain (no changes)
/lib/galaxy/quota/__init__.py - Existing quota check implementation (no changes)


Related Issues


Issue #20637 - Similar quota issue but different code path (doesn't go through Pulsar)
Author	Contribution
mvdbeek	Original Celery fetch implementation, currently assigned to issue
John Davis	Made Celery fetch configurable (PR #15767)
davelopez	Added quota check to MinimalJobWrapper.enqueue() (Feb 2025)
File	Line	Description
`/lib/galaxy/tools/execute.py`	206	Decides which path (Celery vs traditional) to use
`/lib/galaxy/celery/tasks.py`	227-335	Celery tasks with no quota checks
`/lib/galaxy/jobs/__init__.py`	1612	New quota check (doesn't help Celery path)
`/lib/galaxy/config/__init__.py`	1428	`is_fetch_with_celery_enabled()` function
Option	Description	Impact
Server-side	Disable data_fetch tool entirely	Breaks core functionality
Partial	Remove remote repository access	Limits use cases
Monitoring	Manual disk usage monitoring and user suspension	Reactive, not preventive
Path	Flow	Quota Check
Traditional (Non-Celery)	`JobWrapper.enqueue()` → `_set_object_store_ids()` → `_pause_job_if_over_quota()`	YES
Celery (Default in 25.1)	`setup_fetch_data()` → `_set_object_store_ids()` → `fetch_data()` → `change_state(RUNNING)`	NO
File	Changes Needed
`/lib/galaxy/jobs/__init__.py`	Add `check_and_pause_if_over_quota()` to `MinimalJobWrapper`
`/lib/galaxy/celery/tasks.py`	Modify `setup_fetch_data()`, `fetch_data()`, `finish_job()`, `set_job_metadata()`
`/lib/galaxy/quota/__init__.py`	Reference only (no changes needed)
`/test/integration/objectstore/test_quota_limit.py`	Pattern to follow for integration tests
`/lib/galaxy/tools/execute.py`	Reference only (no changes needed)
Criterion	Assessment
Severity	CRITICAL - Enables complete bypass of quota enforcement
Blast Radius	HIGH - Affects all public-facing Galaxy instances with quotas enabled
Workaround	PAINFUL - No practical workaround without disabling core functionality
Regression Status	REGRESSION - Introduced in Galaxy 23.0 (March 2023)
Priority Recommendation	HOTFIX - Should be backported to all supported releases
Aspect	Assessment
Implementation Effort	Medium - 5 files to modify, well-scoped changes
Testing Complexity	Medium - Requires Celery worker setup for integration tests
Reproduction Difficulty	Easy - Set up quota, enable Celery fetch (default), fetch data
Risk Level	Low - Changes are additive, existing code paths unchanged