Skip to content

Instantly share code, notes, and snippets.

@jmchilton
Created January 15, 2026 12:23
Show Gist options
  • Select an option

  • Save jmchilton/cb0c59f0aaa5d0e36285b169e1e926db to your computer and use it in GitHub Desktop.

Select an option

Save jmchilton/cb0c59f0aaa5d0e36285b169e1e926db to your computer and use it in GitHub Desktop.
Triage for Galaxy Issue #21589: Job cache shows file as deleted but file is there

Issue 21589: Job cache shows file as deleted but file is there in the original history

Author: paulzierep Created: 2026-01-15 State: OPEN Labels: None

Description

Describe the bug Tried to run a workflow with job cache. This is a retry of the issue galaxyproject/galaxy#21556 now fixed. Some jobs worked as expected. But one job reported input dataset ... was deleted before the job started ... However, the original dataset is available in the history used as basis for the job cache. Full workflow paused after this failed job, even though other jobs do not depend on this job.

Galaxy Version and/or server at which you observed the bug version_major: "25.1", version_minor: "1.dev0"

Browser and Operating System Operating System: Linux Browser: Chrome

To Reproduce Steps to reproduce the behavior:

  1. Get this history: https://usegalaxy.eu/u/paulzierep/h/building-an-amplicon-sequence-variant-asv-table-from-16s-data-using-dada2
  2. Get this workflow: https://usegalaxy.eu/u/paulzierep/w/building-an-amplicon-sequence-variant-asv-table-from-16s-data-using-dada2-1-1-1
  3. Run the workflow with reads and Pasted Entry as input
  4. See error

Expected behavior Even though I cannot understand in first case why it reports the job as failed. In general: since jobs running with cache should Attempt to re-use jobs with identical parameters? they should not be able to reuse failed jobs that had missing inputs.

Screenshots

Original history shows dataset available. New history shows error claiming dataset was deleted.

Related Issues

Issue 21589: Code Research - Job Cache Reports Input Dataset Deleted

Summary

Issue reports that when running a workflow with job caching enabled, some jobs fail with error message "input dataset ... was deleted before the job started" even though the dataset is available in the original history. This is a follow-up to issue #21556 which was recently fixed.

Job Caching Mechanism Overview

Galaxy's job caching allows reusing results from previously executed jobs if inputs and parameters match.

Flow:

  1. Early Cache Check (Tool.completed_jobs): Before job creation, Galaxy searches for existing completed jobs with matching:

    • Tool ID & version
    • Input datasets (with matching metadata, extension)
    • Parameters
    • By default requires dataset name match
  2. Job Creation with Cache Reference: If early cache finds a match, completed_job is passed to tool_action.execute(). The new job is created and marks outputs to be copied from cached job.

  3. Late Cache Check (JobWrapper.prepare): If early cache missed but __use_cached_job__ was set, a second search happens with require_name_match=False. If found, job.copy_from_job() is called and job returns early.

Key Files:

File Purpose
lib/galaxy/managers/jobs.py (JobSearch class) Job search/matching logic
lib/galaxy/tools/actions/__init__.py Job creation and input/output recording
lib/galaxy/tools/execute.py Tool execution coordination
lib/galaxy/jobs/__init__.py (JobWrapper) Late cache check and job preparation
lib/galaxy/jobs/handler.py Input validation and job state checking
lib/galaxy/model/__init__.py (Job.copy_from_job) Job copying logic

Error Message Source

The error "was deleted before the job started" comes from two locations in lib/galaxy/jobs/handler.py:

Location 1: __filter_jobs_with_invalid_input_states() (lines 644-649)

for job_id, hda_deleted, hda_state, hda_name, dataset_deleted, dataset_purged, dataset_state in queries:
    if hda_deleted or dataset_deleted:
        if dataset_purged:
            jobs_to_fail[job_id].append(f"Input dataset '{hda_name}' was deleted before the job started")
        else:
            jobs_to_pause[job_id].append(f"Input dataset '{hda_name}' was deleted before the job started")

This checks via SQL query if the job's input datasets are deleted.

Location 2: __verify_in_memory_job_inputs() (lines 800-802)

if idata.deleted:
    self.job_wrappers.pop(job.id, self.job_wrapper(job)).fail(
        f"input data {idata.hid} (file: {idata.get_file_name()}) was deleted before the job started"
    )

Job Cache Matching Logic

In lib/galaxy/managers/jobs.py, the _build_stmt_for_hda() method builds the query to find matching jobs. Key condition on line 827:

or_(b.deleted == false(), c.deleted == false()),

Where:

  • b = The HDA used by the previously run job (the cached job's input)
  • c = The HDA from the current request

This means: A match is allowed if either the original job's input OR the new request's input is not deleted.

Related Issue #21556 Fix (PR #21558)

The fix changed lib/galaxy/model/__init__.py in DatasetCollection.replace_elements_with_copies():

-            if replacement.child_collection:
+            elif replacement.child_collection:

This fixed a bug where both if replacement.hda and if replacement.child_collection could execute, causing a "Cannot replace" error during collection copying for cached jobs.

Theories for Issue #21589

Theory 1: Input HDA from Cached Job's History is Deleted (Most Likely)

When the job cache finds a match:

  • The new job records input dataset associations pointing to the new workflow invocation's input HDAs
  • However, the cached job's inputs may have been deleted in the original history
  • The query allows this because c.deleted == false() (current request's input exists)
  • When the job handler validates inputs, it checks the new job's input associations
  • But somehow the check may be looking at the wrong HDA (cached job's input vs new job's input)

The key question: When is the "deleted" check in handler.py performed - against the new job's input dataset associations or something else?

Looking at the handler code, it joins through:

.join(job_to_input, input_association.job)
.join(input_association)

This should check the new job's input associations, which point to datasets in the new history. If those are not deleted, this shouldn't fail.

Possible root cause: During early cache matching (before job creation) vs late cache (during job prepare), the input dataset associations may be different or may reference datasets differently.

Theory 2: Race Condition in Input Dataset State

When job caching is used during workflow execution:

  1. Workflow creates input datasets (possibly copied or derived from original history)
  2. Cache check happens
  3. Between cache match and job handler validation, the input's state changes

This could happen if:

  • A workflow step deletes intermediate outputs
  • Post-job actions modify input datasets
  • Collection manipulation marks elements as deleted

Theory 3: Collection Element Association Issue

The issue mentions it happens with some jobs but not all in the same workflow. Looking at the previous fix (#21556), it was specifically about collection handling.

When a cached job produces collections:

  • The new job creates new collection elements
  • replace_elements_with_copies() is called to copy outputs
  • Collection elements may have HDA associations that point to datasets marked deleted

The condition in job cache matching:

or_(b.deleted == false(), c.deleted == false())

This is for individual HDAs, but for collections, the elements inside may have different deletion states. The cache might find a match based on the collection being available, but individual elements inside could be deleted.

Specific scenario:

  • Original job's output collection has elements that were later deleted
  • Cache matches on collection level
  • New job copies collection structure
  • Handler validates inputs and finds one of the HDA elements (from the cached job's collection) is deleted

Relevant Code Paths for Further Investigation

  1. Input dataset recording during job creation with completed_job:

    • lib/galaxy/tools/actions/__init__.py line 747: self._record_inputs(trans, tool, job, incoming, inp_data, inp_dataset_collections)
    • Does this record the right inputs when completed_job is provided?
  2. Collection element copying in cache scenario:

    • lib/galaxy/model/__init__.py - DatasetCollection.replace_elements_with_copies()
    • How does element HDA state affect the new job's input associations?
  3. Late cache path:

    • lib/galaxy/jobs/__init__.py line 1292: job.copy_from_job(job_to_copy, copy_outputs=True)
    • Does this modify input associations? Does handler then check wrong inputs?
  4. Handler input validation query:

    • lib/galaxy/jobs/handler.py lines 608-637
    • Verify which input associations are being checked

New Theory: Cached Job Matching Doesn't Validate Cached Job's State

Looking more closely at the job cache query (line 827):

or_(b.deleted == false(), c.deleted == false())

This allows matching a cached job even when:

  • The cached job's original input (b) is deleted
  • As long as the new request's input (c) is not deleted

But wait - the job cache query checks job state at line 688:

stmt = stmt.where(Job.state.in_(job_states))

Where job_states defaults to {Job.states.OK}.

Key insight: The cache only checks if the cached job itself is OK, but doesn't verify that the cached job didn't fail/pause due to deleted inputs. The job could have been originally OK, but if somehow the original workflow had issues with deleted inputs that paused the workflow before that cached job ran, we might be in a weird state.

Most likely scenario:

  1. First workflow runs, job completes OK (state=ok)
  2. Some time later, inputs in original history get deleted
  3. Second workflow runs with same inputs (in different history)
  4. Cache finds match (job state=ok, original inputs deleted but new inputs exist)
  5. New job created with new input dataset associations
  6. BUT - during output copying or collection element copying, something references the original job's deleted input
  7. Handler validation sees the deleted reference and fails

The _exclude_jobs_with_deleted_outputs check (lines 725-765) excludes jobs with deleted outputs, but there's no check for deleted inputs of the cached job.

Suggested Fix Direction

Add a check in JobSearch to exclude cached jobs whose inputs are now deleted. This could be added to _filter_jobs or as a new exclusion filter similar to _exclude_jobs_with_deleted_outputs.

Unresolved Questions

  1. In the specific failing workflow, is the deleted-reported dataset an input or part of a collection?
  2. What is the exact tool producing the failed job?
  3. Is early caching or late caching being used in the failing case?
  4. Are there any post-job actions that could be modifying input states?
  5. Is the issue reproducible with a simpler workflow?
  6. When the cached job's input is deleted, what exactly gets copied/referenced during output copy?

Issue 21589: Importance Assessment

Summary

Job cache incorrectly reports "input dataset was deleted before the job started" when dataset exists in original history. Workflow pauses entirely even when failing job is not upstream dependency. Follow-up to recently fixed #21556.


1. Severity: MEDIUM

  • Not critical: No data loss or security implications
  • Not high: Not a crash/hang - workflow pauses, doesn't terminate
  • Medium: Functional breakage - job caching feature fails incorrectly, blocks workflow execution
  • The error message is misleading (claims deleted when not deleted) but recoverable via re-run without caching

2. Blast Radius: Specific Configurations

Affected users:

  • Users running workflows with job caching enabled (use_cached_job=True)
  • Specifically when re-using jobs from histories where original inputs may have different states
  • Likely affects collection-based workflows more than simple HDA workflows (based on #21556 pattern)

Not affected:

  • Users running workflows without caching (default)
  • Users running individual tools
  • Fresh workflow runs with no previous cached jobs

Production impact:

  • usegalaxy.eu confirmed affected (reporter's environment)
  • Any production Galaxy instance with job caching enabled for workflows

3. Workaround Existence: ACCEPTABLE

Workaround Difficulty
Disable job caching for the problematic workflow Easy - single checkbox
Re-run failed jobs individually Easy - manual intervention
Delete cached history and re-run fresh Medium - loses time savings

The workarounds allow completion of scientific work, just without caching benefits.


4. Regression Status: NEW REGRESSION (25.1.dev)

Timeline:

  • 2026-01-12: Issue #21556 reported (job cache collection copy error)
  • 2026-01-12: PR #21558 fix merged (changed if to elif in replace_elements_with_copies)
  • 2026-01-15: Issue #21589 reported - same user, same workflow, different error after fix

Key evidence:

  • Same reporter (paulzierep) testing same workflow after #21556 fix
  • Error changed from "Cannot replace" to "was deleted before job started"
  • Suggests incomplete fix or exposed secondary bug
  • Version: 25.1.1.dev0 (development branch)

Root cause hypothesis (from code research):

  • Job cache query allows matching when cached job's input is deleted (via or_(b.deleted == false(), c.deleted == false()))
  • Query checks job state=OK but doesn't verify cached job's inputs still exist
  • Handler validation then fails when it encounters deleted reference during output copying

5. User Impact Signals

Signal Value Notes
Issue reactions 0 Too new (same day)
Duplicate reports 0 No duplicates found
Related issues #20196 (closed) Similar "deleted before job started" error, different root cause
Comments 2 mvdbeek responded, user clarified
Reporter history Active tester Same user reported #21556, actively testing job cache

Comment from mvdbeek (maintainer):

"Jobs and datasets are only picked as source if they are in OK state, however you might easily run into this if an implicit converter fails."

This suggests potential edge case with implicit converters, not just collection copying.

Support signal:

  • Reporter is actively testing GTN (Galaxy Training Network) tutorials
  • Represents real-world workflow usage pattern
  • Workflow caching is specifically important for training/education scenarios

6. Recommendation: NEXT RELEASE

Rationale:

Factor Assessment
Severity Medium - blocks workflow but workaround exists
Urgency Low - acceptable workaround available
Complexity Medium-High - requires careful investigation of cache matching + handler validation
Release impact Should be fixed before 25.1 release

Not a hotfix because:

  • Workaround available (disable caching)
  • No data loss or corruption
  • Limited blast radius (caching users only)
  • Requires investigation to avoid incomplete fix (like #21558)

Not backlog because:

  • Job caching is important production feature
  • Part of incomplete fix chain from #21556
  • Affects training/education workflows specifically
  • Active user testing and expecting resolution

Recommended actions:

  1. Label: Add kind/bug and area/job-caching labels
  2. Milestone: Target 25.1 release
  3. Priority: P2 (should fix)
  4. Investigation needed:
    • Verify if implicit converter scenario is root cause (per mvdbeek comment)
    • Check if _exclude_jobs_with_deleted_outputs should have sibling _exclude_jobs_with_deleted_inputs
    • Review handler validation to ensure it checks correct input associations
  5. Test: Add regression test covering deleted-input-in-cached-job scenario

Related Issues Summary

Issue Status Relationship
#21556 CLOSED Immediate predecessor - "Cannot replace" error
#21558 MERGED Fix PR for #21556
#20196 CLOSED Similar error message, different context (Pick Value tool)
#6887 OPEN Long-standing "make job cache more useful" RFC

Unresolved Questions

  1. Is implicit converter failure the actual cause (mvdbeek hypothesis)?
  2. Why does error occur for only one job in workflow, not all cached jobs?
  3. Does the specific tool matter (which tool in workflow fails)?
  4. Is the issue in early cache path, late cache path, or both?
  5. Should job cache query filter out jobs with deleted inputs entirely?

Fix Plan: Issue 21589 - Job Cache Reports Input Dataset Deleted

1. Problem Analysis

Summary: When running a workflow with job caching enabled, jobs fail with "input dataset ... was deleted before the job started" even though the input dataset exists in the current history.

Root Cause: The job cache query in lib/galaxy/managers/jobs.py allows matching a cached job even when the cached job's original input datasets have been deleted. Line 827 has:

or_(b.deleted == false(), c.deleted == false())

Where:

  • b = The HDA used by the cached job (original input)
  • c = The HDA from the current request (new input)

This condition passes if either is not deleted. So a cached job whose original inputs are now deleted will still match as long as the new request's inputs are valid.

Downstream Effect: When a cached job match is found:

  1. New job created with input associations pointing to new history's datasets
  2. Output copying happens via Job.copy_from_job()
  3. Handler (lib/galaxy/jobs/handler.py) validates inputs
  4. Something in the copy/validation chain references the cached job's deleted original input
  5. Handler fails the job with "was deleted before the job started"

2. Proposed Solution

Add an input exclusion filter similar to _exclude_jobs_with_deleted_outputs() that excludes cached jobs whose input datasets or input collections are now deleted.

The fix should:

  1. Create _exclude_jobs_with_deleted_inputs() method
  2. Call it after _exclude_jobs_with_deleted_outputs() in the search pipeline
  3. Optionally: Fix the condition on line 827 to require both b and c not deleted

3. Implementation Steps

Step 1: Add _exclude_jobs_with_deleted_inputs() method

Location: lib/galaxy/managers/jobs.py after _exclude_jobs_with_deleted_outputs() (line 765)

def _exclude_jobs_with_deleted_inputs(self, stmt):
    """Exclude jobs whose input datasets or collections are now deleted."""
    subquery_alias = stmt.subquery("pre_input_filter_subquery")
    outer_select_columns = [subquery_alias.c[col.name] for col in stmt.selected_columns]
    outer_stmt = select(*outer_select_columns).select_from(subquery_alias)
    job_id_from_subquery = subquery_alias.c.job_id

    # Subquery for deleted input collections
    deleted_input_collection_exists = exists().where(
        and_(
            model.JobToInputDatasetCollectionAssociation.job_id == job_id_from_subquery,
            model.JobToInputDatasetCollectionAssociation.dataset_collection_id
            == model.HistoryDatasetCollectionAssociation.id,
            model.HistoryDatasetCollectionAssociation.deleted == true(),
        )
    )

    # Subquery for deleted input datasets
    deleted_input_dataset_exists = exists().where(
        and_(
            model.JobToInputDatasetAssociation.job_id == job_id_from_subquery,
            model.JobToInputDatasetAssociation.dataset_id == model.HistoryDatasetAssociation.id,
            model.HistoryDatasetAssociation.deleted == true(),
        )
    )

    # Exclude jobs where a deleted input collection OR deleted input dataset exists
    outer_stmt = outer_stmt.where(
        and_(
            ~deleted_input_collection_exists,
            ~deleted_input_dataset_exists,
        )
    )
    return outer_stmt

Step 2: Call the new exclusion filter

Location: lib/galaxy/managers/jobs.py line 573

Change from:

stmt = self._exclude_jobs_with_deleted_outputs(stmt)

To:

stmt = self._exclude_jobs_with_deleted_outputs(stmt)
stmt = self._exclude_jobs_with_deleted_inputs(stmt)

Step 3: Fix the inline condition (optional but recommended)

Location: lib/galaxy/managers/jobs.py line 827

Change from:

or_(b.deleted == false(), c.deleted == false()),

To:

and_(b.deleted == false(), c.deleted == false()),

This makes the matching stricter - both the cached job's input AND the new request's input must not be deleted. This is a belt-and-suspenders approach.

Rationale: The original or_ condition seems intentionally permissive (maybe to allow caching when original was deleted but new exists?). However, this causes downstream errors. Requiring both to exist is safer.

4. Files to Modify

File Line(s) Change
lib/galaxy/managers/jobs.py 573 Add call to _exclude_jobs_with_deleted_inputs()
lib/galaxy/managers/jobs.py 765-766 Add new _exclude_jobs_with_deleted_inputs() method
lib/galaxy/managers/jobs.py 827 Change or_ to and_ (optional)

5. Testing Strategy

Unit Test

Create test in test/unit/app/managers/test_job_search.py (new file):

def test_exclude_jobs_with_deleted_inputs():
    """Verify cached jobs with deleted inputs are not matched."""
    # Setup:
    # 1. Create job with input HDA, complete it
    # 2. Delete the input HDA
    # 3. Create new HDA with same dataset
    # 4. Search for matching job
    # Assert: No job found (deleted input should exclude)

Integration Test

Add test in test/integration/test_workflow_caching.py (may need to create):

def test_workflow_cache_with_deleted_original_inputs():
    """Test that job cache doesn't reuse jobs whose original inputs were deleted."""
    # 1. Run workflow with job caching
    # 2. Delete input from original history
    # 3. Run same workflow again (new history, same inputs by dataset)
    # 4. Verify: New job runs (no cache hit) OR proper handling

Manual Testing

  1. Get reproduction workflow from issue
  2. Run workflow with caching enabled
  3. Delete input dataset from original history
  4. Run workflow again with equivalent input in new history
  5. Verify: No "was deleted before job started" error

Red-to-Green Approach

  1. Write failing test that reproduces the bug:

    • Create job with input
    • Delete input HDA (mark deleted=True)
    • Create new HDA pointing to same dataset
    • Call JobSearch.by_tool_input()
    • Assert job is NOT found (currently fails - job IS found)
  2. Implement fix

  3. Test passes

6. Risks and Considerations

Performance

  • New exclusion filter adds another subquery/EXISTS check
  • Should be low impact - filtered on indexed job_id column
  • Follows same pattern as _exclude_jobs_with_deleted_outputs

Backward Compatibility

  • Breaking change potential: Some users may rely on current behavior where cache works even when original inputs deleted
  • However, current behavior causes job failures, so fixing is preferred
  • No API changes, only internal query logic

Edge Cases

  1. Partially deleted inputs: Job has 3 inputs, only 1 deleted

    • Current fix: Excludes job entirely
    • This is correct - can't reuse job with any deleted input
  2. Collection with deleted elements: HDCA exists but contains deleted HDAs

    • Current fix checks HDCA deletion, not element deletion
    • May need separate handling if collection elements matter
  3. Library datasets (LDDA):

    • JobToInputLibraryDatasetAssociation also exists
    • Should add LDDA check too for completeness
  4. Deferred datasets:

    • Deferred datasets have different state handling
    • Should verify fix doesn't break deferred dataset caching

Alternative Approaches

  1. Fix at copy time: Instead of excluding from search, handle gracefully during Job.copy_from_job()

    • Pro: Less restrictive
    • Con: More complex, may still fail downstream
  2. Fix the OR condition only: Just change line 827 from or_ to and_

    • Pro: Simpler change
    • Con: Doesn't catch collections, relies on join condition
  3. Validate inputs during handler: Check if cached job's inputs still valid before using

    • Pro: Most accurate check at point of use
    • Con: Late failure, already committed to using cached job

7. Unresolved Questions

  1. Why or_ in original condition? Was there a use case where caching should work with deleted original input? Need to check git history for line 827.

  2. Collection element deletion: Should we also check individual DatasetCollectionElement HDAs for deletion, not just the HDCA?

  3. Library dataset inputs: Need LDDA deletion check too?

  4. Test data availability: Can we reproduce with usegalaxy.eu workflow? Or need simpler reproduction case?

  5. Interaction with issue #21556: Recent fix was in collection copying. Is new issue related to collection input specifically, or also happens with simple HDA inputs?

Issue 21589: Triage Summary

Top-Line Summary

Job caching incorrectly matches cached jobs whose original input datasets have been deleted. The job cache query in lib/galaxy/managers/jobs.py:827 uses or_(b.deleted == false(), c.deleted == false()) which allows a match when the new request's input exists even if the cached job's original input was deleted. During output copying or handler validation, something references the deleted original input, causing the "was deleted before the job started" error. The most probable fix is adding _exclude_jobs_with_deleted_inputs() filter similar to the existing _exclude_jobs_with_deleted_outputs(). This is a follow-up regression to #21556, reported by the same user testing the same workflow after that fix was merged.


Importance Assessment

Factor Assessment
Severity MEDIUM - Functional breakage, no data loss
Blast Radius Specific configs - job caching users only
Workaround ACCEPTABLE - Disable caching checkbox
Regression NEW in 25.1.dev (after #21556 fix on 2026-01-12)
Priority NEXT RELEASE - Should fix before 25.1

Questions for Discussion

  1. Reproduction: Can we get a minimal reproduction case? The workflow on usegalaxy.eu is complex (DADA2 16S analysis).

  2. Collection vs HDA: Is this specifically a collection input issue (like #21556), or does it also happen with simple HDA inputs?

  3. Which tool fails?: Which specific tool in the workflow triggers the error? Understanding this would help narrow down the cache/copy path.

  4. Original OR condition intent: The or_(b.deleted == false(), c.deleted == false()) condition on line 827 seems intentionally permissive. Was there a use case for allowing cache match when original input is deleted?

  5. mvdbeek's converter hypothesis: mvdbeek mentioned implicit converter failures could trigger this. Is an implicit conversion happening in the failing job?


Fix Estimate

Aspect Estimate
Complexity Medium - Query change + new filter method
Lines of code ~30-40 LOC
Files 1 (lib/galaxy/managers/jobs.py)
Risk Low - Additive filter, follows existing pattern
Testing difficulty Medium - Need workflow + deleted input scenario

Reproduction Difficulty

Hard to reproduce without:

  • Access to usegalaxy.eu shared history/workflow
  • Understanding which specific job in workflow fails
  • Dataset collection with specific deletion state

Suggested approach: Create minimal test case with:

  1. Simple tool with one input
  2. Run with caching
  3. Delete original input HDA
  4. Re-run with equivalent input
  5. Verify error/fix

Related Documents

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment