You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Issue 21589: Job cache shows file as deleted but file is there in the original history
Author: paulzierep
Created: 2026-01-15
State: OPEN
Labels: None
Description
Describe the bug
Tried to run a workflow with job cache.
This is a retry of the issue galaxyproject/galaxy#21556 now fixed.
Some jobs worked as expected. But one job reported
input dataset ... was deleted before the job started ...
However, the original dataset is available in the history used as basis for the job cache.
Full workflow paused after this failed job, even though other jobs do not depend on this job.
Galaxy Version and/or server at which you observed the bug
version_major: "25.1",
version_minor: "1.dev0"
Browser and Operating System
Operating System: Linux
Browser: Chrome
Run the workflow with reads and Pasted Entry as input
See error
Expected behavior
Even though I cannot understand in first case why it reports the job as failed. In general:
since jobs running with cache should Attempt to re-use jobs with identical parameters? they should not be able to reuse failed jobs that had missing inputs.
Screenshots
Original history shows dataset available.
New history shows error claiming dataset was deleted.
Issue reports that when running a workflow with job caching enabled, some jobs fail with error message "input dataset ... was deleted before the job started" even though the dataset is available in the original history. This is a follow-up to issue #21556 which was recently fixed.
Job Caching Mechanism Overview
Galaxy's job caching allows reusing results from previously executed jobs if inputs and parameters match.
Flow:
Early Cache Check (Tool.completed_jobs): Before job creation, Galaxy searches for existing completed jobs with matching:
Job Creation with Cache Reference: If early cache finds a match, completed_job is passed to tool_action.execute(). The new job is created and marks outputs to be copied from cached job.
Late Cache Check (JobWrapper.prepare): If early cache missed but __use_cached_job__ was set, a second search happens with require_name_match=False. If found, job.copy_from_job() is called and job returns early.
Key Files:
File
Purpose
lib/galaxy/managers/jobs.py (JobSearch class)
Job search/matching logic
lib/galaxy/tools/actions/__init__.py
Job creation and input/output recording
lib/galaxy/tools/execute.py
Tool execution coordination
lib/galaxy/jobs/__init__.py (JobWrapper)
Late cache check and job preparation
lib/galaxy/jobs/handler.py
Input validation and job state checking
lib/galaxy/model/__init__.py (Job.copy_from_job)
Job copying logic
Error Message Source
The error "was deleted before the job started" comes from two locations in lib/galaxy/jobs/handler.py:
forjob_id, hda_deleted, hda_state, hda_name, dataset_deleted, dataset_purged, dataset_stateinqueries:
ifhda_deletedordataset_deleted:
ifdataset_purged:
jobs_to_fail[job_id].append(f"Input dataset '{hda_name}' was deleted before the job started")
else:
jobs_to_pause[job_id].append(f"Input dataset '{hda_name}' was deleted before the job started")
This checks via SQL query if the job's input datasets are deleted.
ifidata.deleted:
self.job_wrappers.pop(job.id, self.job_wrapper(job)).fail(
f"input data {idata.hid} (file: {idata.get_file_name()}) was deleted before the job started"
)
Job Cache Matching Logic
In lib/galaxy/managers/jobs.py, the _build_stmt_for_hda() method builds the query to find matching jobs. Key condition on line 827:
or_(b.deleted==false(), c.deleted==false()),
Where:
b = The HDA used by the previously run job (the cached job's input)
c = The HDA from the current request
This means: A match is allowed if either the original job's input OR the new request's input is not deleted.
Related Issue #21556 Fix (PR #21558)
The fix changed lib/galaxy/model/__init__.py in DatasetCollection.replace_elements_with_copies():
- if replacement.child_collection:+ elif replacement.child_collection:
This fixed a bug where both if replacement.hda and if replacement.child_collection could execute, causing a "Cannot replace" error during collection copying for cached jobs.
Theories for Issue #21589
Theory 1: Input HDA from Cached Job's History is Deleted (Most Likely)
When the job cache finds a match:
The new job records input dataset associations pointing to the new workflow invocation's input HDAs
However, the cached job's inputs may have been deleted in the original history
The query allows this because c.deleted == false() (current request's input exists)
When the job handler validates inputs, it checks the new job's input associations
But somehow the check may be looking at the wrong HDA (cached job's input vs new job's input)
The key question: When is the "deleted" check in handler.py performed - against the new job's input dataset associations or something else?
This should check the new job's input associations, which point to datasets in the new history. If those are not deleted, this shouldn't fail.
Possible root cause: During early cache matching (before job creation) vs late cache (during job prepare), the input dataset associations may be different or may reference datasets differently.
Theory 2: Race Condition in Input Dataset State
When job caching is used during workflow execution:
Workflow creates input datasets (possibly copied or derived from original history)
Cache check happens
Between cache match and job handler validation, the input's state changes
This could happen if:
A workflow step deletes intermediate outputs
Post-job actions modify input datasets
Collection manipulation marks elements as deleted
Theory 3: Collection Element Association Issue
The issue mentions it happens with some jobs but not all in the same workflow. Looking at the previous fix (#21556), it was specifically about collection handling.
When a cached job produces collections:
The new job creates new collection elements
replace_elements_with_copies() is called to copy outputs
Collection elements may have HDA associations that point to datasets marked deleted
The condition in job cache matching:
or_(b.deleted==false(), c.deleted==false())
This is for individual HDAs, but for collections, the elements inside may have different deletion states. The cache might find a match based on the collection being available, but individual elements inside could be deleted.
Specific scenario:
Original job's output collection has elements that were later deleted
Cache matches on collection level
New job copies collection structure
Handler validates inputs and finds one of the HDA elements (from the cached job's collection) is deleted
Relevant Code Paths for Further Investigation
Input dataset recording during job creation with completed_job:
lib/galaxy/tools/actions/__init__.py line 747: self._record_inputs(trans, tool, job, incoming, inp_data, inp_dataset_collections)
Does this record the right inputs when completed_job is provided?
How does element HDA state affect the new job's input associations?
Late cache path:
lib/galaxy/jobs/__init__.py line 1292: job.copy_from_job(job_to_copy, copy_outputs=True)
Does this modify input associations? Does handler then check wrong inputs?
Handler input validation query:
lib/galaxy/jobs/handler.py lines 608-637
Verify which input associations are being checked
New Theory: Cached Job Matching Doesn't Validate Cached Job's State
Looking more closely at the job cache query (line 827):
or_(b.deleted==false(), c.deleted==false())
This allows matching a cached job even when:
The cached job's original input (b) is deleted
As long as the new request's input (c) is not deleted
But wait - the job cache query checks job state at line 688:
stmt=stmt.where(Job.state.in_(job_states))
Where job_states defaults to {Job.states.OK}.
Key insight: The cache only checks if the cached job itself is OK, but doesn't verify that the cached job didn't fail/pause due to deleted inputs. The job could have been originally OK, but if somehow the original workflow had issues with deleted inputs that paused the workflow before that cached job ran, we might be in a weird state.
Most likely scenario:
First workflow runs, job completes OK (state=ok)
Some time later, inputs in original history get deleted
Second workflow runs with same inputs (in different history)
Cache finds match (job state=ok, original inputs deleted but new inputs exist)
New job created with new input dataset associations
BUT - during output copying or collection element copying, something references the original job's deleted input
Handler validation sees the deleted reference and fails
The _exclude_jobs_with_deleted_outputs check (lines 725-765) excludes jobs with deleted outputs, but there's no check for deleted inputs of the cached job.
Suggested Fix Direction
Add a check in JobSearch to exclude cached jobs whose inputs are now deleted. This could be added to _filter_jobs or as a new exclusion filter similar to _exclude_jobs_with_deleted_outputs.
Unresolved Questions
In the specific failing workflow, is the deleted-reported dataset an input or part of a collection?
What is the exact tool producing the failed job?
Is early caching or late caching being used in the failing case?
Are there any post-job actions that could be modifying input states?
Is the issue reproducible with a simpler workflow?
When the cached job's input is deleted, what exactly gets copied/referenced during output copy?
Job cache incorrectly reports "input dataset was deleted before the job started" when dataset exists in original history. Workflow pauses entirely even when failing job is not upstream dependency. Follow-up to recently fixed #21556.
1. Severity: MEDIUM
Not critical: No data loss or security implications
Not high: Not a crash/hang - workflow pauses, doesn't terminate
Summary: When running a workflow with job caching enabled, jobs fail with "input dataset ... was deleted before the job started" even though the input dataset exists in the current history.
Root Cause: The job cache query in lib/galaxy/managers/jobs.py allows matching a cached job even when the cached job's original input datasets have been deleted. Line 827 has:
or_(b.deleted==false(), c.deleted==false())
Where:
b = The HDA used by the cached job (original input)
c = The HDA from the current request (new input)
This condition passes if either is not deleted. So a cached job whose original inputs are now deleted will still match as long as the new request's inputs are valid.
Downstream Effect: When a cached job match is found:
New job created with input associations pointing to new history's datasets
Something in the copy/validation chain references the cached job's deleted original input
Handler fails the job with "was deleted before the job started"
2. Proposed Solution
Add an input exclusion filter similar to _exclude_jobs_with_deleted_outputs() that excludes cached jobs whose input datasets or input collections are now deleted.
The fix should:
Create _exclude_jobs_with_deleted_inputs() method
Call it after _exclude_jobs_with_deleted_outputs() in the search pipeline
Optionally: Fix the condition on line 827 to require both b and c not deleted
Step 3: Fix the inline condition (optional but recommended)
Location: lib/galaxy/managers/jobs.py line 827
Change from:
or_(b.deleted==false(), c.deleted==false()),
To:
and_(b.deleted==false(), c.deleted==false()),
This makes the matching stricter - both the cached job's input AND the new request's input must not be deleted. This is a belt-and-suspenders approach.
Rationale: The original or_ condition seems intentionally permissive (maybe to allow caching when original was deleted but new exists?). However, this causes downstream errors. Requiring both to exist is safer.
4. Files to Modify
File
Line(s)
Change
lib/galaxy/managers/jobs.py
573
Add call to _exclude_jobs_with_deleted_inputs()
lib/galaxy/managers/jobs.py
765-766
Add new _exclude_jobs_with_deleted_inputs() method
lib/galaxy/managers/jobs.py
827
Change or_ to and_ (optional)
5. Testing Strategy
Unit Test
Create test in test/unit/app/managers/test_job_search.py (new file):
deftest_exclude_jobs_with_deleted_inputs():
"""Verify cached jobs with deleted inputs are not matched."""# Setup:# 1. Create job with input HDA, complete it# 2. Delete the input HDA# 3. Create new HDA with same dataset# 4. Search for matching job# Assert: No job found (deleted input should exclude)
Integration Test
Add test in test/integration/test_workflow_caching.py (may need to create):
deftest_workflow_cache_with_deleted_original_inputs():
"""Test that job cache doesn't reuse jobs whose original inputs were deleted."""# 1. Run workflow with job caching# 2. Delete input from original history# 3. Run same workflow again (new history, same inputs by dataset)# 4. Verify: New job runs (no cache hit) OR proper handling
Manual Testing
Get reproduction workflow from issue
Run workflow with caching enabled
Delete input dataset from original history
Run workflow again with equivalent input in new history
Verify: No "was deleted before job started" error
Red-to-Green Approach
Write failing test that reproduces the bug:
Create job with input
Delete input HDA (mark deleted=True)
Create new HDA pointing to same dataset
Call JobSearch.by_tool_input()
Assert job is NOT found (currently fails - job IS found)
Implement fix
Test passes
6. Risks and Considerations
Performance
New exclusion filter adds another subquery/EXISTS check
Should be low impact - filtered on indexed job_id column
Follows same pattern as _exclude_jobs_with_deleted_outputs
Backward Compatibility
Breaking change potential: Some users may rely on current behavior where cache works even when original inputs deleted
However, current behavior causes job failures, so fixing is preferred
No API changes, only internal query logic
Edge Cases
Partially deleted inputs: Job has 3 inputs, only 1 deleted
Current fix: Excludes job entirely
This is correct - can't reuse job with any deleted input
Collection with deleted elements: HDCA exists but contains deleted HDAs
Current fix checks HDCA deletion, not element deletion
May need separate handling if collection elements matter
Library datasets (LDDA):
JobToInputLibraryDatasetAssociation also exists
Should add LDDA check too for completeness
Deferred datasets:
Deferred datasets have different state handling
Should verify fix doesn't break deferred dataset caching
Alternative Approaches
Fix at copy time: Instead of excluding from search, handle gracefully during Job.copy_from_job()
Pro: Less restrictive
Con: More complex, may still fail downstream
Fix the OR condition only: Just change line 827 from or_ to and_
Pro: Simpler change
Con: Doesn't catch collections, relies on join condition
Validate inputs during handler: Check if cached job's inputs still valid before using
Pro: Most accurate check at point of use
Con: Late failure, already committed to using cached job
7. Unresolved Questions
Why or_ in original condition? Was there a use case where caching should work with deleted original input? Need to check git history for line 827.
Collection element deletion: Should we also check individual DatasetCollectionElement HDAs for deletion, not just the HDCA?
Library dataset inputs: Need LDDA deletion check too?
Test data availability: Can we reproduce with usegalaxy.eu workflow? Or need simpler reproduction case?
Interaction with issue #21556: Recent fix was in collection copying. Is new issue related to collection input specifically, or also happens with simple HDA inputs?
Job caching incorrectly matches cached jobs whose original input datasets have been deleted. The job cache query in lib/galaxy/managers/jobs.py:827 uses or_(b.deleted == false(), c.deleted == false()) which allows a match when the new request's input exists even if the cached job's original input was deleted. During output copying or handler validation, something references the deleted original input, causing the "was deleted before the job started" error. The most probable fix is adding _exclude_jobs_with_deleted_inputs() filter similar to the existing _exclude_jobs_with_deleted_outputs(). This is a follow-up regression to #21556, reported by the same user testing the same workflow after that fix was merged.
Importance Assessment
Factor
Assessment
Severity
MEDIUM - Functional breakage, no data loss
Blast Radius
Specific configs - job caching users only
Workaround
ACCEPTABLE - Disable caching checkbox
Regression
NEW in 25.1.dev (after #21556 fix on 2026-01-12)
Priority
NEXT RELEASE - Should fix before 25.1
Questions for Discussion
Reproduction: Can we get a minimal reproduction case? The workflow on usegalaxy.eu is complex (DADA2 16S analysis).
Collection vs HDA: Is this specifically a collection input issue (like #21556), or does it also happen with simple HDA inputs?
Which tool fails?: Which specific tool in the workflow triggers the error? Understanding this would help narrow down the cache/copy path.
Original OR condition intent: The or_(b.deleted == false(), c.deleted == false()) condition on line 827 seems intentionally permissive. Was there a use case for allowing cache match when original input is deleted?
mvdbeek's converter hypothesis: mvdbeek mentioned implicit converter failures could trigger this. Is an implicit conversion happening in the failing job?
Fix Estimate
Aspect
Estimate
Complexity
Medium - Query change + new filter method
Lines of code
~30-40 LOC
Files
1 (lib/galaxy/managers/jobs.py)
Risk
Low - Additive filter, follows existing pattern
Testing difficulty
Medium - Need workflow + deleted input scenario
Reproduction Difficulty
Hard to reproduce without:
Access to usegalaxy.eu shared history/workflow
Understanding which specific job in workflow fails
Dataset collection with specific deletion state
Suggested approach: Create minimal test case with: