jmchilton/ISSUE_21573.md

## ISSUE_21573.md

      
    Raw
  

              ISSUE_21573.md
            
          
    Issue 21573: Conversions triggered via display application

Author: mvdbeek (Marius van den Beek)
Created: 2026-01-13
State: OPEN
Labels: kind/bug
Description

From @jennaj
Hi @dannon -- would you be able to help me with this? The users datasets are being "converted" hundreds of times. I've not seen this before. They are not very large but can clutter up the storage view and is confusing them. This looks like a bug but I am not sure how to ticket it. They stated that they are reviewing data at UCSC. Maybe we are recreating the index for them at each request, instead of reusing it? (guess!) Thank you!
History: https://usegalaxy.org/u/jen-galaxyproject/h/copy-of-tiago-lp-history-too-many-converted-datasets.
Public topic (but this is also in bugs from https://help.galaxyproject.org/t/datasets-visible-in-history-size-but-not-in-history/17205/2
Key Details


Datasets being converted hundreds of times
User is reviewing data at UCSC (display application)
Conversions should be reused, not recreated each request
Clutters storage view and confuses users


## ISSUE_21573_CODE_RESEARCH.md

      
    Raw
  

              ISSUE_21573_CODE_RESEARCH.md
            
          
    Issue 21573: Conversions Triggered via Display Applications - Code Research

Issue Summary

Datasets are being converted hundreds of times when users view data at UCSC via display applications. Conversions should be reused, not recreated on each request. This clutters storage and confuses users.
Relevant Code Paths

1. Display Application Flow

When a user clicks "display at UCSC", the following path is executed:


API Endpoint (lib/galaxy/webapps/galaxy/api/display_applications.py:52-71)

POST /api/display_applications/create_link calls DisplayApplicationsManager.create_link()


Manager (lib/galaxy/managers/display_applications.py:115-206)

create_link() decodes dataset/user, creates PopulatedDisplayApplicationLink
If display_link.display_ready() is False, calls display_link.prepare_display() (line 199)


PopulatedDisplayApplicationLink (lib/galaxy/datatypes/display_applications/application.py:209-273)

prepare_display() (line 235-249) calls param.prepare() for each data parameter
preparing_display() (line 230-233) checks if conversion is already running


DisplayApplicationDataParameter (lib/galaxy/datatypes/display_applications/parameters.py:79-203)

prepare() (line 166-189) - KEY METHOD that triggers conversions
_get_dataset_like_object() (line 119-159) checks for existing conversions
Calls data.get_converted_files_by_type(ext) to find existing converted datasets (line 151)


2. Conversion Triggering Path

In DisplayApplicationDataParameter.prepare() (line 166-189):
def prepare(self, other_values, dataset_hash, user_hash, trans):
    data = self._get_dataset_like_object(other_values)
    if not data and self.formats:
        data = other_values.get(self.dataset, None)
        # start conversion
        (direct_match, target_ext, converted_dataset) = self.datatypes_registry.find_conversion_destination_for_dataset_by_extensions(
            data.extension, self.extensions
        )
        if not direct_match:
            if target_ext and not converted_dataset:
                data.datatype.convert_dataset(
                    trans, data, target_ext, return_output=True, visible=False, history=data.history
                )
3. Conversion Checking Logic

_get_dataset_like_object() (line 147-158):
elif (self.formats and self.extensions and (self.force_conversion or not isinstance(data.datatype, self.formats))):
    for ext in self.extensions:
        rval = data.get_converted_files_by_type(ext)  # Check existing conversions
        if rval:
            return rval
    # If no converted file found, return None (triggers conversion in prepare())
    return None
get_converted_files_by_type() in model (lib/galaxy/model/__init__.py:5440-5451):
def get_converted_files_by_type(self, file_type, include_errored=False):
    for assoc in self.implicitly_converted_datasets:
        if not assoc.deleted and assoc.type == file_type:
            item = assoc.dataset or assoc.dataset_ldda
            valid_states = (Dataset.states.ERROR, *Dataset.valid_input_states) if include_errored else Dataset.valid_input_states
            if not item.deleted and item.state in valid_states:
                return item
    return None
4. convert_dataset Method

Data.convert_dataset() (lib/galaxy/datatypes/data.py:829-880):

Creates converter job
Note: use_cached_job=False by default when called from display applications (line 839)
Creates ImplicitlyConvertedDatasetAssociation (line 871)
Critical: Uses flush_job=False (line 863) - association not immediately visible to DB

5. Dataset Controller Route (Legacy/Direct Access)

DatasetInterface.display_application() (lib/galaxy/webapps/galaxy/controllers/dataset.py:466-601):

Called when UCSC fetches data directly
This is the controller endpoint that serves converted data to external applications
Each data fetch could potentially trigger check for/creation of conversions

Key Files


File
Purpose


lib/galaxy/datatypes/display_applications/parameters.py
Display parameter logic including conversion triggers


lib/galaxy/datatypes/display_applications/application.py
PopulatedDisplayApplicationLink, prepare_display()


lib/galaxy/managers/display_applications.py
create_link API logic


lib/galaxy/datatypes/data.py:829-880
convert_dataset() method


lib/galaxy/model/__init__.py:5440-5451
get_converted_files_by_type()


lib/galaxy/model/__init__.py:5464-5519
get_converted_dataset()


lib/galaxy/model/__init__.py:6868-6948
ImplicitlyConvertedDatasetAssociation model


lib/galaxy/webapps/galaxy/controllers/dataset.py:466-601
display_application controller


lib/galaxy/datatypes/display_applications/configs/ucsc/*.xml
UCSC display app definitions


Theories About Repeated Conversions

Theory 1: Race Condition During Initial Conversion

Evidence:

convert_dataset() uses flush_job=False (line 863), so the ImplicitlyConvertedDatasetAssociation is added to session but NOT immediately committed
Multiple concurrent requests to create_link or display_application will each call get_converted_files_by_type()
If the first request's conversion hasn't been committed yet, subsequent requests won't see it and will trigger new conversions

Race Window:
Request 1: _get_dataset_like_object() -> None (no conversion exists)
Request 1: prepare() -> convert_dataset() -> creates job + assoc (not flushed)
Request 2: _get_dataset_like_object() -> None (assoc not visible in DB yet!)
Request 2: prepare() -> convert_dataset() -> creates ANOTHER job + assoc
...
Request N: Same pattern

Supporting Code:

attach_implicitly_converted_dataset() calls session.add() but there's no commit before the request returns
preparing_display() checks if conversion is RUNNING, but doesn't protect against NEW conversions being triggered

Theory 2: UCSC Makes Multiple Concurrent Data Requests

Evidence:

UCSC display applications use hgt.customText URLs that point back to Galaxy
When UCSC loads the track, it may make multiple byte-range requests concurrently
Each request to display_application controller could trigger conversion check

UCSC Request Pattern:

Initial redirect to UCSC with track URL
UCSC fetches track description file
UCSC makes multiple range requests for actual data
Each request to Galaxy could independently trigger conversion

Supporting Code:

display_application() controller (line 468) doesn't cache/lock on conversion state
Multiple simultaneous requests from UCSC = multiple conversion triggers

Theory 3: Checking Against Running Conversions is Insufficient

Evidence:

is_preparing() only checks if dataset state is NEW, UPLOAD, QUEUED, or RUNNING
get_converted_files_by_type() uses Dataset.valid_input_states which may not include all "in progress" states
If conversion job is created but not yet QUEUED, it might not be detected as "preparing"

Code Path:
def is_preparing(self, other_values):
    value = self._get_dataset_like_object(other_values)
    if value and value.state in (DatasetState.NEW, DatasetState.UPLOAD, DatasetState.QUEUED, DatasetState.RUNNING):
        return True
    return False
The issue: _get_dataset_like_object() returns None if no converted dataset exists yet (even if one is being created), so is_preparing() checks if None has a state - which it doesn't.
Configuration and Parameters

Relevant Galaxy Config Options


enable_old_display_applications - enables legacy display path
Display sites configuration in datatypes_registry.get_display_sites('ucsc')

Display Application XML Parameters


format="bedstrict" - requires conversion from interval/BED to strict BED
format="vcf_bgzip" - requires bgzip compression
format="tabix" - requires tabix indexing
force_conversion="True" - forces conversion even if format matches

UCSC Display Applications that Trigger Conversions


ucsc_interval_as_bed.xml - interval -> bedstrict (line 13)
ucsc_vcf.xml - vcf -> vcf_bgzip and tabix (lines 13-14)
Most others just reference the original data

Potential Fix Areas


Add locking/deduplication in prepare() method - prevent concurrent conversion triggers
Flush conversion association immediately - make it visible to concurrent requests
Add DB-level unique constraint - prevent duplicate ImplicitlyConvertedDatasetAssociation entries
Cache conversion state per-request - avoid re-checking for each parameter
Use use_cached_job=True - reuse completed conversion jobs (currently False by default from display apps)

Unresolved Questions


Are the "hundreds of conversions" from a single user click, or from UCSC making many range requests?
Is there transaction isolation preventing visibility of newly created conversions?
Would use_cached_job=True help, or does it only work for completed jobs (not pending)?
Is there a way to reproduce this reliably in tests?


## ISSUE_21573_HISTORY.md

      
    Raw
  

              ISSUE_21573_HISTORY.md
            
          
    Issue 21573: Git History Research

Summary

Root Cause: The flush_job=False parameter in convert_dataset() method means the ImplicitlyConvertedDatasetAssociation record isn't committed to the database before the method returns. Concurrent requests checking get_converted_files_by_type() won't see the pending conversion, triggering duplicate conversions.
Assessment: This is a regression introduced in April 2024 (release 24.0).
Timeline of Key Changes

1. April 21, 2024 - The Regression Introduced

Commit: 1b1094be215 by mvdbeek
PR: #18036 - [24.0] Fix History Dataset Association creation so that hid is always set
Release: 24.0
This commit:

Added flush_job=False to converter.execute() call in convert_dataset()
Removed session.commit() from attach_implicitly_converted_dataset()
Moved ICDA creation into convert_dataset() (from parameters.py)

The purpose was to fix HDAs missing hid values, but it inadvertently removed the database commit that made conversions visible to concurrent requests.
Before this commit: Display application code in parameters.py explicitly committed:
trans.sa_session.add(assoc)
with transaction(trans.sa_session):
    trans.sa_session.commit()
After this commit: No explicit commit - relies on eventual flush which may not happen before concurrent requests check.
2. October 7, 2025 - Attempted Fix (Partial)

Commit: 2b5ea2234c4 by mvdbeek
PR: #21021 - Use job cache also for implicit conversions
Release: 25.0
This added job caching for implicit conversions:
completed_jobs = converter.completed_jobs(trans, all_params=[params], use_cached_job=use_cached_job)
completed_job = completed_jobs[0] if completed_jobs else None
However, use_cached_job=False by default, and display applications don't pass True, so this doesn't help the UCSC race condition.
3. November 6, 2025 - Display Application Refactor

Commit: 8aa1d7d71bf by guerler
PR: #15076 - Refactor display application handling
Updated parameters.py to pass history=data.history, but didn't address the race condition.
4. January 30, 2025 - Related Fix

Commit: 291dd45af6d by mvdbeek
PR: #19494 - Prevent cycling through failing conversion jobs in trackster
Added include_errored parameter to get_converted_files_by_type() but doesn't help with the pending conversion visibility issue.
Relevant Code Locations


File
Function
Issue


lib/galaxy/datatypes/data.py:858-865
convert_dataset()
Uses flush_job=False


lib/galaxy/model/__init__.py:5521-5528
attach_implicitly_converted_dataset()
No commit after add


lib/galaxy/model/__init__.py:5440-5451
get_converted_files_by_type()
Can't see unflushed records


lib/galaxy/datatypes/display_applications/parameters.py:180-186
prepare()
Calls convert_dataset without waiting


Authors Involved


mvdbeek (Marius van den Beek) - Introduced regression in PR #18036, worked on related fixes
guerler (Aysam Guerler) - Recent display application refactoring
jdavcs (John Davis) - Earlier session handling changes

Pre-Regression Behavior (Before April 2024)

In parameters.py (before commit 1b1094be215):
new_data = next(iter(data.datatype.convert_dataset(...).values()))
new_data.hid = data.hid
new_data.name = data.name
trans.sa_session.add(new_data)
assoc = trans.app.model.ImplicitlyConvertedDatasetAssociation(...)
trans.sa_session.add(assoc)
with transaction(trans.sa_session):
    trans.sa_session.commit()  # <-- This was removed
The explicit commit ensured the ICDA was visible to concurrent requests immediately.
Conclusion

This is a regression from Galaxy 24.0 (April 2024). The fix that prevented missing HIDs inadvertently created a race condition where concurrent display application requests don't see each other's pending conversions because:

flush_job=False delays job record creation
attach_implicitly_converted_dataset() no longer commits
Session changes are only visible after eventual auto-flush or commit

Potential fixes:

Add explicit commit/flush after attach_implicitly_converted_dataset()
Pass use_cached_job=True to convert_dataset() from display applications
Check for pending (unflushed) conversions in get_converted_files_by_type()
Add locking/mutex around conversion check-and-start logic


## ISSUE_21573_IMPORTANCE.md

      
    Raw
  

              ISSUE_21573_IMPORTANCE.md
            
          
    Issue 21573 Importance Assessment

Issue Summary

Datasets converted hundreds of times when viewing via display applications (e.g., UCSC browser). Conversions should be reused, not recreated each request.
1. Severity: MEDIUM

No data loss - User data remains intact; conversions create additional copies but don't corrupt or delete source data.
No security issue - Not a vulnerability or authentication/authorization bypass.
No crash/hang - Galaxy remains operational; issue manifests as excessive resource consumption.
Functional breakage - Yes, but partial:

Conversion reuse mechanism is broken (functional issue)
Galaxy still works, just inefficiently
User workflow not blocked, but cluttered

Classification: Resource waste + UX degradation, not critical failure.
2. Blast Radius: LIMITED

Affected users:

Users of display applications (UCSC browser, IGV, etc.)
Particularly those making multiple requests to view same dataset

NOT affected:

Tool execution / job running
Data upload/download
History management (outside display apps)
API consumers not using display applications
Users who don't use external genome browsers

Configurations affected:

All Galaxy instances with display applications enabled
Primarily usegalaxy.org and similar production instances
Dev/test instances less impacted (lower traffic, less concurrency)

Estimate: Maybe 5-15% of user sessions involve display applications, and of those, only subset trigger concurrent conversion requests.
3. Workaround Existence: YES, but painful

Admin workaround:

Periodically clean up duplicate ICDA records
SQL: Identify and remove duplicate conversions of same parent/type

User workaround:

Wait for conversion to complete before clicking display link again
Delete duplicate converted datasets manually from history
Use storage manager to clean up clutter

Workaround pain level: MODERATE

Users confused by clutter (see support request in issue)
Storage usage inflated (quota impact)
Manual cleanup tedious but possible

4. Regression Status: CONFIRMED REGRESSION

Regression introduced: Galaxy 24.0 (April 2024)
Source: PR #18036 - "Fix History Dataset Association creation so that hid is always set"
Mechanism: PR #18036 changed session flush behavior. ICDA records added to session but not immediately flushed. Concurrent requests don't see pending conversions in get_converted_files_by_type() because query only sees committed records.
Age of regression: ~9 months (April 2024 to January 2025)
Code path:

attach_implicitly_converted_dataset() creates ICDA
Adds to session without immediate flush
Second concurrent request checks get_converted_files_by_type()
Query returns None (ICDA not committed yet)
Second conversion triggered

5. User Impact Signals

Issue reactions: None recorded yet (newly filed)
Duplicate reports: None found via search
Support requests:

Original report via usegalaxy.org support (link in issue body)
Help forum thread: https://help.galaxyproject.org/t/datasets-visible-in-history-size-but-not-in-history/17205/2

Reporter: Jennifer Hillman-Jackson (Galaxy team member) via Marius van den Beek - credible, well-documented report.
User testimony: "datasets are being converted hundreds of times" - indicates significant user-facing impact when it occurs.
6. Overall Recommendation: NEXT RELEASE (High Priority)

NOT Hotfix because:


Not critical/breaking functionality
Workarounds exist
No data loss or security impact
Regression exists 9 months without widespread reports

Not Backlog because:


Confirmed regression in recent release
Simple fix identified (add flush after ICDA creation)
Causes storage waste on production instances
User confusion documented
Fix is low-risk

Recommended Action:


Priority: High for next regular release (24.2 or 25.0)
Fix: Add session flush in attach_implicitly_converted_dataset() after ICDA creation
Test: Add concurrency test for display application conversion
Backport: Consider backport to 24.1.x if low-risk

Risk Assessment of Fix:


Low risk - Adding flush is defensive, won't break existing behavior
Session already has autoflush on queries; explicit flush just makes timing deterministic
Similar pattern used elsewhere in codebase

Why higher priority than typical medium-severity bugs:


Affects public-facing usegalaxy.org
Creates visible clutter users notice and report
Storage quota implications for users
Simple, low-risk fix available
Regression vs long-standing issue


Assessment date: 2025-01-13
Assessor: Claude Code

  
## ISSUE_21573_PLAN.md

      
    Raw
  

              ISSUE_21573_PLAN.md
            
          
    Issue 21573 Fix Plan: Duplicate Conversions via Display Applications

Problem Summary

Datasets are being converted hundreds of times when users view data at UCSC via display applications. This is a race condition where multiple concurrent requests check for existing conversions, find none (because the ICDA record hasn't been committed yet), and all start new conversions.
Root Cause Analysis

The Race Condition Timeline


Request A arrives, calls get_converted_files_by_type() - finds no conversion
Request A calls convert_dataset() which:

Creates job with flush_job=False (line 863 in data.py)
Creates ICDA record via attach_implicitly_converted_dataset() - NO COMMIT YET
Calls job_manager.enqueue() - commit happens here eventually


Request B arrives (before Request A's commit), calls get_converted_files_by_type() - finds no conversion
Request B starts another conversion...
Repeat for hundreds of concurrent requests

The Breaking Change (PR #18036, Galaxy 24.0)

Before PR #18036 (attach_implicitly_converted_dataset in model/init.py):
def attach_implicitly_converted_dataset(self, session, new_dataset, target_ext: str):
    # ...
    session.add(new_dataset)
    session.add(assoc)
    with transaction(session):
        session.commit()  # <-- IMMEDIATE COMMIT
    return new_dataset
After PR #18036:
def attach_implicitly_converted_dataset(self, session, new_dataset, target_ext: str):
    # ...
    session.add(new_dataset)
    session.add(assoc)
    # NO COMMIT - deferred to job_manager.enqueue()
The change was made to fix HID assignment issues, but it introduced this race window.
Key Files


File
Line
Function


lib/galaxy/datatypes/display_applications/parameters.py
166-189
DisplayApplicationDataParameter.prepare() - triggers conversion


lib/galaxy/datatypes/data.py
829-880
convert_dataset() - creates job and ICDA


lib/galaxy/model/__init__.py
5440-5451
get_converted_files_by_type() - checks for existing conversion


lib/galaxy/model/__init__.py
5521-5528
attach_implicitly_converted_dataset() - creates ICDA record


lib/galaxy/datatypes/registry.py
948
find_conversion_destination_for_dataset_by_extensions() - also checks


Proposed Approaches

Approach 1: Re-add Flush After ICDA Creation (Recommended)

Rationale: Simplest fix, restores pre-24.0 behavior for this specific path, minimal risk.
Changes:
File: lib/galaxy/datatypes/data.py (lines 858-872)
# Current code:
job, converted_datasets, *_ = converter.execute(
    trans,
    incoming=params,
    set_output_hid=visible,
    history=history,
    flush_job=False,
    completed_job=completed_job,
)
# We should only have a single converted output, but let's be defensive here
n_converted_datasets = len(converted_datasets)
for converted_dataset in converted_datasets.values():
    if converted_dataset.extension == "auto" and n_converted_datasets == 1:
        converted_dataset.extension = target_type
    original_dataset.attach_implicitly_converted_dataset(trans.sa_session, converted_dataset, target_type)
trans.app.job_manager.enqueue(job, tool=converter)

# Proposed change:
job, converted_datasets, *_ = converter.execute(
    trans,
    incoming=params,
    set_output_hid=visible,
    history=history,
    flush_job=False,
    completed_job=completed_job,
)
# We should only have a single converted output, but let's be defensive here
n_converted_datasets = len(converted_datasets)
for converted_dataset in converted_datasets.values():
    if converted_dataset.extension == "auto" and n_converted_datasets == 1:
        converted_dataset.extension = target_type
    original_dataset.attach_implicitly_converted_dataset(trans.sa_session, converted_dataset, target_type)
# Flush ICDA record immediately to prevent duplicate conversions from concurrent requests
trans.sa_session.flush()
trans.app.job_manager.enqueue(job, tool=converter)
Pros:

Simple, targeted fix
Low risk - only adds a flush, doesn't change model or schema
Restores intended behavior without full commit

Cons:

Doesn't prevent true concurrent execution (microsecond window still exists)
Requires the flush to complete before enqueue


Approach 2: Database Unique Constraint

Rationale: Database-level enforcement prevents duplicates even with concurrent transactions.
Changes:
File: lib/galaxy/model/__init__.py (after line 6869)
class ImplicitlyConvertedDatasetAssociation(Base, Serializable):
    __tablename__ = "implicitly_converted_dataset_association"
    __table_args__ = (
        # Prevent duplicate conversions for same parent/type combination
        UniqueConstraint('hda_parent_id', 'type', name='uq_icda_hda_parent_type'),
        UniqueConstraint('ldda_parent_id', 'type', name='uq_icda_ldda_parent_type'),
    )
File: lib/galaxy/model/migrations/alembic/versions_gxy/XXX_add_icda_unique_constraint.py (new migration)
"""Add unique constraint to implicitly_converted_dataset_association

Revision ID: XXX
"""
from alembic import op

def upgrade():
    # First, delete duplicate records (keep oldest)
    op.execute("""
        DELETE FROM implicitly_converted_dataset_association
        WHERE id NOT IN (
            SELECT MIN(id)
            FROM implicitly_converted_dataset_association
            GROUP BY hda_parent_id, type
        ) AND hda_parent_id IS NOT NULL
    """)
    op.execute("""
        DELETE FROM implicitly_converted_dataset_association
        WHERE id NOT IN (
            SELECT MIN(id)
            FROM implicitly_converted_dataset_association
            GROUP BY ldda_parent_id, type
        ) AND ldda_parent_id IS NOT NULL
    """)

    op.create_unique_constraint('uq_icda_hda_parent_type',
                                'implicitly_converted_dataset_association',
                                ['hda_parent_id', 'type'])
    op.create_unique_constraint('uq_icda_ldda_parent_type',
                                'implicitly_converted_dataset_association',
                                ['ldda_parent_id', 'type'])

def downgrade():
    op.drop_constraint('uq_icda_hda_parent_type', 'implicitly_converted_dataset_association')
    op.drop_constraint('uq_icda_ldda_parent_type', 'implicitly_converted_dataset_association')
File: lib/galaxy/model/__init__.py (modify attach_implicitly_converted_dataset)
def attach_implicitly_converted_dataset(self, session, new_dataset, target_ext: str):
    new_dataset.name = self.name
    self.copy_attributes(new_dataset)
    assoc = ImplicitlyConvertedDatasetAssociation(
        parent=self, file_type=target_ext, dataset=new_dataset, metadata_safe=False
    )
    session.add(new_dataset)
    session.add(assoc)
    try:
        session.flush()
    except IntegrityError:
        session.rollback()
        # Another request already created this conversion, return None to signal duplicate
        return None
    return assoc
File: lib/galaxy/datatypes/data.py (modify convert_dataset)
for converted_dataset in converted_datasets.values():
    if converted_dataset.extension == "auto" and n_converted_datasets == 1:
        converted_dataset.extension = target_type
    assoc = original_dataset.attach_implicitly_converted_dataset(trans.sa_session, converted_dataset, target_type)
    if assoc is None:
        # Duplicate conversion already exists, skip this job
        log.info(f"Skipping duplicate conversion to {target_type} for dataset {original_dataset.id}")
        return None  # or handle differently
trans.app.job_manager.enqueue(job, tool=converter)
Pros:

Database-enforced uniqueness - bulletproof against race conditions
Catches duplicates at the database level
Self-documenting constraint

Cons:

Requires migration with data cleanup
Need to handle IntegrityError in application code
Migration could be slow on large databases
More complex implementation


Approach 3: Application-Level Locking

Rationale: Use Redis/database locking to serialize conversion initiation.
Changes:
File: lib/galaxy/datatypes/data.py (modify convert_dataset)
def convert_dataset(
    self,
    trans,
    original_dataset: DatasetHasHidProtocol,
    target_type: str,
    ...
):
    # Generate lock key for this specific conversion
    lock_key = f"conversion:{original_dataset.id}:{target_type}"

    # Try to acquire lock (non-blocking)
    lock = trans.app.model.context.get_conversion_lock(lock_key)
    if not lock.acquire(blocking=False):
        # Another request is already starting this conversion
        log.debug(f"Conversion to {target_type} for {original_dataset.id} already in progress")
        return None

    try:
        # Double-check after acquiring lock
        existing = original_dataset.get_converted_files_by_type(target_type)
        if existing:
            return existing  # Already converted

        # Proceed with conversion...
        converter = trans.app.datatypes_registry.get_converter_by_target_type(...)
        # ... rest of conversion logic
    finally:
        lock.release()
Pros:

Serializes conversion requests properly
No database schema changes

Cons:

Requires lock infrastructure (Redis or similar)
Complex implementation
Lock management overhead
Potential deadlock risks


Approach 4: Enable Job Caching for Display Applications

Rationale: PR #21021 added job caching for implicit conversions, but it's not enabled by default for display applications.
Changes:
File: lib/galaxy/datatypes/display_applications/parameters.py (line 184-186)
# Current code:
data.datatype.convert_dataset(
    trans, data, target_ext, return_output=True, visible=False, history=data.history
)

# Proposed change:
data.datatype.convert_dataset(
    trans, data, target_ext, return_output=True, visible=False, history=data.history,
    use_cached_job=True
)
File: lib/galaxy/datatypes/data.py (line 839)
Already has use_cached_job: bool = False parameter, but it's not being used from display applications.
Analysis of why this alone won't fix the issue:
The job caching in PR #21021 uses completed_jobs() which queries for completed jobs with matching parameters. However:

It checks for completed jobs, not in-progress ones
The race occurs BEFORE any job is committed to the database
Job caching helps avoid re-running a conversion that already succeeded, not preventing concurrent starts

Verdict: Job caching is useful but doesn't fix this race condition. Multiple requests would still all fail to find a cached job and start new ones.

Recommended Solution

Approach 1 (Re-add Flush) is recommended for these reasons:

Simplest fix - one line change
Low risk - doesn't change schema or require migrations
Addresses root cause - makes ICDA visible before enqueue
Easy to verify - behavior matches pre-24.0

Secondary recommendation: Also add Approach 2 (unique constraint) as a longer-term safeguard, but as a separate PR/release to avoid coupling fixes.

Implementation Plan

Phase 1: Immediate Fix (Approach 1)


Add flush in convert_dataset()
File: lib/galaxy/datatypes/data.py, after line 871
for converted_dataset in converted_datasets.values():
    if converted_dataset.extension == "auto" and n_converted_datasets == 1:
        converted_dataset.extension = target_type
    original_dataset.attach_implicitly_converted_dataset(trans.sa_session, converted_dataset, target_type)
# Flush ICDA record to prevent duplicate conversions from concurrent requests
trans.sa_session.flush()
trans.app.job_manager.enqueue(job, tool=converter)


Phase 2: Future Enhancement (Approach 2)


Create migration to add unique constraint (separate PR)
Handle IntegrityError in attach_implicitly_converted_dataset


Test Strategy

Unit Test: Race Condition Simulation

Create a test that simulates concurrent conversion requests:
File: lib/galaxy_test/unit/data/test_conversion_race.py (new file)
import threading
from unittest.mock import MagicMock, patch

def test_concurrent_conversion_requests_dont_duplicate():
    """
    Verify that concurrent requests to convert the same dataset
    don't create duplicate ICDA records.
    """
    mock_session = MagicMock()
    mock_dataset = MagicMock()
    mock_dataset.id = 1
    mock_dataset.get_converted_files_by_type.return_value = None

    conversions_started = []
    lock = threading.Lock()

    def mock_convert(*args, **kwargs):
        with lock:
            conversions_started.append(1)
        return MagicMock()

    # Start multiple threads attempting conversion
    threads = []
    for _ in range(10):
        t = threading.Thread(target=mock_convert)
        threads.append(t)
        t.start()

    for t in threads:
        t.join()

    # Should only have started one conversion
    # (This would fail with current code, pass with fix)
Integration Test: Display Application Conversion

File: lib/galaxy_test/api/test_display_applications.py (new or extend existing)
from concurrent.futures import ThreadPoolExecutor, as_completed

class DisplayApplicationsConversionTestCase(ApiTestCase):

    def test_concurrent_display_app_conversion(self):
        """
        Test that concurrent requests to display application
        don't create duplicate conversions.
        """
        # Upload a dataset that requires conversion (e.g., BAM)
        history_id = self.dataset_populator.new_history()
        hda = self.dataset_populator.new_dataset(
            history_id,
            content=open("test-data/1.bam", "rb").read(),
            file_type="bam"
        )
        dataset_id = hda["id"]

        # Make concurrent requests to UCSC display app
        def make_request():
            return self._post(
                "display_applications/create_link",
                data={
                    "app_name": "ucsc",
                    "link_name": "main",
                    "dataset_id": dataset_id,
                }
            )

        # 20 concurrent requests
        with ThreadPoolExecutor(max_workers=20) as executor:
            futures = [executor.submit(make_request) for _ in range(20)]
            results = [f.result() for f in as_completed(futures)]

        # Verify only one conversion job was created
        # Query ICDA table for this dataset
        icda_count = self._get_icda_count_for_dataset(dataset_id)
        assert icda_count == 1, f"Expected 1 ICDA, got {icda_count}"
Manual Testing


Set up Galaxy with UCSC display application
Upload a BAM file
Open UCSC link in multiple browser tabs simultaneously
Verify in database: SELECT COUNT(*) FROM implicitly_converted_dataset_association WHERE hda_parent_id = <id>
Should be exactly 1


Risks and Considerations

Performance Impact


Flush overhead: One additional flush per conversion is minimal
Lock contention: None with Approach 1

Backward Compatibility


No API changes
No schema changes (for Approach 1)
Behavior matches pre-24.0 Galaxy

Edge Cases


Session already flushed: If session auto-flushes before our explicit flush, no harm
Transaction rollback: If later code rollbacks, both ICDA and job are undone (correct behavior)
Database disconnection: Standard error handling applies

Migration Concerns (if Approach 2 is added later)


Existing duplicate ICDA records need cleanup before constraint can be added
Large tables may take time to add constraint


Unresolved Questions


Should we also enable use_cached_job=True in display_applications/parameters.py for additional protection?
For Approach 2, what's the strategy for cleaning up existing duplicate ICDAs - delete all but oldest, or all but newest?
Should we add logging to track when a duplicate conversion was prevented?
Is there a concern about the flush happening before enqueue affecting HID assignment (the original reason for PR #18036)?
Should fix be backported to 24.0, 24.1, 25.0 branches?


## ISSUE_21573_SUMMARY.md

      
    Raw
  

              ISSUE_21573_SUMMARY.md
            
          
    Issue 21573 Summary: Duplicate Conversions via Display Applications

Top-Line Summary

Display applications (e.g., UCSC browser) trigger duplicate dataset conversions due to a race condition introduced in Galaxy 24.0 (PR #18036). The fix that prevented missing HIDs inadvertently removed an immediate session commit from attach_implicitly_converted_dataset(), causing concurrent requests to not see pending conversions and each start new ones. The fix is simple: add trans.sa_session.flush() after ICDA creation in lib/galaxy/datatypes/data.py:871.
Importance Assessment


Dimension
Assessment


Severity
MEDIUM - Resource waste and UX degradation, no data loss or security issues


Blast Radius
LIMITED - Only affects display application users (5-15% of sessions)


Workaround
EXISTS but painful - Manual cleanup possible, storage quota impact


Regression
CONFIRMED - Introduced in Galaxy 24.0 (April 2024), PR #18036


Priority
HIGH for next release - Simple fix, low risk, affects usegalaxy.org


Questions for Discussion


Should fix be backported to 24.x/25.0 branches?
Does flushing before job_manager.enqueue() affect HID assignment (original concern in PR #18036)?
Add unique constraint on ICDA table as additional safeguard (separate PR)?
Enable use_cached_job=True for display applications too?
Cleanup strategy for existing duplicate conversions on production instances?

Effort Estimate

Fix complexity: LOW (one-line change)

Add trans.sa_session.flush() after ICDA creation loop in convert_dataset()
File: lib/galaxy/datatypes/data.py, after line 871

Testing complexity: MEDIUM

Need concurrency test to verify fix
ThreadPoolExecutor test with display_applications API
Manual testing with UCSC browser

Reproduction difficulty: MODERATE

Requires concurrent requests to display application
More likely on production instances with real traffic
Can simulate with concurrent API calls

Key Files


File
Function
Role


lib/galaxy/datatypes/data.py:829-880
convert_dataset()
Where fix goes


lib/galaxy/datatypes/display_applications/parameters.py:166-189
prepare()
Triggers conversion


lib/galaxy/model/__init__.py:5440-5451
get_converted_files_by_type()
Race condition check


lib/galaxy/model/__init__.py:5521-5528
attach_implicitly_converted_dataset()
Creates ICDA (no commit)


Regression Source

PR #18036 (April 2024) by @mvdbeek removed session.commit() from attach_implicitly_converted_dataset() to fix HID assignment issues. This created a race window where ICDA records aren't visible to concurrent requests.
File	Purpose
`lib/galaxy/datatypes/display_applications/parameters.py`	Display parameter logic including conversion triggers
`lib/galaxy/datatypes/display_applications/application.py`	PopulatedDisplayApplicationLink, prepare_display()
`lib/galaxy/managers/display_applications.py`	create_link API logic
`lib/galaxy/datatypes/data.py:829-880`	convert_dataset() method
`lib/galaxy/model/__init__.py:5440-5451`	get_converted_files_by_type()
`lib/galaxy/model/__init__.py:5464-5519`	get_converted_dataset()
`lib/galaxy/model/__init__.py:6868-6948`	ImplicitlyConvertedDatasetAssociation model
`lib/galaxy/webapps/galaxy/controllers/dataset.py:466-601`	display_application controller
`lib/galaxy/datatypes/display_applications/configs/ucsc/*.xml`	UCSC display app definitions
File	Function	Issue
`lib/galaxy/datatypes/data.py:858-865`	`convert_dataset()`	Uses `flush_job=False`
`lib/galaxy/model/__init__.py:5521-5528`	`attach_implicitly_converted_dataset()`	No commit after add
`lib/galaxy/model/__init__.py:5440-5451`	`get_converted_files_by_type()`	Can't see unflushed records
`lib/galaxy/datatypes/display_applications/parameters.py:180-186`	`prepare()`	Calls convert_dataset without waiting
File	Line	Function
`lib/galaxy/datatypes/display_applications/parameters.py`	166-189	`DisplayApplicationDataParameter.prepare()` - triggers conversion
`lib/galaxy/datatypes/data.py`	829-880	`convert_dataset()` - creates job and ICDA
`lib/galaxy/model/__init__.py`	5440-5451	`get_converted_files_by_type()` - checks for existing conversion
`lib/galaxy/model/__init__.py`	5521-5528	`attach_implicitly_converted_dataset()` - creates ICDA record
`lib/galaxy/datatypes/registry.py`	948	`find_conversion_destination_for_dataset_by_extensions()` - also checks
Dimension	Assessment
Severity	MEDIUM - Resource waste and UX degradation, no data loss or security issues
Blast Radius	LIMITED - Only affects display application users (5-15% of sessions)
Workaround	EXISTS but painful - Manual cleanup possible, storage quota impact
Regression	CONFIRMED - Introduced in Galaxy 24.0 (April 2024), PR #18036
Priority	HIGH for next release - Simple fix, low risk, affects usegalaxy.org
File	Function	Role
`lib/galaxy/datatypes/data.py:829-880`	`convert_dataset()`	Where fix goes
`lib/galaxy/datatypes/display_applications/parameters.py:166-189`	`prepare()`	Triggers conversion
`lib/galaxy/model/__init__.py:5440-5451`	`get_converted_files_by_type()`	Race condition check
`lib/galaxy/model/__init__.py:5521-5528`	`attach_implicitly_converted_dataset()`	Creates ICDA (no commit)