Skip to content

Instantly share code, notes, and snippets.

@jmchilton
Created January 13, 2026 20:51
Show Gist options
  • Select an option

  • Save jmchilton/3f9f3b7845488f8b0cb109a2ed01b074 to your computer and use it in GitHub Desktop.

Select an option

Save jmchilton/3f9f3b7845488f8b0cb109a2ed01b074 to your computer and use it in GitHub Desktop.
Issue 21573 Triage: Duplicate Conversions via Display Applications

Issue 21573: Conversions triggered via display application

Author: mvdbeek (Marius van den Beek) Created: 2026-01-13 State: OPEN Labels: kind/bug

Description

From @jennaj

Hi @dannon -- would you be able to help me with this? The users datasets are being "converted" hundreds of times. I've not seen this before. They are not very large but can clutter up the storage view and is confusing them. This looks like a bug but I am not sure how to ticket it. They stated that they are reviewing data at UCSC. Maybe we are recreating the index for them at each request, instead of reusing it? (guess!) Thank you!

History: https://usegalaxy.org/u/jen-galaxyproject/h/copy-of-tiago-lp-history-too-many-converted-datasets. Public topic (but this is also in bugs from https://help.galaxyproject.org/t/datasets-visible-in-history-size-but-not-in-history/17205/2

Key Details

  • Datasets being converted hundreds of times
  • User is reviewing data at UCSC (display application)
  • Conversions should be reused, not recreated each request
  • Clutters storage view and confuses users

Issue 21573: Conversions Triggered via Display Applications - Code Research

Issue Summary

Datasets are being converted hundreds of times when users view data at UCSC via display applications. Conversions should be reused, not recreated on each request. This clutters storage and confuses users.

Relevant Code Paths

1. Display Application Flow

When a user clicks "display at UCSC", the following path is executed:

  1. API Endpoint (lib/galaxy/webapps/galaxy/api/display_applications.py:52-71)

    • POST /api/display_applications/create_link calls DisplayApplicationsManager.create_link()
  2. Manager (lib/galaxy/managers/display_applications.py:115-206)

    • create_link() decodes dataset/user, creates PopulatedDisplayApplicationLink
    • If display_link.display_ready() is False, calls display_link.prepare_display() (line 199)
  3. PopulatedDisplayApplicationLink (lib/galaxy/datatypes/display_applications/application.py:209-273)

    • prepare_display() (line 235-249) calls param.prepare() for each data parameter
    • preparing_display() (line 230-233) checks if conversion is already running
  4. DisplayApplicationDataParameter (lib/galaxy/datatypes/display_applications/parameters.py:79-203)

    • prepare() (line 166-189) - KEY METHOD that triggers conversions
    • _get_dataset_like_object() (line 119-159) checks for existing conversions
    • Calls data.get_converted_files_by_type(ext) to find existing converted datasets (line 151)

2. Conversion Triggering Path

In DisplayApplicationDataParameter.prepare() (line 166-189):

def prepare(self, other_values, dataset_hash, user_hash, trans):
    data = self._get_dataset_like_object(other_values)
    if not data and self.formats:
        data = other_values.get(self.dataset, None)
        # start conversion
        (direct_match, target_ext, converted_dataset) = self.datatypes_registry.find_conversion_destination_for_dataset_by_extensions(
            data.extension, self.extensions
        )
        if not direct_match:
            if target_ext and not converted_dataset:
                data.datatype.convert_dataset(
                    trans, data, target_ext, return_output=True, visible=False, history=data.history
                )

3. Conversion Checking Logic

_get_dataset_like_object() (line 147-158):

elif (self.formats and self.extensions and (self.force_conversion or not isinstance(data.datatype, self.formats))):
    for ext in self.extensions:
        rval = data.get_converted_files_by_type(ext)  # Check existing conversions
        if rval:
            return rval
    # If no converted file found, return None (triggers conversion in prepare())
    return None

get_converted_files_by_type() in model (lib/galaxy/model/__init__.py:5440-5451):

def get_converted_files_by_type(self, file_type, include_errored=False):
    for assoc in self.implicitly_converted_datasets:
        if not assoc.deleted and assoc.type == file_type:
            item = assoc.dataset or assoc.dataset_ldda
            valid_states = (Dataset.states.ERROR, *Dataset.valid_input_states) if include_errored else Dataset.valid_input_states
            if not item.deleted and item.state in valid_states:
                return item
    return None

4. convert_dataset Method

Data.convert_dataset() (lib/galaxy/datatypes/data.py:829-880):

  • Creates converter job
  • Note: use_cached_job=False by default when called from display applications (line 839)
  • Creates ImplicitlyConvertedDatasetAssociation (line 871)
  • Critical: Uses flush_job=False (line 863) - association not immediately visible to DB

5. Dataset Controller Route (Legacy/Direct Access)

DatasetInterface.display_application() (lib/galaxy/webapps/galaxy/controllers/dataset.py:466-601):

  • Called when UCSC fetches data directly
  • This is the controller endpoint that serves converted data to external applications
  • Each data fetch could potentially trigger check for/creation of conversions

Key Files

File Purpose
lib/galaxy/datatypes/display_applications/parameters.py Display parameter logic including conversion triggers
lib/galaxy/datatypes/display_applications/application.py PopulatedDisplayApplicationLink, prepare_display()
lib/galaxy/managers/display_applications.py create_link API logic
lib/galaxy/datatypes/data.py:829-880 convert_dataset() method
lib/galaxy/model/__init__.py:5440-5451 get_converted_files_by_type()
lib/galaxy/model/__init__.py:5464-5519 get_converted_dataset()
lib/galaxy/model/__init__.py:6868-6948 ImplicitlyConvertedDatasetAssociation model
lib/galaxy/webapps/galaxy/controllers/dataset.py:466-601 display_application controller
lib/galaxy/datatypes/display_applications/configs/ucsc/*.xml UCSC display app definitions

Theories About Repeated Conversions

Theory 1: Race Condition During Initial Conversion

Evidence:

  • convert_dataset() uses flush_job=False (line 863), so the ImplicitlyConvertedDatasetAssociation is added to session but NOT immediately committed
  • Multiple concurrent requests to create_link or display_application will each call get_converted_files_by_type()
  • If the first request's conversion hasn't been committed yet, subsequent requests won't see it and will trigger new conversions

Race Window:

Request 1: _get_dataset_like_object() -> None (no conversion exists)
Request 1: prepare() -> convert_dataset() -> creates job + assoc (not flushed)
Request 2: _get_dataset_like_object() -> None (assoc not visible in DB yet!)
Request 2: prepare() -> convert_dataset() -> creates ANOTHER job + assoc
...
Request N: Same pattern

Supporting Code:

  • attach_implicitly_converted_dataset() calls session.add() but there's no commit before the request returns
  • preparing_display() checks if conversion is RUNNING, but doesn't protect against NEW conversions being triggered

Theory 2: UCSC Makes Multiple Concurrent Data Requests

Evidence:

  • UCSC display applications use hgt.customText URLs that point back to Galaxy
  • When UCSC loads the track, it may make multiple byte-range requests concurrently
  • Each request to display_application controller could trigger conversion check

UCSC Request Pattern:

  • Initial redirect to UCSC with track URL
  • UCSC fetches track description file
  • UCSC makes multiple range requests for actual data
  • Each request to Galaxy could independently trigger conversion

Supporting Code:

  • display_application() controller (line 468) doesn't cache/lock on conversion state
  • Multiple simultaneous requests from UCSC = multiple conversion triggers

Theory 3: Checking Against Running Conversions is Insufficient

Evidence:

  • is_preparing() only checks if dataset state is NEW, UPLOAD, QUEUED, or RUNNING
  • get_converted_files_by_type() uses Dataset.valid_input_states which may not include all "in progress" states
  • If conversion job is created but not yet QUEUED, it might not be detected as "preparing"

Code Path:

def is_preparing(self, other_values):
    value = self._get_dataset_like_object(other_values)
    if value and value.state in (DatasetState.NEW, DatasetState.UPLOAD, DatasetState.QUEUED, DatasetState.RUNNING):
        return True
    return False

The issue: _get_dataset_like_object() returns None if no converted dataset exists yet (even if one is being created), so is_preparing() checks if None has a state - which it doesn't.

Configuration and Parameters

Relevant Galaxy Config Options

  • enable_old_display_applications - enables legacy display path
  • Display sites configuration in datatypes_registry.get_display_sites('ucsc')

Display Application XML Parameters

  • format="bedstrict" - requires conversion from interval/BED to strict BED
  • format="vcf_bgzip" - requires bgzip compression
  • format="tabix" - requires tabix indexing
  • force_conversion="True" - forces conversion even if format matches

UCSC Display Applications that Trigger Conversions

  1. ucsc_interval_as_bed.xml - interval -> bedstrict (line 13)
  2. ucsc_vcf.xml - vcf -> vcf_bgzip and tabix (lines 13-14)
  3. Most others just reference the original data

Potential Fix Areas

  1. Add locking/deduplication in prepare() method - prevent concurrent conversion triggers
  2. Flush conversion association immediately - make it visible to concurrent requests
  3. Add DB-level unique constraint - prevent duplicate ImplicitlyConvertedDatasetAssociation entries
  4. Cache conversion state per-request - avoid re-checking for each parameter
  5. Use use_cached_job=True - reuse completed conversion jobs (currently False by default from display apps)

Unresolved Questions

  1. Are the "hundreds of conversions" from a single user click, or from UCSC making many range requests?
  2. Is there transaction isolation preventing visibility of newly created conversions?
  3. Would use_cached_job=True help, or does it only work for completed jobs (not pending)?
  4. Is there a way to reproduce this reliably in tests?

Issue 21573: Git History Research

Summary

Root Cause: The flush_job=False parameter in convert_dataset() method means the ImplicitlyConvertedDatasetAssociation record isn't committed to the database before the method returns. Concurrent requests checking get_converted_files_by_type() won't see the pending conversion, triggering duplicate conversions.

Assessment: This is a regression introduced in April 2024 (release 24.0).

Timeline of Key Changes

1. April 21, 2024 - The Regression Introduced

Commit: 1b1094be215 by mvdbeek PR: #18036 - [24.0] Fix History Dataset Association creation so that hid is always set Release: 24.0

This commit:

  1. Added flush_job=False to converter.execute() call in convert_dataset()
  2. Removed session.commit() from attach_implicitly_converted_dataset()
  3. Moved ICDA creation into convert_dataset() (from parameters.py)

The purpose was to fix HDAs missing hid values, but it inadvertently removed the database commit that made conversions visible to concurrent requests.

Before this commit: Display application code in parameters.py explicitly committed:

trans.sa_session.add(assoc)
with transaction(trans.sa_session):
    trans.sa_session.commit()

After this commit: No explicit commit - relies on eventual flush which may not happen before concurrent requests check.

2. October 7, 2025 - Attempted Fix (Partial)

Commit: 2b5ea2234c4 by mvdbeek PR: #21021 - Use job cache also for implicit conversions Release: 25.0

This added job caching for implicit conversions:

completed_jobs = converter.completed_jobs(trans, all_params=[params], use_cached_job=use_cached_job)
completed_job = completed_jobs[0] if completed_jobs else None

However, use_cached_job=False by default, and display applications don't pass True, so this doesn't help the UCSC race condition.

3. November 6, 2025 - Display Application Refactor

Commit: 8aa1d7d71bf by guerler PR: #15076 - Refactor display application handling

Updated parameters.py to pass history=data.history, but didn't address the race condition.

4. January 30, 2025 - Related Fix

Commit: 291dd45af6d by mvdbeek PR: #19494 - Prevent cycling through failing conversion jobs in trackster

Added include_errored parameter to get_converted_files_by_type() but doesn't help with the pending conversion visibility issue.

Relevant Code Locations

File Function Issue
lib/galaxy/datatypes/data.py:858-865 convert_dataset() Uses flush_job=False
lib/galaxy/model/__init__.py:5521-5528 attach_implicitly_converted_dataset() No commit after add
lib/galaxy/model/__init__.py:5440-5451 get_converted_files_by_type() Can't see unflushed records
lib/galaxy/datatypes/display_applications/parameters.py:180-186 prepare() Calls convert_dataset without waiting

Authors Involved

  • mvdbeek (Marius van den Beek) - Introduced regression in PR #18036, worked on related fixes
  • guerler (Aysam Guerler) - Recent display application refactoring
  • jdavcs (John Davis) - Earlier session handling changes

Pre-Regression Behavior (Before April 2024)

In parameters.py (before commit 1b1094be215):

new_data = next(iter(data.datatype.convert_dataset(...).values()))
new_data.hid = data.hid
new_data.name = data.name
trans.sa_session.add(new_data)
assoc = trans.app.model.ImplicitlyConvertedDatasetAssociation(...)
trans.sa_session.add(assoc)
with transaction(trans.sa_session):
    trans.sa_session.commit()  # <-- This was removed

The explicit commit ensured the ICDA was visible to concurrent requests immediately.

Conclusion

This is a regression from Galaxy 24.0 (April 2024). The fix that prevented missing HIDs inadvertently created a race condition where concurrent display application requests don't see each other's pending conversions because:

  1. flush_job=False delays job record creation
  2. attach_implicitly_converted_dataset() no longer commits
  3. Session changes are only visible after eventual auto-flush or commit

Potential fixes:

  1. Add explicit commit/flush after attach_implicitly_converted_dataset()
  2. Pass use_cached_job=True to convert_dataset() from display applications
  3. Check for pending (unflushed) conversions in get_converted_files_by_type()
  4. Add locking/mutex around conversion check-and-start logic

Issue 21573 Importance Assessment

Issue Summary

Datasets converted hundreds of times when viewing via display applications (e.g., UCSC browser). Conversions should be reused, not recreated each request.

1. Severity: MEDIUM

No data loss - User data remains intact; conversions create additional copies but don't corrupt or delete source data.

No security issue - Not a vulnerability or authentication/authorization bypass.

No crash/hang - Galaxy remains operational; issue manifests as excessive resource consumption.

Functional breakage - Yes, but partial:

  • Conversion reuse mechanism is broken (functional issue)
  • Galaxy still works, just inefficiently
  • User workflow not blocked, but cluttered

Classification: Resource waste + UX degradation, not critical failure.

2. Blast Radius: LIMITED

Affected users:

  • Users of display applications (UCSC browser, IGV, etc.)
  • Particularly those making multiple requests to view same dataset

NOT affected:

  • Tool execution / job running
  • Data upload/download
  • History management (outside display apps)
  • API consumers not using display applications
  • Users who don't use external genome browsers

Configurations affected:

  • All Galaxy instances with display applications enabled
  • Primarily usegalaxy.org and similar production instances
  • Dev/test instances less impacted (lower traffic, less concurrency)

Estimate: Maybe 5-15% of user sessions involve display applications, and of those, only subset trigger concurrent conversion requests.

3. Workaround Existence: YES, but painful

Admin workaround:

  • Periodically clean up duplicate ICDA records
  • SQL: Identify and remove duplicate conversions of same parent/type

User workaround:

  • Wait for conversion to complete before clicking display link again
  • Delete duplicate converted datasets manually from history
  • Use storage manager to clean up clutter

Workaround pain level: MODERATE

  • Users confused by clutter (see support request in issue)
  • Storage usage inflated (quota impact)
  • Manual cleanup tedious but possible

4. Regression Status: CONFIRMED REGRESSION

Regression introduced: Galaxy 24.0 (April 2024)

Source: PR #18036 - "Fix History Dataset Association creation so that hid is always set"

Mechanism: PR #18036 changed session flush behavior. ICDA records added to session but not immediately flushed. Concurrent requests don't see pending conversions in get_converted_files_by_type() because query only sees committed records.

Age of regression: ~9 months (April 2024 to January 2025)

Code path:

  1. attach_implicitly_converted_dataset() creates ICDA
  2. Adds to session without immediate flush
  3. Second concurrent request checks get_converted_files_by_type()
  4. Query returns None (ICDA not committed yet)
  5. Second conversion triggered

5. User Impact Signals

Issue reactions: None recorded yet (newly filed)

Duplicate reports: None found via search

Support requests:

Reporter: Jennifer Hillman-Jackson (Galaxy team member) via Marius van den Beek - credible, well-documented report.

User testimony: "datasets are being converted hundreds of times" - indicates significant user-facing impact when it occurs.

6. Overall Recommendation: NEXT RELEASE (High Priority)

NOT Hotfix because:

  • Not critical/breaking functionality
  • Workarounds exist
  • No data loss or security impact
  • Regression exists 9 months without widespread reports

Not Backlog because:

  • Confirmed regression in recent release
  • Simple fix identified (add flush after ICDA creation)
  • Causes storage waste on production instances
  • User confusion documented
  • Fix is low-risk

Recommended Action:

  1. Priority: High for next regular release (24.2 or 25.0)
  2. Fix: Add session flush in attach_implicitly_converted_dataset() after ICDA creation
  3. Test: Add concurrency test for display application conversion
  4. Backport: Consider backport to 24.1.x if low-risk

Risk Assessment of Fix:

  • Low risk - Adding flush is defensive, won't break existing behavior
  • Session already has autoflush on queries; explicit flush just makes timing deterministic
  • Similar pattern used elsewhere in codebase

Why higher priority than typical medium-severity bugs:

  • Affects public-facing usegalaxy.org
  • Creates visible clutter users notice and report
  • Storage quota implications for users
  • Simple, low-risk fix available
  • Regression vs long-standing issue

Assessment date: 2025-01-13 Assessor: Claude Code

Issue 21573 Fix Plan: Duplicate Conversions via Display Applications

Problem Summary

Datasets are being converted hundreds of times when users view data at UCSC via display applications. This is a race condition where multiple concurrent requests check for existing conversions, find none (because the ICDA record hasn't been committed yet), and all start new conversions.

Root Cause Analysis

The Race Condition Timeline

  1. Request A arrives, calls get_converted_files_by_type() - finds no conversion
  2. Request A calls convert_dataset() which:
    • Creates job with flush_job=False (line 863 in data.py)
    • Creates ICDA record via attach_implicitly_converted_dataset() - NO COMMIT YET
    • Calls job_manager.enqueue() - commit happens here eventually
  3. Request B arrives (before Request A's commit), calls get_converted_files_by_type() - finds no conversion
  4. Request B starts another conversion...
  5. Repeat for hundreds of concurrent requests

The Breaking Change (PR #18036, Galaxy 24.0)

Before PR #18036 (attach_implicitly_converted_dataset in model/init.py):

def attach_implicitly_converted_dataset(self, session, new_dataset, target_ext: str):
    # ...
    session.add(new_dataset)
    session.add(assoc)
    with transaction(session):
        session.commit()  # <-- IMMEDIATE COMMIT
    return new_dataset

After PR #18036:

def attach_implicitly_converted_dataset(self, session, new_dataset, target_ext: str):
    # ...
    session.add(new_dataset)
    session.add(assoc)
    # NO COMMIT - deferred to job_manager.enqueue()

The change was made to fix HID assignment issues, but it introduced this race window.

Key Files

File Line Function
lib/galaxy/datatypes/display_applications/parameters.py 166-189 DisplayApplicationDataParameter.prepare() - triggers conversion
lib/galaxy/datatypes/data.py 829-880 convert_dataset() - creates job and ICDA
lib/galaxy/model/__init__.py 5440-5451 get_converted_files_by_type() - checks for existing conversion
lib/galaxy/model/__init__.py 5521-5528 attach_implicitly_converted_dataset() - creates ICDA record
lib/galaxy/datatypes/registry.py 948 find_conversion_destination_for_dataset_by_extensions() - also checks

Proposed Approaches

Approach 1: Re-add Flush After ICDA Creation (Recommended)

Rationale: Simplest fix, restores pre-24.0 behavior for this specific path, minimal risk.

Changes:

File: lib/galaxy/datatypes/data.py (lines 858-872)

# Current code:
job, converted_datasets, *_ = converter.execute(
    trans,
    incoming=params,
    set_output_hid=visible,
    history=history,
    flush_job=False,
    completed_job=completed_job,
)
# We should only have a single converted output, but let's be defensive here
n_converted_datasets = len(converted_datasets)
for converted_dataset in converted_datasets.values():
    if converted_dataset.extension == "auto" and n_converted_datasets == 1:
        converted_dataset.extension = target_type
    original_dataset.attach_implicitly_converted_dataset(trans.sa_session, converted_dataset, target_type)
trans.app.job_manager.enqueue(job, tool=converter)

# Proposed change:
job, converted_datasets, *_ = converter.execute(
    trans,
    incoming=params,
    set_output_hid=visible,
    history=history,
    flush_job=False,
    completed_job=completed_job,
)
# We should only have a single converted output, but let's be defensive here
n_converted_datasets = len(converted_datasets)
for converted_dataset in converted_datasets.values():
    if converted_dataset.extension == "auto" and n_converted_datasets == 1:
        converted_dataset.extension = target_type
    original_dataset.attach_implicitly_converted_dataset(trans.sa_session, converted_dataset, target_type)
# Flush ICDA record immediately to prevent duplicate conversions from concurrent requests
trans.sa_session.flush()
trans.app.job_manager.enqueue(job, tool=converter)

Pros:

  • Simple, targeted fix
  • Low risk - only adds a flush, doesn't change model or schema
  • Restores intended behavior without full commit

Cons:

  • Doesn't prevent true concurrent execution (microsecond window still exists)
  • Requires the flush to complete before enqueue

Approach 2: Database Unique Constraint

Rationale: Database-level enforcement prevents duplicates even with concurrent transactions.

Changes:

File: lib/galaxy/model/__init__.py (after line 6869)

class ImplicitlyConvertedDatasetAssociation(Base, Serializable):
    __tablename__ = "implicitly_converted_dataset_association"
    __table_args__ = (
        # Prevent duplicate conversions for same parent/type combination
        UniqueConstraint('hda_parent_id', 'type', name='uq_icda_hda_parent_type'),
        UniqueConstraint('ldda_parent_id', 'type', name='uq_icda_ldda_parent_type'),
    )

File: lib/galaxy/model/migrations/alembic/versions_gxy/XXX_add_icda_unique_constraint.py (new migration)

"""Add unique constraint to implicitly_converted_dataset_association

Revision ID: XXX
"""
from alembic import op

def upgrade():
    # First, delete duplicate records (keep oldest)
    op.execute("""
        DELETE FROM implicitly_converted_dataset_association
        WHERE id NOT IN (
            SELECT MIN(id)
            FROM implicitly_converted_dataset_association
            GROUP BY hda_parent_id, type
        ) AND hda_parent_id IS NOT NULL
    """)
    op.execute("""
        DELETE FROM implicitly_converted_dataset_association
        WHERE id NOT IN (
            SELECT MIN(id)
            FROM implicitly_converted_dataset_association
            GROUP BY ldda_parent_id, type
        ) AND ldda_parent_id IS NOT NULL
    """)

    op.create_unique_constraint('uq_icda_hda_parent_type',
                                'implicitly_converted_dataset_association',
                                ['hda_parent_id', 'type'])
    op.create_unique_constraint('uq_icda_ldda_parent_type',
                                'implicitly_converted_dataset_association',
                                ['ldda_parent_id', 'type'])

def downgrade():
    op.drop_constraint('uq_icda_hda_parent_type', 'implicitly_converted_dataset_association')
    op.drop_constraint('uq_icda_ldda_parent_type', 'implicitly_converted_dataset_association')

File: lib/galaxy/model/__init__.py (modify attach_implicitly_converted_dataset)

def attach_implicitly_converted_dataset(self, session, new_dataset, target_ext: str):
    new_dataset.name = self.name
    self.copy_attributes(new_dataset)
    assoc = ImplicitlyConvertedDatasetAssociation(
        parent=self, file_type=target_ext, dataset=new_dataset, metadata_safe=False
    )
    session.add(new_dataset)
    session.add(assoc)
    try:
        session.flush()
    except IntegrityError:
        session.rollback()
        # Another request already created this conversion, return None to signal duplicate
        return None
    return assoc

File: lib/galaxy/datatypes/data.py (modify convert_dataset)

for converted_dataset in converted_datasets.values():
    if converted_dataset.extension == "auto" and n_converted_datasets == 1:
        converted_dataset.extension = target_type
    assoc = original_dataset.attach_implicitly_converted_dataset(trans.sa_session, converted_dataset, target_type)
    if assoc is None:
        # Duplicate conversion already exists, skip this job
        log.info(f"Skipping duplicate conversion to {target_type} for dataset {original_dataset.id}")
        return None  # or handle differently
trans.app.job_manager.enqueue(job, tool=converter)

Pros:

  • Database-enforced uniqueness - bulletproof against race conditions
  • Catches duplicates at the database level
  • Self-documenting constraint

Cons:

  • Requires migration with data cleanup
  • Need to handle IntegrityError in application code
  • Migration could be slow on large databases
  • More complex implementation

Approach 3: Application-Level Locking

Rationale: Use Redis/database locking to serialize conversion initiation.

Changes:

File: lib/galaxy/datatypes/data.py (modify convert_dataset)

def convert_dataset(
    self,
    trans,
    original_dataset: DatasetHasHidProtocol,
    target_type: str,
    ...
):
    # Generate lock key for this specific conversion
    lock_key = f"conversion:{original_dataset.id}:{target_type}"

    # Try to acquire lock (non-blocking)
    lock = trans.app.model.context.get_conversion_lock(lock_key)
    if not lock.acquire(blocking=False):
        # Another request is already starting this conversion
        log.debug(f"Conversion to {target_type} for {original_dataset.id} already in progress")
        return None

    try:
        # Double-check after acquiring lock
        existing = original_dataset.get_converted_files_by_type(target_type)
        if existing:
            return existing  # Already converted

        # Proceed with conversion...
        converter = trans.app.datatypes_registry.get_converter_by_target_type(...)
        # ... rest of conversion logic
    finally:
        lock.release()

Pros:

  • Serializes conversion requests properly
  • No database schema changes

Cons:

  • Requires lock infrastructure (Redis or similar)
  • Complex implementation
  • Lock management overhead
  • Potential deadlock risks

Approach 4: Enable Job Caching for Display Applications

Rationale: PR #21021 added job caching for implicit conversions, but it's not enabled by default for display applications.

Changes:

File: lib/galaxy/datatypes/display_applications/parameters.py (line 184-186)

# Current code:
data.datatype.convert_dataset(
    trans, data, target_ext, return_output=True, visible=False, history=data.history
)

# Proposed change:
data.datatype.convert_dataset(
    trans, data, target_ext, return_output=True, visible=False, history=data.history,
    use_cached_job=True
)

File: lib/galaxy/datatypes/data.py (line 839)

Already has use_cached_job: bool = False parameter, but it's not being used from display applications.

Analysis of why this alone won't fix the issue:

The job caching in PR #21021 uses completed_jobs() which queries for completed jobs with matching parameters. However:

  1. It checks for completed jobs, not in-progress ones
  2. The race occurs BEFORE any job is committed to the database
  3. Job caching helps avoid re-running a conversion that already succeeded, not preventing concurrent starts

Verdict: Job caching is useful but doesn't fix this race condition. Multiple requests would still all fail to find a cached job and start new ones.


Recommended Solution

Approach 1 (Re-add Flush) is recommended for these reasons:

  1. Simplest fix - one line change
  2. Low risk - doesn't change schema or require migrations
  3. Addresses root cause - makes ICDA visible before enqueue
  4. Easy to verify - behavior matches pre-24.0

Secondary recommendation: Also add Approach 2 (unique constraint) as a longer-term safeguard, but as a separate PR/release to avoid coupling fixes.


Implementation Plan

Phase 1: Immediate Fix (Approach 1)

  1. Add flush in convert_dataset()

    File: lib/galaxy/datatypes/data.py, after line 871

    for converted_dataset in converted_datasets.values():
        if converted_dataset.extension == "auto" and n_converted_datasets == 1:
            converted_dataset.extension = target_type
        original_dataset.attach_implicitly_converted_dataset(trans.sa_session, converted_dataset, target_type)
    # Flush ICDA record to prevent duplicate conversions from concurrent requests
    trans.sa_session.flush()
    trans.app.job_manager.enqueue(job, tool=converter)

Phase 2: Future Enhancement (Approach 2)

  1. Create migration to add unique constraint (separate PR)
  2. Handle IntegrityError in attach_implicitly_converted_dataset

Test Strategy

Unit Test: Race Condition Simulation

Create a test that simulates concurrent conversion requests:

File: lib/galaxy_test/unit/data/test_conversion_race.py (new file)

import threading
from unittest.mock import MagicMock, patch

def test_concurrent_conversion_requests_dont_duplicate():
    """
    Verify that concurrent requests to convert the same dataset
    don't create duplicate ICDA records.
    """
    mock_session = MagicMock()
    mock_dataset = MagicMock()
    mock_dataset.id = 1
    mock_dataset.get_converted_files_by_type.return_value = None

    conversions_started = []
    lock = threading.Lock()

    def mock_convert(*args, **kwargs):
        with lock:
            conversions_started.append(1)
        return MagicMock()

    # Start multiple threads attempting conversion
    threads = []
    for _ in range(10):
        t = threading.Thread(target=mock_convert)
        threads.append(t)
        t.start()

    for t in threads:
        t.join()

    # Should only have started one conversion
    # (This would fail with current code, pass with fix)

Integration Test: Display Application Conversion

File: lib/galaxy_test/api/test_display_applications.py (new or extend existing)

from concurrent.futures import ThreadPoolExecutor, as_completed

class DisplayApplicationsConversionTestCase(ApiTestCase):

    def test_concurrent_display_app_conversion(self):
        """
        Test that concurrent requests to display application
        don't create duplicate conversions.
        """
        # Upload a dataset that requires conversion (e.g., BAM)
        history_id = self.dataset_populator.new_history()
        hda = self.dataset_populator.new_dataset(
            history_id,
            content=open("test-data/1.bam", "rb").read(),
            file_type="bam"
        )
        dataset_id = hda["id"]

        # Make concurrent requests to UCSC display app
        def make_request():
            return self._post(
                "display_applications/create_link",
                data={
                    "app_name": "ucsc",
                    "link_name": "main",
                    "dataset_id": dataset_id,
                }
            )

        # 20 concurrent requests
        with ThreadPoolExecutor(max_workers=20) as executor:
            futures = [executor.submit(make_request) for _ in range(20)]
            results = [f.result() for f in as_completed(futures)]

        # Verify only one conversion job was created
        # Query ICDA table for this dataset
        icda_count = self._get_icda_count_for_dataset(dataset_id)
        assert icda_count == 1, f"Expected 1 ICDA, got {icda_count}"

Manual Testing

  1. Set up Galaxy with UCSC display application
  2. Upload a BAM file
  3. Open UCSC link in multiple browser tabs simultaneously
  4. Verify in database: SELECT COUNT(*) FROM implicitly_converted_dataset_association WHERE hda_parent_id = <id>
  5. Should be exactly 1

Risks and Considerations

Performance Impact

  • Flush overhead: One additional flush per conversion is minimal
  • Lock contention: None with Approach 1

Backward Compatibility

  • No API changes
  • No schema changes (for Approach 1)
  • Behavior matches pre-24.0 Galaxy

Edge Cases

  1. Session already flushed: If session auto-flushes before our explicit flush, no harm
  2. Transaction rollback: If later code rollbacks, both ICDA and job are undone (correct behavior)
  3. Database disconnection: Standard error handling applies

Migration Concerns (if Approach 2 is added later)

  • Existing duplicate ICDA records need cleanup before constraint can be added
  • Large tables may take time to add constraint

Unresolved Questions

  1. Should we also enable use_cached_job=True in display_applications/parameters.py for additional protection?
  2. For Approach 2, what's the strategy for cleaning up existing duplicate ICDAs - delete all but oldest, or all but newest?
  3. Should we add logging to track when a duplicate conversion was prevented?
  4. Is there a concern about the flush happening before enqueue affecting HID assignment (the original reason for PR #18036)?
  5. Should fix be backported to 24.0, 24.1, 25.0 branches?

Issue 21573 Summary: Duplicate Conversions via Display Applications

Top-Line Summary

Display applications (e.g., UCSC browser) trigger duplicate dataset conversions due to a race condition introduced in Galaxy 24.0 (PR #18036). The fix that prevented missing HIDs inadvertently removed an immediate session commit from attach_implicitly_converted_dataset(), causing concurrent requests to not see pending conversions and each start new ones. The fix is simple: add trans.sa_session.flush() after ICDA creation in lib/galaxy/datatypes/data.py:871.

Importance Assessment

Dimension Assessment
Severity MEDIUM - Resource waste and UX degradation, no data loss or security issues
Blast Radius LIMITED - Only affects display application users (5-15% of sessions)
Workaround EXISTS but painful - Manual cleanup possible, storage quota impact
Regression CONFIRMED - Introduced in Galaxy 24.0 (April 2024), PR #18036
Priority HIGH for next release - Simple fix, low risk, affects usegalaxy.org

Questions for Discussion

  1. Should fix be backported to 24.x/25.0 branches?
  2. Does flushing before job_manager.enqueue() affect HID assignment (original concern in PR #18036)?
  3. Add unique constraint on ICDA table as additional safeguard (separate PR)?
  4. Enable use_cached_job=True for display applications too?
  5. Cleanup strategy for existing duplicate conversions on production instances?

Effort Estimate

Fix complexity: LOW (one-line change)

  • Add trans.sa_session.flush() after ICDA creation loop in convert_dataset()
  • File: lib/galaxy/datatypes/data.py, after line 871

Testing complexity: MEDIUM

  • Need concurrency test to verify fix
  • ThreadPoolExecutor test with display_applications API
  • Manual testing with UCSC browser

Reproduction difficulty: MODERATE

  • Requires concurrent requests to display application
  • More likely on production instances with real traffic
  • Can simulate with concurrent API calls

Key Files

File Function Role
lib/galaxy/datatypes/data.py:829-880 convert_dataset() Where fix goes
lib/galaxy/datatypes/display_applications/parameters.py:166-189 prepare() Triggers conversion
lib/galaxy/model/__init__.py:5440-5451 get_converted_files_by_type() Race condition check
lib/galaxy/model/__init__.py:5521-5528 attach_implicitly_converted_dataset() Creates ICDA (no commit)

Regression Source

PR #18036 (April 2024) by @mvdbeek removed session.commit() from attach_implicitly_converted_dataset() to fix HID assignment issues. This created a race window where ICDA records aren't visible to concurrent requests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment