Skip to content

Instantly share code, notes, and snippets.

@jmchilton
Created February 24, 2026 14:55
Show Gist options
  • Select an option

  • Save jmchilton/d9b9e1ab048963145a3f7af0d3e15eab to your computer and use it in GitHub Desktop.

Select an option

Save jmchilton/d9b9e1ab048963145a3f7af0d3e15eab to your computer and use it in GitHub Desktop.
CWL Secondary Files in Output Records

CWL Secondary Files in Output Records

Test

test_conformance_v1_1_secondary_files_in_output_records (xfail in v1.1 and v1.2)

CWL Tool

test/functional/tools/cwl_tools/v1.1/tests/record-out-secondaryFiles.cwl:

outputs:
  record_output:
    type:
      type: record
      fields:
        f1:
          type: File
          secondaryFiles: .s2
          outputBinding:
            glob: A
        f2:
          type: { type: array, items: File }
          secondaryFiles: .s3
          outputBinding:
            glob: [B, C]
baseCommand: touch
arguments: [A, A.s2, B, B.s3, C, C.s3]

No inputs. Creates 6 files, expects record output with:

  • f1: File "A" with secondary file "A.s2"
  • f2: array of [File "B" with "B.s3", File "C" with "C.s3"]

What Happens

Expected

{
  "f1": {"class": "File", "location": "A", "secondaryFiles": [{"location": "A.s2"}]},
  "f2": [
    {"class": "File", "location": "B", "secondaryFiles": [{"location": "B.s3"}]},
    {"class": "File", "location": "C", "secondaryFiles": [{"location": "C.s3"}]}
  ]
}

Got

{
  "f1": {"class": "File", "basename": "record-out-secondaryFiles.cwl", "secondaryFiles": [{"basename": "A.s2"}]},
  "f2": {"class": "File", "basename": "record-out-secondaryFiles.cwl"}
}

Problems:

  1. f1 wrong basename: "record-out-secondaryFiles.cwl" (the CWL tool file) instead of "A"
  2. f1 secondary files present: A.s2 IS found (secondary files partially work)
  3. f2 is a single File, not array: Should be [{B}, {C}], got single File
  4. f2 no secondary files: Missing B.s3, C.s3
  5. f2 wrong basename: Same tool filename

Root Cause: 4 Layered Bugs

Bug 1: Record output pre-creation doesn't support nested collections

lib/galaxy/model/dataset_collections/types/record.py:44-55:

def prototype_elements(self, fields=None, **kwds):
    for field in fields:
        name = field.get("name", None)
        assert field.get("type", "File")  # NS: this assert doesn't make sense
        field_dataset = DatasetCollectionElement(
            element=HistoryDatasetAssociation(),
            element_identifier=name,
        )
        yield field_dataset

Every record field becomes a plain HDA. The CWL type info in field.get("type") is ignored. For f2 (array of File), it should create a nested list collection, but instead creates a single HDA.

Bug 2: runtime_actions.py else branch uses wrong variables (crashes handle_outputs)

lib/galaxy/tool_util/cwl/runtime_actions.py:193-201:

elif isinstance(output, dict):
    prefix = f"{output_name}|__part__|"
    for record_key, record_value in output.items():
        record_value_output_key = f"{prefix}{record_key}"
        if isinstance(record_value, dict) and "class" in record_value:
            handle_known_output(record_value, record_value_output_key)
        else:
            handle_known_output_json(output, output_name)  # BUG

The else branch (line 201) has TWO wrong variables:

  • output (entire record dict) instead of record_value (the field value)
  • output_name ("record_output") instead of record_value_output_key ("record_output|part|f2")

output_name = "record_output" is the COLLECTION name, not a dataset. It's not in _output_dict (which only has record_output|__part__|f1 and record_output|__part__|f2). So job_proxy.output_path("record_output") raises KeyError, crashing handle_outputs() before it writes provided_metadata (line 228).

Consequence: The metadata JSON is never written. ALL record field HDAs keep their default names (the CWL tool filename) and created_from_basename is never set. This explains why f1 has wrong basename even though move_output correctly copied file "A" to f1's path and wrote secondary files.

Bug 3: No array handling in record output loop

Lines 203-216 handle arrays at the TOP level:

elif isinstance(output, list):
    elements = []
    for index, el in enumerate(output):
        if isinstance(el, dict) and el["class"] == "File":
            elements.append({"name": str(index), "filename": output_path, ...})
        ...
    provided_metadata[output_name] = {"elements": elements}

But INSIDE the record loop (lines 195-201), only dict with "class" is handled. List values (arrays) fall through to the broken else branch.

Even if the else branch were fixed, a list field needs special handling — it can't just be JSON-dumped into a single HDA.

Bug 4: Secondary files not implemented for list elements

runtime_actions.py:128-129:

for secondary_file in secondary_files:
    if output_name is None:
        raise NotImplementedError("secondaryFiles are unimplemented for dynamic list elements")

The top-level list handler (lines 203-216) doesn't pass output_name to move_output. Even if arrays-in-records were implemented, secondary files for array elements would hit this NotImplementedError.

Data Flow

Output creation path

CwlToolSource.parse_outputs()
  → _parse_output_record()
    → ToolOutputCollection(structure=ToolOutputCollectionStructure(collection_type="record", fields=...))
      → RecordDatasetCollectionType.prototype_elements(fields)
        → DatasetCollectionElement(element=HistoryDatasetAssociation(), element_identifier=name)
          → Creates plain HDA for EVERY field (f1, f2) regardless of CWL type

Job execution path

CwlToolEvaluation (tools/evaluation.py:1246-1270)
  → out_data = job.io_dicts() → {
      "record_output|__part__|f1": HDA_f1,
      "record_output|__part__|f2": HDA_f2,
    }
  → output_dict = {name: {"id": ..., "path": ...} for name, dataset in out_data.items()}
  → cwl_job_proxy = JobProxy(input_json, output_dict, ...)

Post-job relocate path

handle_outputs()
  → cwltool returns: {"record_output": {"f1": {class:File,...}, "f2": [{class:File,...}, ...]}}
  → record loop: f1 handled correctly by handle_known_output ✓
  → record loop: f2 is list → else → handle_known_output_json(output, "record_output") → KeyError!
  → provided_metadata never written → all HDAs keep defaults

Output reconversion path (test comparison)

CwlToolRun._output_name_to_object("record_output")
  → job["output_collections"]["record_output"] → GalaxyOutput(dataset_collection)
output_to_cwl_json()
  → collection_type "record" → iterate elements → element_to_cwl_json(element)
    → f1 element → single HDA → File (but wrong basename from missing metadata)
    → f2 element → single HDA → File (should be list of Files)

Key Files

  • lib/galaxy/model/dataset_collections/types/record.py:44-55 — prototype_elements, all fields become plain HDAs
  • lib/galaxy/tool_util/cwl/runtime_actions.py:193-201 — record output handling, broken else branch
  • lib/galaxy/tool_util/cwl/runtime_actions.py:117-158 — move_output with secondary files, NotImplementedError for lists
  • lib/galaxy/tool_util/cwl/runtime_actions.py:203-216 — top-level list output handling (not used in records)
  • lib/galaxy/tool_util/parser/cwl.py:275-287 — _parse_output_record
  • lib/galaxy/tool_util/parser/output_objects.py:318-367 — known_outputs, ToolOutputCollectionPart
  • lib/galaxy/tools/evaluation.py:1246-1270 — output_dict construction
  • lib/galaxy/tool_util/cwl/util.py:683-694 — record output to CWL JSON conversion

Fix Strategy

Phase 1: Fix the crash (Bug 2) — immediate, safe

Fix the else branch to use correct variables:

handle_known_output_json(record_value, record_value_output_key)

This prevents the KeyError crash and fixes f1's metadata (basename). Non-File/non-array record fields (scalars, expressions) would also work correctly. f2 would still be wrong (JSON-serialized list in a single HDA) but at least f1 works and the function doesn't crash.

Phase 2: Array handling in records (Bug 1 + Bug 3) — medium scope

  1. record.py: Detect array-type fields from field.get("type") and create nested list collections instead of plain HDAs
  2. runtime_actions.py: Add list handling case in the record loop (similar to lines 203-216) using record_value_output_key as the output name
  3. cwl.py: Ensure the fields list passed through has type info preserved
  4. util.py: Handle nested collections within record elements during reconversion

Phase 3: Secondary files for array elements (Bug 4) — larger scope

  1. runtime_actions.py: Implement secondary files for list elements in move_output (or a new handler for array elements with secondary files)
  2. Needs a way to store secondary files per-element, potentially using the element index in the path structure

Alternative: Dynamic output discovery for record arrays

Instead of pre-creating nested collections, treat array-within-record fields as dynamic outputs and use from_provided_metadata discovery (like top-level array outputs do). This might be simpler since it avoids changing the pre-creation infrastructure.

Unresolved Questions

  • Existing tests for record outputs with plain File fields (no arrays, no secondaryFiles)? Could verify Bug 2 in isolation.
  • Does the fields list passed to record.py:prototype_elements contain CWL type info, or stripped?
  • Would dynamic output discovery (from_provided_metadata) work for nested collections within records?
  • The secondary_files_in_unnamed_records test (also xfail) — same root cause or different?
  • Does anyone currently use CWL record outputs successfully for simpler cases (all-File fields)?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment