jmchilton/CWL_RECORD_OUTPUT_SECONDARY_FILES.md

## CWL_RECORD_OUTPUT_SECONDARY_FILES.md

      
    Raw
  

              CWL_RECORD_OUTPUT_SECONDARY_FILES.md
            
          
    CWL Secondary Files in Output Records

Test

test_conformance_v1_1_secondary_files_in_output_records (xfail in v1.1 and v1.2)
CWL Tool

test/functional/tools/cwl_tools/v1.1/tests/record-out-secondaryFiles.cwl:
outputs:
  record_output:
    type:
      type: record
      fields:
        f1:
          type: File
          secondaryFiles: .s2
          outputBinding:
            glob: A
        f2:
          type: { type: array, items: File }
          secondaryFiles: .s3
          outputBinding:
            glob: [B, C]
baseCommand: touch
arguments: [A, A.s2, B, B.s3, C, C.s3]
No inputs. Creates 6 files, expects record output with:

f1: File "A" with secondary file "A.s2"
f2: array of [File "B" with "B.s3", File "C" with "C.s3"]

What Happens

Expected

{
  "f1": {"class": "File", "location": "A", "secondaryFiles": [{"location": "A.s2"}]},
  "f2": [
    {"class": "File", "location": "B", "secondaryFiles": [{"location": "B.s3"}]},
    {"class": "File", "location": "C", "secondaryFiles": [{"location": "C.s3"}]}
  ]
}
Got

{
  "f1": {"class": "File", "basename": "record-out-secondaryFiles.cwl", "secondaryFiles": [{"basename": "A.s2"}]},
  "f2": {"class": "File", "basename": "record-out-secondaryFiles.cwl"}
}
Problems:

f1 wrong basename: "record-out-secondaryFiles.cwl" (the CWL tool file) instead of "A"
f1 secondary files present: A.s2 IS found (secondary files partially work)
f2 is a single File, not array: Should be [{B}, {C}], got single File
f2 no secondary files: Missing B.s3, C.s3
f2 wrong basename: Same tool filename

Root Cause: 4 Layered Bugs

Bug 1: Record output pre-creation doesn't support nested collections

lib/galaxy/model/dataset_collections/types/record.py:44-55:
def prototype_elements(self, fields=None, **kwds):
    for field in fields:
        name = field.get("name", None)
        assert field.get("type", "File")  # NS: this assert doesn't make sense
        field_dataset = DatasetCollectionElement(
            element=HistoryDatasetAssociation(),
            element_identifier=name,
        )
        yield field_dataset
Every record field becomes a plain HDA. The CWL type info in field.get("type") is
ignored. For f2 (array of File), it should create a nested list collection, but instead
creates a single HDA.
Bug 2: runtime_actions.py else branch uses wrong variables (crashes handle_outputs)

lib/galaxy/tool_util/cwl/runtime_actions.py:193-201:
elif isinstance(output, dict):
    prefix = f"{output_name}|__part__|"
    for record_key, record_value in output.items():
        record_value_output_key = f"{prefix}{record_key}"
        if isinstance(record_value, dict) and "class" in record_value:
            handle_known_output(record_value, record_value_output_key)
        else:
            handle_known_output_json(output, output_name)  # BUG
The else branch (line 201) has TWO wrong variables:

output (entire record dict) instead of record_value (the field value)
output_name ("record_output") instead of record_value_output_key ("record_output|part|f2")

output_name = "record_output" is the COLLECTION name, not a dataset. It's not in
_output_dict (which only has record_output|__part__|f1 and record_output|__part__|f2).
So job_proxy.output_path("record_output") raises KeyError, crashing handle_outputs()
before it writes provided_metadata (line 228).
Consequence: The metadata JSON is never written. ALL record field HDAs keep their default
names (the CWL tool filename) and created_from_basename is never set. This explains why
f1 has wrong basename even though move_output correctly copied file "A" to f1's path and
wrote secondary files.
Bug 3: No array handling in record output loop

Lines 203-216 handle arrays at the TOP level:
elif isinstance(output, list):
    elements = []
    for index, el in enumerate(output):
        if isinstance(el, dict) and el["class"] == "File":
            elements.append({"name": str(index), "filename": output_path, ...})
        ...
    provided_metadata[output_name] = {"elements": elements}
But INSIDE the record loop (lines 195-201), only dict with "class" is handled.
List values (arrays) fall through to the broken else branch.
Even if the else branch were fixed, a list field needs special handling — it can't
just be JSON-dumped into a single HDA.
Bug 4: Secondary files not implemented for list elements

runtime_actions.py:128-129:
for secondary_file in secondary_files:
    if output_name is None:
        raise NotImplementedError("secondaryFiles are unimplemented for dynamic list elements")
The top-level list handler (lines 203-216) doesn't pass output_name to move_output.
Even if arrays-in-records were implemented, secondary files for array elements would hit
this NotImplementedError.
Data Flow

Output creation path

CwlToolSource.parse_outputs()
  → _parse_output_record()
    → ToolOutputCollection(structure=ToolOutputCollectionStructure(collection_type="record", fields=...))
      → RecordDatasetCollectionType.prototype_elements(fields)
        → DatasetCollectionElement(element=HistoryDatasetAssociation(), element_identifier=name)
          → Creates plain HDA for EVERY field (f1, f2) regardless of CWL type

Job execution path

CwlToolEvaluation (tools/evaluation.py:1246-1270)
  → out_data = job.io_dicts() → {
      "record_output|__part__|f1": HDA_f1,
      "record_output|__part__|f2": HDA_f2,
    }
  → output_dict = {name: {"id": ..., "path": ...} for name, dataset in out_data.items()}
  → cwl_job_proxy = JobProxy(input_json, output_dict, ...)

Post-job relocate path

handle_outputs()
  → cwltool returns: {"record_output": {"f1": {class:File,...}, "f2": [{class:File,...}, ...]}}
  → record loop: f1 handled correctly by handle_known_output ✓
  → record loop: f2 is list → else → handle_known_output_json(output, "record_output") → KeyError!
  → provided_metadata never written → all HDAs keep defaults

Output reconversion path (test comparison)

CwlToolRun._output_name_to_object("record_output")
  → job["output_collections"]["record_output"] → GalaxyOutput(dataset_collection)
output_to_cwl_json()
  → collection_type "record" → iterate elements → element_to_cwl_json(element)
    → f1 element → single HDA → File (but wrong basename from missing metadata)
    → f2 element → single HDA → File (should be list of Files)

Key Files


lib/galaxy/model/dataset_collections/types/record.py:44-55 — prototype_elements, all fields become plain HDAs
lib/galaxy/tool_util/cwl/runtime_actions.py:193-201 — record output handling, broken else branch
lib/galaxy/tool_util/cwl/runtime_actions.py:117-158 — move_output with secondary files, NotImplementedError for lists
lib/galaxy/tool_util/cwl/runtime_actions.py:203-216 — top-level list output handling (not used in records)
lib/galaxy/tool_util/parser/cwl.py:275-287 — _parse_output_record
lib/galaxy/tool_util/parser/output_objects.py:318-367 — known_outputs, ToolOutputCollectionPart
lib/galaxy/tools/evaluation.py:1246-1270 — output_dict construction
lib/galaxy/tool_util/cwl/util.py:683-694 — record output to CWL JSON conversion

Fix Strategy

Phase 1: Fix the crash (Bug 2) — immediate, safe

Fix the else branch to use correct variables:
handle_known_output_json(record_value, record_value_output_key)
This prevents the KeyError crash and fixes f1's metadata (basename). Non-File/non-array
record fields (scalars, expressions) would also work correctly. f2 would still be wrong
(JSON-serialized list in a single HDA) but at least f1 works and the function doesn't crash.
Phase 2: Array handling in records (Bug 1 + Bug 3) — medium scope


record.py: Detect array-type fields from field.get("type") and create nested
list collections instead of plain HDAs
runtime_actions.py: Add list handling case in the record loop (similar to lines 203-216)
using record_value_output_key as the output name
cwl.py: Ensure the fields list passed through has type info preserved
util.py: Handle nested collections within record elements during reconversion

Phase 3: Secondary files for array elements (Bug 4) — larger scope


runtime_actions.py: Implement secondary files for list elements in move_output
(or a new handler for array elements with secondary files)
Needs a way to store secondary files per-element, potentially using the element index
in the path structure

Alternative: Dynamic output discovery for record arrays

Instead of pre-creating nested collections, treat array-within-record fields as dynamic
outputs and use from_provided_metadata discovery (like top-level array outputs do).
This might be simpler since it avoids changing the pre-creation infrastructure.
Unresolved Questions


Existing tests for record outputs with plain File fields (no arrays, no secondaryFiles)? Could verify Bug 2 in isolation.
Does the fields list passed to record.py:prototype_elements contain CWL type info, or stripped?
Would dynamic output discovery (from_provided_metadata) work for nested collections within records?
The secondary_files_in_unnamed_records test (also xfail) — same root cause or different?
Does anyone currently use CWL record outputs successfully for simpler cases (all-File fields)?
No results found