jmchilton/IMPLEMENTATION_PLAN.md

## IMPLEMENTATION_PLAN.md

      
    Raw
  

              IMPLEMENTATION_PLAN.md
            
          
    Implementation Plan: Subcollection Mapping & DCE Modeling

Phase 1: Schema Modeling (parameters.py)

1a. Add map_over_type to BatchDataInstance and BatchDataInstanceInternal

Currently BatchDataInstance (line 534) and BatchDataInstanceInternal (line 883) are simple {src, id} models. Add map_over_type: Optional[str] = None to both. Use Optional[str] — consistent with how collection_type is modeled elsewhere.
This is the core request-layer gap — map_over_type is how clients express subcollection mapping intent in batch values, but the schema doesn't model it.
Files: lib/galaxy/tool_util_models/parameters.py
1b. Add DCE to Internal Representations Only

DCE is backend-produced during batch expansion — it does NOT belong in the external request layer.
Add DataRequestInternalDce with src: Literal["dce"], id: StrictInt (if not already present).
Add "dce" to internal-only types:

DataRequestInternalDereferencedT union — add DatasetCollectionElementReference (already exists at parameters.py:1067) to cover job_internal DCE refs produced by subcollection mapping expansion
Verify MultiDataInstanceInternal and MultiDataInstanceInternalDereferenced unions include DataRequestInternalDce
Do NOT add DataRequestDce to the external _DataRequest union or BatchDataInstance.src
Do NOT add "dce" to BatchDataInstanceInternal.src — batch expansion happens after request_internal, so DCE never appears in Batch values at that layer

Files: lib/galaxy/tool_util_models/parameters.py
1c. Verify Conversion Functions Handle DCE

The encode() and decode() functions in convert.py work with generic src_dict format. Verify they handle src: "dce" in internal representations without special-casing. The dereference() function may need DCE handling if a dereference step encounters stored DCE refs.
Fix runtimeify in convert.py (line 548) — currently hardcodes DataRequestInternalHda(**value), which breaks on DCE src dicts. Needs to dispatch on src and handle DCE → dataset resolution.
Files: lib/galaxy/tool_util/parameters/convert.py
1d. Run Existing Unit Tests (Sanity Check)

PYTHONPATH=lib python -m pytest test/unit/tool_util/test_parameter_specification.py -x --timeout=60
Existing tests should still pass — we're only adding new fields/types, not changing existing validation.

Phase 2: Parameter Specification Tests (parameter_specification.yml)

2a. Add map_over_type Batch Specs to gx_data (Request Layer)

Add test cases to gx_data entry. These validate the client-facing schema:
# request_valid additions — map_over_type on batch values:
- parameter: {__class__: "Batch", values: [{src: hdca, id: abcdabcd, map_over_type: paired}]}
- parameter: {__class__: "Batch", values: [{src: hdca, id: abcdabcd, map_over_type: "list:paired"}]}
# map_over_type: null should also be valid (no subcollection mapping)
- parameter: {__class__: "Batch", values: [{src: hdca, id: abcdabcd, map_over_type: null}]}

# landing_request_valid additions — landing pages can pre-fill batch params with map_over_type:
- parameter: {__class__: "Batch", values: [{src: hdca, id: abcdabcd, map_over_type: paired}]}

# request_invalid additions — dce should NOT be valid in external request:
- parameter: {__class__: "Batch", values: [{src: dce, id: abcdabcd}]}
- parameter: {src: dce, id: abcdabcd}
2b. Add Internal Batch Specs to gx_data

These validate post-decode representations where map_over_type carries through:
# request_internal_valid additions:
- parameter: {__class__: "Batch", values: [{src: hdca, id: 5, map_over_type: paired}]}

# request_internal_dereferenced_valid additions:
- parameter: {__class__: "Batch", values: [{src: hdca, id: 5, map_over_type: paired}]}
DCE does NOT belong in Batch values at request_internal — batch expansion hasn't happened yet, and reruns reconstruct HDCA refs via build_for_rerun.
2c. Add DCE to job_internal Specs for gx_data

After expansion, individual job params contain DCE refs (not wrapped in Batch — Batch is expanded away by this layer). Subcollection mapping over gx_data produces {"src": "dce", "id": <int>} via to_decoded_json — each expanded job gets a DCE representing one subcollection element whose child_collection contains the datasets the tool will process.
# job_internal_valid additions — subcollection mapping produces DCE refs:
- parameter: {src: dce, id: 5}

# job_internal_invalid — DCE with encoded ID should fail:
- parameter: {src: dce, id: abcdabcd}
The current job_internal schema for gx_data only allows src: "hda" or src: "ldda" (DataRequestInternalDereferencedT). Must add DatasetCollectionElementReference to the union.
2d. Run Specification Tests (Red→Green)

PYTHONPATH=lib python -m pytest test/unit/tool_util/test_parameter_specification.py -x --timeout=60
Write specs first (red), then fix any model issues (green).
Files: test/unit/tool_util/parameter_specification.yml

Phase 3: Async Expansion Fix (meta.py)

3a. Add DCE Support to __expand_collection_parameter_async

Currently (meta.py:472) the async path rejects src != "hdca". Change to accept "dce" and resolve DatasetCollectionElement → child collection, matching the sync path.
This matters for job reruns where stored job state contains DCE refs from a previous expansion.
if src not in ("hdca", "dce"):
    raise exceptions.ToolMetaParameterException(...)
if src == "dce":
    item = app.model.context.get(DatasetCollectionElement, item_id)
    collection = item.child_collection
else:
    item = app.model.context.get(HistoryDatasetCollectionAssociation, item_id)
    collection = item.collection
Files: lib/galaxy/tools/parameters/meta.py

Phase 4: API Execution Tests (test_tool_execute.py)

4a. Refactor test_map_over_with_nested_paired_output_format_actions to Fluent API

The existing test_map_over_with_nested_paired_output_format_actions uses a manual dict. Refactor it to use the tool_input_format fixture (runs 3x: flat, nested, request) so it gains request-format coverage with map_over_type.
The request-format callback produces {__class__: "Batch", values: [{src: "hdca", id: ..., map_over_type: "paired"}]}. Need to check if DescribeToolInputs supports this or if we need to extend the fluent API.
4b. Add Simple Subcollection Mapping Test (cat1 over list:paired)

Migrate test_simple_subcollection_mapping from test_tools.py to test_tool_execute.py with request format coverage:
@requires_tool_id("cat1")
def test_simple_subcollection_mapping(
    target_history: TargetHistory,
    required_tool: RequiredTool,
    tool_input_format: DescribeToolInputs,
):
    hdca = target_history.with_example_list_of_pairs()
    # legacy/nested: {"f1": {"batch": True, "values": [{"src": "hdca", "map_over_type": "paired", "id": hdca_id}]}}
    # request: {"f1": {"__class__": "Batch", "values": [{"src": "hdca", "id": hdca_id, "map_over_type": "paired"}]}}
    ...
4c. Add paired_or_unpaired Subcollection Mapping with Request Format

Refactor existing test_map_over_paired_or_unpaired_with_list_paired to use tool_input_format fixture so it covers all 3 input formats including request.
4d. Check Fluent API Support

Review DescribeToolInputs in populators.py to see if .when.request() callbacks can produce batch inputs with map_over_type. If not, extend the fluent API. May need a helper like:
def batch_with_map_over(hdca, map_over_type):
    return {"__class__": "Batch", "values": [{**hdca.src_dict, "map_over_type": map_over_type}]}

Phase 5: Run Full Test Suite

5a. Unit Tests

PYTHONPATH=lib python -m pytest test/unit/tool_util/test_parameter_specification.py -x
5b. API Tests (new tests only, quick check)

./run_tests.sh -api lib/galaxy_test/api/test_tool_execute.py -k "subcollection or dce or map_over"
5c. API Tests (full tool execute suite, regression)

./run_tests.sh -api lib/galaxy_test/api/test_tool_execute.py

Implementation Order


Step
Phase
Description
Test First?


1
2a-2b
Write parameter specification tests for map_over_type (expect failures)
Yes (red)


2
1a
Add map_over_type to BatchDataInstance/BatchDataInstanceInternal
Green


3
2d
Verify spec tests pass
Green check


4
1b-1c
Add DCE to internal representations, fix runtimeify in convert.py
Implementation


5
2c
Write job_internal spec tests for DCE (red→green)
Red→Green


6
4a-4d
Write API execution tests (expect failures for request format)
Yes (red)


7
3a
Fix async expansion for DCE
Green


8
4d
Extend fluent API if needed
Green


9
5a-5c
Full test runs
Regression


## PROBLEM_AND_GOAL.md

      
    Raw
  

              PROBLEM_AND_GOAL.md
            
          
    Subcollection Mapping & DCE Modeling: Problem & Goal

Context

PR #21842 (guerler's /api/jobs modernization) exposed gaps in the structured tool state modeling around subcollection mapping (map_over_type). These features work in the legacy tool execution path but were not fully modeled or tested in the new request schema system.
Key Concept: Where DCE Lives in the Pipeline

Understanding the execution pipeline is critical to scoping this work correctly:
Client Request (request layer)
  → {"input": {"__class__": "Batch", "values": [{src: "hdca", id: "abc", map_over_type: "paired"}]}}

ID Decode (request → request_internal)
  → {"input": {"__class__": "Batch", "values": [{src: "hdca", id: 5, map_over_type: "paired"}]}}

Batch Expansion (meta.py — __expand_collection_parameter)
  → Splits list:paired HDCA into paired DatasetCollectionElement objects
  → to_decoded_json() serializes these as {"src": "dce", "id": <int>}

Job Internal (job_internal layer)
  → Parameters stored with src:"dce" refs pointing to specific subcollection elements

DCE (src: "dce") is backend-produced, not client-sent. The client sends src: "hdca" with map_over_type to express subcollection mapping intent. The backend's batch expansion in meta.py produces DCE references as an internal artifact of splitting collections into subcollection elements. These DCE refs are then stored in job parameters for tracking and reruns.
guerler confirmed this: "The client always resolves dce to hda before submission. The only time dce appears is for sub collection elements in batch or map over scenarios, which are now handled by collection expansion in meta.py."
Problem Statement

1. map_over_type Not Modeled in Request Schema

When a user maps a list:paired collection over a tool expecting a single dataset input, the API receives:
{"input": {"__class__": "Batch", "values": [{"src": "hdca", "id": "abc123", "map_over_type": "paired"}]}}
Currently, map_over_type is only present as a legacy attribute on LegacyRequestModelAttributes (parameters.py:389) with exclude=True and SkipJsonSchema. This means:

It is silently stripped during validation — it works by accident, not by design
The BatchRequest.values list uses BatchDataInstance which has no map_over_type field on dev (guerler added it in the PR)
There are zero parameter specification tests for map_over_type in parameter_specification.yml
The schema doesn't communicate to clients how subcollection mapping should be requested

2. DCE Not Modeled in Post-Expansion Representations

After batch expansion produces {"src": "dce", "id": <int>} references, these end up stored in job parameters. But the job_internal schema layer has no src: "dce" option — it only knows hda, ldda, hdca. This means:

Stored job state containing DCE refs can't be validated against the job_internal model
The request_internal and request_internal_dereferenced layers also lack DCE for the same reason — re-expansion of stored job state passes through these layers

Important: DCE does not belong in the external request layer. Clients never send it. guerler's PR added DataRequestDce to the external _DataRequest union, but that's unnecessary for the request model — DCE only needs to exist in internal/post-expansion representations.
3. No API-Level Test Coverage for Subcollection Mapping via Request Format

Existing subcollection mapping tests in test_tool_execute.py use only legacy/nested input formats, not the "request" format (__class__: "Batch"). Tests that exist:


Test
File
Input Format


test_map_over_with_nested_paired_output_format_actions
test_tool_execute.py:182
legacy only (manual dict)


test_map_over_paired_or_unpaired_with_list_paired
test_tool_execute.py:505
legacy only


test_map_over_paired_or_unpaired_with_list
test_tool_execute.py:517
legacy only


test_paired_input_map_over_nested_collections
test_tools.py:2479
legacy only


test_simple_subcollection_mapping
test_tools.py:3023
legacy only


test_can_map_over_dce_on_non_multiple_data_param
test_tools.py:2637
legacy only


None test the request format with __class__: "Batch" and map_over_type, because the schema doesn't model it yet.
4. Sync/Async Expansion Mismatch for DCE

The synchronous __expand_collection_parameter (meta.py:419) handles both src: "hdca" and src: "dce", but the async __expand_collection_parameter_async (meta.py:469) only handles src: "hdca" on dev. This matters for job reruns — when stored job state with DCE refs gets re-expanded through the async path, it fails. guerler fixed this in PR but there are no tests ensuring parity.
Goal

Model subcollection mapping correctly across schema layers, with full test coverage, independent of guerler's PR:

Model map_over_type properly on BatchDataInstance / BatchDataInstanceInternal — the request-layer gap
Add DCE to internal representations (BatchDataInstanceInternal, job_internal layer) — where expansion output lives
Add parameter specification tests for batch values with map_over_type (request layer) and DCE src (internal layers)
Add/migrate API execution tests in test_tool_execute.py covering subcollection mapping with request-format inputs
Fix async expansion to handle DCE references (for job rerun scenarios)

Scope

In Scope


map_over_type on BatchDataInstance (request layer) and BatchDataInstanceInternal (internal layer)
DCE support in BatchDataInstanceInternal and post-expansion representations (request_internal, request_internal_dereferenced, job_internal)
Parameter specification tests in parameter_specification.yml
API execution tests in test_tool_execute.py using the request input format
Async expansion fix in meta.py for DCE
Conversion handling in convert.py if needed for DCE encode/decode through internal layers

Out of Scope


Adding DCE to the external request layer (_DataRequest union) — clients don't send it
Client-side form changes (guerler's PR territory)
Legacy format deprecation
Full migration of all subcollection tests from test_tools.py to test_tool_execute.py

Success Criteria


map_over_type is a first-class field on batch value models, validated by schema
DCE is accepted in internal/post-expansion representations where it naturally appears
DCE is explicitly not part of the external request model (clients never send it)
Parameter specification YAML covers batch+map_over_type (request) and batch+dce (internal) scenarios
At least 3 new API tests in test_tool_execute.py exercise subcollection mapping with request-format inputs
Async expansion handles DCE without error
All existing tests continue to pass
Step	Phase	Description	Test First?
1	2a-2b	Write parameter specification tests for map_over_type (expect failures)	Yes (red)
2	1a	Add map_over_type to BatchDataInstance/BatchDataInstanceInternal	Green
3	2d	Verify spec tests pass	Green check
4	1b-1c	Add DCE to internal representations, fix `runtimeify` in convert.py	Implementation
5	2c	Write job_internal spec tests for DCE (red→green)	Red→Green
6	4a-4d	Write API execution tests (expect failures for request format)	Yes (red)
7	3a	Fix async expansion for DCE	Green
8	4d	Extend fluent API if needed	Green
9	5a-5c	Full test runs	Regression
Test	File	Input Format
`test_map_over_with_nested_paired_output_format_actions`	test_tool_execute.py:182	legacy only (manual dict)
`test_map_over_paired_or_unpaired_with_list_paired`	test_tool_execute.py:505	legacy only
`test_map_over_paired_or_unpaired_with_list`	test_tool_execute.py:517	legacy only
`test_paired_input_map_over_nested_collections`	test_tools.py:2479	legacy only
`test_simple_subcollection_mapping`	test_tools.py:3023	legacy only
`test_can_map_over_dce_on_non_multiple_data_param`	test_tools.py:2637	legacy only