jmchilton/CWL_EXTRA_PROPERTIES_PROBLEM.md

## CWL_EXTRA_PROPERTIES_PROBLEM.md

      
    Raw
  

              CWL_EXTRA_PROPERTIES_PROBLEM.md
            
          
    CWL Extra Properties in Job JSON

Problem

Failing test: test_conformance_v1_2_nested_prefixes_arrays
err_msg: "2 validation errors for binding-test.cwl (request model)
  min_std_max_min - Extra inputs are not permitted
  minimum_seed_length - Extra inputs are not permitted"

CWL conformance tests share job JSON files across multiple tools. bwa-mem-job.json has 4 keys:

reference (File)
reads (File[])
min_std_max_min ([1,2,3,4])
minimum_seed_length (3)

But binding-test.cwl only defines inputs: reference, reads, #args.py.
Galaxy's Pydantic request model uses extra="forbid" and rejects them.
What CWL spec / cwltool actually do

The CWL v1.2 spec does NOT explicitly address extra job inputs. It says "Validate the input
object against the inputs schema" but doesn't specify whether extra properties should be
rejected or ignored.
However, cwltool (reference implementation) keeps extra keys and makes them available in JS:

process.py:_init_job() copies the raw joborder dict → job
Validates with validate_ex(schema, job, strict=False) — strict=False means extra keys
produce a warning, not an error (schema_salad/validate.py:438-441)
job (with extras still present) is passed to Builder.__init__ → self.job
Builder.do_eval() passes self.job to expression.do_eval() as jobinput
expression.do_eval() sets {"inputs": jobinput, ...} as the JS root context
(cwl_utils/expression.py:292)

So extra job keys survive into $(inputs.extra_key) in JavaScript expressions. cwltool's
--strict flag would reject them, but the default is --non-strict (despite strict=True
being the argparse default — the conformance test runner uses --non-strict).
The conformance test itself is implicit evidence: nested_prefixes_arrays uses
bwa-mem-job.json (4 keys) with binding-test.cwl (3 inputs). The test expects success.
No explicit "extra inputs" conformance test exists.
For Galaxy's purposes: filtering extras before validation is safe — Galaxy doesn't run
cwltool JS expressions at request validation time; it passes them through to cwltool later
in the job execution where cwltool handles the full job dict independently.
Flow


CwlPopulator.run_cwl_job() loads bwa-mem-job.json with all 4 keys
stage_inputs() processes ALL keys — even creates an HDCA from [1,2,3,4] for min_std_max_min
_run_cwl_tool_job() → tool_request_raw() POSTs to /api/jobs with all 4 keys
JobsService.create() builds RequestToolState from inputs
RequestToolState.validate() creates Pydantic model with only the tool's defined params (reference, reads, #args.py)
Pydantic model has extra="forbid" (via create_model_strict() in tool_util_models/parameters.py:2497)
Validation rejects min_std_max_min and minimum_seed_length as extra forbidden inputs

Key Files


lib/galaxy/webapps/galaxy/services/jobs.py:241-255 — JobsService.create(), validation entry point
lib/galaxy/tool_util/parameters/state.py:87-92 — RequestToolState uses create_request_model
lib/galaxy/tool_util_models/parameters.py:2495-2497 — create_model_strict with extra="forbid"
lib/galaxy_test/base/populators.py:3053-3085 — _run_cwl_tool_job, submits raw job dict
lib/galaxy_test/base/populators.py:3111-3175 — run_cwl_job, loads job JSON and stages inputs
test/functional/tools/cwl_tools/v1.2/tests/binding-test.cwl — tool with 3 inputs
test/functional/tools/cwl_tools/v1.2/tests/bwa-mem-job.json — job with 4 keys (1 extra)

Existing Infrastructure

lib/galaxy/tool_util/cwl/job_conversion.py already has cwl_job_to_request() which strips extra keys:
param_names = {p.name for p in input_models.parameters}
for key in list(job.keys()):
    if key not in param_names:
        del job[key]
This function isn't used in the conformance test submission path though.
Staging is a blind walk (no schema)

galactic_job_json() (tool_util/cwl/util.py:418-422) iterates every key in the job dict
with zero schema awareness:
replace_keys = {}
for key, value in job.items():
    replace_keys[key] = replacement_item(value)
job.update(replace_keys)
replacement_item() dispatches purely on Python type / class field:

{class: File} → upload → {src: hda, id: ...}
{class: Directory} → tar + upload → {src: hda, id: ...}
list → each item uploaded, wrapped in HDCA → {src: hdca, id: ...}
scalar (for tools) → pass through unchanged

No schema is used client-side at any point — not for staging, not for submission. All the
CWL input schema parsing and parameter model generation happens server-side via the tool
parameter models (which already have good test coverage for CWL types).
Recommended Fix

Filter extra keys server-side in JobsService.create(), after loading the tool but
before request validation. The tool's parameter models are already available at this point
and correctly handle all CWL schema complexity. The JobRequest Pydantic model accepts
inputs: dict[str, Any], so extras pass through FastAPI fine — rejection happens at
RequestToolState.validate() inside create().
# jobs.py:create(), after line 247 (inputs = job_request.inputs)
if inputs and tool.tool_type in ("cwl", "galactic_cwl"):
    param_names = {p.name for p in tool.parameters}
    inputs = {k: v for k, v in inputs.items() if k in param_names}
This reuses the server's already-parsed parameter models — no CWL schema parsing needed.
The CWL job runner builds its own job dict independently from the tool source, so filtering
at the API boundary doesn't lose anything.
Side effect: wasteful staging remains

stage_inputs() still blindly uploads all job keys (e.g. creating an HDCA from [1,2,3,4]
for min_std_max_min). This is harmless but wasteful. Fixing it would require either:

Passing tool parameter info to the client (more invasive)
Parsing the CWL file client-side (fragile — list vs dict forms, # prefixes, nested types,
$import/$mixin, etc.)

Not worth it for now.
Unresolved Questions


Any other conformance tests hit same issue? Likely yes — any test sharing a job JSON across
tools with different input sets.
No results found