Skip to content

Instantly share code, notes, and snippets.

@jmchilton
Created February 24, 2026 01:56
Show Gist options
  • Select an option

  • Save jmchilton/5d4dc010cb1228f9940c5893d8f85c4e to your computer and use it in GitHub Desktop.

Select an option

Save jmchilton/5d4dc010cb1228f9940c5893d8f85c4e to your computer and use it in GitHub Desktop.
CWL Extra Properties in Job JSON

CWL Extra Properties in Job JSON

Problem

Failing test: test_conformance_v1_2_nested_prefixes_arrays

err_msg: "2 validation errors for binding-test.cwl (request model)
  min_std_max_min - Extra inputs are not permitted
  minimum_seed_length - Extra inputs are not permitted"

CWL conformance tests share job JSON files across multiple tools. bwa-mem-job.json has 4 keys:

  • reference (File)
  • reads (File[])
  • min_std_max_min ([1,2,3,4])
  • minimum_seed_length (3)

But binding-test.cwl only defines inputs: reference, reads, #args.py.

Galaxy's Pydantic request model uses extra="forbid" and rejects them.

What CWL spec / cwltool actually do

The CWL v1.2 spec does NOT explicitly address extra job inputs. It says "Validate the input object against the inputs schema" but doesn't specify whether extra properties should be rejected or ignored.

However, cwltool (reference implementation) keeps extra keys and makes them available in JS:

  1. process.py:_init_job() copies the raw joborder dict → job
  2. Validates with validate_ex(schema, job, strict=False)strict=False means extra keys produce a warning, not an error (schema_salad/validate.py:438-441)
  3. job (with extras still present) is passed to Builder.__init__self.job
  4. Builder.do_eval() passes self.job to expression.do_eval() as jobinput
  5. expression.do_eval() sets {"inputs": jobinput, ...} as the JS root context (cwl_utils/expression.py:292)

So extra job keys survive into $(inputs.extra_key) in JavaScript expressions. cwltool's --strict flag would reject them, but the default is --non-strict (despite strict=True being the argparse default — the conformance test runner uses --non-strict).

The conformance test itself is implicit evidence: nested_prefixes_arrays uses bwa-mem-job.json (4 keys) with binding-test.cwl (3 inputs). The test expects success. No explicit "extra inputs" conformance test exists.

For Galaxy's purposes: filtering extras before validation is safe — Galaxy doesn't run cwltool JS expressions at request validation time; it passes them through to cwltool later in the job execution where cwltool handles the full job dict independently.

Flow

  1. CwlPopulator.run_cwl_job() loads bwa-mem-job.json with all 4 keys
  2. stage_inputs() processes ALL keys — even creates an HDCA from [1,2,3,4] for min_std_max_min
  3. _run_cwl_tool_job()tool_request_raw() POSTs to /api/jobs with all 4 keys
  4. JobsService.create() builds RequestToolState from inputs
  5. RequestToolState.validate() creates Pydantic model with only the tool's defined params (reference, reads, #args.py)
  6. Pydantic model has extra="forbid" (via create_model_strict() in tool_util_models/parameters.py:2497)
  7. Validation rejects min_std_max_min and minimum_seed_length as extra forbidden inputs

Key Files

  • lib/galaxy/webapps/galaxy/services/jobs.py:241-255JobsService.create(), validation entry point
  • lib/galaxy/tool_util/parameters/state.py:87-92RequestToolState uses create_request_model
  • lib/galaxy/tool_util_models/parameters.py:2495-2497create_model_strict with extra="forbid"
  • lib/galaxy_test/base/populators.py:3053-3085_run_cwl_tool_job, submits raw job dict
  • lib/galaxy_test/base/populators.py:3111-3175run_cwl_job, loads job JSON and stages inputs
  • test/functional/tools/cwl_tools/v1.2/tests/binding-test.cwl — tool with 3 inputs
  • test/functional/tools/cwl_tools/v1.2/tests/bwa-mem-job.json — job with 4 keys (1 extra)

Existing Infrastructure

lib/galaxy/tool_util/cwl/job_conversion.py already has cwl_job_to_request() which strips extra keys:

param_names = {p.name for p in input_models.parameters}
for key in list(job.keys()):
    if key not in param_names:
        del job[key]

This function isn't used in the conformance test submission path though.

Staging is a blind walk (no schema)

galactic_job_json() (tool_util/cwl/util.py:418-422) iterates every key in the job dict with zero schema awareness:

replace_keys = {}
for key, value in job.items():
    replace_keys[key] = replacement_item(value)
job.update(replace_keys)

replacement_item() dispatches purely on Python type / class field:

  • {class: File} → upload → {src: hda, id: ...}
  • {class: Directory} → tar + upload → {src: hda, id: ...}
  • list → each item uploaded, wrapped in HDCA → {src: hdca, id: ...}
  • scalar (for tools) → pass through unchanged

No schema is used client-side at any point — not for staging, not for submission. All the CWL input schema parsing and parameter model generation happens server-side via the tool parameter models (which already have good test coverage for CWL types).

Recommended Fix

Filter extra keys server-side in JobsService.create(), after loading the tool but before request validation. The tool's parameter models are already available at this point and correctly handle all CWL schema complexity. The JobRequest Pydantic model accepts inputs: dict[str, Any], so extras pass through FastAPI fine — rejection happens at RequestToolState.validate() inside create().

# jobs.py:create(), after line 247 (inputs = job_request.inputs)
if inputs and tool.tool_type in ("cwl", "galactic_cwl"):
    param_names = {p.name for p in tool.parameters}
    inputs = {k: v for k, v in inputs.items() if k in param_names}

This reuses the server's already-parsed parameter models — no CWL schema parsing needed. The CWL job runner builds its own job dict independently from the tool source, so filtering at the API boundary doesn't lose anything.

Side effect: wasteful staging remains

stage_inputs() still blindly uploads all job keys (e.g. creating an HDCA from [1,2,3,4] for min_std_max_min). This is harmless but wasteful. Fixing it would require either:

  • Passing tool parameter info to the client (more invasive)
  • Parsing the CWL file client-side (fragile — list vs dict forms, # prefixes, nested types, $import/$mixin, etc.)

Not worth it for now.

Unresolved Questions

  • Any other conformance tests hit same issue? Likely yes — any test sharing a job JSON across tools with different input sets.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment