Skip to content

Instantly share code, notes, and snippets.

@jmchilton
Created February 24, 2026 18:09
Show Gist options
  • Select an option

  • Save jmchilton/4f18f6e64f409ab4a29eb2ffddf8751c to your computer and use it in GitHub Desktop.

Select an option

Save jmchilton/4f18f6e64f409ab4a29eb2ffddf8751c to your computer and use it in GitHub Desktop.
Tool Runtime Timelimit Plan

Tool Runtime Timelimit Plan

Problem

CWL conformance test test_conformance_v1_2_timelimit_basic is a false green - it passes because a broad exception handler (lib/galaxy_test/base/populators.py:3191-3194) catches an unrelated failure, not because Galaxy actually enforces the ToolTimeLimit requirement. Galaxy declares ToolTimeLimit in SUPPORTED_TOOL_REQUIREMENTS but never extracts, propagates, or enforces the timelimit value.

Goal

  1. Add timelimit as a first-class resource requirement in Galaxy XML/YAML tools (non-CWL)
  2. Wire it through the existing resource requirements infrastructure (which implicitly reaches TPV)
  3. Enforce it in the local job runner
  4. Wire CWL ToolTimeLimit into Galaxy's resource requirement system

Commits 1-3 are CWL-independent and can be branched off for a standalone PR. Commit 4 is CWL-specific.

Background

Existing Infrastructure

  • ResourceType literal (lib/galaxy/tool_util/deps/requirements.py:236-249): 12 types (cores, ram, tmpdir, cuda, shm). No timelimit.
  • ResourceRequirement class (same file, line 253): stores value_or_expression + resource_type, supports numeric values and expressions (expressions raise NotImplementedError).
  • resource_requirements_from_list() (line 280): maps CWL camelCase keys to Galaxy snake_case keys via cwl_to_galaxy dict. For Galaxy-format items (type: resource), valid keys come from cwl_to_galaxy.values().
  • ResourceRequirement Pydantic model (lib/galaxy/tool_util_models/tool_source.py:56-73): separate model for YAML tool validation/schema generation. Has explicit fields for each resource type. Drives ToolSourceSchema.json generation.
  • Local runner (lib/galaxy/jobs/runners/local.py): already has __poll_if_needed() (line 235) that polls running processes and calls job_wrapper.check_limits() for output size and global walltime.
  • Job wrapper check_limits() (lib/galaxy/jobs/__init__.py:2414): checks global walltime_delta from job config. runtime parameter is a datetime.timedelta. This is an admin-set global limit, NOT a per-tool limit.
  • Job wrapper has_limits() (line 2442): gates whether __poll_if_needed() activates polling. Currently only checks global output_size and walltime.
  • Runner states (lib/galaxy/jobs/runners/util/__init__.py:10-17): includes WALLTIME_REACHED and GLOBAL_WALLTIME_REACHED as distinct states.
  • CWL ToolTimeLimit format: ToolTimeLimit: { timelimit: <seconds> } where value is int/float in seconds. Can also be an expression $(...).
  • XSD schema (galaxy.xsd:8090-8156): ResourceType simpleType with 12 enumerated values.
  • Other runners using check_limits(): drmaa.py and pbs.py also call check_limits() via the runner state wrapper - they'll automatically benefit. Kubernetes and Pulsar do NOT use check_limits() and won't get timelimit support from this change.

Key Distinction: Global Walltime vs Per-Tool Timelimit

The existing walltime limit in check_limits() is a global admin setting from job_conf.xml. Our new timelimit is a per-tool declaration by the tool author. These are complementary - both should be checked, with the more restrictive one winning.

CWL Timelimit Conformance Tests

Test Tool should_fail Expected Behavior
timelimit_basic timelimit.cwl true sleep 15, 3s limit - killed by timeout
timelimit_invalid timelimit2.cwl true negative timelimit -1 - CWL schema validation error
timelimit_zero_unlimited timelimit3.cwl false zero timelimit = no limit, sleep 15 succeeds
timelimit_from_expression timelimit4.cwl true $(1+2) expression - requires JS eval
timelimit_expressiontool timelimit5.cwl false ExpressionTool ignores timelimit

Implementation Plan

Commit 1: Add timelimit resource requirement type to Galaxy XML/YAML tools

Files to modify:

  1. lib/galaxy/tool_util/deps/requirements.py

    • Add "timelimit" to ResourceType literal (line 248, before closing paren)
    • No need to add to cwl_to_galaxy dict yet - CWL ToolTimeLimit is a separate requirement class, not a ResourceRequirement field. The dict is only consulted for class: ResourceRequirement items. The "timelimit" key will be found via cwl_to_galaxy.values() for Galaxy-format type: resource items.
  2. lib/galaxy/tool_util_models/tool_source.py

    • Add timelimit: Optional[Union[int, float]] = None field to the ResourceRequirement Pydantic model (after shm_size, ~line 73). This is required for YAML tool validation and ToolSourceSchema.json generation.
  3. lib/galaxy/tool_util/xsd/galaxy.xsd

    • Add <xs:enumeration value="timelimit"> with documentation to ResourceType simpleType (after shm_size, before line 8155). Doc: "Maximum time in seconds the tool is allowed to run. Job will be terminated if exceeded."
  4. test/functional/tools/resource_requirements.xml

    • Add <resource type="timelimit">60</resource> after line 14
  5. test/unit/tool_util/test_parsing.py

    • Update TOOL_XML_1 fixture: add <resource type="timelimit">60</resource> (~line 55)
    • Update TOOL_YAML_1 fixture: add - type: resource / timelimit: 60 block (~line 166)
    • Update TestXmlLoader.test_requirements(): change count from 7 to 8, add assert resource_requirements[7].resource_type == "timelimit" after line 386
    • Update TestYamlLoader.test_requirements(): change len(resource_requirements) == 7 to == 8 (line 574), add assertion for resource_requirements[7]
  6. Regenerate ToolSourceSchema.json (via client/src/components/Tool/rebuild.py) after tool_source.py change.

Red-to-green test: Write the test assertion for timelimit first, see it fail (resource type not found), then add the type.

Commit 2: Local runner enforces per-tool timelimit

Files to modify:

  1. lib/galaxy/jobs/runners/util/__init__.py

    • Add new runner state: TOOL_TIMELIMIT_REACHED="tool_timelimit_reached" (after GLOBAL_WALLTIME_REACHED, ~line 15). Distinct from walltime states for operational visibility.
  2. lib/galaxy/tools/__init__.py

    • After self.resource_requirements is set (~line 1535), extract timelimit:
      self.timelimit = None
      for rr in self.resource_requirements:
          if rr.resource_type == "timelimit" and not rr.runtime_required:
              self.timelimit = rr.get_value()
              break
    • No dedicated parse_timelimit() on tool source interface - cores_min having its own accessor is a historical artifact. Extracting from the already-parsed resource_requirements list is cleaner.
  3. lib/galaxy/jobs/__init__.py (JobWrapper)

    • In has_limits() (~line 2442): add check for per-tool timelimit:
      has_tool_timelimit = self.tool is not None and getattr(self.tool, 'timelimit', None) is not None
      return has_output_limit or has_walltime_limit or has_tool_timelimit
      This is critical - without it, __poll_if_needed() won't activate when only per-tool timelimit exists (no global walltime configured).
    • In check_limits() (~line 2414): after global walltime check, add per-tool timelimit check:
      if self.tool and getattr(self.tool, 'timelimit', None) and runtime is not None:
          timelimit_seconds = self.tool.timelimit
          if timelimit_seconds > 0:  # zero = no limit (CWL spec)
              timelimit_delta = datetime.timedelta(seconds=timelimit_seconds)
              if runtime > timelimit_delta:
                  return (
                      JobState.runner_states.TOOL_TIMELIMIT_REACHED,
                      f"Job exceeded tool time limit ({timelimit_seconds}s)"
                  )
      Note: runtime is already a timedelta, so we convert timelimit seconds to timedelta for comparison.
  4. lib/galaxy/jobs/runners/local.py

    • No changes needed. __poll_if_needed() already calls job_wrapper.has_limits() and job_wrapper.check_limits(runtime=...).
  5. lib/galaxy/jobs/runners/state_handlers/resubmit.py

    • Add tool_timelimit_reached to MESSAGES dict (~line 19):
      tool_timelimit_reached="it exceeded the tool's time limit",
    • Add to _ExpressionContext (~line 147):
      "tool_timelimit_reached": runner_state == JobState.runner_states.TOOL_TIMELIMIT_REACHED,
  6. test/unit/app/jobs/test_runner_local.py

    • MockJobWrapper.has_limits() is hardcoded to False (line 214). Update to check self.tool.timelimit:
      def has_limits(self):
          return getattr(self.tool, 'timelimit', None) is not None
    • Add check_limits() mock method that mirrors the real implementation.
    • Add test: mock tool with timelimit=3, run sleep 15, assert job is killed and fail() is called.

Red-to-green test: Add test that runs a tool with timelimit: 3 and sleep 15, expect job failure. Should fail initially since timelimit isn't enforced, then pass after implementation.

Commit 3: Integration test with TPV (optional)

The timelimit resource requirement is automatically available to TPV through tool.resource_requirements - no explicit wiring needed. But an integration test confirming TPV can read it would be good.

Files to modify:

  1. test/integration/test_user_defined_tool_job_conf.py
    • Add TOOL_WITH_TIMELIMIT_SPECIFICATION constant following the pattern of TOOL_WITH_RESOURCE_SPECIFICATION (~line 18)
    • Verify TPV receives it (similar to existing test_user_defined_applies_resource_requirements test for cores_min)

Note: This commit is optional if TPV doesn't yet have a {timelimit} template variable. The important thing is that timelimit appears in tool.resource_requirements which TPV already iterates.

Commit 4: Wire CWL ToolTimeLimit into Galaxy resource requirements

Files to modify:

  1. lib/galaxy/tool_util/cwl/parser.py

    • Add timelimit_requirements() method on the tool proxy:
      def timelimit_requirements(self) -> List:
          return self.hints_or_requirements_of_class("ToolTimeLimit")
  2. lib/galaxy/tool_util/parser/cwl.py

    • In parse_requirements(), extract ToolTimeLimit and add it to resource_requirements list:
      for tl in self.tool_proxy.timelimit_requirements():
          timelimit_value = tl.get("timelimit")
          if timelimit_value is not None:
              resource_requirements.append({"type": "resource", "timelimit": timelimit_value})
    • Pass these through to parse_requirements_from_lists(resource_requirements=...). The "timelimit" key will be found in cwl_to_galaxy.values() (since Commit 1 adds it to ResourceType) and converted to a ResourceRequirement object.
  3. CWL conformance test expectations

    • timelimit_basic: converts from false-green to true-green. Job actually killed by timeout after 3s. The broad exception handler at populators.py:3191 still catches it, but now it's the right exception.
    • timelimit_invalid (negative): fails at CWL schema validation before Galaxy. No change.
    • timelimit_zero_unlimited: zero = no limit. check_limits() skips enforcement when timelimit_seconds <= 0. Job completes successfully.
    • timelimit_from_expression: $(1+2) expression. ResourceRequirement marks runtime_required=True, get_value() raises NotImplementedError. Acceptable failure mode - expression evaluation is a pre-existing TODO across all resource requirements.
    • timelimit_expressiontool: ExpressionTools don't go through the job runner, so timelimit enforcement doesn't apply. Passes as expected.

Red-to-green test: Add unit test that loads a CWL tool with ToolTimeLimit and verifies it appears in parse_requirements() output as a ResourceRequirement with resource_type="timelimit".

Commit Isolation Strategy

master
  |
  +-- Branch: tool_timelimit (commits 1-3, non-CWL PR)
  |     |-- Commit 1: Add timelimit resource type
  |     |-- Commit 2: Local runner timelimit enforcement
  |     +-- Commit 3: TPV integration test (optional)
  |
  +-- Branch: cwl_tool_state (commit 4, CWL PR, depends on tool_timelimit)
        |-- ... existing CWL work ...
        +-- Commit 4: Wire CWL ToolTimeLimit

Commits 1-3 can be branched off independently. Commit 4 depends on commits 1-3.

Resolved Decisions

  1. Units: Seconds (matches CWL spec). No HH:MM:SS support - that's a global walltime format concern.
  2. Negative values: CWL spec treats negative as schema validation error (timelimit2.cwl). Galaxy should skip enforcement for values <= 0.
  3. No timelimit_max variant: CWL has single timelimit field. Just timelimit.
  4. Precedence: Stricter of global walltime and per-tool timelimit wins. Both are checked independently in check_limits().
  5. Zero = no limit: Per CWL spec (timelimit3.cwl conformance test). check_limits() skips when timelimit_seconds <= 0.
  6. Runner state: New TOOL_TIMELIMIT_REACHED state, distinct from WALLTIME_REACHED/GLOBAL_WALLTIME_REACHED.
  7. Expression timelimit: Not implemented (pre-existing TODO for all resource requirement expressions). NotImplementedError is acceptable failure mode.

Unresolved Questions

  1. Should the TPV integration test (Commit 3) be deferred if TPV doesn't have timelimit template support yet?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment