Skip to content

Instantly share code, notes, and snippets.

@kabouzeid
Last active October 14, 2025 16:20
Show Gist options
  • Select an option

  • Save kabouzeid/f3251e4bfbbd9e6099d85781691d8ce9 to your computer and use it in GitHub Desktop.

Select an option

Save kabouzeid/f3251e4bfbbd9e6099d85781691d8ce9 to your computer and use it in GitHub Desktop.
#!/usr/bin/env bash
set -euo pipefail
signal_workers () {
# send the signal to all child processes of torchrun, which are the actual workers
while IFS= read -r pid; do
echo "Sending SIG$1 to $pid"
kill -"$1" "$pid"
done < <(pgrep -P "$trpid")
wait "$trpid"
exit $? # use exit code of torchrun instead of default 128 + signal_number
}
trap 'signal_workers USR1' USR1 # to be used with e.g. --signal=SIGUSR1@60
torchrun \
--nproc-per-node="$SLURM_GPUS_ON_NODE" \
--nnodes="$SLURM_JOB_NUM_NODES" \
--node_rank="$SLURM_NODEID" \
--master_addr="$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n1)" \
--master_port=$((10000 + 10#$(echo -n "$SLURM_JOBID" | tail -c 4))) \
"$@" &
trpid=$!
wait "$trpid"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment