Skip to content

Instantly share code, notes, and snippets.

@b1tg
Last active November 18, 2025 15:43
Show Gist options
  • Select an option

  • Save b1tg/6a32e7fae6aee1861c4941b697d63e2c to your computer and use it in GitHub Desktop.

Select an option

Save b1tg/6a32e7fae6aee1861c4941b697d63e2c to your computer and use it in GitHub Desktop.
mi350 RuntimeError: Wait timeout
[2771223.012041] amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:3 pasid:32779)
[2771223.023037] amdgpu 0000:05:00.0: amdgpu: for process python3 pid 3895854 thread python3 pid 3895854)
[2771223.033665] amdgpu 0000:05:00.0: amdgpu: in page starting at address 0x0000ffffffbfe000 from IH client 0x1b (UTCL2)
[2771223.045852] amdgpu 0000:05:00.0: amdgpu: cookie node_id 2 fault from die AID0.XCD1
[2771223.054826] amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x003012B1
[2771223.063501] amdgpu 0000:05:00.0: amdgpu: Faulty UTCL2 client ID: SQC (inst) (0x9)
[2771223.072377] amdgpu 0000:05:00.0: amdgpu: MORE_FAULTS: 0x1
[2771223.078908] amdgpu 0000:05:00.0: amdgpu: WALKER_ERROR: 0x0
[2771223.085538] amdgpu 0000:05:00.0: amdgpu: PERMISSION_FAULTS: 0xb
[2771223.092651] amdgpu 0000:05:00.0: amdgpu: MAPPING_ERROR: 0x0
[2771223.099372] amdgpu 0000:05:00.0: amdgpu: RW: 0x0
[2771227.011513] amdgpu 0000:05:00.0: amdgpu: XCC 0: Queue preemption failed for queue with doorbell_id: 80006000
[2771227.022950] amdgpu 0000:05:00.0: amdgpu: XCC 1: Queue preemption failed for queue with doorbell_id: 80006000
[2771227.034277] amdgpu 0000:05:00.0: amdgpu: XCC 2: Queue preemption failed for queue with doorbell_id: 80006000
[2771227.045610] amdgpu 0000:05:00.0: amdgpu: XCC 3: Queue preemption failed for queue with doorbell_id: 80006000
[2771227.056938] amdgpu 0000:05:00.0: amdgpu: XCC 4: Queue preemption failed for queue with doorbell_id: 80006000
[2771227.068263] amdgpu 0000:05:00.0: amdgpu: XCC 5: Queue preemption failed for queue with doorbell_id: 80006000
[2771227.079591] amdgpu 0000:05:00.0: amdgpu: XCC 6: Queue preemption failed for queue with doorbell_id: 80006000
[2771227.090913] amdgpu 0000:05:00.0: amdgpu: XCC 7: Queue preemption failed for queue with doorbell_id: 80006000
[2771227.105113] amdgpu 0000:05:00.0: amdgpu: queue id 0x2 at pasid 3895854 is reset
[2771227.114463] amdgpu 0000:05:00.0: amdgpu: queue id 0x2 at pasid 3895854 is reset
[2771227.124110] amdgpu 0000:05:00.0: amdgpu: queue id 0x2 at pasid 3895854 is reset
[2771227.133481] amdgpu 0000:05:00.0: amdgpu: queue id 0x2 at pasid 3895854 is reset
[2771227.143135] amdgpu 0000:05:00.0: amdgpu: queue id 0x2 at pasid 3895854 is reset
[2771227.152462] amdgpu 0000:05:00.0: amdgpu: queue id 0x2 at pasid 3895854 is reset
[2771227.162089] amdgpu 0000:05:00.0: amdgpu: queue id 0x2 at pasid 3895854 is reset
[2771227.171482] amdgpu 0000:05:00.0: amdgpu: queue id 0x2 at pasid 3895854 is reset
[2771227.180039] amdgpu 0000:05:00.0: amdgpu: Queues reset on process python3 tid 3895854 thread python3 pid 3895854
[2771227.191741] amdgpu 0000:05:00.0: amdgpu: Queues reset on process python3 tid 3888885 thread python3 pid 3888885
(.venv) (master) ~/tinygrad$ HCQDEV_WAIT_TIMEOUT_MS=300000 DEFAULT_FLOAT=HALF AMD_LLVM=0 BASEDIR="/raid/datasets/wiki" RUNMLPERF=1 PYTHONPATH=. BENCHMARK=10 GPUS=2 BS=128 MODEL=bert python3
examples/mlperf/model_train.py
training bert
training on ['AMD:0', 'AMD:1']
Total parameters: 367.480636M
HParam: "GPUS": ['AMD:0', 'AMD:1']
HParam: "seed": 12345
HParam: "BS": 128
HParam: "GRADIENT_ACC_STEPS": 1
HParam: "GLOBAL_BATCH_SIZE": 128
HParam: "EVAL_BS": 2
HParam: "OPT_BASE_LEARNING_RATE": 0.000202072594216369
HParam: "OPT_LAMB_BETA_1": 0.9
HParam: "OPT_LAMB_BETA_2": 0.999
HParam: "TRAIN_STEPS": 28125
HParam: "NUM_WARMUP_STEPS": 1
HParam: "MAX_EVAL_STEPS": 5000
HParam: "EVAL_STEP_FREQ": 1171
HParam: "SAVE_CKPT_FREQ": 1000
HParam: "KEEP_CKPT_AMOUNT": 5
HParam: "SAVE_CKPT_DIR": ./ckpts
HParam: "INIT_CKPT_DIR": /raid/datasets/wiki
HParam: "LOSS_SCALER": 2048.0
HParam: "DECAY": 0.01
HParam: "EPSILON": 1e-06
HParam: "POLY_POWER": 1.0
HParam: "DEFAULT_FLOAT": half
HParam: "DISABLE_DROPOUT": 0
HParam: "TRAIN_BEAM": 0
HParam: "EVAL_BEAM": 0
training with global batch size 128 for one epoch with 28125 steps
0 42625.96 ms run, 42609.51 ms python, 1.00 ms fetch data, 15.45 ms AMD * 2, 6.78 loss, 0.000202 LR, 10.76 GB used, 5747.66 GFLOPS
1 36893.95 ms run, 36871.53 ms python, 1.23 ms fetch data, 21.18 ms AMD * 2, 6.33 loss, 0.000202 LR, 172.72 GB used, 6640.64 GFLOPS
Traceback (most recent call last):
File "/home/b1tg/tinygrad/tinygrad/runtime/support/hcq.py", line 390, in synchronize
try: self.timeline_signal.wait(self.timeline_value - 1)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/b1tg/tinygrad/tinygrad/runtime/support/hcq.py", line 262, in wait
if not_passed and self.value < value: raise RuntimeError(f"Wait timeout: {timeout} ms! (the signal is not set to {value}, but {self.value})")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Wait timeout: 300000 ms! (the signal is not set to 16518, but 16515)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/b1tg/tinygrad/examples/mlperf/model_train.py", line 1648, in <module>
with Profiling(enabled=getenv("PYPROFILE")): globals()[nm]()
^^^^^^^^^^^^^^^
File "/home/b1tg/tinygrad/examples/mlperf/model_train.py", line 1160, in train_bert
loss = loss.item()
^^^^^^^^^^^
File "/home/b1tg/tinygrad/tinygrad/tensor.py", line 4227, in _wrapper
ret = fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/b1tg/tinygrad/tinygrad/tensor.py", line 339, in item
return self.data()[(0,) * len(self.shape)]
^^^^^^^^^^^
File "/home/b1tg/tinygrad/tinygrad/tensor.py", line 4202, in _wrapper
if _METADATA.get() is not None: return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/b1tg/tinygrad/tinygrad/tensor.py", line 327, in data
return self._buffer().as_typed_buffer(self.shape)
^^^^^^^^^^^^^^
File "/home/b1tg/tinygrad/tinygrad/tensor.py", line 4202, in _wrapper
if _METADATA.get() is not None: return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/b1tg/tinygrad/tinygrad/tensor.py", line 313, in _buffer
return cast(Buffer, x.realize().uop.base.buffer).ensure_allocated()
^^^^^^^^^^^
File "/home/b1tg/tinygrad/tinygrad/tensor.py", line 4202, in _wrapper
if _METADATA.get() is not None: return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/b1tg/tinygrad/tinygrad/tensor.py", line 276, in realize
run_schedule(*Tensor.schedule_with_vars(*to_realize), do_update_stats=do_update_stats)
File "/home/b1tg/tinygrad/tinygrad/engine/realize.py", line 257, in run_schedule
ei.run(var_vals, do_update_stats=do_update_stats)
File "/home/b1tg/tinygrad/tinygrad/engine/realize.py", line 192, in run
et = self.prg(bufs, var_vals, wait=wait or DEBUG >= 2)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/b1tg/tinygrad/tinygrad/engine/realize.py", line 153, in __call__
self.copy(dest, src)
File "/home/b1tg/tinygrad/tinygrad/engine/realize.py", line 148, in copy
dest.copyin(src.as_buffer(allow_zero_copy=True)) # may allocate a CPU buffer depending on allow_zero_copy
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/b1tg/tinygrad/tinygrad/device.py", line 179, in as_buffer
return self.copyout(memoryview(bytearray(self.nbytes)))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/b1tg/tinygrad/tinygrad/device.py", line 198, in copyout
self.allocator._copyout(mv, self._buf)
File "/home/b1tg/tinygrad/tinygrad/runtime/support/hcq.py", line 531, in _copyout
self.dev.synchronize()
File "/home/b1tg/tinygrad/tinygrad/runtime/support/hcq.py", line 393, in synchronize
if hasattr(self, 'on_device_hang'): self.on_device_hang()
^^^^^^^^^^^^^^^^^^^^^
File "/home/b1tg/tinygrad/tinygrad/runtime/ops_amd.py", line 1000, in on_device_hang
def on_device_hang(self): self.iface.on_device_hang()
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/b1tg/tinygrad/tinygrad/runtime/ops_amd.py", line 779, in on_device_hang
raise RuntimeError("\n".join(report))
RuntimeError: MMU fault: 0xFFFFFFBFE000 | NotPresent=0 ReadOnly=0 NoExecute=0 imprecise=0
HW fault: reset_type=0 reset_cause=0 memory_lost=1 gpu_id=54552
^CException ignored in atexit callback: <bound method finalize._exitfunc of <class 'weakref.finalize'>>
Traceback (most recent call last):
File "/usr/lib/python3.12/weakref.py", line 666, in _exitfunc
f()
File "/usr/lib/python3.12/weakref.py", line 590, in __call__
return info.func(*info.args, **(info.kwargs or {}))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/b1tg/tinygrad/tinygrad/runtime/support/hcq.py", line 304, in _fini
def _fini(dev, buf, spec): dev.allocator.free(buf, buf.size, spec)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/b1tg/tinygrad/tinygrad/device.py", line 262, in free
else: super().free(opaque, size, options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/b1tg/tinygrad/tinygrad/device.py", line 231, in free
self._free(opaque, options if options is not None else self.default_buffer_spec)
File "/home/b1tg/tinygrad/tinygrad/helpers.py", line 112, in wrapper
try: return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/b1tg/tinygrad/tinygrad/runtime/ops_amd.py", line 622, in _free
self.dev.synchronize()
File "/home/b1tg/tinygrad/tinygrad/runtime/support/hcq.py", line 390, in synchronize
try: self.timeline_signal.wait(self.timeline_value - 1)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/b1tg/tinygrad/tinygrad/runtime/support/hcq.py", line 260, in wait
self._sleep(time_spent)
File "/home/b1tg/tinygrad/tinygrad/runtime/ops_amd.py", line 44, in _sleep
if time_spent_waiting_ms > 2000 and self.is_timeline and self.owner is not None: self.owner.iface.sleep(200)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/b1tg/tinygrad/tinygrad/runtime/ops_amd.py", line 766, in sleep
def sleep(self, tm:int): kfd.AMDKFD_IOC_WAIT_EVENTS(KFDIface.kfd, events_ptr=self.queue_event_arr_ptr, num_events=1, wait_for_all=1, timeout=tm)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/b1tg/tinygrad/tinygrad/runtime/autogen/kfd.py", line 17, in _do_ioctl
ret = __fd.ioctl((__idir<<30) | (ctypes.sizeof(made := __user_struct(**kwargs))<<16) | (__base<<8) | __nr, made)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/b1tg/tinygrad/tinygrad/runtime/support/hcq.py", line 30, in ioctl
def ioctl(self, request, arg): return fcntl.ioctl(self.fd, request, arg)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt:
^CException ignored in atexit callback: <function <lambda> at 0x702c27b98360>
Traceback (most recent call last):
File "/home/b1tg/tinygrad/tinygrad/device.py", line 51, in <lambda>
atexit.register(lambda: [Device[dn].finalize() for dn in Device._opened_devices])
^^^^^^^^^^^^^^^^^^^^^
File "/home/b1tg/tinygrad/tinygrad/runtime/support/hcq.py", line 448, in finalize
try: self.synchronize() # Try to finalize device in any case.
^^^^^^^^^^^^^^^^^^
File "/home/b1tg/tinygrad/tinygrad/runtime/support/hcq.py", line 390, in synchronize
try: self.timeline_signal.wait(self.timeline_value - 1)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/b1tg/tinygrad/tinygrad/runtime/support/hcq.py", line 259, in wait
while (not_passed:=(prev_value:=self.value) < value) and (time_spent:=int(time.perf_counter() * 1000) - start_time) < timeout:
^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt:
^C
@b1tg
Copy link
Author

b1tg commented Nov 11, 2025

reproduce command on tinyamd4:

  • RUNMLPERF need to be 1
  • AMD_LLVM=1 also trig bug, but 0 make it easier to trig
HCQDEV_WAIT_TIMEOUT_MS=300000 DEFAULT_FLOAT=HALF AMD_LLVM=0 BASEDIR="/raid/datasets/wiki" RUNMLPERF=1  PYTHONPATH=.   BENCHMARK=10 GPUS=2 BS=128 MODEL=bert python3 examples/mlperf/model_train.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment