Last active
November 18, 2025 15:43
-
-
Save b1tg/6a32e7fae6aee1861c4941b697d63e2c to your computer and use it in GitHub Desktop.
mi350 RuntimeError: Wait timeout
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| [2771223.012041] amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:3 pasid:32779) | |
| [2771223.023037] amdgpu 0000:05:00.0: amdgpu: for process python3 pid 3895854 thread python3 pid 3895854) | |
| [2771223.033665] amdgpu 0000:05:00.0: amdgpu: in page starting at address 0x0000ffffffbfe000 from IH client 0x1b (UTCL2) | |
| [2771223.045852] amdgpu 0000:05:00.0: amdgpu: cookie node_id 2 fault from die AID0.XCD1 | |
| [2771223.054826] amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x003012B1 | |
| [2771223.063501] amdgpu 0000:05:00.0: amdgpu: Faulty UTCL2 client ID: SQC (inst) (0x9) | |
| [2771223.072377] amdgpu 0000:05:00.0: amdgpu: MORE_FAULTS: 0x1 | |
| [2771223.078908] amdgpu 0000:05:00.0: amdgpu: WALKER_ERROR: 0x0 | |
| [2771223.085538] amdgpu 0000:05:00.0: amdgpu: PERMISSION_FAULTS: 0xb | |
| [2771223.092651] amdgpu 0000:05:00.0: amdgpu: MAPPING_ERROR: 0x0 | |
| [2771223.099372] amdgpu 0000:05:00.0: amdgpu: RW: 0x0 | |
| [2771227.011513] amdgpu 0000:05:00.0: amdgpu: XCC 0: Queue preemption failed for queue with doorbell_id: 80006000 | |
| [2771227.022950] amdgpu 0000:05:00.0: amdgpu: XCC 1: Queue preemption failed for queue with doorbell_id: 80006000 | |
| [2771227.034277] amdgpu 0000:05:00.0: amdgpu: XCC 2: Queue preemption failed for queue with doorbell_id: 80006000 | |
| [2771227.045610] amdgpu 0000:05:00.0: amdgpu: XCC 3: Queue preemption failed for queue with doorbell_id: 80006000 | |
| [2771227.056938] amdgpu 0000:05:00.0: amdgpu: XCC 4: Queue preemption failed for queue with doorbell_id: 80006000 | |
| [2771227.068263] amdgpu 0000:05:00.0: amdgpu: XCC 5: Queue preemption failed for queue with doorbell_id: 80006000 | |
| [2771227.079591] amdgpu 0000:05:00.0: amdgpu: XCC 6: Queue preemption failed for queue with doorbell_id: 80006000 | |
| [2771227.090913] amdgpu 0000:05:00.0: amdgpu: XCC 7: Queue preemption failed for queue with doorbell_id: 80006000 | |
| [2771227.105113] amdgpu 0000:05:00.0: amdgpu: queue id 0x2 at pasid 3895854 is reset | |
| [2771227.114463] amdgpu 0000:05:00.0: amdgpu: queue id 0x2 at pasid 3895854 is reset | |
| [2771227.124110] amdgpu 0000:05:00.0: amdgpu: queue id 0x2 at pasid 3895854 is reset | |
| [2771227.133481] amdgpu 0000:05:00.0: amdgpu: queue id 0x2 at pasid 3895854 is reset | |
| [2771227.143135] amdgpu 0000:05:00.0: amdgpu: queue id 0x2 at pasid 3895854 is reset | |
| [2771227.152462] amdgpu 0000:05:00.0: amdgpu: queue id 0x2 at pasid 3895854 is reset | |
| [2771227.162089] amdgpu 0000:05:00.0: amdgpu: queue id 0x2 at pasid 3895854 is reset | |
| [2771227.171482] amdgpu 0000:05:00.0: amdgpu: queue id 0x2 at pasid 3895854 is reset | |
| [2771227.180039] amdgpu 0000:05:00.0: amdgpu: Queues reset on process python3 tid 3895854 thread python3 pid 3895854 | |
| [2771227.191741] amdgpu 0000:05:00.0: amdgpu: Queues reset on process python3 tid 3888885 thread python3 pid 3888885 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| (.venv) (master) ~/tinygrad$ HCQDEV_WAIT_TIMEOUT_MS=300000 DEFAULT_FLOAT=HALF AMD_LLVM=0 BASEDIR="/raid/datasets/wiki" RUNMLPERF=1 PYTHONPATH=. BENCHMARK=10 GPUS=2 BS=128 MODEL=bert python3 | |
| examples/mlperf/model_train.py | |
| training bert | |
| training on ['AMD:0', 'AMD:1'] | |
| Total parameters: 367.480636M | |
| HParam: "GPUS": ['AMD:0', 'AMD:1'] | |
| HParam: "seed": 12345 | |
| HParam: "BS": 128 | |
| HParam: "GRADIENT_ACC_STEPS": 1 | |
| HParam: "GLOBAL_BATCH_SIZE": 128 | |
| HParam: "EVAL_BS": 2 | |
| HParam: "OPT_BASE_LEARNING_RATE": 0.000202072594216369 | |
| HParam: "OPT_LAMB_BETA_1": 0.9 | |
| HParam: "OPT_LAMB_BETA_2": 0.999 | |
| HParam: "TRAIN_STEPS": 28125 | |
| HParam: "NUM_WARMUP_STEPS": 1 | |
| HParam: "MAX_EVAL_STEPS": 5000 | |
| HParam: "EVAL_STEP_FREQ": 1171 | |
| HParam: "SAVE_CKPT_FREQ": 1000 | |
| HParam: "KEEP_CKPT_AMOUNT": 5 | |
| HParam: "SAVE_CKPT_DIR": ./ckpts | |
| HParam: "INIT_CKPT_DIR": /raid/datasets/wiki | |
| HParam: "LOSS_SCALER": 2048.0 | |
| HParam: "DECAY": 0.01 | |
| HParam: "EPSILON": 1e-06 | |
| HParam: "POLY_POWER": 1.0 | |
| HParam: "DEFAULT_FLOAT": half | |
| HParam: "DISABLE_DROPOUT": 0 | |
| HParam: "TRAIN_BEAM": 0 | |
| HParam: "EVAL_BEAM": 0 | |
| training with global batch size 128 for one epoch with 28125 steps | |
| 0 42625.96 ms run, 42609.51 ms python, 1.00 ms fetch data, 15.45 ms AMD * 2, 6.78 loss, 0.000202 LR, 10.76 GB used, 5747.66 GFLOPS | |
| 1 36893.95 ms run, 36871.53 ms python, 1.23 ms fetch data, 21.18 ms AMD * 2, 6.33 loss, 0.000202 LR, 172.72 GB used, 6640.64 GFLOPS | |
| Traceback (most recent call last): | |
| File "/home/b1tg/tinygrad/tinygrad/runtime/support/hcq.py", line 390, in synchronize | |
| try: self.timeline_signal.wait(self.timeline_value - 1) | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
| File "/home/b1tg/tinygrad/tinygrad/runtime/support/hcq.py", line 262, in wait | |
| if not_passed and self.value < value: raise RuntimeError(f"Wait timeout: {timeout} ms! (the signal is not set to {value}, but {self.value})") | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
| RuntimeError: Wait timeout: 300000 ms! (the signal is not set to 16518, but 16515) | |
| During handling of the above exception, another exception occurred: | |
| Traceback (most recent call last): | |
| File "/home/b1tg/tinygrad/examples/mlperf/model_train.py", line 1648, in <module> | |
| with Profiling(enabled=getenv("PYPROFILE")): globals()[nm]() | |
| ^^^^^^^^^^^^^^^ | |
| File "/home/b1tg/tinygrad/examples/mlperf/model_train.py", line 1160, in train_bert | |
| loss = loss.item() | |
| ^^^^^^^^^^^ | |
| File "/home/b1tg/tinygrad/tinygrad/tensor.py", line 4227, in _wrapper | |
| ret = fn(*args, **kwargs) | |
| ^^^^^^^^^^^^^^^^^^^ | |
| File "/home/b1tg/tinygrad/tinygrad/tensor.py", line 339, in item | |
| return self.data()[(0,) * len(self.shape)] | |
| ^^^^^^^^^^^ | |
| File "/home/b1tg/tinygrad/tinygrad/tensor.py", line 4202, in _wrapper | |
| if _METADATA.get() is not None: return fn(*args, **kwargs) | |
| ^^^^^^^^^^^^^^^^^^^ | |
| File "/home/b1tg/tinygrad/tinygrad/tensor.py", line 327, in data | |
| return self._buffer().as_typed_buffer(self.shape) | |
| ^^^^^^^^^^^^^^ | |
| File "/home/b1tg/tinygrad/tinygrad/tensor.py", line 4202, in _wrapper | |
| if _METADATA.get() is not None: return fn(*args, **kwargs) | |
| ^^^^^^^^^^^^^^^^^^^ | |
| File "/home/b1tg/tinygrad/tinygrad/tensor.py", line 313, in _buffer | |
| return cast(Buffer, x.realize().uop.base.buffer).ensure_allocated() | |
| ^^^^^^^^^^^ | |
| File "/home/b1tg/tinygrad/tinygrad/tensor.py", line 4202, in _wrapper | |
| if _METADATA.get() is not None: return fn(*args, **kwargs) | |
| ^^^^^^^^^^^^^^^^^^^ | |
| File "/home/b1tg/tinygrad/tinygrad/tensor.py", line 276, in realize | |
| run_schedule(*Tensor.schedule_with_vars(*to_realize), do_update_stats=do_update_stats) | |
| File "/home/b1tg/tinygrad/tinygrad/engine/realize.py", line 257, in run_schedule | |
| ei.run(var_vals, do_update_stats=do_update_stats) | |
| File "/home/b1tg/tinygrad/tinygrad/engine/realize.py", line 192, in run | |
| et = self.prg(bufs, var_vals, wait=wait or DEBUG >= 2) | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
| File "/home/b1tg/tinygrad/tinygrad/engine/realize.py", line 153, in __call__ | |
| self.copy(dest, src) | |
| File "/home/b1tg/tinygrad/tinygrad/engine/realize.py", line 148, in copy | |
| dest.copyin(src.as_buffer(allow_zero_copy=True)) # may allocate a CPU buffer depending on allow_zero_copy | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
| File "/home/b1tg/tinygrad/tinygrad/device.py", line 179, in as_buffer | |
| return self.copyout(memoryview(bytearray(self.nbytes))) | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
| File "/home/b1tg/tinygrad/tinygrad/device.py", line 198, in copyout | |
| self.allocator._copyout(mv, self._buf) | |
| File "/home/b1tg/tinygrad/tinygrad/runtime/support/hcq.py", line 531, in _copyout | |
| self.dev.synchronize() | |
| File "/home/b1tg/tinygrad/tinygrad/runtime/support/hcq.py", line 393, in synchronize | |
| if hasattr(self, 'on_device_hang'): self.on_device_hang() | |
| ^^^^^^^^^^^^^^^^^^^^^ | |
| File "/home/b1tg/tinygrad/tinygrad/runtime/ops_amd.py", line 1000, in on_device_hang | |
| def on_device_hang(self): self.iface.on_device_hang() | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
| File "/home/b1tg/tinygrad/tinygrad/runtime/ops_amd.py", line 779, in on_device_hang | |
| raise RuntimeError("\n".join(report)) | |
| RuntimeError: MMU fault: 0xFFFFFFBFE000 | NotPresent=0 ReadOnly=0 NoExecute=0 imprecise=0 | |
| HW fault: reset_type=0 reset_cause=0 memory_lost=1 gpu_id=54552 | |
| ^CException ignored in atexit callback: <bound method finalize._exitfunc of <class 'weakref.finalize'>> | |
| Traceback (most recent call last): | |
| File "/usr/lib/python3.12/weakref.py", line 666, in _exitfunc | |
| f() | |
| File "/usr/lib/python3.12/weakref.py", line 590, in __call__ | |
| return info.func(*info.args, **(info.kwargs or {})) | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
| File "/home/b1tg/tinygrad/tinygrad/runtime/support/hcq.py", line 304, in _fini | |
| def _fini(dev, buf, spec): dev.allocator.free(buf, buf.size, spec) | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
| File "/home/b1tg/tinygrad/tinygrad/device.py", line 262, in free | |
| else: super().free(opaque, size, options) | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
| File "/home/b1tg/tinygrad/tinygrad/device.py", line 231, in free | |
| self._free(opaque, options if options is not None else self.default_buffer_spec) | |
| File "/home/b1tg/tinygrad/tinygrad/helpers.py", line 112, in wrapper | |
| try: return func(*args, **kwargs) | |
| ^^^^^^^^^^^^^^^^^^^^^ | |
| File "/home/b1tg/tinygrad/tinygrad/runtime/ops_amd.py", line 622, in _free | |
| self.dev.synchronize() | |
| File "/home/b1tg/tinygrad/tinygrad/runtime/support/hcq.py", line 390, in synchronize | |
| try: self.timeline_signal.wait(self.timeline_value - 1) | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
| File "/home/b1tg/tinygrad/tinygrad/runtime/support/hcq.py", line 260, in wait | |
| self._sleep(time_spent) | |
| File "/home/b1tg/tinygrad/tinygrad/runtime/ops_amd.py", line 44, in _sleep | |
| if time_spent_waiting_ms > 2000 and self.is_timeline and self.owner is not None: self.owner.iface.sleep(200) | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
| File "/home/b1tg/tinygrad/tinygrad/runtime/ops_amd.py", line 766, in sleep | |
| def sleep(self, tm:int): kfd.AMDKFD_IOC_WAIT_EVENTS(KFDIface.kfd, events_ptr=self.queue_event_arr_ptr, num_events=1, wait_for_all=1, timeout=tm) | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
| File "/home/b1tg/tinygrad/tinygrad/runtime/autogen/kfd.py", line 17, in _do_ioctl | |
| ret = __fd.ioctl((__idir<<30) | (ctypes.sizeof(made := __user_struct(**kwargs))<<16) | (__base<<8) | __nr, made) | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
| File "/home/b1tg/tinygrad/tinygrad/runtime/support/hcq.py", line 30, in ioctl | |
| def ioctl(self, request, arg): return fcntl.ioctl(self.fd, request, arg) | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
| KeyboardInterrupt: | |
| ^CException ignored in atexit callback: <function <lambda> at 0x702c27b98360> | |
| Traceback (most recent call last): | |
| File "/home/b1tg/tinygrad/tinygrad/device.py", line 51, in <lambda> | |
| atexit.register(lambda: [Device[dn].finalize() for dn in Device._opened_devices]) | |
| ^^^^^^^^^^^^^^^^^^^^^ | |
| File "/home/b1tg/tinygrad/tinygrad/runtime/support/hcq.py", line 448, in finalize | |
| try: self.synchronize() # Try to finalize device in any case. | |
| ^^^^^^^^^^^^^^^^^^ | |
| File "/home/b1tg/tinygrad/tinygrad/runtime/support/hcq.py", line 390, in synchronize | |
| try: self.timeline_signal.wait(self.timeline_value - 1) | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
| File "/home/b1tg/tinygrad/tinygrad/runtime/support/hcq.py", line 259, in wait | |
| while (not_passed:=(prev_value:=self.value) < value) and (time_spent:=int(time.perf_counter() * 1000) - start_time) < timeout: | |
| ^^^^^^^^^^^^^^^^^^^ | |
| KeyboardInterrupt: | |
| ^C |
Author
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
reproduce command on tinyamd4: