| Device | Flash | Download | SHA-256 Checksum |
|---|---|---|---|
| Pixel 6 (oriole) |
Link | Link | 090145837d44224448311b65ec98c9af32890dd499ce64c3c657b5319a9643c9 |
| Pixel 6 Pro (raven) |
Link | Link | b84d3368c74ad579c125cfc60597209236a730a2a7bbd776119d4dcc9008038f |
| Pixel 6a (bluejay) | Link | Link | 9ac60f986b386847878a3d531d1f6a75c43d8c7ccb900ccb5 |
This is Felix Kuehling, long time KFD driver architect. I started looking into the TinyGrad source code yesterday, focusing on ops_kfd.py, ops_hsa.py and driver/hsa.py, to understand how TinyGrad talks to our HW and help with the ongoing debugging effort from the top down. This analysis is based on this commit: https://github.com/tinygrad/tinygrad/tree/3de855ea50d72238deac14fc05cda2a611497778
I'm intrigued by the use of Python for low-level programming. I think I can learn something from your use of ctypes and clang2py for fast prototyping and test development. I want to share some observations based on my initial review.
ops_kfd looks pretty new, and I see many problems with it based on my long experience working on KFD. I think it's interesting, but probably not relevant for the most pressing problems at hand, so I'll cover that last.
ops_hsa uses ROCr APIs to manage GPU memory, create a user mode AQL queue for GPU kernel dispatch, async SDMA copies, and signal-based synchronization with barrier packets
| (tf) root@rocm:~/tmp# python benchmark.py | |
| 2023-10-14 15:02:22.116047: E external/local_xla/xla/stream_executor/plugin_registry.cc:93] Invalid plugin kind specified: DNN | |
| 2023-10-14 15:02:22.348480: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. | |
| To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. | |
| 2023-10-14 15:02:23.756833: I external/local_xla/xla/stream_executor/rocm/rocm_gpu_executor.cc:787] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero | |
| 2023-10-14 15:02:23.982269: I external/local_xla/xla/stream_executor/rocm/rocm_gpu_executor.cc:787] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero | |
| 2023-10-14 15:02:23.9823 |
Scalable Vector Extensions (SVE) is ARM’s latest SIMD extension to their instruction set, which was announced back in 2016. A follow-up SVE2 extension was announced in 2019, designed to incorporate all functionality from ARM’s current primary SIMD extension, NEON (aka ASIMD).
Despite being announced 5 years ago, there is currently no generally available CPU which supports any form of SVE (which excludes the [Fugaku supercomputer](https://www.fujitsu.com/global/about/innovation/
| % ./clpeak | |
| [mvk-info] MoltenVK version 1.1.5, supporting Vulkan version 1.1.189. | |
| The following 72 Vulkan extensions are supported: | |
| VK_KHR_16bit_storage v1 | |
| VK_KHR_8bit_storage v1 | |
| VK_KHR_bind_memory2 v1 | |
| VK_KHR_create_renderpass2 v1 | |
| VK_KHR_dedicated_allocation v3 | |
| VK_KHR_depth_stencil_resolve v1 | |
| VK_KHR_descriptor_update_template v1 |
| # IDA (disassembler) and Hex-Rays (decompiler) plugin for Apple AMX | |
| # | |
| # WIP research. (This was edited to add more info after someone posted it to | |
| # Hacker News. Click "Revisions" to see full changes.) | |
| # | |
| # Copyright (c) 2020 dougallj | |
| # Based on Python port of VMX intrinsics plugin: | |
| # Copyright (c) 2019 w4kfu - Synacktiv |
Here is easy steps to try Windows 10 on ARM or Ubuntu for ARM64 on your Apple Silicon Mac. Enjoy!
NOTE: that this is current, 10/1/2021 state.
- Install Xcode from App Store or install Command Line Tools on your Mac
| platform: 7.5 | |
| ext: 7p5 | |
| name: HSW | |
| 1 add add 0x40 Addition | |
| 0xfc0 u8 i8 u16 i16 u32 i32 , 0xfc0 u8 i8 u16 i16 u32 i32 | |
| 0x20000 f32 , 0xfc0 u8 i8 u16 i16 u32 i32 | |
| 0x20000 f32 , 0x20000 f32 | |
| 0x40000 f64 , 0x40000 f64 | |
| 3 addc addc 0x4e Addition with Carry | |
| 0x400 u32 , 0x400 u32 |