Skip to content

Instantly share code, notes, and snippets.

View leegao's full-sized avatar
💭
Backpacking the 🌍

Lee Gao leegao

💭
Backpacking the 🌍
View GitHub Profile
@leegao
leegao / fusion_5.txt
Created January 8, 2026 17:23
Annotated libtpu vliw dump for the operation: max(x[64,32] @ w[32,64], axis=1), Input in VMEM. Weights in HBM (need DMA
// Operation: max(x[64,32] @ w[32,64], axis=1), Input in VMEM. Weights in HBM (need DMA).
0x0 : { %v_const_neg_inf = vmov -inf // Init accumulator for Max reduction
;; %ptr_x_in = inlined_call_operand.vmem [shape: f32[64,32]] // operand 0: input x
;; %ptr_w_hbm = inlined_call_operand.hbm [shape: f32[32,64]] // operand 1: weights w (HBM)
;; %ptr_out_max = inlined_call_operand.vmem [shape: f32[64]] // operand 2: output
;; %ptr_out_full = inlined_call_operand.vmem [shape: f32[64,64]] // operand 3: scratch }
0x1 : { %6 = vst [vmem:[#allocation1] sm:$0xff] /*vst_source=*/%v_const_neg_inf } // for another kernel, #allocation1 is an addr of -\infty
// First Phase: get w out of HBM into vmem (the mxu and vpu can only use vmem)
@leegao
leegao / compute.py
Created January 8, 2026 04:32
VLIW dump of mini_attention (softmax(x @ w1) @ w2)
!rm -rf compiler_dump
!rm compiler_dump.zip
import os
# # Create dump directories
DUMP_ROOT = "compiler_dump/"
HLO_DUMP_PATH = os.path.join(DUMP_ROOT, "hlo")
LLO_DUMP_PATH = os.path.join(DUMP_ROOT, "llo")
@leegao
leegao / compute.py
Created January 8, 2026 04:11
TPU v5e softmax kernel
!rm -rf compiler_dump
!rm compiler_dump.zip
import os
# # Create dump directories
DUMP_ROOT = "compiler_dump/"
HLO_DUMP_PATH = os.path.join(DUMP_ROOT, "hlo")
LLO_DUMP_PATH = os.path.join(DUMP_ROOT, "llo")
@leegao
leegao / Gauss.md
Created January 4, 2026 17:29
Numerical sensitivity + floating point roundoff errors for a 2013 computational physics course

Suppose little Gauss lived in the modern age. Little Gauss’ teacher wanted to surf the internet, so he assigned all of his students the following integral to evaluate:

$$ \int_0^1 x^{100} e^{x-1} dx $$

Being the clever alter-ego of the boy who immediately saw $\sum^n_k k = {n+1 \choose 2}$, little Gauss constructed a sequence

$$ s_k = \int_0^1 x^k e^{x-1} dx $$

and with a little bit of clever manipulation (integration by parts), he found that, using $u_k = x^k$, $dv = e^{x-1}dx$, $v = e^{x-1}$

Caching ASTC Parameters to Disk

Intro

Here's a negative result around compressing and caching ASTC parameters to reduce transfer overhead.

Taking an Adreno 650 as our reference:

  1. GPU - 1.2 TFlops (1200 GFlops)
  2. CPU SIMD int8 - 18 GFlops
@leegao
leegao / dxvk 2.0 profile on Android
Created August 2, 2025 00:01
DXVK 2.0 Vulkan Requirements by Android drivers
[REQUIRED] depthBiasClamp / CORE10 (feature)
FL=9.0 - Required for D3D9 and D3D11. Allows clamping the depth bias value.
Qcom Turnip
+ unsupported (0/53): []
+ supported (53/53): ['23.0-26.708 (52/52)']
Qcom Proprietary
+ unsupported (0/163): []
+ supported (163/163): ['512.502-512.826 (162/162)']
Mali Proprietary
+ unsupported (0/67): []
@leegao
leegao / cmds.txt
Created July 21, 2025 13:14
ACC.exe spirv shader causing Mali G715 to fail CreateGraphicsPipeline
CreateFramebuffer
in: device: VkDevice (handle) = 0xb4000075a4e69010
in: pCreateInfo: VkFramebufferCreateInfo*
.flags: VkFramebufferCreateFlags = 0x0
.renderPass: VkRenderPass (handle) = 0xb400007454e8d0d0
.attachmentCount: uint32_t = 0x1
.pAttachments[0]: VkImageView* = 0xb40000746501b800
.width: uint32_t = 0x500
.height: uint32_t = 0x2d0
.layers: uint32_t = 0x1

From AI Studio analysis of https://github.com/google/angle/blob/6a04a50f98cac71b25464d10289ce7a013841caf/src/libANGLE/renderer/vulkan/vk_renderer.cpp#L4879

1. ARM / Mali

These workarounds apply to GPUs designed by ARM (Mali), found in chipsets like Samsung Exynos, Google Tensor, and MediaTek Dimensity.

Feature / Workaround mFeatures Flag Condition / Driver Version Reason & Impact
Protected Memory Restriction supportsProtectedMemory Blocked if: isARM && !pipelineProtectedAccess Bug: On older ARM platforms, enabling VK_KHR_protected_memory causes excessive, unnecessary load/store unit activity. Workaround: Only enabled on ARM if the newer VK_EXT_pipeline_protected_access extension is also present, indicating a fixed driver. (b/208458772)
Mixed Load Op Restriction disallowMixedDepthStencilLoadOpNoneAndLoad Enabled if: isARM && driverVersion < r38.1.0 Bug: ARM drivers older than r38p1 are bug
@leegao
leegao / gist:e24afbb5f55fe678139197d703d7f600
Last active July 17, 2025 14:16
dxvk 1.10.3 features
[REQUIRED] robustBufferAccess / CORE10 (feature)
FL=9.1 - Always enabled if supported by Vulkan. Used for robustness and constant buffer range checks.
ImgTec
+ unsupported (0/42): []
+ supported (42/42): ['0.1017-139.3 (41/41)']
Mali Proprietary
+ unsupported (0/66): []
+ supported (66/66): ['25.1-50.0 (65/65)']
Qcom Proprietary
+ unsupported (0/157): []
@leegao
leegao / dxvk_1_10_3_feature_support.json
Created July 17, 2025 08:23
Vulkan features and extension features requirements and coverage on Android devices for those requested by dxvk 1.10.3
[
{
"name": "robustBufferAccess",
"type": "feature",
"extension": "CORE10",
"required": true,
"feature_level": "9.1",
"notes": "Always enabled if supported by Vulkan. Used for robustness and constant buffer range checks.",
"supported_driver_versions": {
"Qcom Proprietary": [