import numpy as np
# ------------------- quantisation helpers (same as user's snippet) -----------
def activation_quant(x):
scale = 127.0 / np.maximum(np.max(np.abs(x), axis=-1, keepdims=True), 1e-5)
return np.round(x * scale).clip(-128, 127) / scale
def weight_quant(w):
scale = 1.0 / np.maximum(np.mean(np.abs(w)), 1e-5)
return np.round(w * scale).clip(-1, 1) / scale
# ------------------- 3‑layer fully‑connected network --------------------------
np.random.seed(0)
batch, d_in, d_h1, d_h2, d_out = 2, 7, 6, 5, 4
# latent full‑precision weights
W1_fp = np.random.randn(d_h1, d_in)
W2_fp = np.random.randn(d_h2, d_h1)
W3_fp = np.random.randn(d_out, d_h2)
# quantised weights used in forward path
Q1 = weight_quant(W1_fp)
Q2 = weight_quant(W2_fp)
Q3 = weight_quant(W3_fp)
# random input and target
x0 = np.random.randn(batch, d_in)
target = np.random.randn(batch, d_out)
# ---------------- forward pass (uses only quantised weights) ------------------
a0 = activation_quant(x0) # layer‑0 activation
h1 = a0 @ Q1.T
a1 = activation_quant(h1)
h2 = a1 @ Q2.T
a2 = activation_quant(h2)
y = a2 @ Q3.T # network output
loss = 0.5 * np.sum((y - target)**2)
# ---------------- backward pass: compute grads wrt quantised weights ----------
delta3 = y - target # shape (batch, d_out)
grad_Q3 = delta3.T @ a2 # (d_out, d_h2)
delta2 = delta3 @ Q3 # (batch, d_h2)
# identity‑STE ⇒ activation derivative = 1
grad_Q2 = delta2.T @ a1 # (d_h2, d_h1)
delta1 = delta2 @ Q2 # (batch, d_h1)
grad_Q1 = delta1.T @ a0 # (d_h1, d_in)
# ---------------- Now: pretend W_fp are parameters under identity‑STE ---------
# Because local Jacobian = 1, the analytical gradients are identical:
grad_W1_fp = grad_Q1.copy()
grad_W2_fp = grad_Q2.copy()
grad_W3_fp = grad_Q3.copy()
# ---------------- Verify sign equality for every element ---------------------
def sign_eq(a, b): # element‑wise sign comparison
return np.all(np.sign(a) == np.sign(b))
print("Layer‑1 sign equality:", sign_eq(grad_W1_fp, grad_Q1))
print("Layer‑2 sign equality:", sign_eq(grad_W2_fp, grad_Q2))
print("Layer‑3 sign equality:", sign_eq(grad_W3_fp, grad_Q3))
# show a few rows for illustration
print("\nSample signs (Layer‑2, first 3 rows):")
print("grad_W2_fp signs:\n", np.sign(grad_W2_fp[:3]))
print("grad_Q2 signs:\n", np.sign(grad_Q2[:3]))Multilayer check: 3-layer fully connected net
The script just executed builds a tiny 3-layer network that uses the exact same activation- and weight-quantisation functions you provided. It runs the forward pass with quantised values, then computes gradients analytically for
-
grad_Qi— treating the quantised weight$Q_i$ itself as the parameter, no STE, and -
grad_Wi_fp— treating the latent full-precision weight$W_i$ as the parameter with the identity-STE (Jacobian = 1).
Layer-1 sign equality: True
Layer-2 sign equality: True
Layer-3 sign equality: True
Sample signs (Layer-2, first 3 rows):
grad_W2_fp signs:
[[-1. 1. 1. -1. 1. 1.]
[-1. 1. 1. -1. 1. 1.]
[-1. 1. 1. -1. 1. 1.]]
grad_Q2 signs:
[[-1. 1. 1. -1. 1. 1.]
[-1. 1. 1. -1. 1. 1.]
[-1. 1. 1. -1. 1. 1.]]
For every single weight element in all three layers, the sign of the STE-based gradient equals the sign of the gradient you would obtain if you optimised the quantised weight itself.
For any layer (i):
Because the local Jacobian
Hence SignSGD (or any sign-based optimiser) sees the same direction whether you send back signs computed w.r.t. the quantised weights or w.r.t. the latent weights under the identity-STE, even in deep networks.
The implications of this equality are that you do not need the master weights on device in order to train a BitNet model with SignSGD, you can hold only the 2 bit representation and stream the 1 bit gradients to a central location for accumulation.