d-kja/microgpt.md

## microgpt.md

      
    Raw
  

              microgpt.md
            
          
    Karpathy's Minimal GPT — Explained for Beginners

This is Andrej Karpathy's famous minimal GPT implementation — a complete AI language model in pure Python. Let me walk you through every single piece.
The Big Picture First

This program does three things:

Reads a dataset of human names (like "Emma", "Olivia", "Liam")
Trains a tiny brain (neural network) to learn the patterns in those names
Generates brand new, made-up names that sound plausible

It's the same core idea behind ChatGPT, just shrunk down to its absolute minimum.
Part 1: Setup and Imports

import os       # os.path.exists
import math     # math.log, math.exp
import random   # random.seed, random.choices, random.gauss, random.shuffle
random.seed(42)
import is how Python loads toolboxes. Think of it like grabbing tools off a shelf:

os — lets you interact with files on your computer
math — gives you math functions like logarithms and exponents
random — generates random numbers

random.seed(42) — Random numbers on computers aren't truly random. They follow a sequence determined by a "seed." Setting it to 42 means every time you run this code, you get the exact same "random" numbers. This makes experiments reproducible. (42 is a cultural reference to The Hitchhiker's Guide to the Galaxy.)
Part 2: Loading the Dataset

if not os.path.exists('input.txt'):
    import urllib.request
    names_url = 'https://raw.githubusercontent.com/karpathy/makemore/988aa59/names.txt'
    urllib.request.urlretrieve(names_url, 'input.txt')
docs = [line.strip() for line in open('input.txt') if line.strip()]
random.shuffle(docs)
This says: "If we don't already have the data file, download it from the internet." The file is a list of ~32,000 human names, one per line.
docs = [line.strip() for line in open('input.txt') if line.strip()] — This is called a list comprehension, a compact Python way to build a list. In plain English: "Open the file, go through each line, strip off whitespace (spaces, newlines), and keep it if it's not empty." After this, docs is a list like ["emma", "olivia", "liam", ...].
random.shuffle(docs) — Randomizes the order of the list, like shuffling a deck of cards. This prevents the model from memorizing the alphabetical order of names.
Part 3: The Tokenizer

uchars = sorted(set(''.join(docs)))
BOS = len(uchars)
vocab_size = len(uchars) + 1
Neural networks can't read letters. They only understand numbers. So we need to convert text into numbers.

''.join(docs) — Glues all names into one giant string: "emmaolivaliam...".
set(...) — A set removes duplicates. So this gives us every unique character: {'a', 'b', 'c', ..., 'z'}.
sorted(...) — Puts them in alphabetical order: ['a', 'b', 'c', ..., 'z'].

Now each character has a position (index) in this list. 'a' is 0, 'b' is 1, etc. These indices are the token IDs — the numbers that represent each character.
BOS (Beginning of Sequence) — A special token that signals "this is the start (or end) of a name." It gets the next available number (26, after the 26 letters). Think of it as a punctuation mark that says "name boundary here."
vocab_size — Total number of unique tokens (27: 26 letters + 1 BOS).
Part 4: Autograd (The Engine of Learning)

This is the most conceptually dense part. Let me build up to it.
Why do we need this?

Training a neural network means:

Make a prediction
Measure how wrong it is (the loss)
Figure out how to adjust each parameter to make the loss smaller
Adjust them

Step 3 requires calculus — specifically, computing the gradient (derivative) of the loss with respect to every single parameter. With thousands of parameters connected through complex math, doing this by hand is impossible. Autograd (automatic differentiation) does it for you.
The Value class

class Value:
A class in Python is a blueprint for creating objects. Think of it like a cookie cutter — it defines the shape, and you stamp out cookies (objects) with it.
Each Value object wraps a single number but also remembers:

What number it holds (data)
How sensitive the final loss is to this number (grad — short for gradient)
What other Values were used to compute it (_children)
The local derivative formulas (_local_grads)

__slots__ = ('data', 'grad', '_children', '_local_grads')
This is a Python optimization. Normally Python objects store their attributes in a flexible dictionary. __slots__ says "these are the only attributes this object will ever have," which uses less memory — important when you create millions of Value objects.
The __init__ method (constructor)

def __init__(self, data, children=(), local_grads=()):
    self.data = data
    self.grad = 0
    self._children = children
    self._local_grads = local_grads
__init__ is a special Python method that runs when you create a new object. self refers to the object being created.

data — the actual number (e.g., 3.7)
grad — starts at 0, will be filled in during the backward pass
_children — the inputs that produced this value (empty if it's a raw input)
_local_grads — the local derivatives (from basic calculus)

Operator overloading (making math work)

def __add__(self, other):
    other = other if isinstance(other, Value) else Value(other)
    return Value(self.data + other.data, (self, other), (1, 1))
In Python, __add__ defines what happens when you write a + b. This lets us write natural math expressions like x + y where x and y are Value objects.
The key insight: when we compute c = a + b:

The result's data is a.data + b.data (just normal addition)
The children are (a, b) (we remember the inputs)
The local gradients are (1, 1) (from calculus: the derivative of a + b with respect to a is 1, and with respect to b is also 1)

Similarly for multiplication:
def __mul__(self, other):
    other = other if isinstance(other, Value) else Value(other)
    return Value(self.data * other.data, (self, other), (other.data, self.data))
For c = a * b:

The derivative with respect to a is b (that's other.data)
The derivative with respect to b is a (that's self.data)

This is basic calculus: d(a×b)/da = b, and d(a×b)/db = a.
The other operations follow the same pattern, each encoding the correct derivative from calculus:

__pow__ (power): d(a^n)/da = n × a^(n-1)
log: d(log(a))/da = 1/a
exp: d(e^a)/da = e^a
relu: derivative is 1 if positive, 0 if negative (a simple on/off switch)

The __r*__ methods (like __radd__, __rmul__) handle the case where the Value is on the right side: 5 + value calls value.__radd__(5).
The backward pass

def backward(self):
    topo = []
    visited = set()
    def build_topo(v):
        if v not in visited:
            visited.add(v)
            for child in v._children:
                build_topo(child)
            topo.append(v)
    build_topo(self)
    self.grad = 1
    for v in reversed(topo):
        for child, local_grad in zip(v._children, v._local_grads):
            child.grad += local_grad * v.grad
This is the backpropagation algorithm — the single most important algorithm in deep learning.
Topological sort (build_topo): The computation forms a graph (a tree-like structure). We need to process nodes in the right order — a node must be processed before its children. This recursive function builds that ordering.
The chain rule (the core loop): Starting from the loss (whose gradient is 1, since d(loss)/d(loss) = 1), we walk backward through every operation. For each node, we propagate gradients to its children using:
child.grad += local_grad * v.grad

This is the chain rule from calculus. In English: "How much does the loss change if I wiggle this child? It's the local sensitivity times how much the loss changes if I wiggle the parent."
The += (not just =) is critical — a value might be used in multiple places, and we need to accumulate all the gradient contributions.
Part 5: Model Parameters

n_layer = 1       # depth
n_embd = 16       # width
block_size = 16   # max context length
n_head = 4        # number of attention heads
head_dim = n_embd // n_head  # = 4
These are the hyperparameters — choices made by the human, not learned by the model:

n_layer = 1 — How deep the network is. Real GPTs have 96+ layers. This has 1.
n_embd = 16 — The size of the vector representing each token. GPT-4 uses ~12,000+. This uses 16.
block_size = 16 — How far back the model can look. GPT-4 looks back 128,000 tokens. This looks back 16.
n_head = 4 — Attention is split into multiple "heads" that each look for different patterns.
head_dim = 4 — Each head works with 4-dimensional vectors (16 / 4 = 4).

matrix = lambda nout, nin, std=0.08: [[Value(random.gauss(0, std)) for _ in range(nin)] for _ in range(nout)]
lambda creates a small anonymous function. This one creates a 2D grid (matrix) of Value objects, each initialized with a small random number drawn from a Gaussian (bell curve) distribution centered at 0. The std=0.08 controls how spread out the initial values are — small enough that the network starts "calm."
state_dict = {
    'wte': matrix(vocab_size, n_embd),   # token embeddings
    'wpe': matrix(block_size, n_embd),    # position embeddings
    'lm_head': matrix(vocab_size, n_embd) # output projection
}
state_dict is a dictionary (key-value store) holding all the model's weight matrices:

wte (word token embedding): A 27×16 table. Each of the 27 tokens gets a 16-dimensional vector. These vectors are what the model learns to represent each character.
wpe (word position embedding): A 16×16 table. Each position (0-15) gets a 16-dimensional vector, so the model knows where in the sequence it is.
lm_head: Converts the model's internal 16-dimensional representation back to 27 scores (one per possible next token).

For each layer, we also create:

attn_wq, attn_wk, attn_wv, attn_wo — Weight matrices for the attention mechanism (Q, K, V, and output projection — explained below)
mlp_fc1, mlp_fc2 — Weight matrices for the feed-forward network

params = [p for mat in state_dict.values() for row in mat for p in row]
This flattens all matrices into one big list of individual Value objects — roughly 3,000+ parameters. These are the "knobs" that training will tune.
Part 6: The Model Architecture (The GPT Function)

This is where the magic happens. Let me explain each building block.
Linear layer

def linear(x, w):
    return [sum(wi * xi for wi, xi in zip(wo, x)) for wo in w]
A linear layer (also called a fully connected layer) is the most fundamental neural network operation. It's a matrix-vector multiplication: each output element is the dot product of one row of the weight matrix with the input vector.
In plain English: "For each row in the weight matrix, multiply it element-by-element with the input and sum up the products." This is how the network mixes information from different dimensions.
Softmax

def softmax(logits):
    max_val = max(val.data for val in logits)
    exps = [(val - max_val).exp() for val in logits]
    total = sum(exps)
    return [e / total for e in exps]
Softmax converts a list of arbitrary numbers ("logits") into a probability distribution — all values between 0 and 1, summing to 1. The steps:

Find the maximum value (for numerical stability — prevents overflow)
Subtract the max and exponentiate each value (making them all positive)
Divide by the total (normalizing so they sum to 1)

The result: larger logits get higher probabilities, but everything is normalized. For example, [2.0, 1.0, 0.1] might become [0.66, 0.24, 0.10].
RMSNorm

def rmsnorm(x):
    ms = sum(xi * xi for xi in x) / len(x)
    scale = (ms + 1e-5) ** -0.5
    return [xi * scale for xi in x]
Normalization prevents numbers from growing too large or too small as they flow through the network. RMSNorm (Root Mean Square Normalization) computes the root-mean-square of the vector and divides by it. The 1e-5 (0.00001) prevents division by zero.
The GPT function itself

def gpt(token_id, pos_id, keys, values):
This function processes one token at a time and returns logits (scores) for what the next token should be.
Step 1: Embeddings

tok_emb = state_dict['wte'][token_id]  # look up token embedding
pos_emb = state_dict['wpe'][pos_id]    # look up position embedding
x = [t + p for t, p in zip(tok_emb, pos_emb)]  # add them together
The token "a" (ID 0) gets its 16-dimensional embedding from row 0 of wte. Position 3 gets its embedding from row 3 of wpe. Adding them together gives the model both what the token is and where it is.
Step 2: Attention (the breakthrough idea behind transformers)

Attention answers the question: "When predicting the next character, which previous characters should I pay attention to?"
For example, if the name so far is "Joh", to predict the next letter, the model should pay attention to the "J" (names starting with "Jo" often continue as "John" or "Joseph").
Here's how it works:
q = linear(x, state_dict[f'layer{li}.attn_wq'])  # Query: "What am I looking for?"
k = linear(x, state_dict[f'layer{li}.attn_wk'])  # Key: "What do I contain?"
v = linear(x, state_dict[f'layer{li}.attn_wv'])  # Value: "What information do I offer?"
Think of it like a library:

Query (Q): You walk in with a question — "I need something related to X"
Key (K): Each book on the shelf has a label describing its contents
Value (V): The actual content of each book

The model computes a compatibility score between the current query and every previous key:
attn_logits = [sum(q_h[j] * k_h[t][j] for j in range(head_dim)) / head_dim**0.5 ...]
This is a dot product (measuring similarity) divided by √head_dim (to keep numbers from getting too large). High score = "these tokens are relevant to each other."
These scores become weights via softmax, then the values are combined using those weights — tokens that are more relevant contribute more to the output.
The KV cache (keys, values) stores previous keys and values so we don't recompute them. This is crucial for efficient generation.
Multi-head means this process runs 4 times in parallel, each on a different slice of the embedding. Each head can learn to look for different patterns (one head might focus on vowel patterns, another on consonant patterns, etc.).
Step 3: MLP (Feed-Forward Network)

x = linear(x, state_dict[f'layer{li}.mlp_fc1'])   # expand 16 → 64
x = [xi.relu() for xi in x]                         # non-linearity
x = linear(x, state_dict[f'layer{li}.mlp_fc2'])   # compress 64 → 16
The MLP processes each token independently (no mixing between positions). It expands the representation to 4× the size, applies ReLU (which zeros out negative values, introducing non-linearity), then compresses back. This is where the model "thinks" about each position.
ReLU (max(0, x)) is critical — without non-linearity, stacking linear layers would just be one big linear layer. Non-linearity lets the network learn complex patterns.
Residual connections are the x = [a + b for a, b in zip(x, x_residual)] lines. They add the input of a block back to its output. This creates a "highway" for gradients to flow and makes deep networks much easier to train.
Step 4: Output

logits = linear(x, state_dict['lm_head'])
Project the 16-dimensional representation back to 27 scores — one for each possible next character (26 letters + BOS).
Part 7: The Optimizer (Adam)

learning_rate, beta1, beta2, eps_adam = 0.01, 0.85, 0.99, 1e-8
m = [0.0] * len(params)  # first moment (mean of gradients)
v = [0.0] * len(params)  # second moment (mean of squared gradients)
Adam is the most popular optimizer in deep learning. Plain gradient descent just says "move in the direction of the gradient." Adam is smarter:

m (first moment): A running average of the gradient direction. Like a ball rolling downhill that builds momentum — it smooths out noisy gradients.
v (second moment): A running average of the gradient magnitude. This adapts the learning rate for each parameter individually — parameters with consistently large gradients get smaller steps, and vice versa.
beta1 = 0.85: How much momentum to keep (85% old, 15% new)
beta2 = 0.99: How much of the magnitude history to keep (99% old, 1% new)
eps_adam = 1e-8: Tiny number to prevent division by zero

Part 8: The Training Loop

for step in range(num_steps):
    doc = docs[step % len(docs)]
    tokens = [BOS] + [uchars.index(ch) for ch in doc] + [BOS]
For each training step, pick one name and convert it to tokens. For example, "emma" becomes [26, 4, 12, 12, 0, 26] (BOS, e, m, m, a, BOS).
The model's job: given each prefix, predict the next character.

Given [BOS] → predict e
Given [BOS, e] → predict m
Given [BOS, e, m] → predict m
Given [BOS, e, m, m] → predict a
Given [BOS, e, m, m, a] → predict BOS (end of name)

logits = gpt(token_id, pos_id, keys, values)
probs = softmax(logits)
loss_t = -probs[target_id].log()
The loss measures "how wrong was the model?" Using negative log probability: if the model assigned probability 0.9 to the correct next character, the loss is -log(0.9) = 0.105 (small, good!). If it assigned probability 0.01, the loss is -log(0.01) = 4.6 (large, bad!). The model is trying to minimize this number.
loss.backward()
This single line triggers the entire backpropagation algorithm — computing how every single parameter should change to reduce the loss.
lr_t = learning_rate * (1 - step / num_steps)  # learning rate decay
for i, p in enumerate(params):
    m[i] = beta1 * m[i] + (1 - beta1) * p.grad
    v[i] = beta2 * v[i] + (1 - beta2) * p.grad ** 2
    m_hat = m[i] / (1 - beta1 ** (step + 1))  # bias correction
    v_hat = v[i] / (1 - beta2 ** (step + 1))
    p.data -= lr_t * m_hat / (v_hat ** 0.5 + eps_adam)
    p.grad = 0  # reset gradient for next step
The Adam update for each parameter:

Update the momentum (m) and magnitude (v) estimates
Apply bias correction (m_hat, v_hat) — early in training, the estimates are biased toward zero because they started at zero, so we correct for that
Update the parameter: p.data -= ... (move in the direction that reduces loss)
Reset the gradient to 0 for the next step

The learning rate decay (1 - step / num_steps) starts at the full learning rate and linearly decreases to 0. Large steps early (explore broadly) → small steps later (fine-tune).
Part 9: Inference (Generation)

temperature = 0.5
for sample_idx in range(20):
    keys, values = [[] for _ in range(n_layer)], [[] for _ in range(n_layer)]
    token_id = BOS
    sample = []
    for pos_id in range(block_size):
        logits = gpt(token_id, pos_id, keys, values)
        probs = softmax([l / temperature for l in logits])
        token_id = random.choices(range(vocab_size), weights=[p.data for p in probs])[0]
        if token_id == BOS:
            break
        sample.append(uchars[token_id])
Now we use the trained model to generate names:

Start with the BOS token
Feed it through the model to get probabilities for the next character
Sample from those probabilities (randomly pick, weighted by probability)
Feed that character back in, repeat
Stop when the model outputs BOS (signaling "end of name")

Temperature controls randomness. The logits are divided by temperature before softmax:

Low temperature (0.1): Makes the probability distribution "sharper" — the model almost always picks the highest-probability character. Output is predictable and conservative.
High temperature (1.0): Keeps the original distribution — more diverse and creative, but sometimes nonsensical.
Temperature 0.5: A middle ground.

random.choices picks a random element from the list, where each element's chance of being picked is proportional to its weight (probability). This is how we get different names each time, not just the single most likely name.
The Beautiful Summary

The entire algorithm is:

Represent text as numbers
Define a mathematical function (the GPT) that maps input tokens to predictions
Measure how wrong the predictions are (loss)
Compute how to adjust every parameter to be less wrong (backpropagation)
Adjust the parameters (Adam optimizer)
Repeat 1,000 times
Generate by repeatedly asking "what comes next?" and sampling

Every piece of "intelligence" the model exhibits comes from adjusting those ~3,000 numbers to minimize prediction error on the training data. That's it. The same principle scales up to GPT-4's hundreds of billions of parameters — as Karpathy says in the header: "Everything else is just efficiency."
Original code

"""
The most atomic way to train and run inference for a GPT in pure, dependency-free Python.
This file is the complete algorithm.
Everything else is just efficiency.

@karpathy
"""

import os       # os.path.exists
import math     # math.log, math.exp
import random   # random.seed, random.choices, random.gauss, random.shuffle
random.seed(42) # Let there be order among chaos

# Let there be a Dataset `docs`: list[str] of documents (e.g. a list of names)
if not os.path.exists('input.txt'):
    import urllib.request
    names_url = 'https://raw.githubusercontent.com/karpathy/makemore/988aa59/names.txt'
    urllib.request.urlretrieve(names_url, 'input.txt')
docs = [line.strip() for line in open('input.txt') if line.strip()]
random.shuffle(docs)
print(f"num docs: {len(docs)}")

# Let there be a Tokenizer to translate strings to sequences of integers ("tokens") and back
uchars = sorted(set(''.join(docs))) # unique characters in the dataset become token ids 0..n-1
BOS = len(uchars) # token id for a special Beginning of Sequence (BOS) token
vocab_size = len(uchars) + 1 # total number of unique tokens, +1 is for BOS
print(f"vocab size: {vocab_size}")

# Let there be Autograd to recursively apply the chain rule through a computation graph
class Value:
    __slots__ = ('data', 'grad', '_children', '_local_grads') # Python optimization for memory usage

    def __init__(self, data, children=(), local_grads=()):
        self.data = data                # scalar value of this node calculated during forward pass
        self.grad = 0                   # derivative of the loss w.r.t. this node, calculated in backward pass
        self._children = children       # children of this node in the computation graph
        self._local_grads = local_grads # local derivative of this node w.r.t. its children

    def __add__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        return Value(self.data + other.data, (self, other), (1, 1))

    def __mul__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        return Value(self.data * other.data, (self, other), (other.data, self.data))

    def __pow__(self, other): return Value(self.data**other, (self,), (other * self.data**(other-1),))
    def log(self): return Value(math.log(self.data), (self,), (1/self.data,))
    def exp(self): return Value(math.exp(self.data), (self,), (math.exp(self.data),))
    def relu(self): return Value(max(0, self.data), (self,), (float(self.data > 0),))
    def __neg__(self): return self * -1
    def __radd__(self, other): return self + other
    def __sub__(self, other): return self + (-other)
    def __rsub__(self, other): return other + (-self)
    def __rmul__(self, other): return self * other
    def __truediv__(self, other): return self * other**-1
    def __rtruediv__(self, other): return other * self**-1

    def backward(self):
        topo = []
        visited = set()
        def build_topo(v):
            if v not in visited:
                visited.add(v)
                for child in v._children:
                    build_topo(child)
                topo.append(v)
        build_topo(self)
        self.grad = 1
        for v in reversed(topo):
            for child, local_grad in zip(v._children, v._local_grads):
                child.grad += local_grad * v.grad

# Initialize the parameters, to store the knowledge of the model
n_layer = 1     # depth of the transformer neural network (number of layers)
n_embd = 16     # width of the network (embedding dimension)
block_size = 16 # maximum context length of the attention window (note: the longest name is 15 characters)
n_head = 4      # number of attention heads
head_dim = n_embd // n_head # derived dimension of each head
matrix = lambda nout, nin, std=0.08: [[Value(random.gauss(0, std)) for _ in range(nin)] for _ in range(nout)]
state_dict = {'wte': matrix(vocab_size, n_embd), 'wpe': matrix(block_size, n_embd), 'lm_head': matrix(vocab_size, n_embd)}
for i in range(n_layer):
    state_dict[f'layer{i}.attn_wq'] = matrix(n_embd, n_embd)
    state_dict[f'layer{i}.attn_wk'] = matrix(n_embd, n_embd)
    state_dict[f'layer{i}.attn_wv'] = matrix(n_embd, n_embd)
    state_dict[f'layer{i}.attn_wo'] = matrix(n_embd, n_embd)
    state_dict[f'layer{i}.mlp_fc1'] = matrix(4 * n_embd, n_embd)
    state_dict[f'layer{i}.mlp_fc2'] = matrix(n_embd, 4 * n_embd)
params = [p for mat in state_dict.values() for row in mat for p in row] # flatten params into a single list[Value]
print(f"num params: {len(params)}")

# Define the model architecture: a function mapping tokens and parameters to logits over what comes next
# Follow GPT-2, blessed among the GPTs, with minor differences: layernorm -> rmsnorm, no biases, GeLU -> ReLU
def linear(x, w):
    return [sum(wi * xi for wi, xi in zip(wo, x)) for wo in w]

def softmax(logits):
    max_val = max(val.data for val in logits)
    exps = [(val - max_val).exp() for val in logits]
    total = sum(exps)
    return [e / total for e in exps]

def rmsnorm(x):
    ms = sum(xi * xi for xi in x) / len(x)
    scale = (ms + 1e-5) ** -0.5
    return [xi * scale for xi in x]

def gpt(token_id, pos_id, keys, values):
    tok_emb = state_dict['wte'][token_id] # token embedding
    pos_emb = state_dict['wpe'][pos_id] # position embedding
    x = [t + p for t, p in zip(tok_emb, pos_emb)] # joint token and position embedding
    x = rmsnorm(x) # note: not redundant due to backward pass via the residual connection

    for li in range(n_layer):
        # 1) Multi-head Attention block
        x_residual = x
        x = rmsnorm(x)
        q = linear(x, state_dict[f'layer{li}.attn_wq'])
        k = linear(x, state_dict[f'layer{li}.attn_wk'])
        v = linear(x, state_dict[f'layer{li}.attn_wv'])
        keys[li].append(k)
        values[li].append(v)
        x_attn = []
        for h in range(n_head):
            hs = h * head_dim
            q_h = q[hs:hs+head_dim]
            k_h = [ki[hs:hs+head_dim] for ki in keys[li]]
            v_h = [vi[hs:hs+head_dim] for vi in values[li]]
            attn_logits = [sum(q_h[j] * k_h[t][j] for j in range(head_dim)) / head_dim**0.5 for t in range(len(k_h))]
            attn_weights = softmax(attn_logits)
            head_out = [sum(attn_weights[t] * v_h[t][j] for t in range(len(v_h))) for j in range(head_dim)]
            x_attn.extend(head_out)
        x = linear(x_attn, state_dict[f'layer{li}.attn_wo'])
        x = [a + b for a, b in zip(x, x_residual)]
        # 2) MLP block
        x_residual = x
        x = rmsnorm(x)
        x = linear(x, state_dict[f'layer{li}.mlp_fc1'])
        x = [xi.relu() for xi in x]
        x = linear(x, state_dict[f'layer{li}.mlp_fc2'])
        x = [a + b for a, b in zip(x, x_residual)]

    logits = linear(x, state_dict['lm_head'])
    return logits

# Let there be Adam, the blessed optimizer and its buffers
learning_rate, beta1, beta2, eps_adam = 0.01, 0.85, 0.99, 1e-8
m = [0.0] * len(params) # first moment buffer
v = [0.0] * len(params) # second moment buffer

# Repeat in sequence
num_steps = 1000 # number of training steps
for step in range(num_steps):

    # Take single document, tokenize it, surround it with BOS special token on both sides
    doc = docs[step % len(docs)]
    tokens = [BOS] + [uchars.index(ch) for ch in doc] + [BOS]
    n = min(block_size, len(tokens) - 1)

    # Forward the token sequence through the model, building up the computation graph all the way to the loss
    keys, values = [[] for _ in range(n_layer)], [[] for _ in range(n_layer)]
    losses = []
    for pos_id in range(n):
        token_id, target_id = tokens[pos_id], tokens[pos_id + 1]
        logits = gpt(token_id, pos_id, keys, values)
        probs = softmax(logits)
        loss_t = -probs[target_id].log()
        losses.append(loss_t)
    loss = (1 / n) * sum(losses) # final average loss over the document sequence. May yours be low.

    # Backward the loss, calculating the gradients with respect to all model parameters
    loss.backward()

    # Adam optimizer update: update the model parameters based on the corresponding gradients
    lr_t = learning_rate * (1 - step / num_steps) # linear learning rate decay
    for i, p in enumerate(params):
        m[i] = beta1 * m[i] + (1 - beta1) * p.grad
        v[i] = beta2 * v[i] + (1 - beta2) * p.grad ** 2
        m_hat = m[i] / (1 - beta1 ** (step + 1))
        v_hat = v[i] / (1 - beta2 ** (step + 1))
        p.data -= lr_t * m_hat / (v_hat ** 0.5 + eps_adam)
        p.grad = 0

    print(f"step {step+1:4d} / {num_steps:4d} | loss {loss.data:.4f}", end='\r')

# Inference: may the model babble back to us
temperature = 0.5 # in (0, 1], control the "creativity" of generated text, low to high
print("\n--- inference (new, hallucinated names) ---")
for sample_idx in range(20):
    keys, values = [[] for _ in range(n_layer)], [[] for _ in range(n_layer)]
    token_id = BOS
    sample = []
    for pos_id in range(block_size):
        logits = gpt(token_id, pos_id, keys, values)
        probs = softmax([l / temperature for l in logits])
        token_id = random.choices(range(vocab_size), weights=[p.data for p in probs])[0]
        if token_id == BOS:
            break
        sample.append(uchars[token_id])
    print(f"sample {sample_idx+1:2d}: {''.join(sample)}")
This markdown was generated by AI for reference only.
No results found