← All notes

Tutorial WIP

nanoGPT from scratch — Karpathy walkthrough

Updated  ·  source

Following Karpathy’s Let’s build GPT from scratch video and the nanoGPT repo. Reimplementing each piece in a fresh notebook to make sure I can derive it without copy-paste.

Plan

  1. Bigram baseline on tinyshakespeare.
  2. Add self-attention → multi-head → blocks.
  3. Add positional embeddings + layer norm + residual.
  4. Train on GPU, sample.
  5. Swap dataset (my own corpus).

Live notebook

(Will replace with my actual training notebook — for now the embed demo proves the wiring works.)

Notebook embed demo

This .ipynb is rendered statically at Jekyll build time. Markdown, code, stdout, errors, plots, and DataFrames all work.

In [1]:
msg = 'Hello from a notebook cell.'
print(msg)
Hello from a notebook cell.
In [2]:
def answer():
    return 6 * 7

answer()
Out[2]:
42

Math also renders (MathJax already loaded):

\[J(\theta) = \tfrac{1}{2} \sum_i (h_\theta(x_i) - y_i)^2\]
In [3]:
1 / 0
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ZeroDivisionError: division by zero

Key derivations (TODO — fill as I go)

  • Scaled dot-product attention as a soft k-NN over keys.
  • Why the /√d_k scaling matters (variance argument).
  • Causal mask = upper-triangular -inf.
  • Why pre-LayerNorm is more stable than post-LN.

Bugs I hit

  • Forgot to detach hidden state when iterating batches — gradient graph exploded. Lesson: in a training loop with custom slicing, print loss.grad_fn once to sanity check.
  • Off-by-one in the targets shift: y = x[:, 1:] not x[:, :-1].