Following Karpathy’s Let’s build GPT from scratch video and the nanoGPT repo. Reimplementing each piece in a fresh notebook to make sure I can derive it without copy-paste.
Plan
- Bigram baseline on tinyshakespeare.
- Add self-attention → multi-head → blocks.
- Add positional embeddings + layer norm + residual.
- Train on GPU, sample.
- Swap dataset (my own corpus).
Live notebook
(Will replace with my actual training notebook — for now the embed demo proves the wiring works.)
Notebook embed demo
This .ipynb is rendered statically at Jekyll build time. Markdown, code, stdout, errors, plots, and DataFrames all work.
msg = 'Hello from a notebook cell.'
print(msg)Hello from a notebook cell.
def answer():
return 6 * 7
answer()42
Math also renders (MathJax already loaded):
\[J(\theta) = \tfrac{1}{2} \sum_i (h_\theta(x_i) - y_i)^2\]1 / 0Traceback (most recent call last): File "<stdin>", line 1, in <module> ZeroDivisionError: division by zero
Key derivations (TODO — fill as I go)
- Scaled dot-product attention as a soft k-NN over keys.
- Why the
/√d_kscaling matters (variance argument). - Causal mask = upper-triangular
-inf. - Why pre-LayerNorm is more stable than post-LN.
Bugs I hit
- Forgot to detach hidden state when iterating batches — gradient
graph exploded. Lesson: in a training loop with custom slicing,
print
loss.grad_fnonce to sanity check. - Off-by-one in the targets shift:
y = x[:, 1:]notx[:, :-1].