The earlier chapters built each component in isolation. This project assembles them into a single codebase that prepares data, trains a model, and generates text. Each stage builds directly on the previous one, so work through them in order.
We will train GPT-2 Small: 12 layers, 12 attention heads, 768-dimensional embeddings, roughly 124 million parameters. The training data is FineWeb-Edu, a high-quality filtered web corpus. Training at this scale requires a CUDA GPU. If you are on a CPU or Apple Silicon, the architecture chapter includes a smaller configuration with fewer layers and a narrower embedding that keeps all the code identical.
By The End Of The Project
prepare_data.pytokenizes FineWeb-Edu shards intotrain.binandval.bindata.pyserves alignedx/ybatches from those filesmodel.pywraps the Transformer stack into a trainable GPTtrain.pyruns the training loop and saves checkpointsgenerate.pyloads a checkpoint and samples continuations from a prompt