Based on Andrej Karpathy's 'NanoGPT' lecture (training a small transformer architecture on a shakespearean dataset), refactoring for training with PyTorch Lightning.
Character level tokenizer and decoder only transformer architecture trained with masked self-attention.
Training tested on an A100-40 and M2 Macbook.