Trained a 125M LM from scratch instead of fine-tuning GPT-2 — releasing weights + SFT framework for others to build on

📰 Reddit r/LocalLLaMA

Trained a 125M LM from scratch (custom tokenizer) + released instruct checkpoint and SFT framework so others can fine-tune their own variants I’ve been experimenting with training small language models fully from scratch (no GPT-2 init, no borrowed tokenizer) and wanted to share something others here might be able to build on. I trained a 12-layer 125M parameter causal LM using a custom 16k BPE tokenizer on WikiText-103 + TinyStories. Training ran

Published 13 Apr 2026
Read full article → ← Back to Reads