Welcome to the first post in my series on learning Large Language Models (LLMs) from scratch! As someone who’s been working with LLMs in production environments, I’ve realized there’s a gap between using these models and truly understanding how they work under the hood.
Why This Series?
Over the past year, I’ve been building AI-powered applications, including WhereToPlant - an AI system that analyzes environmental parameters for forest restoration. While working with LLMs through APIs has been incredibly powerful, I want to dive deeper into the fundamentals:
- How do transformers actually work?
- What makes attention mechanisms so effective?
- How is training orchestrated at scale?
- What are the mathematical foundations?
What We’ll Cover
This series will be my learning journey, documented in real-time. I’ll be working through:
- Foundations: Neural networks, backpropagation, and optimization
- Attention Mechanisms: Self-attention and multi-head attention
- Transformers: Architecture, training, and inference
- Tokenization: BPE, WordPiece, and SentencePiece
- Training at Scale: Distributed training, mixed precision, and optimization techniques
- Fine-tuning: Transfer learning, LoRA, and adaptation strategies
The Approach
I’ll be:
- Building implementations from scratch (in PyTorch)
- Reading seminal papers (Attention is All You Need, GPT series, etc.)
- Visualizing concepts to build intuition
- Sharing code examples and experiments
Resources I’m Following
- The Illustrated Transformer by Jay Alammar
- Original papers: “Attention is All You Need”, GPT-1/2/3 papers
- Karpathy’s Neural Networks: Zero to Hero
- Various open-source implementations (nanoGPT, minGPT)
Next Up
In the next post, we’ll start with the absolute basics - understanding the transformer architecture at a high level before diving into the mathematics of attention.
If you’re also learning about LLMs or have resources to share, I’d love to hear from you!
This is part 1 of the “Learning LLMs from Scratch” series.