Welcome to the first post in my series on learning Large Language Models (LLMs) from scratch! As someone who’s been working with LLMs in production environments, I’ve realized there’s a gap between using these models and truly understanding how they work under the hood.

Why This Series?

Over the past year, I’ve been building AI-powered applications, including WhereToPlant - an AI system that analyzes environmental parameters for forest restoration. While working with LLMs through APIs has been incredibly powerful, I want to dive deeper into the fundamentals:

  • How do transformers actually work?
  • What makes attention mechanisms so effective?
  • How is training orchestrated at scale?
  • What are the mathematical foundations?

What We’ll Cover

This series will be my learning journey, documented in real-time. I’ll be working through:

  1. Foundations: Neural networks, backpropagation, and optimization
  2. Attention Mechanisms: Self-attention and multi-head attention
  3. Transformers: Architecture, training, and inference
  4. Tokenization: BPE, WordPiece, and SentencePiece
  5. Training at Scale: Distributed training, mixed precision, and optimization techniques
  6. Fine-tuning: Transfer learning, LoRA, and adaptation strategies

The Approach

I’ll be:

  • Building implementations from scratch (in PyTorch)
  • Reading seminal papers (Attention is All You Need, GPT series, etc.)
  • Visualizing concepts to build intuition
  • Sharing code examples and experiments

Resources I’m Following

Next Up

In the next post, we’ll start with the absolute basics - understanding the transformer architecture at a high level before diving into the mathematics of attention.

If you’re also learning about LLMs or have resources to share, I’d love to hear from you!


This is part 1 of the “Learning LLMs from Scratch” series.