Learning LLMs from Scratch: Introduction

Welcome to the first post in my series on learning Large Language Models (LLMs) from scratch! As someone who’s been working with LLMs in production environments, I’ve realized there’s a gap between using these models and truly understanding how they work under the hood.

Why This Series?

Over the past year, I’ve been building AI-powered applications, including WhereToPlant - an AI system that analyzes environmental parameters for forest restoration. While working with LLMs through APIs has been incredibly powerful, I want to dive deeper into the fundamentals:

How do transformers actually work?
What makes attention mechanisms so effective?
How is training orchestrated at scale?
What are the mathematical foundations?

What We’ll Cover

This series will be my learning journey, documented in real-time. I’ll be working through:

Foundations: Neural networks, backpropagation, and optimization
Attention Mechanisms: Self-attention and multi-head attention
Transformers: Architecture, training, and inference
Tokenization: BPE, WordPiece, and SentencePiece
Training at Scale: Distributed training, mixed precision, and optimization techniques
Fine-tuning: Transfer learning, LoRA, and adaptation strategies

The Approach

I’ll be:

Building implementations from scratch (in PyTorch)
Reading seminal papers (Attention is All You Need, GPT series, etc.)
Visualizing concepts to build intuition
Sharing code examples and experiments

Resources I’m Following

The Illustrated Transformer by Jay Alammar
Original papers: “Attention is All You Need”, GPT-1/2/3 papers
Karpathy’s Neural Networks: Zero to Hero
Various open-source implementations (nanoGPT, minGPT)

Next Up

In the next post, we’ll start with the absolute basics - understanding the transformer architecture at a high level before diving into the mathematics of attention.

If you’re also learning about LLMs or have resources to share, I’d love to hear from you!

This is part 1 of the “Learning LLMs from Scratch” series.