A Pedagogical Journey through Attention, Memory, and Parallelization
Long Short-Term Memory (LSTM) networks ruled AI from 2014-2017. But they were fundamentally limited by their architecture—the Conveyor Belt.
for loop, it cannot be trained efficiently on modern parallel GPUs.
LSTMs process tokens strictly one-by-one, causing massive traffic jams.
The `amnesia_test_demo.py` script proves this limitation mathematically.
Strings of 20 random letters are generated. Both an LSTM and a Transformer are tasked to memorize and output the very first letter of the sequence.
Because the LSTM must pass the letter 'K' through 19 subsequent hidden states, the gradient vanishes, and the context is forgotten entirely.
Theoretical Takeaway: LSTMs suffer from the Vanishing Gradient problem over long sequences, whereas Transformers retrieve context perfectly via Attention.
In 2017, Google published "Attention Is All You Need". They abandoned the sequential conveyor belt entirely.
Instead of passing a hidden state down a line, the Transformer builds a Web of Connections. Every single word connects directly to every other word simultaneously.
The "Path Length" from Word 1 to Word 10,000 is O(1). It is a direct, instant mathematical connection.
Self-Attention allows every token to directly analyze every other token instantly.
Imagine being at a noisy cocktail party. When talking to someone, all other voices are instantly tuned out to focus entirely on them. This is what the Transformer does using three matrices: Query, Key, and Value (Q, K, V).
When a Query matches a Key (via Dot Product), the model absorbs that Value. If 'The' matches perfectly with 'Bank', it absorbs 99% of 'Bank's' meaning to understand its own context.
The `transformer_deep_dive.py` script traces this exact Matrix Math.
By multiplying the Query matrix by the Key matrix and applying a Softmax function, a grid of percentages is produced.
Notice how the model mathematically assigns exactly 14% of its attention to the word "network" to understand the context of the word "propose".
Theoretical Takeaway: The Softmax function ensures that attention is strictly distributed as a percentage (summing to 1.0), meaning the model perfectly balances its focus across the entire sequence.
Backpropagation is applied to this math in `transformer_training_demo.py`.
At Epoch 0, the Q, K, V matrices are completely random, meaning the model hallucinates gibberish.
During training, these matrices are automatically adjusted to align perfectly with English grammar rules, steadily dropping the loss to zero.
Theoretical Takeaway: As the loss decreases, the Attention Matrices learn to map relationships correctly (e.g. adjectives to nouns), transforming random guesses into coherent text.
Why do Transformers scale to trillions of parameters? Because they eliminated the for
loop.
The `lstm_vs_transformer_race.py` script processes a massive 1,000-word document in pure Python.
The LSTM is forced to wait for the previous word 1,000 times. The Transformer computes all 1,000 words simultaneously via matrices.
Theoretical Takeaway: The Transformer’s parallel $O(1)$ architecture utilizes hardware accelerators infinitely better than the LSTM's sequential $O(N)$ architecture.
Finally, both fully trained models are given the prompt "We propose a". Because the LSTM
struggles with long dependencies, it loses track of grammar and hallucinates. The Transformer
maintains $O(1)$ context and generates perfectly coherent, domain-specific text. This is how ChatGPT
works!