From basic N-Grams to RNNs, LSTMs, and Transformers.
Before we look at Neural Networks, we must ask: How does a machine predict the next token?
If you see: "I like to eat ____"
Before Deep Learning, we just counted how often sequences appeared in our dataset!
Why not just build a 100-Gram model that looks at the last 99 words?
When we test our N-Gram model against an RNN and LSTM on a sequence that has never appeared in the exact same order in our training data:
Neural Networks don't rely on exact matches! They use a Hidden State to compress and generalize the context, allowing them to make educated predictions even on unseen sequences.
Predicting the next word is one thing, but what about translating an entire sequence?
Imagine reading a word letter by letter:
COMMUNICATION
Instead of looking at a fixed "N" window, a Recurrent Neural Network (RNN) passes a Hidden State forward at every step.
Like a baton in a relay race, each character adds its information to the baton and hands it to the next character.
While the baton pass makes sense conceptually, it fails mathematically on long sequences.
Just like a massive game of "Telephone", the original message is constantly overwritten and distorted at every step.
By the time an RNN reaches the end of a 13-letter word, it has almost entirely forgotten the first 3 letters.
To fix this amnesia, the Long Short-Term Memory (LSTM) was invented. It introduces a Cell State—a separate track of memory that acts like a smart notebook.
When generating full translations character-by-character, the standard RNN completely falls apart on long words because it forgets the prefix. The LSTM handles it flawlessly.
LSTMs are great, but they still have a massive fundamental problem: Sequential Processing.
In 2017, Google published "Attention Is All You Need", introducing the Transformer. It abandoned RNNs entirely.
In our `transformer_demo.py` script, we see exactly how Self-Attention links words directly to each other without passing through intermediate states.
Notice how the word "bank" assigns a massive 0.40 weight directly to "river" to understand its own context! No sequential baton pass required.
The true power of Language Models is that they don't just predict one word—they can generate infinite text. By taking the predicted output and feeding it back as the new input, we create an autoregressive loop. This is exactly how ChatGPT writes essays!
Whether using an LSTM or a Transformer, the model "hallucinates" or creates new paths based on the structure it learned!