Memory in Machines

From basic N-Grams to RNNs, LSTMs, and Transformers.

Use Arrow Keys to Navigate →

The Problem: Language Modeling

Before we look at Neural Networks, we must ask: How does a machine predict the next token?

If you see: "I like to eat ____"

How do you know the next word is "apples" and not "cars"?
The answer is Context. You look at the words that came before.
The core challenge in AI is figuring out how much past context a model needs to accurately predict the future.

The Simplest Approach: N-Gram Models

Before Deep Learning, we just counted how often sequences appeared in our dataset!

1-Gram (Unigram) - 0 Context:
Just pick the most common word in the dictionary. Predicts: "the".
2-Gram (Bigram) - 1 Word Context:
Looks only at the very last word ("eat") and counts what usually follows it. Predicts: "apples".
N-Gram - (N-1) Context:
Looks at a fixed window of N-1 previous words.

The N-Gram Wall (Curse of Dimensionality)

Why not just build a 100-Gram model that looks at the last 99 words?

Data Sparsity: The exact sequence of 99 specific words has probably never occurred in your training data!
If an exact sequence isn't found, an N-Gram model breaks down completely.
We need a model that can generalize context, rather than just counting exact matches.

Breaking the Wall: Neural Language Models

When we test our N-Gram model against an RNN and LSTM on a sequence that has never appeared in the exact same order in our training data:

--- Testing the N-Gram Wall Sequence ---
Context: 'and i do'
N-Gram Model Prediction: <UNKNOWN CONTEXT>
RNN Model Prediction: 'not'
LSTM Model Prediction: 'like'

Neural Networks don't rely on exact matches! They use a Hidden State to compress and generalize the context, allowing them to make educated predictions even on unseen sequences.

A Harder Problem: Sequence-to-Sequence

Predicting the next word is one thing, but what about translating an entire sequence?

Imagine reading a word letter by letter:
COMMUNICATION

If I ask you what the 13th letter is, you can't just look at the letter N. You have to remember everything that came before it.
Sequence-to-Sequence (Seq2Seq) models solve this using an Encoder and a Decoder.
The Encoder builds a memory of the English word, the Decoder uses that memory to translate to Tamil.

RNN: Passing the Baton

Instead of looking at a fixed "N" window, a Recurrent Neural Network (RNN) passes a Hidden State forward at every step.

Like a baton in a relay race, each character adds its information to the baton and hands it to the next character.

The Fatal Flaw: Vanishing Gradient

While the baton pass makes sense conceptually, it fails mathematically on long sequences.

Just like a massive game of "Telephone", the original message is constantly overwritten and distorted at every step.

By the time an RNN reaches the end of a 13-letter word, it has almost entirely forgotten the first 3 letters.

LSTM: The Smart Notebook

To fix this amnesia, the Long Short-Term Memory (LSTM) was invented. It introduces a Cell State—a separate track of memory that acts like a smart notebook.

Forget Gate: Decides what old information to erase.
Input Gate: Decides what new information to write down.
Output Gate: Decides what to output as the current Hidden State.

The Ultimate Test: Character Translation

When generating full translations character-by-character, the standard RNN completely falls apart on long words because it forgets the prefix. The LSTM handles it flawlessly.

The Remaining Bottleneck

LSTMs are great, but they still have a massive fundamental problem: Sequential Processing.

You cannot calculate the hidden state for word #2 until you finish word #1.
This means you cannot train LSTMs on GPUs in parallel effectively. They are painfully slow to train on massive datasets.
Even with a "notebook", information from word #1 still has to travel through 1,000 steps to reach word #1,000. It is an $O(N)$ path length.

The Transformer Revolution: Self-Attention

In 2017, Google published "Attention Is All You Need", introducing the Transformer. It abandoned RNNs entirely.

Instead of passing a baton sequentially, it uses Self-Attention.
Imagine a room full of people. Instead of passing a whisper down the line, everyone is shouting and listening to everyone else simultaneously.
The path length from any word to any other word is exactly $O(1)$. It is a direct mathematical connection.
Because it isn't sequential, you can calculate the entire sequence on a GPU in massive parallel!

The Attention Matrix in Action

In our `transformer_demo.py` script, we see exactly how Self-Attention links words directly to each other without passing through intermediate states.

--- The Attention Matrix ---
How much does each word 'pay attention' to the other words?

the bank of the river
the 0.05 0.51 0.18 0.20 0.06
bank 0.41 0.07 0.04 0.08 0.40
of 0.18 0.03 0.15 0.41 0.23
the 0.28 0.07 0.27 0.29 0.09
river 0.08 0.50 0.25 0.07 0.09

Notice how the word "bank" assigns a massive 0.40 weight directly to "river" to understand its own context! No sequential baton pass required.

The Grand Finale: Text Generation

The true power of Language Models is that they don't just predict one word—they can generate infinite text. By taking the predicted output and feeding it back as the new input, we create an autoregressive loop. This is exactly how ChatGPT writes essays!

========================================
5. The Magic of Generation (Autoregressive)
========================================

Seed Context: 'i like'
RNN Generates: 'i like to eat grapes and i like to eat'
LSTM Generates: 'i like to eat grapes and i like to eat'
Transformer Generates: 'i like to eat apples and i like to eat'

Whether using an LSTM or a Transformer, the model "hallucinates" or creates new paths based on the structure it learned!