AI - Part 5:
Embeddings & Language Models


Seeking the soul of words. The mathematical world of language, from Word2Vec to Language Models and Neural Networks.

Word Embeddings · Language Models · Neural Networks · Attention

Word Embeddings

Giving Words a Soul

Transforming discrete words into continuous, dense, multi-dimensional floating-point vectors (Digital DNA).

  • The Semantic Topology: In this high-dimensional space, words with similar meanings (e.g., King, Prince) naturally gravitate toward each other.
  • Unlike the rigid accountant that is BoW, Embeddings act like a poet—capturing the nuances, emotions, and multifaceted relationships of human language.

Word2Vec

Training the Semantic Map

Rule: *"You shall know a word by the company it keeps."* (J.R. Firth, popularized by Mikolov in 2013)

  • Architecture 1: CBOW (Continuous Bag of Words): Predicts a hidden target word based on its surrounding context words. (Fast and efficient).
  • Architecture 2: Skip-Gram: Takes a single target word and predicts the surrounding context words. (Slower, but superior for rare words).

The Magic of Word2Vec

Algebraic Semantics

vector("King") - vector("Man") + vector("Woman") ≈ vector("Queen")

  • Real-world Impact:
    1. Semantic Search: Querying "Cheap Laptop" retrieves "Budget Notebook".
    2. Cross-Lingual Embeddings: Mapping French and English vector spaces for zero-shot translation.
    3. Recommendation Systems (Item2Vec): "Users who bought this also bought..."

💻 Code: Vector Arithmetic

(Understanding word relationships using Numpy arrays)


import numpy as np

# Hypothetical multi-dimensional word embeddings
# Dimensions: [Royalty_Score, Masculinity_Score]
vec_king   = np.array([0.9,  0.9])
vec_man    = np.array([0.0,  0.9])
vec_woman  = np.array([0.0, -0.9])

# The Word2Vec Magic equation
vec_result = vec_king - vec_man + vec_woman

print("Computed Vector (King - Man + Woman):")
print(vec_result)

# Ideal vector for Queen = [Royalty, Femininity]
vec_queen = np.array([0.9, -0.9])

print("\nTarget Vector (Queen):")
print(vec_queen)

print("\n💡 The algebraic operation perfectly arrives at the semantic concept of 'Queen'!")
                        

💻 Code: Cosine Similarity

(The mathematical backbone of semantic search and NLP distance metrics)


import numpy as np
from numpy.linalg import norm

def cosine_similarity(vec1, vec2):
    """Calculates cosine of angle between vectors. Range: -1 to 1"""
    return np.dot(vec1, vec2) / (norm(vec1) * norm(vec2))

# Hypothetical embeddings
vec_king   = np.array([0.9,  0.8,  0.1])
vec_queen  = np.array([0.9, -0.8,  0.1])
vec_man    = np.array([0.1,  0.8,  0.0])
vec_apple  = np.array([0.0,  0.0,  0.9])

print("Cosine Similarity Scores:")
print("-" * 30)
print(f"King  & Queen : {cosine_similarity(vec_king, vec_queen):.3f}")
print(f"King  & Man   : {cosine_similarity(vec_king, vec_man):.3f}")
print(f"King  & Apple : {cosine_similarity(vec_king, vec_apple):.3f}")

print("\nConclusion: King and Queen share high semantic alignment.")
print("King and Apple are completely orthogonal (unrelated).")
                        

The Flaws of Static Embeddings

  • OOV (Out-of-Vocabulary): It utterly fails when encountering a word not present in its training corpus.
  • Morphological Blindness: Without subword tokenization, "run", "runner", and "running" are treated as totally separate entities.
  • Static Nature: Word2Vec assigns a single, permanent vector to the word "bat", permanently collapsing the distinction between the animal and the sports equipment.

Contextual Embeddings

The Dynamic Revolution (ELMo, BERT)

The ultimate solution to the Word2Vec "Static" limitation.

  • Mechanism: Before finalizing the vector representation of a word, the model dynamically reads the entire surrounding sentence.
  • The Result:
    • "He bought a new bat" → Generates the equipment vector.
    • "The bat flew away" → Generates the mammal vector.
  • This finally solved lexical ambiguity, pushing NLP into the modern era.

Language Models (LMs)

From Reader to Creator

  • The Evolution: Machines that learned to read and classify text are now ready to generate it.
  • Core Philosophy: "What word comes next?" The entire premise is based on probabilistic prediction.
  • Just as human intuition completes sentences (e.g., "The sky is... -> blue"), AI trains on terabytes of textual data to map the statistical relationships between words and predict sequences.

First-Order Sequence Models

The "Goldfish Memory" Approach

Markov Chains: Predicting the next state based exclusively on the current state, ignoring all previous history.

P(next word | current word)

Example: "The cat sat on the [___]".
The model looks only at the word 'the' and checks its statistical history to see how often 'mat' or 'chair' followed it.

  • Severe Limitation: By looking only at one preceding word, it loses global context. (e.g., "The astronaut flew to the [tree]" - highly probable grammatically, but contextually absurd).

Transition Matrices

The AI's Probability Map

A massive lookup table storing the probability distribution of shifting from one specific word to another.

P(word₂ | word₁) = Count(word₁, word₂) / Count(word₁)

  • <start> → "Play" (100% chance)
  • "Play" → "the" (80% chance)
  • "the" → "music" (90% chance)
  • "music" → <end>

*Historical Note: Andrey Markov originally developed this by analyzing the consonant/vowel distribution in Alexander Pushkin's poetry.

💻 Code: Markov Generation

(Building a Transition Matrix using NLTK N-grams)


from nltk import ngrams, ConditionalFreqDist

text = "I love AI . I love Python . AI is amazing .".split()

# Extract Bigrams (Pairs of consecutive words)
bigrams = list(ngrams(text, 2))
print("Sample Bigrams:", bigrams[:4], "...\n")

# Build Conditional Frequency Distribution (Transition Matrix)
cfd = ConditionalFreqDist(bigrams)

print("Transition Matrix for 'AI':", dict(cfd['AI']))
print("Transition Matrix for 'love':", dict(cfd['love']))

# Next Word Prediction
current_word = "I"
# Max() returns the most statistically probable next word
next_word = cfd[current_word].max()
print(f"\nPrediction: After the word '{current_word}', the model predicts '{next_word}'.")
                        

Second-Order Markov Models

A Broader Horizon

To predict the next word, the model uses a sliding window to look at the two preceding words (Trigrams) instead of just one.

  • Advantage: The ambiguity of ("Play" -> ?) is resolved by analyzing ("Play", "the" -> "music"). Uncertainty drops drastically.
  • Skip-Grams: An advanced variation that skips over arbitrary filler words to establish relationships with distant, highly relevant keywords.

The Barrier of Statistical Models

Stochastic Parrots vs. True Intelligence

  • Human Perspective: We utilize world knowledge and common sense mental models to understand text.
  • Markov Models: They are heartless statistical engines. They don't know that "the sun rises in the west" is a physical impossibility; they only care if it's statistically probable in the training corpus.
  • Conclusion: Pure statistical frequency models are fundamentally incapable of deep semantic comprehension.

Sentiment Analysis

The Digital Empathy

The computational study of extracting subjective information, identifying whether the underlying emotional tone of a text is Positive, Negative, or Neutral.

  • Key Metrics:
    • Polarity: Ranges from -1 (Extremely Negative) to +1 (Extremely Positive).
    • Subjectivity: Ranges from 0 (Objective Fact) to 1 (Personal Opinion).
  • The LLM Revolution: Modern deep learning models can now detect highly nuanced emotional states, including sarcasm and irony, which traditionally defeated legacy NLP.

💻 Code: Sentiment Analysis

(Implementing VADER - Valence Aware Dictionary and sEntiment Reasoner)


from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# VADER is highly optimized for social media text and microblogs
analyzer = SentimentIntensityAnalyzer()

text = "I absolutely love this NLP course, it is incredibly engaging! But the homework is awful."

# Generate polarity scores
scores = analyzer.polarity_scores(text)
print("Input Text:", text)
print("\nRaw Scoring Metrics:", scores)

# The 'compound' score is a normalized, weighted composite score
compound = scores['compound']

if compound >= 0.05:
    print("\nOverall Sentiment: Positive 😊")
elif compound <= -0.05:
    print("\nOverall Sentiment: Negative 😠")
else:
    print("\nOverall Sentiment: Neutral 😐")
                        

Neural Networks

Recurrent Neural Networks (RNN)

The Need for Sequence: How do we differentiate grammatical structures over time?

  • The RNN Solution: It reads tokens sequentially, maintaining a "Hidden State" (a digital notepad) that carries memory of previous words into the calculation of the next word.
  • The Villains (Drawbacks):
    • Vanishing Gradients (Amnesia): In long sentences, mathematical gradients diminish to zero, causing the network to "forget" the beginning of the sentence.
    • Sequential Bottleneck: Processing must happen one word at a time, preventing parallel GPU acceleration.

Long Short-Term Memory (LSTM)

Curing the Network's Amnesia

An advanced RNN architecture specifically engineered to carry long-term dependencies across vast sequences.

  • The Three Gates:
    1. Forget Gate: Decides which irrelevant historical data to aggressively purge.
    2. Input Gate: Determines what incoming new data is critical enough to store.
    3. Output Gate: Controls what information is exposed to the next step.
  • The Cell State: Information flows through a central highway via linear addition, mathematically preventing the Vanishing Gradient problem!

Seq2Seq Architecture

The Dual-Expert Alliance

How do we capture the "essence" of one language and generate an entirely new sequence? (e.g., Machine Translation, Summarization).

  • The Architecture:
    1. The Encoder: Reads the input sequence and compresses its semantic meaning into a dense mathematical representation.
    2. The Decoder: Receives this compressed representation and dynamically unwraps it to generate a new output sequence.

Deep Dive: The Encoder

  1. Tokenization: "Hello world" → [Hello, world]
  2. Vocabulary Indexing: Tokens mapped to integers.
  3. Embedding Layer: Integers expanded into dense continuous vectors.
  4. Recurrent Processing: An RNN/LSTM processes tokens step-by-step, updating its internal hidden state.
  5. The Context Vector: The final hidden state becomes the "Thought Vector"—a compressed mathematical summary of the entire input sentence.

Deep Dive: The Decoder

  1. Initial State: Inherits the Encoder's final Context Vector as its absolute starting point.
  2. Autoregressive Generation: Generates the first token, then feeds that output back into itself to predict the next token, creating a sequential chain reaction.
  3. Detokenization: Translates predicted vector probabilities back into human-readable text via the vocabulary matrix.
  4. Termination: Halts generation only when it predicts a special "EOS" (End of Sequence) token.

The Seq2Seq Bottleneck

The Information Crush

  • The Flaw: Forcing an entire complex paragraph into a single, fixed-size Context Vector is like trying to summarize a 3-hour movie on a sticky note. Critical nuance is lost.
  • The Revolutionary Fix: Instead of relying on one static vector, what if the Decoder could look back at the entire original input sequence dynamically while generating every single new word?

Self-Attention Mechanism

The Magic Trick of Modern NLP

A mathematical mechanism allowing a model to calculate how strongly every word in a sequence relates to every other word simultaneously.

"The animal didn't cross the street because it was too tired."

How does the machine know what 'it' refers to? The street or the animal? Self-Attention assigns massive mathematical weight between 'it' and 'animal', instantly resolving the coreference.

💻 Code: Self-Attention Math

(Using Matrix Multiplication to calculate focus weightings)


import numpy as np

# Sequence: "Bank", "of", "River"
words = ["Bank", "of", "River"]

# Simplified Embeddings: [Water_Feature, Financial_Feature]
vectors = np.array([
    [0.9, 0.1],  # Bank (Assuming riverbank context here)
    [0.1, 0.1],  # of
    [0.8, 0.2]   # River
])

# Self-Attention Formula core: Q x K^T (Query matrix dot Key matrix)
attention_scores = np.dot(vectors, vectors.T)

print("Raw Attention Scores for 'Bank' against all words:")
print(np.round(attention_scores[0], 2))

# Identify which context word 'Bank' pays the most attention to
highest_attention_idx = np.argmax(attention_scores[0][1:]) + 1
print(f"\nThe word 'Bank' attends most strongly to: '{words[highest_attention_idx]}'")
print("💡 The model successfully contextualizes 'Bank' as a body of water!")
                        

Transformers

The Heart of Generative AI

By discarding slow RNN architecture entirely and relying solely on Attention mechanisms ("Attention Is All You Need", 2017), the Transformer unlocked massive parallel processing capabilities.

  • The Future is Here: This exact architecture serves as the foundation for the world's most powerful Generative AI models, including GPT, BERT, and LLaMA.
  • Machines have finally transcended logic constraints, demonstrating the ability to reason, write poetry, code, and converse with human parity.

🎯 The NLP Pipeline Summary

From Raw Data to Intelligent Generation

① Data Processing

  • Tokenization (Words/Subwords)
  • Stop Words Removal
  • Stemming & Lemmatization
  • POS Tagging & NER

② Feature Engineering

  • Bag of Words (BoW)
  • TF-IDF Vectorization
  • N-grams & Markov Models

③ Semantic Representation

  • Word Embeddings (Word2Vec)
  • Contextual Embeddings (BERT)
  • Cosine Similarity Metrics

④ Deep Learning Architectures

  • RNN & LSTM Networks
  • Seq2Seq Topologies
  • Self-Attention Mechanism
  • Transformers (LLMs)

💡 Key Takeaway: There is no ML without clean data. There is no comprehension without embeddings. Modern AI is the culmination of this entire pipeline.