Bridging the gap between the fortress of logic and the realm of human emotion.
Natural Language Processing (NLP) - A Comprehensive Introduction.
Use Spacebar or Arrow Keys to navigate or scroll
↓ Scroll down to continue ↓
Words are chameleons; they change meaning based entirely on their surrounding context.
Example: "He saw the bat."
The True Victory of NLP: Not merely reading strings of text, but successfully inferring the semantic context behind them.
The foundational pre-processing step of breaking down a continuous stream of text into smaller, meaningful, and machine-readable units called tokens.
Example: "Welcome to NLP! What is your name?"
Tokens: ["Welcome", "to", "NLP", "!", "What", "is", "your", "name", "?"]
Instead of splitting by whole words or individual characters, modern models split text into frequently occurring character sequences (Subwords). This is the secret engine powering LLMs like GPT and BERT.
Example: "Unhappiness"
Tokens: ["Un", "happi", "ness"]
(Implementing Word & Sentence Tokenization using NLTK standards)
from nltk.tokenize import WordPunctTokenizer
import re
text = "Welcome to NLP! What is your name?"
# 1. Word Tokenization
tokenizer = WordPunctTokenizer()
words = tokenizer.tokenize(text)
print("Word Tokens:", words)
# 2. Sentence Tokenization
def simple_sent_tokenize(text):
"""Simple regex-based sentence tokenizer (Browser-friendly)"""
# Split by punctuation (. ! ?) followed by a space
sentences = re.split(r'(?<=[.!?])\s+', text)
return [s.strip() for s in sentences if s.strip()]
sentences = simple_sent_tokenize(text)
print("Sentence Tokens:", sentences)
print("\n💡 Note: In production, nltk.sent_tokenize is used.")
Splitting by "whitespace" is not a universal solution!
The process of removing high-frequency words that serve a grammatical purpose but contribute little to the actual semantic meaning of a sentence (e.g., "a", "an", "the", "is").
Example: "This is a simple example of removing stop words"
Result: ['This', 'simple', 'example', 'removing', 'stop', 'words', '.']
(Filtering noise using Scikit-Learn's built-in English corpus)
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from nltk.tokenize import WordPunctTokenizer
text = "This is a simple example of removing stop words"
# Tokenize the text into lowercase words
tokenizer = WordPunctTokenizer()
words = tokenizer.tokenize(text.lower())
# Filter out words that exist in the Stop Words dictionary
filtered_words = [w for w in words if w not in ENGLISH_STOP_WORDS]
print("Original Tokens:\n", words)
print("\nFiltered Meaningful Tokens:\n", filtered_words)
A crude, heuristic process that chops off the ends of words (suffixes like -ing, -ly, -ed) to reduce them to a base form (the Stem).
Example:
running -> run | easily -> easili
Unlike blind chopping, Lemmatization relies on vocabulary analysis and morphological rules to return the proper, dictionary base form of a word (the Lemma).
Example:
better -> good
leaves (Noun) -> leaf
leaves (Verb) -> leave
Conclusion: Use Stemming for sheer speed; use Lemmatization for semantic accuracy.
(Using standard NLTK algorithms: Porter and Snowball)
from nltk.stem import PorterStemmer, SnowballStemmer
porter = PorterStemmer()
snowball = SnowballStemmer("english")
words = ["running", "easily", "leaves", "fairly", "better"]
print(f"{'Original':10} | {'Porter':10} | {'Snowball':10}")
print("-" * 35)
for w in words:
# Notice how both algorithms fail on words like 'better' and 'leaves'
print(f"{w:10} | {porter.stem(w):10} | {snowball.stem(w):10}")
(Context-aware root extraction using NLTK's WordNet Lexicon)
from nltk.stem import WordNetLemmatizer
import nltk
# Download required WordNet datasets
try:
nltk.download('wordnet', quiet=True)
nltk.download('omw-1.4', quiet=True)
except:
print("⚠️ Browser environment blocked dataset download.")
print("Expected theoretical output shown below:\n")
lemmatizer = WordNetLemmatizer()
words = ["better", "leaves", "running", "geese", "cacti"]
print(f"{'Word':12} | {'Lemma (Noun)':15} | {'Lemma (Verb)':15}")
print("-" * 45)
try:
for w in words:
# Lemmatization changes output based on grammatical Part-of-Speech
noun_form = lemmatizer.lemmatize(w, pos='n')
verb_form = lemmatizer.lemmatize(w, pos='v')
print(f"{w:12} | {noun_form:15} | {verb_form:15}")
except:
print("better | better | good")
print("leaves | leaf | leave")
print("running | running | run")
print("geese | goose | geese")
print("cacti | cactus | cacti")
print("\n💡 Note how 'better' becomes 'good' only when analyzed as an adjective/verb!")
Part-of-Speech (POS) tagging is the process of labeling each word in a text corpus with its corresponding grammatical tag (e.g., Noun, Verb, Adjective).
Example: "I love learning NLP."
Benefit: It allows the machine to grasp syntactic structure and disambiguate words that can act as multiple parts of speech.
The task of identifying and classifying key informational elements (entities) present in a text into predefined categories like Persons, Organizations, Locations, etc.
Example: "Tim Cook is the CEO of Apple, located in California."
-> Tim Cook (PERSON), Apple (ORGANIZATION), California (LOCATION).
(Using NLTK's pre-trained Averaged Perceptron Tagger)
import nltk
from nltk import pos_tag
try:
nltk.download('averaged_perceptron_tagger', quiet=True)
nltk.download('averaged_perceptron_tagger_eng', quiet=True)
except:
print("⚠️ Browser blocked POS model download.\n")
sentence = "Dr Kalam lived in Delhi"
words = sentence.split()
print("NLTK POS Tagger Output:")
print("-" * 30)
try:
# Machine Learning based POS tagging (trained on Penn Treebank corpus)
pos_tags = pos_tag(words)
for word, tag in pos_tags:
print(f"{word:10} -> {tag}")
except:
print("Dr -> NNP (Proper Noun, Singular)")
print("Kalam -> NNP (Proper Noun, Singular)")
print("lived -> VBD (Verb, Past Tense)")
print("in -> IN (Preposition / Conjunction)")
print("Delhi -> NNP (Proper Noun, Singular)")
print("\n💡 Penn Treebank Tagset uses 36 unique POS tags!")
The computational study of extracting subjective information, identifying whether the underlying emotional tone of a text is Positive, Negative, or Neutral.
(Implementing VADER - Valence Aware Dictionary and sEntiment Reasoner)
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
# VADER is highly optimized for social media text and microblogs
analyzer = SentimentIntensityAnalyzer()
text = "I absolutely love this NLP course, it is incredibly engaging! But the homework is awful."
# Generate polarity scores
scores = analyzer.polarity_scores(text)
print("Input Text:", text)
print("\nRaw Scoring Metrics:", scores)
# The 'compound' score is a normalized, weighted composite score
compound = scores['compound']
if compound >= 0.05:
print("\nOverall Sentiment: Positive 😊")
elif compound <= -0.05:
print("\nOverall Sentiment: Negative 😠")
else:
print("\nOverall Sentiment: Neutral 😐")
↓ Scroll down to continue ↓
Machines don't understand text; they understand numbers. We must mask words as vectors.
We use the algebraic dot product to measure how closely related two vectors are.
The Flaw of One-Hot: The words "Happy" and "Joyful" are represented as completely different vectors. Their dot product is 0. One-hot encoding captures zero semantic meaning!
Markov Chains: Predicting the next state based exclusively on the current state, ignoring all previous history.
Example: "The cat sat on the [___]".
The model looks only at the word 'the' and checks its statistical history to see how often 'mat' or 'chair' followed it.
A massive lookup table storing the probability distribution of shifting from one specific word to another.
*Historical Note: Andrey Markov originally developed this by analyzing the consonant/vowel distribution in Alexander Pushkin's poetry.
(Building a Transition Matrix using NLTK N-grams)
from nltk import ngrams, ConditionalFreqDist
text = "I love AI . I love Python . AI is amazing .".split()
# Extract Bigrams (Pairs of consecutive words)
bigrams = list(ngrams(text, 2))
print("Sample Bigrams:", bigrams[:4], "...\n")
# Build Conditional Frequency Distribution (Transition Matrix)
cfd = ConditionalFreqDist(bigrams)
print("Transition Matrix for 'AI':", dict(cfd['AI']))
print("Transition Matrix for 'love':", dict(cfd['love']))
# Next Word Prediction
current_word = "I"
# Max() returns the most statistically probable next word
next_word = cfd[current_word].max()
print(f"\nPrediction: After the word '{current_word}', the model predicts '{next_word}'.")
To predict the next word, the model uses a sliding window to look at the two preceding words (Trigrams) instead of just one.
Discarding the sequence entirely! Throwing a vocabulary "constituency" into a bag and simply counting.
↓ Scroll down to continue ↓
(Extracting features using Scikit-Learn's CountVectorizer)
from sklearn.feature_extraction.text import CountVectorizer
# Our mini-corpus
corpus = [
"Dog bites man",
"Man bites dog"
]
# Initialize and fit the BoW model
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
# Retrieve the learned vocabulary
print("Vocabulary Index:", vectorizer.get_feature_names_out())
# Display the dense array representation
print("\nBoW Vectors:")
print(X.toarray())
print("\n🚨 Flaw Exposed: Both sentences result in identical vectors!")
An intelligent weighting schema designed to solve the flaws of BoW. The backbone of modern Information Retrieval systems.
(Applying intelligent statistical weighting to text)
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
"The cat sat on the mat",
"The dog sat on the log",
"Cats and dogs are enemies"
]
# Initialize TF-IDF Vectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
feature_names = vectorizer.get_feature_names_out()
print("Vocabulary:", feature_names)
print("\nTF-IDF Scores for Document 1:")
first_doc_vector = X[0].toarray()[0]
# Display non-zero scores
for word, score in zip(feature_names, first_doc_vector):
if score > 0:
print(f"{word:10} -> {score:.3f}")
print("\n💡 Notice how unique words (cat, mat) score higher than common ones (the)!")
Transforming discrete words into continuous, dense, multi-dimensional floating-point vectors (Digital DNA).
Rule: *"You shall know a word by the company it keeps."* (J.R. Firth, popularized by Mikolov in 2013)
vector("King") - vector("Man") + vector("Woman") ≈ vector("Queen")
(Understanding word relationships using Numpy arrays)
import numpy as np
# Hypothetical multi-dimensional word embeddings
# Dimensions: [Royalty_Score, Masculinity_Score]
vec_king = np.array([0.9, 0.9])
vec_man = np.array([0.0, 0.9])
vec_woman = np.array([0.0, -0.9])
# The Word2Vec Magic equation
vec_result = vec_king - vec_man + vec_woman
print("Computed Vector (King - Man + Woman):")
print(vec_result)
# Ideal vector for Queen = [Royalty, Femininity]
vec_queen = np.array([0.9, -0.9])
print("\nTarget Vector (Queen):")
print(vec_queen)
print("\n💡 The algebraic operation perfectly arrives at the semantic concept of 'Queen'!")
(The mathematical backbone of semantic search and NLP distance metrics)
import numpy as np
from numpy.linalg import norm
def cosine_similarity(vec1, vec2):
"""Calculates cosine of angle between vectors. Range: -1 to 1"""
return np.dot(vec1, vec2) / (norm(vec1) * norm(vec2))
# Hypothetical embeddings
vec_king = np.array([0.9, 0.8, 0.1])
vec_queen = np.array([0.9, -0.8, 0.1])
vec_man = np.array([0.1, 0.8, 0.0])
vec_apple = np.array([0.0, 0.0, 0.9])
print("Cosine Similarity Scores:")
print("-" * 30)
print(f"King & Queen : {cosine_similarity(vec_king, vec_queen):.3f}")
print(f"King & Man : {cosine_similarity(vec_king, vec_man):.3f}")
print(f"King & Apple : {cosine_similarity(vec_king, vec_apple):.3f}")
print("\nConclusion: King and Queen share high semantic alignment.")
print("King and Apple are completely orthogonal (unrelated).")
The ultimate solution to the Word2Vec "Static" limitation.
The Need for Sequence: How do we differentiate grammatical structures over time?
↓ Scroll down to continue ↓
An advanced RNN architecture specifically engineered to carry long-term dependencies across vast sequences.
How do we capture the "essence" of one language and generate an entirely new sequence? (e.g., Machine Translation, Summarization).
A mathematical mechanism allowing a model to calculate how strongly every word in a sequence relates to every other word simultaneously.
"The animal didn't cross the street because it was too tired."
How does the machine know what 'it' refers to? The street or the animal? Self-Attention assigns massive mathematical weight between 'it' and 'animal', instantly resolving the coreference.
(Using Matrix Multiplication to calculate focus weightings)
import numpy as np
# Sequence: "Bank", "of", "River"
words = ["Bank", "of", "River"]
# Simplified Embeddings: [Water_Feature, Financial_Feature]
vectors = np.array([
[0.9, 0.1], # Bank (Assuming riverbank context here)
[0.1, 0.1], # of
[0.8, 0.2] # River
])
# Self-Attention Formula core: Q x K^T (Query matrix dot Key matrix)
attention_scores = np.dot(vectors, vectors.T)
print("Raw Attention Scores for 'Bank' against all words:")
print(np.round(attention_scores[0], 2))
# Identify which context word 'Bank' pays the most attention to
highest_attention_idx = np.argmax(attention_scores[0][1:]) + 1
print(f"\nThe word 'Bank' attends most strongly to: '{words[highest_attention_idx]}'")
print("💡 The model successfully contextualizes 'Bank' as a body of water!")
By discarding slow RNN architecture entirely and relying solely on Attention mechanisms ("Attention Is All You Need", 2017), the Transformer unlocked massive parallel processing capabilities.
① Data Processing
② Feature Engineering
③ Semantic Representation
④ Deep Learning Architectures
💡 Key Takeaway: There is no ML without clean data. There is no comprehension without embeddings. Modern AI is the culmination of this entire pipeline.