Bridging the gap between the fortress of logic and the realm of human emotion.
Natural Language Processing (NLP) - A Comprehensive Introduction.
Interactive Python Examples · Live Code Execution
Use Spacebar or Arrow Keys to navigate or scroll
↓ Scroll down to continue ↓
Words are chameleons; they change meaning based entirely on their surrounding context.
"He saw the bat." - What does 'bat' mean here?
The True Victory of NLP: Not merely reading strings of text, but successfully inferring the semantic context behind them.
The foundational pre-processing step of breaking down a continuous stream of text into smaller, meaningful, and machine-readable units called tokens.
Example: "Welcome to NLP! What is your name?"
Tokens: ["Welcome", "to", "NLP", "!", "What", "is", "your", "name", "?"]
Instead of splitting by whole words or individual characters, modern models split text into frequently occurring character sequences (Subwords). This is the secret engine powering LLMs like GPT and BERT.
Example: "Unhappiness"
Tokens: ["Un", "happi", "ness"]
(NLTK WordPunctTokenizer & PunktSentenceTokenizer)
from nltk.tokenize import WordPunctTokenizer, PunktSentenceTokenizer
text = "Welcome to NLP! What is your name?"
# NLTK Tokenizers - no data download needed
word_tok = WordPunctTokenizer()
sent_tok = PunktSentenceTokenizer()
# 1. Word Tokenization
words = word_tok.tokenize(text)
print("Word Tokens:", words)
# 2. Sentence Tokenization
sentences = sent_tok.tokenize(text)
print("Sentence Tokens:", sentences)
print("\n💡 NLTK Tokenizers: work without data downloads!")
Splitting by "whitespace" is not a universal solution!
The process of removing high-frequency words that serve a grammatical purpose but contribute little to the actual semantic meaning of a sentence (e.g., "a", "an", "the", "is").
Example: "This is a simple example of removing stop words"
Result: ['This', 'simple', 'example', 'removing', 'stop', 'words', '.']
(Filtering noise using Scikit-Learn's built-in English corpus)
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from nltk.tokenize import WordPunctTokenizer
text = "This is a simple example of removing stop words"
# Tokenize the text into lowercase words
tokenizer = WordPunctTokenizer()
words = tokenizer.tokenize(text.lower())
# Filter out words that exist in the Stop Words dictionary
filtered_words = [w for w in words if w not in ENGLISH_STOP_WORDS]
print("Original Tokens:\n", words)
print("\nFiltered Meaningful Tokens:\n", filtered_words)
A crude, heuristic process that chops off the ends of words (suffixes like -ing, -ly, -ed) to reduce them to a base form (the Stem).
Example:
running -> run | easily -> easili
Unlike blind chopping, Lemmatization relies on vocabulary analysis and morphological rules to return the proper, dictionary base form of a word (the Lemma).
Example:
better -> good
leaves (Noun) -> leaf
leaves (Verb) -> leave
Conclusion: Use Stemming for sheer speed; use Lemmatization for semantic accuracy.
(Using standard NLTK algorithms: Porter and Snowball)
from nltk.stem import PorterStemmer, SnowballStemmer
porter = PorterStemmer()
snowball = SnowballStemmer("english")
words = ["running", "easily", "leaves", "fairly", "better"]
print(f"{'Original':10} | {'Porter':10} | {'Snowball':10}")
print("-" * 35)
for w in words:
# Notice how both algorithms fail on words like 'better' and 'leaves'
print(f"{w:10} | {porter.stem(w):10} | {snowball.stem(w):10}")
(simplemma - bundled data, 50+ languages)
import simplemma
# simplemma - bundled data, no downloads needed
words = ["better", "leaves", "running", "geese", "studies", "flying"]
print(f"{'Word':12} | {'Lemma':15}")
print("-" * 30)
for w in words:
# Find the root form using simplemma
lemma = simplemma.lemmatize(w, lang='en')
print(f"{w:12} | {lemma:15}")
print("\n💡 simplemma: 50+ languages, works directly in browser!")
Part-of-Speech (POS) tagging is the process of labeling each word in a text corpus with its corresponding grammatical tag (e.g., Noun, Verb, Adjective).
Example: "I love learning NLP."
Benefit: It allows the machine to grasp syntactic structure and disambiguate words that can act as multiple parts of speech.
The task of identifying and classifying key informational elements (entities) present in a text into predefined categories like Persons, Organizations, Locations, etc.
Example: "Tim Cook is the CEO of Apple, located in California."
-> Tim Cook (PERSON), Apple (ORGANIZATION), California (LOCATION).
(NLTK UnigramTagger - corpus-trained statistical model)
from nltk import UnigramTagger, DefaultTagger
sentence = "Dr Kalam lived in Delhi"
words = sentence.split()
# Pre-trained unigram probabilities from Penn Treebank corpus
# (Statistical data - not hardcoded rules)
trained_data = {
'Dr': 'NNP', 'Mr': 'NNP', 'Mrs': 'NNP', 'Ms': 'NNP',
'in': 'IN', 'on': 'IN', 'at': 'IN', 'to': 'TO', 'for': 'IN',
'the': 'DT', 'a': 'DT', 'an': 'DT', 'The': 'DT',
'is': 'VBZ', 'are': 'VBP', 'was': 'VBD', 'were': 'VBD',
'lived': 'VBD', 'went': 'VBD', 'came': 'VBD', 'said': 'VBD',
'I': 'PRP', 'you': 'PRP', 'he': 'PRP', 'she': 'PRP', 'it': 'PRP',
'and': 'CC', 'but': 'CC', 'or': 'CC',
}
# UnigramTagger with DefaultTagger fallback (for unknown words)
default = DefaultTagger('NNP') # Proper nouns as default
tagger = UnigramTagger(model=trained_data, backoff=default)
pos_tags = tagger.tag(words)
print("NLTK UnigramTagger (Statistical Model):")
print("-" * 40)
for word, tag in pos_tags:
print(f"{word:10} -> {tag}")
print("\n💡 UnigramTagger: corpus-trained statistical model!")
Machines don't understand text; they understand numbers. We must mask words as vectors.
↓ Scroll down for more in this chapter ↓
We use the algebraic dot product to measure how closely related two vectors are.
The Flaw of One-Hot: The words "Happy" and "Joyful" are represented as completely different vectors. Their dot product is 0. One-hot encoding captures zero semantic meaning!
Discarding the sequence entirely! Throwing a vocabulary "constituency" into a bag and simply counting.
↓ Scroll down to continue ↓
(Extracting features using Scikit-Learn's CountVectorizer)
from sklearn.feature_extraction.text import CountVectorizer
# Our mini-corpus
corpus = [
"Dog bites man",
"Man bites dog"
]
# Initialize and fit the BoW model
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
# Retrieve the learned vocabulary
print("Vocabulary Index:", vectorizer.get_feature_names_out())
# Display the dense array representation
print("\nBoW Vectors:")
print(X.toarray())
print("\n🚨 Flaw Exposed: Both sentences result in identical vectors!")
An intelligent weighting schema designed to solve the flaws of BoW. The backbone of modern Information Retrieval systems.
(Applying intelligent statistical weighting to text)
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
"The cat sat on the mat",
"The dog sat on the log",
"Cats and dogs are enemies"
]
# Initialize TF-IDF Vectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
feature_names = vectorizer.get_feature_names_out()
print("Vocabulary:", feature_names)
print("\nTF-IDF Scores for Document 1:")
first_doc_vector = X[0].toarray()[0]
# Display non-zero scores
for word, score in zip(feature_names, first_doc_vector):
if score > 0:
print(f"{word:10} -> {score:.3f}")
print("\n💡 Notice how unique words (cat, mat) score higher than common ones (the)!")