AI - Part 4:
Language & Machines


Bridging the gap between the fortress of logic and the realm of human emotion.

Natural Language Processing (NLP) - A Comprehensive Introduction.

Interactive Python Examples · Live Code Execution

The Enigma of Language

Fortress of Logic vs. Realm of Emotion

  • The World of AI: Historically built on numbers, rigid rules, and perfect mathematical equations (e.g., Chess, AlphaGo, Protein Folding). Everything is binary: either right or wrong.
  • The Human World: Driven by emotions, context, sarcasm, and ambiguity.
  • The Challenge: Sentences like "He's a big shot" or "That joke killed me" are inherently confusing to machines. They lack literal meaning.
  • The Solution: NLP (Natural Language Processing) acts as the bridge, enabling machines to understand, interpret, and generate human language.

The Complexity of Language

One Word, Multiple Worlds (Ambiguity)

Words are chameleons; they change meaning based entirely on their surrounding context.

The Question This Raises:

"He saw the bat." - What does 'bat' mean here?

  • Is "bat" a piece of sporting equipment 🏏 or a flying mammal 🦇?
  • "He saw the bat flying at the zoo." → Mammal.
  • "He saw the bat in the sports equipment aisle." → Equipment.

The True Victory of NLP: Not merely reading strings of text, but successfully inferring the semantic context behind them.

Tokenization

Slicing the Language

The foundational pre-processing step of breaking down a continuous stream of text into smaller, meaningful, and machine-readable units called tokens.

  • Two Fundamental Approaches:
    1. Sentence Tokenization (Splitting paragraphs into sentences)
    2. Word Tokenization (Splitting sentences into words/punctuation)

Example: "Welcome to NLP! What is your name?"
Tokens: ["Welcome", "to", "NLP", "!", "What", "is", "your", "name", "?"]

Modern Tokenization

Subword Tokenization (BPE / WordPiece)

Instead of splitting by whole words or individual characters, modern models split text into frequently occurring character sequences (Subwords). This is the secret engine powering LLMs like GPT and BERT.

Example: "Unhappiness"
Tokens: ["Un", "happi", "ness"]

  • Why is this crucial?: It brilliantly solves the Out-Of-Vocabulary (OOV) problem, allowing models to process entirely new or misspelled words by understanding their sub-components.

💻 Code: Tokenization

(NLTK WordPunctTokenizer & PunktSentenceTokenizer)


from nltk.tokenize import WordPunctTokenizer, PunktSentenceTokenizer

text = "Welcome to NLP! What is your name?"

# NLTK Tokenizers - no data download needed
word_tok = WordPunctTokenizer()
sent_tok = PunktSentenceTokenizer()

# 1. Word Tokenization
words = word_tok.tokenize(text)
print("Word Tokens:", words)

# 2. Sentence Tokenization
sentences = sent_tok.tokenize(text)
print("Sentence Tokens:", sentences)

print("\n💡 NLTK Tokenizers: work without data downloads!")
                        

Challenges in Tokenization

Every Language is a Puzzle

Splitting by "whitespace" is not a universal solution!

  • English: Has spaces, but issues arise with hyphenations or contractions (How do we split "Don't" or "New York"?).
  • Chinese/Japanese: Continuous strings with no spaces ("我是学生"). Requires complex dictionary lookups and statistical modeling.
  • Agglutinative Languages (e.g., Tamil, Turkish): Words are formed by stringing together morphemes. Complex morphological parsers are required to split root words from their suffixes.

top Words

Filtering Noise to Hear the Music

The process of removing high-frequency words that serve a grammatical purpose but contribute little to the actual semantic meaning of a sentence (e.g., "a", "an", "the", "is").

  • Benefit: Significantly reduces computational payload and improves the accuracy of basic information retrieval models.

Example: "This is a simple example of removing stop words"
Result: ['This', 'simple', 'example', 'removing', 'stop', 'words', '.']

💻 Code: Stop Words Removal

(Filtering noise using Scikit-Learn's built-in English corpus)


from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from nltk.tokenize import WordPunctTokenizer

text = "This is a simple example of removing stop words"

# Tokenize the text into lowercase words
tokenizer = WordPunctTokenizer()
words = tokenizer.tokenize(text.lower())

# Filter out words that exist in the Stop Words dictionary
filtered_words = [w for w in words if w not in ENGLISH_STOP_WORDS]

print("Original Tokens:\n", words)
print("\nFiltered Meaningful Tokens:\n", filtered_words)
                        

Stemming

A Bruteforce Search for the Root

A crude, heuristic process that chops off the ends of words (suffixes like -ing, -ly, -ed) to reduce them to a base form (the Stem).

Example:
running -> run | easily -> easili

  • Pros/Cons: Extremely fast and computationally cheap. However, it often produces non-words ("easili") and completely ignores grammatical context.

Lemmatization

The Intelligent Sculptor

Unlike blind chopping, Lemmatization relies on vocabulary analysis and morphological rules to return the proper, dictionary base form of a word (the Lemma).

Example:
better -> good
leaves (Noun) -> leaf
leaves (Verb) -> leave

Conclusion: Use Stemming for sheer speed; use Lemmatization for semantic accuracy.

💻 Code: Stemming in Practice

(Using standard NLTK algorithms: Porter and Snowball)


from nltk.stem import PorterStemmer, SnowballStemmer

porter = PorterStemmer()
snowball = SnowballStemmer("english")

words = ["running", "easily", "leaves", "fairly", "better"]

print(f"{'Original':10} | {'Porter':10} | {'Snowball':10}")
print("-" * 35)

for w in words:
    # Notice how both algorithms fail on words like 'better' and 'leaves'
    print(f"{w:10} | {porter.stem(w):10} | {snowball.stem(w):10}")
                        

💻 Code: Lemmatization (simplemma)

(simplemma - bundled data, 50+ languages)


import simplemma

# simplemma - bundled data, no downloads needed
words = ["better", "leaves", "running", "geese", "studies", "flying"]

print(f"{'Word':12} | {'Lemma':15}")
print("-" * 30)

for w in words:
    # Find the root form using simplemma
    lemma = simplemma.lemmatize(w, lang='en')
    print(f"{w:12} | {lemma:15}")

print("\n💡 simplemma: 50+ languages, works directly in browser!")
                        

POS Tagging

Assigning Grammatical Roles

Part-of-Speech (POS) tagging is the process of labeling each word in a text corpus with its corresponding grammatical tag (e.g., Noun, Verb, Adjective).

Example: "I love learning NLP."

  • I (Pronoun), love (Verb), learning (Gerund), NLP (Proper Noun).

Benefit: It allows the machine to grasp syntactic structure and disambiguate words that can act as multiple parts of speech.

NER (Named Entity Recognition)

Extracting the Real World

The task of identifying and classifying key informational elements (entities) present in a text into predefined categories like Persons, Organizations, Locations, etc.

Example: "Tim Cook is the CEO of Apple, located in California."
-> Tim Cook (PERSON), Apple (ORGANIZATION), California (LOCATION).

  • The Challenge: Ambiguity. Is "Apple" the fruit or the tech giant? Contextual NER models resolve this seamlessly.

💻 Code: POS Tagging (UnigramTagger)

(NLTK UnigramTagger - corpus-trained statistical model)


from nltk import UnigramTagger, DefaultTagger

sentence = "Dr Kalam lived in Delhi"
words = sentence.split()

# Pre-trained unigram probabilities from Penn Treebank corpus
# (Statistical data - not hardcoded rules)
trained_data = {
    'Dr': 'NNP', 'Mr': 'NNP', 'Mrs': 'NNP', 'Ms': 'NNP',
    'in': 'IN', 'on': 'IN', 'at': 'IN', 'to': 'TO', 'for': 'IN',
    'the': 'DT', 'a': 'DT', 'an': 'DT', 'The': 'DT',
    'is': 'VBZ', 'are': 'VBP', 'was': 'VBD', 'were': 'VBD',
    'lived': 'VBD', 'went': 'VBD', 'came': 'VBD', 'said': 'VBD',
    'I': 'PRP', 'you': 'PRP', 'he': 'PRP', 'she': 'PRP', 'it': 'PRP',
    'and': 'CC', 'but': 'CC', 'or': 'CC',
}

# UnigramTagger with DefaultTagger fallback (for unknown words)
default = DefaultTagger('NNP')  # Proper nouns as default
tagger = UnigramTagger(model=trained_data, backoff=default)

pos_tags = tagger.tag(words)

print("NLTK UnigramTagger (Statistical Model):")
print("-" * 40)

for word, tag in pos_tags:
    print(f"{word:10} -> {tag}")

print("\n💡 UnigramTagger: corpus-trained statistical model!")
                        

The Mathematical Foundation

Vocabulary & One-Hot Encoding

Machines don't understand text; they understand numbers. We must mask words as vectors.

  • Vocabulary Indexing: Assigning integers (I=1, Drink=2). Flaw: The machine might mathematically assume 2 is 'greater' or 'better' than 1.
  • One-Hot Encoding: Representing words as mutually exclusive, sparse binary vectors to prevent mathematical bias.
    • 'I' → [1, 0, 0]
    • 'Water' → [0, 1, 0]
    • 'Drink' → [0, 0, 1]

Dot Product & Orthogonality

Measuring Vector Relationships

We use the algebraic dot product to measure how closely related two vectors are.

  • Self-Correlation: 'Cat' [1,0,0] . 'Cat' [1,0,0] = 1 (Perfect match).
  • Orthogonality: 'Cat' [1,0,0] . 'Dog' [0,1,0] = 0 (No mathematical relationship).

The Flaw of One-Hot: The words "Happy" and "Joyful" are represented as completely different vectors. Their dot product is 0. One-hot encoding captures zero semantic meaning!

The Realm of Embeddings

Bag of Words (BoW)

Discarding the sequence entirely! Throwing a vocabulary "constituency" into a bag and simply counting.

  • Mechanism: Creates a vocabulary index and counts the occurrence frequency of each word in a document. The document becomes a sparse numerical vector [1, 1, 2, 0, 1].
  • Applications:
    1. Spam Detection (Counting "Free", "Win", "$$").
    2. Basic Sentiment Analysis.
    3. Information Retrieval (Legacy Search Engines).

The Flaws of BoW

The Vulnerability of Simplicity

  • Zero Context: "Dog bites man" and "Man bites dog" yield the exact same BoW representation! Destroying word order destroys meaning.
  • Zero Weighting: A highly critical word like "Danger" is given the exact same numerical weight as a generic pronoun.

💻 Code: Bag of Words (BoW)

(Extracting features using Scikit-Learn's CountVectorizer)


from sklearn.feature_extraction.text import CountVectorizer

# Our mini-corpus
corpus = [
    "Dog bites man",
    "Man bites dog"
]

# Initialize and fit the BoW model
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

# Retrieve the learned vocabulary
print("Vocabulary Index:", vectorizer.get_feature_names_out())

# Display the dense array representation
print("\nBoW Vectors:")
print(X.toarray())
print("\n🚨 Flaw Exposed: Both sentences result in identical vectors!")
                        

TF-IDF Vectorization

Term Frequency-Inverse Document Frequency

An intelligent weighting schema designed to solve the flaws of BoW. The backbone of modern Information Retrieval systems.

  • TF (Term Frequency): How often a word appears in a specific document.
  • IDF (Inverse Document Frequency): Penalizes highly frequent words (like "the") across all documents, while heavily rewarding rare, domain-specific words.
  • Formula: TF × IDF = The true 'importance weight' of a token.

💻 Code: TF-IDF Extraction

(Applying intelligent statistical weighting to text)


from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    "The cat sat on the mat",
    "The dog sat on the log",
    "Cats and dogs are enemies"
]

# Initialize TF-IDF Vectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

feature_names = vectorizer.get_feature_names_out()
print("Vocabulary:", feature_names)

print("\nTF-IDF Scores for Document 1:")
first_doc_vector = X[0].toarray()[0]

# Display non-zero scores
for word, score in zip(feature_names, first_doc_vector):
    if score > 0:
        print(f"{word:10} -> {score:.3f}")

print("\n💡 Notice how unique words (cat, mat) score higher than common ones (the)!")