Natural Language Processing

Teaching machines to understand human language.

Natural Language Processing Basics

Computers are great with numbers but terrible with words. NLP bridges that gap — teaching machines to understand, interpret, and generate human language. Every time you use voice assistants, translation apps, or chatbots, you're using NLP.

The first challenge: computers need numbers, not words. So the entire field starts with one fundamental question — how do we convert text into numbers that preserve meaning?

Tokenization

Before we can process text, we need to break it into pieces called tokens. This sounds simple but has surprising depth.

Word tokenization: Split on spaces and punctuation. "I love AI" becomes ["I", "love", "AI"]. Simple but misses nuances like "don't" → ["do", "n't"].

Subword tokenization: Break words into meaningful chunks. "unhappiness" → ["un", "happiness", "ness"]. Used by BERT and GPT because it handles rare words gracefully.

Character tokenization: Each character is a token. ["H", "e", "l", "l", "o"]. Useful for some tasks but produces very long sequences.


    Tokenization Approaches
    ──────────────────────────────────────────────
    │ Input: "NLP is amazing!"                   │
    │                                             │
    │ Word:     ["NLP", "is", "amazing", "!"]    │
    │ Subword:  ["NLP", "is", "amaz", "ing", "!"]│
    │ Character: ["N","L","P"," ","i","s",...]   │
    ──────────────────────────────────────────────

Text Preprocessing

Raw text is messy. Before feeding it to models, we clean it up:

Lowercasing: Convert "The" and "the" to the same token. Reduces vocabulary size but might lose information (e.g., "US" vs "us").

Removing stop words: Words like "the", "is", "at" carry little meaning. Removing them reduces noise — but sometimes they matter ("to be or not to be").

Stemming: Chop word endings mechanically. "running" → "run", "happiness" → "happi". Fast but crude.

Lemmatization: Reduce words to their dictionary form using grammar rules. "better" → "good", "ran" → "run". More accurate but slower.


    Preprocessing Pipeline
    ──────────────────────────────────────────────
    │ Raw:    "The cats were RUNNING happily!"   │
    │              │                              │
    │              ▼ Lowercase                    │
    │         "the cats were running happily!"    │
    │              │                              │
    │              ▼ Remove stopwords             │
    │         "cats running happily"              │
    │              │                              │
    │              ▼ Lemmatize                    │
    │         "cat run happy"                     │
    ──────────────────────────────────────────────

Word Embeddings

Here's the key breakthrough: representing words as dense vectors that capture meaning. Instead of a sparse one-hot vector of 50,000 dimensions, we use a compact vector of 100-300 dimensions where similar words are close together.

The famous example: King - Man + Woman ≈ Queen. The vector arithmetic captures gender relationships!


    Word Vector Space (simplified 2D)
    ──────────────────────────────────────────────
    │                                             │
    │  king ●                    ● queen          │
    │        \                  /                 │
    │         \    man ●───────● woman            │
    │          \      │         │                 │
    │           \     │  gender │                 │
    │            \    │ vector  │                 │
    │             \   │         │                 │
    │              cat ●─────────● dog            │
    │                    pet relationship         │
    │                                             │
    │  Similar words cluster together             │
    │  Relationship directions are consistent     │
    ──────────────────────────────────────────────

Popular Embedding Methods

Word2Vec (2013): The OG. Uses two architectures — CBOW (predict word from context) and Skip-gram (predict context from word). Fast, effective, still widely used.

GloVe (2014): Global Vectors. Uses co-occurrence statistics across the entire corpus. Captures both local and global relationships.

FastText (2016): Extension of Word2Vec that uses subword information. Handles out-of-vocabulary words better — can generate embeddings for words it hasn't seen.

Contextual embeddings (BERT, GPT): The word "bank" gets different vectors depending on context — "river bank" vs "bank account". This is what modern NLP uses.

The Vocabulary Problem

Every word in your training data gets an index in a vocabulary. But what about words you haven't seen? This is the out-of-vocabulary (OOV) problem.

Solutions include using subword tokenization (BPE, WordPiece, SentencePiece), character-level embeddings, or simply mapping unknown words to a special [UNK] token. Modern subword tokenizers handle this gracefully — most words are compositions of known subwords.

From Words to Sequences

Most NLP tasks need variable-length input. But neural networks want fixed-size tensors. Common approaches:


    Handling Variable Length
    ──────────────────────────────────────────────
    │                                             │
    │ Padding:   "I love AI" → [5, 12, 8, 0, 0] │
    │            (0 = padding token)              │
    │                                             │
    │ Truncation: Long texts → keep first N words │
    │                                             │
    │ Packing: Multiple short texts in one tensor │
    │           (efficient batching)              │
    ──────────────────────────────────────────────

Attention mechanisms (the backbone of Transformers) handle variable lengths elegantly without padding, which is one reason they revolutionized NLP.

NLP Tasks Overview

NLP covers a wide range of tasks:

Text classification: Spam detection, sentiment analysis, topic labeling. The bread and butter of NLP.

Named Entity Recognition (NER): Finding people, places, organizations in text. "Apple CEO Tim Cook lives in California" → [Apple: ORG, Tim Cook: PERSON, California: LOCATION].

Machine translation: Converting text between languages. Google Translate uses Transformer models.

Question answering: Reading a passage and answering questions about it. Powers search engines and virtual assistants.

Text generation: Writing stories, emails, code. GPT models are the current state-of-the-art.

🧪 Quick Quiz

What is tokenization in NLP?

← Previous Generative Adversarial Networks

Next → Text Classification & Sentiment