Natural Language Processing Basics
Computers are great with numbers but terrible with words. NLP bridges that gap โ teaching machines to understand, interpret, and generate human language. Every time you use voice assistants, translation apps, or chatbots, you're using NLP.
The first challenge: computers need numbers, not words. So the entire field starts with one fundamental question โ how do we convert text into numbers that preserve meaning?
Tokenization
Before we can process text, we need to break it into pieces called tokens. This sounds simple but has surprising depth.
Word tokenization: Split on spaces and punctuation. "I love AI" becomes ["I", "love", "AI"]. Simple but misses nuances like "don't" โ ["do", "n't"].
Subword tokenization: Break words into meaningful chunks. "unhappiness" โ ["un", "happiness", "ness"]. Used by BERT and GPT because it handles rare words gracefully.
Character tokenization: Each character is a token. ["H", "e", "l", "l", "o"]. Useful for some tasks but produces very long sequences.
Tokenization Approaches
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Input: "NLP is amazing!" โ
โ โ
โ Word: ["NLP", "is", "amazing", "!"] โ
โ Subword: ["NLP", "is", "amaz", "ing", "!"]โ
โ Character: ["N","L","P"," ","i","s",...] โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Text Preprocessing
Raw text is messy. Before feeding it to models, we clean it up:
Lowercasing: Convert "The" and "the" to the same token. Reduces vocabulary size but might lose information (e.g., "US" vs "us").
Removing stop words: Words like "the", "is", "at" carry little meaning. Removing them reduces noise โ but sometimes they matter ("to be or not to be").
Stemming: Chop word endings mechanically. "running" โ "run", "happiness" โ "happi". Fast but crude.
Lemmatization: Reduce words to their dictionary form using grammar rules. "better" โ "good", "ran" โ "run". More accurate but slower.
Preprocessing Pipeline
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Raw: "The cats were RUNNING happily!" โ
โ โ โ
โ โผ Lowercase โ
โ "the cats were running happily!" โ
โ โ โ
โ โผ Remove stopwords โ
โ "cats running happily" โ
โ โ โ
โ โผ Lemmatize โ
โ "cat run happy" โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Word Embeddings
Here's the key breakthrough: representing words as dense vectors that capture meaning. Instead of a sparse one-hot vector of 50,000 dimensions, we use a compact vector of 100-300 dimensions where similar words are close together.
The famous example: King - Man + Woman โ Queen. The vector arithmetic captures gender relationships!
Word Vector Space (simplified 2D)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โ king โ โ queen โ
โ \ / โ
โ \ man โโโโโโโโโ woman โ
โ \ โ โ โ
โ \ โ gender โ โ
โ \ โ vector โ โ
โ \ โ โ โ
โ cat โโโโโโโโโโโ dog โ
โ pet relationship โ
โ โ
โ Similar words cluster together โ
โ Relationship directions are consistent โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Popular Embedding Methods
Word2Vec (2013): The OG. Uses two architectures โ CBOW (predict word from context) and Skip-gram (predict context from word). Fast, effective, still widely used.
GloVe (2014): Global Vectors. Uses co-occurrence statistics across the entire corpus. Captures both local and global relationships.
FastText (2016): Extension of Word2Vec that uses subword information. Handles out-of-vocabulary words better โ can generate embeddings for words it hasn't seen.
Contextual embeddings (BERT, GPT): The word "bank" gets different vectors depending on context โ "river bank" vs "bank account". This is what modern NLP uses.
The Vocabulary Problem
Every word in your training data gets an index in a vocabulary. But what about words you haven't seen? This is the out-of-vocabulary (OOV) problem.
Solutions include using subword tokenization (BPE, WordPiece, SentencePiece), character-level embeddings, or simply mapping unknown words to a special [UNK] token. Modern subword tokenizers handle this gracefully โ most words are compositions of known subwords.
From Words to Sequences
Most NLP tasks need variable-length input. But neural networks want fixed-size tensors. Common approaches:
Handling Variable Length
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โ Padding: "I love AI" โ [5, 12, 8, 0, 0] โ
โ (0 = padding token) โ
โ โ
โ Truncation: Long texts โ keep first N words โ
โ โ
โ Packing: Multiple short texts in one tensor โ
โ (efficient batching) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Attention mechanisms (the backbone of Transformers) handle variable lengths elegantly without padding, which is one reason they revolutionized NLP.
NLP Tasks Overview
NLP covers a wide range of tasks:
Text classification: Spam detection, sentiment analysis, topic labeling. The bread and butter of NLP.
Named Entity Recognition (NER): Finding people, places, organizations in text. "Apple CEO Tim Cook lives in California" โ [Apple: ORG, Tim Cook: PERSON, California: LOCATION].
Machine translation: Converting text between languages. Google Translate uses Transformer models.
Question answering: Reading a passage and answering questions about it. Powers search engines and virtual assistants.
Text generation: Writing stories, emails, code. GPT models are the current state-of-the-art.