Text Classification & Sentiment

Categorizing text and detecting emotions.

Text Classification & Sentiment Analysis

Text classification is one of the most practical NLP tasks. Given a piece of text, assign it a category. Is this email spam or not? Is this review positive or negative? Is this news about sports, politics, or tech?

It's the gateway drug to NLP — simple enough to learn quickly, powerful enough to solve real problems.

Bag of Words (BoW)

The simplest approach: count how many times each word appears, ignore order entirely. It's like putting all the words in a bag and shaking it.

"The cat sat on the mat" and "The mat sat on the cat" produce the exact same representation. Surprisingly, this works well for many tasks!


    Bag of Words Representation
    ──────────────────────────────────────────────
    │ Vocabulary: [cat, mat, on, sat, the]       │
    │                                             │
    │ "The cat sat on the mat":                   │
    │   [1, 1, 1, 1, 2]  ← "the" appears twice  │
    │                                             │
    │ "The mat sat on the cat":                   │
    │   [1, 1, 1, 1, 2]  ← same vector!         │
    │                                             │
    │ "A dog chased the cat":                     │
    │   [1, 0, 0, 0, 1]  ← "dog" not in vocab   │
    ──────────────────────────────────────────────

TF-IDF

Bag of Words treats all words equally, but "the" appears in almost every document while "quantum" is rare. TF-IDF fixes this by weighting words by their importance.

Term Frequency (TF): How often a word appears in a document. More frequent = more important to that document.

Inverse Document Frequency (IDF): How rare a word is across all documents. Rare words are more informative.


    TF-IDF Scoring
    ──────────────────────────────────────────────
    │ TF(t,d) = Count of term t in doc d         │
    │           ─────────────────────────────     │
    │           Total terms in doc d              │
    │                                             │
    │ IDF(t) = log(Total docs / Docs with t)     │
    │                                             │
    │ TF-IDF = TF × IDF                          │
    │                                             │
    │ Example:                                    │
    │   "the" appears 1000/1000 docs → IDF ≈ 0   │
    │   "quantum" appears 5/1000 docs → IDF ≈ 5.3│
    │                                             │
    │ "quantum" gets HIGH weight (informative)    │
    │ "the" gets LOW weight (common noise)        │
    ──────────────────────────────────────────────

Sentiment Analysis

The most popular text classification task: is this text positive, negative, or neutral? It powers product reviews analysis, social media monitoring, and customer feedback systems.

Document-level: Is this whole review positive? Straightforward.

Sentence-level: Each sentence gets a sentiment. Useful for detailed analysis.

Aspect-based: "The food was great but the service was terrible." Food = positive, Service = negative. More nuanced and more useful.


    Aspect-Based Sentiment
    ──────────────────────────────────────────────
    │ "The camera is amazing but battery life     │
    │  is disappointing and the price is fair"    │
    │                                             │
    │ ┌───────────┬────────────┬────────────┐     │
    │ │ Aspect    │ Sentiment  │ Confidence │     │
    ├───────────┼────────────┼────────────┤     │
    │ │ Camera    │ Positive   │ 0.95       │     │
    │ │ Battery   │ Negative   │ 0.89       │     │
    │ │ Price     │ Neutral    │ 0.72       │     │
    │ └───────────┴────────────┴────────────┘     │
    ──────────────────────────────────────────────

Classification Algorithms

You can use many algorithms for text classification, from simple to complex:

Naive Bayes: Uses Bayes' theorem with the "naive" assumption that features are independent. Surprisingly effective for spam detection and simple classification. Fast, interpretable, works well with small data.

Logistic Regression: A linear model that's easy to train and interpret. Great baseline — you'd be surprised how often it beats complex models.

SVM: Support Vector Machines find the optimal boundary between classes. Works well in high-dimensional spaces (like text).

Neural Networks: Deep learning models that learn features automatically. Need more data but can capture complex patterns.


    Algorithm Comparison
    ──────────────────────────────────────────────
    │ Algorithm     │ Data Needed │ Accuracy  │   │
    ──────────────────────────────────────────────
    │ Naive Bayes   │ Small       │ Good      │   │
    │ Logistic Reg  │ Small       │ Good      │   │
    │ SVM           │ Small-Med   │ Very Good │   │
    │ Random Forest │ Medium      │ Good      │   │
    │ CNN/RNN       │ Large       │ Best      │   │
    │ BERT          │ Medium-Large│ SOTA      │   │
    ──────────────────────────────────────────────

Evaluation Metrics

Accuracy alone can be misleading, especially with imbalanced classes:

Precision: Of all predicted positives, how many are actually positive? Important when false positives are costly (e.g., marking legitimate emails as spam).

Recall: Of all actual positives, how many did we catch? Important when false negatives are costly (e.g., missing cancer detection).

F1-Score: Harmonic mean of precision and recall. Balances both concerns.


    Confusion Matrix
    ──────────────────────────────────────────────
    │                │ Predicted +  │ Predicted - │
    ──────────────────────────────────────────────
    │ Actual +       │ True Pos     │ False Neg   │
    │ Actual -       │ False Pos    │ True Neg    │
    ──────────────────────────────────────────────
    │                                             │
    │ Precision = TP / (TP + FP)                  │
    │ Recall    = TP / (TP + FN)                  │
    │ F1        = 2 × (P × R) / (P + R)          │
    ──────────────────────────────────────────────

Practical Tips

Start simple. Use TF-IDF + Logistic Regression as your baseline. It's fast, interpretable, and often "good enough." Only move to deep learning if you need better performance and have enough data.

Always look at your data first. What's the class distribution? Are there patterns? What preprocessing helps? A good look at your data beats any fancy algorithm.

For sentiment analysis specifically, consider using pre-trained models from Hugging Face. They handle nuances like negation ("not bad"), sarcasm, and domain-specific language much better than traditional approaches.

🧪 Quick Quiz

What does sentiment analysis determine?

← Previous Natural Language Processing

Next → Computer Vision Basics