Probabilistic Classification
The Naive Bayes classifier is based on Bayes' Theorem with a strong (naive) assumption that all features are independent of each other given the class label. Despite this simplification, it works surprisingly well in practice โ especially for text classification problems like spam filtering and sentiment analysis.
Bayes' Theorem
P(Class | Features) = P(Features | Class) ร P(Class)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
P(Features)
In plain English:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Probability of class given features โ
โ = โ
โ Probability of features given class โ
โ ร probability of the class โ
โ รท probability of the features โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
"Naive" assumption: features are independent
P(f1, f2, f3 | Class) = P(f1|Class) ร P(f2|Class) ร P(f3|Class)
Example: Spam Detection
Email contains: "free money winner"
P(Spam | "free money winner")
= P("free"|Spam) ร P("money"|Spam) ร P("winner"|Spam) ร P(Spam)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
P("free money winner")
Calculate for both Spam and Not Spam:
P(Spam | words) = 0.92
P(Not Spam | words) = 0.08
Prediction: SPAM โ
Types of Naive Bayes
- Gaussian NB โ Assumes features follow a normal distribution. Works with continuous data.
- Multinomial NB โ Works with discrete counts (like word frequencies). Great for text classification.
- Bernoulli NB โ Works with binary features (word present or absent). Also good for text.
Why "Naive" Still Works
The independence assumption is almost never true in real data. But Naive Bayes doesn't need the probabilities to be perfectly calibrated โ it just needs to get the ranking right (which class has the higher probability). For classification, getting the ranking right is often enough.
Pros and Cons
Pros: Extremely fast to train and predict, works well with small training data, handles high-dimensional data, great for text classification, no complex optimization needed.
Cons: Independence assumption is unrealistic, can be outperformed by more sophisticated models, probability estimates can be poorly calibrated, struggles with correlated features.