Naive Bayes’ Algorithm
Let’s Explore Naive Bayes’ algorithm with an example

For example let’s say we have that data.
we have 6 data points and two class labels [spam/ham].
- First prepare the data.
- Then calculate prior probabilities
P(spam) = 3/6 = 0.5
P(ham) = 3/6 = 0.5
- Now calculate likelihoods with laplace smoothing
Count the occurences of each word in all data points and apply laplace smoothing (alpha = 1)

Likelihood probabilities are calculated using laplace smoothing
P(word∣spam) = [count(words in spam) + 1]/Total words in spam+Number of unique wordsCount of word in spam
P(word∣ham) = [count(words in ham) + 1]/Total words in ham+Number of unique wordsCount of word in ham
Total words = 19
Word | P(word | spam) | P(word|ham)
free|spam -> (2+1)/11+19 = 0.1
free|ham -> 0+1/13+19 = 0.03
That’s how, calculate for each word
- calculate posterior proababilites
Let’s take a new data point
a new email : “Free lottery now”
We need to find out if this is spam or ham. For that let’s take posterior probabilities
words : “Free”, “Lottery”, “now”
P(spam | X) = P(spam).P(free|spam).P(lottery|spam).P(now|spam)
P(spam|X) = 0.5 * 0.1 * 0.07 * 0.1=0.00035
P(ham | X) = P(spam).P(free|ham).P(lottery|ham).P(now|ham)
P(ham | X) = 0.5 * 0.03 * 0.03 * 0.03=0.0000135
Comparing these two P(spam | X) is higher. So we can label the new data point as “Spam”.
Some edge case scenarios will be there
- Unseen words :
Sometimes, in the email that we need to predict might contain new words that are not in the train data set. Then prior proababilities will become zero and posterior probabilities will also become zero.
That’s why we use laplace smoothing where we add a regulariser (more often 1) to the count of words and then calculate the probability.
2. Highly imbalanced dataset.
- Issue: If one class is significantly more frequent than the other, the classifier might be biased towards predicting the more frequent class.
- Solution: Adjust the prior probabilities or use techniques to balance the dataset, such as oversampling the minority class or undersampling the majority class.
3. Multicollinearity
- Issue: Naive Bayes assumes independence between features, which might not hold in practice. For example, words in text data are often correlated (e.g., “lottery” and “win”).
- Solution: While this is a limitation of the algorithm, feature selection or transformation techniques (like PCA) can sometimes help reduce dependencies.
4. Rare Words
- Issue: Words that appear very rarely might unduly influence the classification.
- Solution: Apply techniques like term frequency-inverse document frequency (TF-IDF) to weight the importance of words.