Naive Bayes’ Algorithm

Venkata Teja

2 min readJul 12, 2024

Let’s Explore Naive Bayes’ algorithm with an example

For example let’s say we have that data.

we have 6 data points and two class labels [spam/ham].

First prepare the data.
Then calculate prior probabilities

P(spam) = 3/6 = 0.5

P(ham) = 3/6 = 0.5

Now calculate likelihoods with laplace smoothing

Count the occurences of each word in all data points and apply laplace smoothing (alpha = 1)

Likelihood probabilities are calculated using laplace smoothing

P(word∣spam) = [count(words in spam) + 1]/Total words in spam+Number of unique wordsCount of word in spam

P(word∣ham) = [count(words in ham) + 1]/Total words in ham+Number of unique wordsCount of word in ham

Total words = 19

Word | P(word | spam) | P(word|ham)

free|spam -> (2+1)/11+19 = 0.1

free|ham -> 0+1/13+19 = 0.03

That’s how, calculate for each word

calculate posterior proababilites

Let’s take a new data point

a new email : “Free lottery now”

We need to find out if this is spam or ham. For that let’s take posterior probabilities

words : “Free”, “Lottery”, “now”

P(spam | X) = P(spam).P(free|spam).P(lottery|spam).P(now|spam)

P(spam|X) = 0.5 * 0.1 * 0.07 * 0.1=0.00035

P(ham | X) = P(spam).P(free|ham).P(lottery|ham).P(now|ham)

P(ham | X) = 0.5 * 0.03 * 0.03 * 0.03=0.0000135

Comparing these two P(spam | X) is higher. So we can label the new data point as “Spam”.

Some edge case scenarios will be there

Unseen words :

Sometimes, in the email that we need to predict might contain new words that are not in the train data set. Then prior proababilities will become zero and posterior probabilities will also become zero.

That’s why we use laplace smoothing where we add a regulariser (more often 1) to the count of words and then calculate the probability.

2. Highly imbalanced dataset.

Issue: If one class is significantly more frequent than the other, the classifier might be biased towards predicting the more frequent class.
Solution: Adjust the prior probabilities or use techniques to balance the dataset, such as oversampling the minority class or undersampling the majority class.

3. Multicollinearity

Issue: Naive Bayes assumes independence between features, which might not hold in practice. For example, words in text data are often correlated (e.g., “lottery” and “win”).
Solution: While this is a limitation of the algorithm, feature selection or transformation techniques (like PCA) can sometimes help reduce dependencies.

4. Rare Words

Issue: Words that appear very rarely might unduly influence the classification.
Solution: Apply techniques like term frequency-inverse document frequency (TF-IDF) to weight the importance of words.

Naive Bayes’ Algorithm

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Venkata Teja

No responses yet