Naive Bayes’ Algorithm

Venkata Teja
2 min readJul 12, 2024

--

Let’s Explore Naive Bayes’ algorithm with an example

For example let’s say we have that data.

we have 6 data points and two class labels [spam/ham].

  • First prepare the data.
  • Then calculate prior probabilities

P(spam) = 3/6 = 0.5

P(ham) = 3/6 = 0.5

  • Now calculate likelihoods with laplace smoothing

Count the occurences of each word in all data points and apply laplace smoothing (alpha = 1)

Likelihood probabilities are calculated using laplace smoothing

P(word∣spam) = [count(words in spam) + 1]/Total words in spam+Number of unique wordsCount of word in spam

P(word∣ham) = [count(words in ham) + 1]/Total words in ham+Number of unique wordsCount of word in ham

Total words = 19

Word | P(word | spam) | P(word|ham)

free|spam -> (2+1)/11+19 = 0.1

free|ham -> 0+1/13+19 = 0.03

That’s how, calculate for each word

  • calculate posterior proababilites

Let’s take a new data point

a new email : “Free lottery now”

We need to find out if this is spam or ham. For that let’s take posterior probabilities

words : “Free”, “Lottery”, “now”

P(spam | X) = P(spam).P(free|spam).P(lottery|spam).P(now|spam)

P(spam|X) = 0.5 * 0.1 * 0.07 * 0.1=0.00035

P(ham | X) = P(spam).P(free|ham).P(lottery|ham).P(now|ham)

P(ham | X) = 0.5 * 0.03 * 0.03 * 0.03=0.0000135

Comparing these two P(spam | X) is higher. So we can label the new data point as “Spam”.

Some edge case scenarios will be there

  1. Unseen words :

Sometimes, in the email that we need to predict might contain new words that are not in the train data set. Then prior proababilities will become zero and posterior probabilities will also become zero.

That’s why we use laplace smoothing where we add a regulariser (more often 1) to the count of words and then calculate the probability.

2. Highly imbalanced dataset.

  • Issue: If one class is significantly more frequent than the other, the classifier might be biased towards predicting the more frequent class.
  • Solution: Adjust the prior probabilities or use techniques to balance the dataset, such as oversampling the minority class or undersampling the majority class.

3. Multicollinearity

  • Issue: Naive Bayes assumes independence between features, which might not hold in practice. For example, words in text data are often correlated (e.g., “lottery” and “win”).
  • Solution: While this is a limitation of the algorithm, feature selection or transformation techniques (like PCA) can sometimes help reduce dependencies.

4. Rare Words

  • Issue: Words that appear very rarely might unduly influence the classification.
  • Solution: Apply techniques like term frequency-inverse document frequency (TF-IDF) to weight the importance of words.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Venkata Teja
Venkata Teja

No responses yet

Write a response