Naive Bayes Overview

Reference: https://www.educba.com/naive-bayes-algorithm/

Naive Bayes is a supervised, classification machine learning modeling method. Naive Bayes is most commonly used for text classification, but can be used for other classifications as well. The overall goal of Naive Bayes is to predict a label or class for a new data vector given the labeled data vectors already in the dataset. This is done by calculating probabilities for every possible label given the new data vector. There are two crucial parts to understanding the Naive Bayes algorithm. One, is that Naive Bayes is based on the Bayes Theorem. Two, is that the assumption is made that the predictors (variables) are independent of each other. This assumption makes it possible for the probabilities in the algorithm to be calculated. Since, independence is assumed, it’s usually beneficial to check for correlation between variables before training a model. If high correlation is found between any variables then one of the them should be removed.

Naive Bayes Algorithm

Using Bayes theorem, the probability of A happening, given that B has occurred can be calculated. In the Naive Bayes Algorithm, B is the data vectors in the dataset that have already been classified and A is the new data vector to be classified. The assumption made here is that the predictors/features are independent. That is the presence of one particular feature does not affect the other. This allows the probability for a data vector to be calculated when it wouldn’t be possible otherwise

Reference: https://byjus.com/maths/bayes-theorem/

The following is an example based on a simple dataset that aims to decide the question: should we play pickleball?

WindyTemperatureWeatherPlay Pickleball
YesHotSunnyYes
YesColdSunnyNo
NoColdCloudyNo
YesMildCloudyNo
NoMildSunnyYes
NoHotCloudyYes
NoHotRainyNo
YesColdRainyNo
NoMildCloudyYes
NoHotSunnyYes

Given a new data vector X need to assign it a classifier, C, for Play Pickleball of ‘Yes’ or ‘No’.

X = {Windy=No, Temperature=Mild, Weather=Sunny}

To simplify the probability equations, A1 = Windy = No, A2 = Temperature = Mild , A3 = Weather = Sunny.

To determine the classifier for the new data vector X, the probabilities for each classifier of Play Pickleball given the values in X need to be calculated. This is where Bayes’ Theorem comes into play.

P(C | X) = (P(C | X1) * P(C | X2) * P(C | X3) * P(C)) / P(X)

Notice that the new data vector, X, will stay the same regardless of the value of the classifier. This means the denominator of the Bayes’ Theorem (P(X)) can be removed as it will have no effect on which classifier probability is larger. So, only the numerator of the Bayes’ Theorem needs to be calculated. The assumption of independence between predictors or variables, allows the initial conditional probability, P(C | X), to be split up into conditional probabilities involving individual values of X.

In this example, the classifier, C, can take on two values: ‘Yes’ or ‘No’. So, the probabilities for each classifier given the new data vector, X, need to be calculated below.

P(C=’Yes’ | A1, A2, A3) = P(A1 | C=’Yes’) * P(A2 | C=’Yes’) * P(A3 | C=’Yes’) * P(C=’Yes’) = (4/5) * (2/5) * (3/5) * (5/10) = 0.096

P(C=’No’ | A1, A2, A3) = P(A1 | C=’No’) * P(A2 | C=’No’) * P(A3 | C=’No) * P(C=’No) = (2/5) * (1/5) * (1/5) * (5/10) = 0.008

The probability for C=’Yes’ given data vector X is greater than the probability for C=’No’ given data vector X. As a result, the new data vector X, would be assigned the class ‘Yes’. So, given the fact that the conditions are not windy, a mild temperature, and sunny weather we should play pickleball.

Quantitative Data

When the dataset includes quantitative variables, then the Naive Bayes algorithm calculates the probabilities of the quantitative variables using a probability distribution. The default is to use a normal distribution. However, this can be changed if desired. A quick way to determine the correct probability distribution to use, is plotting histogram of the quantitative variable and visually determining what it’s approximate distribution is.

Multinomial Naive Bayes

Implements the Naive Bayes algorithm for multinomially distributed data. It’s one of the two most common Naive Bayes variants used for text classification. The multinomial distribution is a probability distribution that is used to calculate outcomes of experiments involving two or more variables. Using this distribution allows Multinomial Naive Bayes to classify data that has two or more classes.

Bernoulli Naive Bayes

Implements the Naive Bayes algorithm for data that follows a Bernoulli distribution. It’s one of the two most common Naive Bayes variants used for text classification. The Bernoulli distribution is a discrete probability distribution that is used to calculate outcomes of experiments where the outcome is either a ‘success’ or a ‘failure’. Using this distribution allows Bernoulli Naive Bayes to classify data that has exactly two classes (one class would be deemed ‘success’ and the other ‘failure).

Image Reference: https://brilliant.org/wiki/bernoulli-distribution/

Smoothing

Smoothing is the solution to the zero probability problem in Naive Bayes. Sometimes the conditional probability for one or more variables in a new data vector will be zero for a classifier. When this happens the overall conditional probability will go to zero regardless of other individual conditional probabilities. This is not desired. To solve this a smoothing technique is used. Smoothing techniques essentially add values to the numerator and denominator to prevent the overall probability from ever being zero. Popular smoothing techniques are below.

 Reference: Page 224 in Intro Data Mining Ed. 2