Naïve Bayes in Python - Casey Cooper

Data Cleaning

The dataset being used is the classic titanic dataset from the seaborn library. The models will attempt to predict whether the passenger survived or not.

The initial uncleaned dataset can be viewed and downloaded below.

Unclean_Data_Titanic.csv Download

The first step is removing duplicate columns.

survived is the same as alive
embarked is the same as embark_town
sex is the same as who
pclass is the same as class

So, alive, embark_town, who, and class were removed.

The second step is dealing with missing values for the deck column. Approximately 77% of the data for this column was missing, so the deck variable was removed alltogether.

The third step is dealing with missing values for the embarked column. There are only two missing values. Since, this is a categorical column, the missing values were replaced with the mode. The mode was ‘S’.

The fourth step is dealing with missing values for the age column. Since age may have a relationship with other columns, the missing values were imputed with the mean after grouping by pclass and sex.

The cleaned dataset can be viewed and downloaded below.

Clean_Data_Titanic.csv Download

The code for the data cleaning can be viewed and downloaded below.

Titanic_Data_Prep.py Download

Copy Code

# -*- coding: utf-8 -*-
"""
Created on Mon May 15 16:37:51 2023

@author: casey
"""

## LOAD LIBRARIES
# Set seed for reproducibility
import random; 
random.seed(53)
import pandas as pd
import seaborn as sns

# ------------------------------------------------------------------------------------------------ #
## LOAD DATASET
df = sns.load_dataset('titanic')

# ------------------------------------------------------------------------------------------------ #
## DISPLAY DETAILS OF DATASET

# Number of rows and columns
df.shape

# General details (includes missing values)
df.info()

# Visual look at dataset
df.head(10)

# ------------------------------------------------------------------------------------------------ #
## REMOVE DUPLICATE COLUMNS
# 'survived' is same as 'alive', 
# 'embarked' is abbreviation of 'embark_town', 
# 'sex' is same as 'who' 
# 'pclass' is same as 'class'

df.drop(columns=['alive','embark_town','who','class'], axis = 1, inplace = True)

# ------------------------------------------------------------------------------------------------ #
## DEAL WITH MISSING VALUES
df.isnull().sum()

# 'deck' has 77% of values missing so that column will just be removed
df.drop(columns=['deck'], axis=1, inplace=True)

# `embarked` has 2 missing values, will replace with the mode
df.embarked.fillna('S', inplace=True)

# age may have a relationship with other columns, so try imputing after grouping
#df.age.fillna(df.age.median(), inplace = True)
df['age'] = df['age'].groupby([df['pclass'], df['sex']]).apply(lambda x: x.fillna(x.mean()))

# ------------------------------------------------------------------------------------------------ #
## WRITE CLEANED DATASET TO .CSV

df.to_csv('C:/Users/casey/OneDrive/Documents/Machine_Learning/Supervised_Learning/Data/Clean_Data_Titanic.csv',
          index = False)

The final cleaned dataset contains the following columns.

survived: whether the passenger survived or not (1=survived, 0=not survived)
pclass: the class the passenger stayed in (1, 2, or 3)
sex: the sex of the passenger
age: the age of the passenger
sibsp: number of siblings/spouses aboard for each passenger
parch: number of parents/children aboard for each passenger
fare: the fare for each passenger
embarked: port each passenger embarked from
adult_male: whether the passenger was an adult male or not
alone: whether the passenger was traveling alone or not

Modeling Prep

Numeric variables only
- Use one hot encoding for categorical variables
Check for variable correlation
- Naïve Bayes assumes independence between variables, so need to remove any correlated variables
Split data into train and test sets

Creating Naïve Bayes models in python requires three key preparatory steps. The first, is that variables need to be numeric. So, any categorical variables need to be converted to numeric variables using one hot encoding. In this case, pclass, sex, embarked, alone, and adult_male were all converted to numeric variables. The second, is to check for correlation among variables. The third, is that since, Naïve Bayes is a supervised machine learning model it requires the prepped data to be split into training and testing sets. The training set is used to train the model. The testing set is used to test the accuracy of the model. In the following example, the training set is created by randomly selecting 80% of the data and the testing set is created by randomly selecting 20% of the data. These numbers are not the only option, just a popular one. The training and testing sets must be kept disjoint (separated) throughout the modeling process. Failure to do so, will most likely result in overfitting and poor performance on real data that is not from the training or testing set.

The code for the modeling prep, as well as the modeling and model evaluation can be viewed and downloaded below.

Titanic_Naive_Bayes.py Download

Copy Code

# -*- coding: utf-8 -*-
"""
Created on Tue May 16 16:51:49 2023

@author: casey
"""

# -*- coding: utf-8 -*-
"""
Created on Tue May 16 13:24:26 2023

@author: casey
"""
## LOAD LIBRARIES
# Set seed for reproducibility
import random; 
random.seed(53)
import pandas as pd

# Import all we need from sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.naive_bayes import MultinomialNB, BernoulliNB

# Import visualization
import scikitplot as skplt
import matplotlib.pyplot as plt
import seaborn as sns

# ------------------------------------------------------------------------------------------------ #
## LOAD DATA
nb_df = pd.read_csv('C:/Users/casey/OneDrive/Documents/Machine_Learning/Supervised_Learning/Data/Clean_Data_Titanic.csv')

# ------------------------------------------------------------------------------------------------ #
## CHECK FOR VARIABLE CORRELATION

# convert categorical variables to number codes so they can be used in the heatmap
nb_df['sex'] = nb_df['sex'].astype('category')
nb_df['sex'] = nb_df['sex'].cat.codes
nb_df['embarked'] = nb_df['embarked'].astype('category')
nb_df['embarked'] = nb_df['embarked'].cat.codes
nb_df['adult_male'] = nb_df['adult_male'].astype('category')
nb_df['adult_male'] = nb_df['adult_male'].cat.codes
nb_df['alone'] = nb_df['alone'].astype('category')
nb_df['alone'] = nb_df['alone'].cat.codes

# heatmap plot
fig, ax = plt.subplots(figsize=(12,12)) 
dataplot = sns.heatmap(nb_df.corr(), cmap="YlGnBu", annot=True, ax=ax)
  
# displaying heatmap
plt.show()

# remove adult_male as it has high correlations
nb_df.drop(columns='adult_male', axis=1, inplace=True)

# ------------------------------------------------------------------------------------------------ #
## CREATE DUMMY VARIABLES FOR CATEGORICAL VARIABLES
nb_onehot = nb_df.copy()
nb_onehot = pd.get_dummies(nb_onehot, columns = ['pclass', 'sex', 'embarked', 'alone'])

# ------------------------------------------------------------------------------------------------ #
## CREATE TRAIN AND TEST SETS

# X will contain all variables except the labels (the labels are the first column 'survived')
X = nb_onehot.iloc[:,1:]
# y will contain the labels (the labels are the first column 'survived')
y = nb_onehot.iloc[:,:1]

# split the data vectors randomly into 80% train and 20% test
# X_train contains the quantitative variables for the training set
# X_test contains the quantitative variables for the testing set
# y_train contains the labels for training set
# y_test contains the lables for the testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# ------------------------------------------------------------------------------------------------ #
## CREATE MULTINOMIAL NAIVE BAYES MODEL
# default smoothing parameter alpha=1
# Look at below documentation for parameters
# https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html
MultiNB_Classifier = MultinomialNB()
MultiNB_Classifier.fit(X_train, y_train)

## EVALUATE MODEL
y_pred = MultiNB_Classifier.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

# ------------------------------------------------------------------------------------------------ #
## PLOT CONFUSION MATRIX

fig = plt.figure(figsize=(15,6))

ax1 = fig.add_subplot(121)
skplt.metrics.plot_confusion_matrix(y_pred, y_test,
                                    title="Confusion Matrix for Multinomial Naive Bayes Model",
                                    cmap="Oranges",
                                    ax=ax1)

# ------------------------------------------------------------------------------------------------ #
## CREATE BERNOULLI NAIVE BAYES MODEL
# default smoothing parameter alpha=1
# Look at below documentation for parameters
# https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html
BernNB_Classifier = BernoulliNB()
BernNB_Classifier.fit(X_train, y_train)

## EVALUATE MODEL
y_pred = BernNB_Classifier.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

# ------------------------------------------------------------------------------------------------ #
## PLOT CONFUSION MATRIX

fig = plt.figure(figsize=(15,6))

ax1 = fig.add_subplot(121)
skplt.metrics.plot_confusion_matrix(y_pred, y_test,
                                    title="Confusion Matrix for Bernoulli Naive Bayes Model",
                                    cmap="Oranges",
                                    ax=ax1)

Model Evaluation Key Ideas

It’s necessary to first check for variable correlation
- If a variable is highly correlated with others, remove it before modeling
Attempt to classify passengers of the titanic as survived or not, with high accuracy

Check for Variable Correlation

To check for correlation, a correlation heatmap is used.

The adult_male column has high correlation with sex as well as some moderate correlations with other variables. Due to this, the adult_male variable should be removed prior to Naïve Bayes modeling.

Modeling (Multinomial Naïve Bayes)

The Multinomial Naïve Bayes model, uses a Multinomial probability distribution. This means it is designed to work best for classifying discrete features. Generally, Multinomial Naïve Bayes works best for text data.

The confusion matrix and evaluation metrics can be viewed below.

Accuracy: 0.70
Precision (0): 0.70
Precision (1): 0.71
Recall (0): 0.84
Recall (1): 0.52

Modeling (Bernoulli Naive Bayes)

The Bernoulli Naïve Bayes model, uses a Bernoulli probability distribution. This means it is designed to work best for classifying binary features. Since, survived is a binary feature, it is expected that Bernoulli Naïve Bayes will perform better than Multinomial Naïve Bayes.

The confusion matrix and evaluation metrics can be viewed below.

Accuracy: 0.74
Precision (0): 0.76
Precision (1): 0.71
Recall (0): 0.80
Recall (1): 0.65