Data Cleaning
The dataset being used is the classic titanic dataset from the seaborn library. The models will attempt to predict whether the passenger survived or not.
The initial uncleaned dataset can be viewed and downloaded below.
The first step is removing duplicate columns.
survived
is the same asalive
embarked
is the same asembark_town
sex
is the same aswho
pclass
is the same asclass
So, alive
, embark_town
, who
, and class
were removed.
The second step is dealing with missing values for the deck
column. Approximately 77% of the data for this column was missing, so the deck
variable was removed alltogether.
The third step is dealing with missing values for the embarked
column. There are only two missing values. Since, this is a categorical column, the missing values were replaced with the mode. The mode was ‘S’.
The fourth step is dealing with missing values for the age
column. Since age may have a relationship with other columns, the missing values were imputed with the mean after grouping by pclass
and sex
.
The cleaned dataset can be viewed and downloaded below.
The code for the data cleaning can be viewed and downloaded below.
# -*- coding: utf-8 -*-
"""
Created on Mon May 15 16:37:51 2023
@author: casey
"""
## LOAD LIBRARIES
# Set seed for reproducibility
import random;
random.seed(53)
import pandas as pd
import seaborn as sns
# ------------------------------------------------------------------------------------------------ #
## LOAD DATASET
df = sns.load_dataset('titanic')
# ------------------------------------------------------------------------------------------------ #
## DISPLAY DETAILS OF DATASET
# Number of rows and columns
df.shape
# General details (includes missing values)
df.info()
# Visual look at dataset
df.head(10)
# ------------------------------------------------------------------------------------------------ #
## REMOVE DUPLICATE COLUMNS
# 'survived' is same as 'alive',
# 'embarked' is abbreviation of 'embark_town',
# 'sex' is same as 'who'
# 'pclass' is same as 'class'
df.drop(columns=['alive','embark_town','who','class'], axis = 1, inplace = True)
# ------------------------------------------------------------------------------------------------ #
## DEAL WITH MISSING VALUES
df.isnull().sum()
# 'deck' has 77% of values missing so that column will just be removed
df.drop(columns=['deck'], axis=1, inplace=True)
# `embarked` has 2 missing values, will replace with the mode
df.embarked.fillna('S', inplace=True)
# age may have a relationship with other columns, so try imputing after grouping
#df.age.fillna(df.age.median(), inplace = True)
df['age'] = df['age'].groupby([df['pclass'], df['sex']]).apply(lambda x: x.fillna(x.mean()))
# ------------------------------------------------------------------------------------------------ #
## WRITE CLEANED DATASET TO .CSV
df.to_csv('C:/Users/casey/OneDrive/Documents/Machine_Learning/Supervised_Learning/Data/Clean_Data_Titanic.csv',
index = False)
The final cleaned dataset contains the following columns.
survived
: whether the passenger survived or not (1=survived, 0=not survived)pclass
: the class the passenger stayed in (1, 2, or 3)sex
: the sex of the passengerage
: the age of the passengersibsp
: number of siblings/spouses aboard for each passengerparch
: number of parents/children aboard for each passengerfare
: the fare for each passengerembarked
: port each passenger embarked fromadult_male
: whether the passenger was an adult male or notalone
: whether the passenger was traveling alone or not
Modeling Prep
- Numeric variables only
- Use one hot encoding for categorical variables
- Split data into train and test sets
- Tree-based, so no normalization necessary
Creating XGBoost models in python requires two key steps. The first, is that variables need to be numeric. So, any categorical variables need to be converted to numeric variables using one hot encoding. In this case, pclass
, sex
, embarked
, alone
, and adult_male
were all converted to numeric variables. The second, is that since, XGBoost is a supervised machine learning model it requires the prepped data to be split into training and testing sets. The training set is used to train the model. The testing set is used to test the accuracy of the model. In the following example, the training set is created by randomly selecting 80% of the data and the testing set is created by randomly selecting 20% of the data. These numbers are not the only option, just a popular one. The training and testing sets must be kept disjoint (separated) throughout the modeling process. Failure to do so, will most likely result in overfitting and poor performance on real data that is not from the training or testing set. Since, XGBoost is a tree-based method no data normalization is necessary prior to modeling.
The code for the modeling prep as well as the modeling and model evaluation can be viewed and downloaded below.
# -*- coding: utf-8 -*-
"""
Created on Sat May 20 15:24:00 2023
@author: casey
"""
## LOAD LIBRARIES
import pandas as pd
import numpy as np
# Import all we need from sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.model_selection import StratifiedKFold, cross_val_score
# Import XGBoost
import xgboost as xgb
# Import Bayesian Optimization
from hyperopt import tpe, STATUS_OK, Trials, hp, fmin, STATUS_OK, space_eval
# Import visualization
import scikitplot as skplt
import matplotlib.pyplot as plt
import seaborn as sns
# ------------------------------------------------------------------------------------------------ #
## LOAD DATA
dt_df = pd.read_csv('C:/Users/casey/OneDrive/Documents/Machine_Learning/Supervised_Learning/Data/Clean_Data_Titanic.csv')
# ------------------------------------------------------------------------------------------------ #
## CREATE DUMMY VARIABLES FOR CATEGORICAL VARIABLES
dt_onehot = dt_df.copy()
dt_onehot = pd.get_dummies(dt_onehot, columns = ['pclass', 'sex', 'embarked', 'alone', 'adult_male'])
# ------------------------------------------------------------------------------------------------ #
## CREATE TRAIN AND TEST SETS
# X will contain all variables except the labels (the labels are the first column 'survived')
X = dt_onehot.iloc[:,1:]
# y will contain the labels (the labels are the first column 'survived')
y = dt_onehot.iloc[:,:1]
# split the data vectors randomly into 80% train and 20% test
# X_train contains the quantitative variables for the training set
# X_test contains the quantitative variables for the testing set
# y_train contains the labels for training set
# y_test contains the lables for the testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# ------------------------------------------------------------------------------------------------ #
## CREATE FULL XGBOOST
# https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn
# documentation for parameters
# https://xgboost.readthedocs.io/en/latest/parameter.html
# default parameters
# increasing max_depth could lead to overfitting
# alpha/lambda are regularization parameters
# gamma typically between 0 and 5, with increasing regularization
# learning rate range is 0.01-0.3
# n_estimators is number of trees range of 100-1000
xgb_Classifier = xgb.XGBClassifier(max_depth = 6,
n_estimators = 100,
learning_rate = 0.3,
colsample_bytree = 1,
subsample = 1,
gamma = 0,
alpha = 0,
reg_lambda = 1,
seed = 4
)
xgb_Classifier.fit(X_train, y_train)
## EVALUATE MODEL
y_pred = xgb_Classifier.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
# ------------------------------------------------------------------------------------------------ #
## PLOT CONFUSION MATRIX
fig = plt.figure(figsize=(15,6))
ax1 = fig.add_subplot(121)
skplt.metrics.plot_confusion_matrix(y_pred, y_test,
title="Confusion Matrix for Default XGBoost",
cmap="Oranges",
ax=ax1)
# ------------------------------------------------------------------------------------------------ #
## GRIDSEARCH TO FIND OPTIMAL PARAMETERS
# Warning: Can take a long time to complete, 720 fits takes approx 4 min, 8640 fits takes approx 35 min
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
estimator = xgb.XGBClassifier(seed=4)
parameters = {
'max_depth': [3, 4, 5],
'n_estimators': range(100, 500, 50),
'learning_rate': [0.01, 0.05, 0.1],
'reg_lambda': [0, 0.5, 1, 2],
'alpha': [0, 0.5, 1]
}
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)
grid_search = GridSearchCV(
estimator = estimator,
param_grid = parameters,
scoring = 'accuracy',
cv = skf,
verbose=1
)
grid_search.fit(X_train, y_train)
# gets best params
grid_search.best_params_
# calculates time to complete search
mean_fit_time= grid_search.cv_results_['mean_fit_time']
mean_score_time= grid_search.cv_results_['mean_score_time']
n_splits = grid_search.n_splits_ #number of splits of training data
n_iter = pd.DataFrame(grid_search.cv_results_).shape[0] #Iterations per split
print('Runtime:', (np.mean(mean_fit_time + mean_score_time) * n_splits * n_iter)/60, 'minutes')
# FIT MODEL WITH BEST PARAMS
xgb_Classifier = xgb.XGBClassifier(max_depth = 4,
n_estimators = 250,
learning_rate = 0.1,
colsample_bytree = 1,
subsample = 1,
gamma = 0,
alpha = 1,
reg_lambda = 2,
seed = 4
)
xgb_Classifier.fit(X_train, y_train)
## EVALUATE MODEL
y_pred = xgb_Classifier.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
## PLOT CONFUSION MATRIX
fig = plt.figure(figsize=(15,6))
ax1 = fig.add_subplot(121)
skplt.metrics.plot_confusion_matrix(y_pred, y_test,
title="Confusion Matrix for GridSearch Tuned XGBoost",
cmap="Oranges",
ax=ax1)
# ------------------------------------------------------------------------------------------------ #
## RANDOMSEARCH TO FIND OPTIMAL PARAMETERS
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html
# Randomly samples parameters so faster than GridSearch, can include more options
estimator = xgb.XGBClassifier(seed=4)
parameters = {
'max_depth': [3, 4, 5, 6, 7],
'n_estimators': range(100, 500, 50),
'learning_rate': [0.001, 0.01, 0.05, 0.1],
'reg_lambda': [0, 0.1, 0.5, 1, 2, 3],
'alpha': [0, 0.1, 0.5, 1, 2, 3],
'gamma': [0, 0.1, 0.5, 1, 2, 3],
'colsample_bytree': [0.1, 0.3, 0.5, 0.7, 0.9, 1],
'subsample': [0.1, 0.5, 0.7, 1]
}
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)
random_search = RandomizedSearchCV(
estimator = estimator,
param_distributions = parameters,
scoring = 'accuracy',
cv = skf,
verbose=1
)
random_search.fit(X_train, y_train)
# gets best params
random_search.best_params_
# FIT MODEL WITH BEST PARAMS
xgb_Classifier = xgb.XGBClassifier(max_depth = 3,
n_estimators = 400,
learning_rate = 0.001,
colsample_bytree = 0.7,
subsample = 1,
gamma = 1,
alpha = 0.1,
reg_lambda = 1,
seed = 4
)
xgb_Classifier.fit(X_train, y_train)
## EVALUATE MODEL
y_pred = xgb_Classifier.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
## PLOT CONFUSION MATRIX
fig = plt.figure(figsize=(15,6))
ax1 = fig.add_subplot(121)
skplt.metrics.plot_confusion_matrix(y_pred, y_test,
title="Confusion Matrix for RandomSearch Tuned XGBoost",
cmap="Oranges",
ax=ax1)
# ------------------------------------------------------------------------------------------------ #
## BAYESIAN OPTIMIZATION TO FIND OPTIMAL PARAMETERS
# http://hyperopt.github.io/hyperopt/
# Randomly samples parameters so faster than GridSearch, can include more options
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)
space = {
'max_depth': hp.choice('max_depth', [3, 4, 5, 6, 7]),
'n_estimators': hp.choice('n_estimators', [100, 150, 200, 250, 300, 350, 400, 450, 500]),
'learning_rate': hp.choice('learning_rate', [0.001, 0.01, 0.05, 0.1]),
'reg_lambda': hp.choice('reg_lambda', [0, 0.1, 0.5, 1, 2, 3]),
'alpha': hp.choice('alpha', [0, 0.1, 0.5, 1, 2, 3]),
'gamma': hp.choice('gamma', [0, 0.1, 0.5, 1, 2, 3]),
'colsample_bytree': hp.choice('colsample_bytree', [0.1, 0.3, 0.5, 0.7, 0.9, 1]),
'subsample': hp.choice('subsample', [0.1, 0.5, 0.7, 1])
}
# Objective function
def objective(params):
xgboost = xgb.XGBClassifier(seed=4, **params)
score = cross_val_score(estimator=xgboost,
X=X_train,
y=y_train,
cv=skf,
scoring='accuracy',
n_jobs=-1).mean()
# Loss is negative score
loss = - score
# Dictionary with information for evaluation
return {'loss': loss, 'params': params, 'status': STATUS_OK}
# Optimize
# fmin is the function to search the best hyperparameters with the smallest loss value
best = fmin(fn = objective, space = space, algo = tpe.suggest, max_evals = 200, trials = Trials())
# Print the index of the best parameters
print(best)
# Print the values of the best parameters
print(space_eval(space, best))
# FIT MODEL WITH BEST PARAMS
xgb_Classifier = xgb.XGBClassifier(max_depth = 5,
n_estimators = 450,
learning_rate = 0.01,
colsample_bytree = 0.5,
subsample = 0.5,
gamma = 0.5,
alpha = 0.1,
reg_lambda = 0,
seed = 4
)
xgb_Classifier.fit(X_train, y_train)
## EVALUATE MODEL
y_pred = xgb_Classifier.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
## PLOT CONFUSION MATRIX
fig = plt.figure(figsize=(15,6))
ax1 = fig.add_subplot(121)
skplt.metrics.plot_confusion_matrix(y_pred, y_test,
title="Confusion Matrix for Bayesian Optimzation Tuned XGBoost",
cmap="Oranges",
ax=ax1)
Model Evaluation Key Ideas
- Begin by fitting a basic model with reasonable parameters
- Then optimize parameters for best performance/speed
- Can do this by hand
- Can use optimization algorithms (Grid Search, Random Search)
- Attempt to classify passengers of the titanic as survived or not, with high accuracy
Modeling (Initial)
To begin, a XGBoost model was created using the default parameters and all of the variables. The confusion matrix and evaluation metrics can be viewed below.
- Accuracy: 0.79
- Precision (0): 0.84
- Precision (1): 0.71
- Recall (0): 0.82
- Recall (1): 0.75
Modeling (Tuned: Grid Search)
Next, hyperparameters were tuned using Grid Search in an attempt to improve accuracy. The confusion matrix and evaluation metrics can be viewed below.
- Accuracy: 0.81
- Precision (0): 0.85
- Precision (1): 0.74
- Recall (0): 0.84
- Recall (1): 0.76
Using Grid Search to tune parameters resulted in a 2% increase in accuracy from the default model.
Modeling (Tuned: Random Search)
Next, hyperparameters were tuned using Random Search in an attempt to improve accuracy. The confusion matrix and evaluation metrics can be viewed below.
- Accuracy: 0.81
- Precision (0): 0.85
- Precision (1): 0.75
- Recall (0): 0.85
- Recall (1): 0.75
Using Random Search to tune parameters resulted in a 2% increase in accuracy from the default model.
Modeling (Tuned: Bayesian Optimization)
Finally, hyperparameters were tuned using Bayesian Optimization in an attempt to improve accuracy. The confusion matrix and evaluation metrics can be viewed below.
- Accuracy: 0.82
- Precision (0): 0.83
- Precision (1): 0.78
- Recall (0): 0.87
- Recall (1): 0.73
Using Bayesian Optimization to tune parameters resulted in a 3% increase in accuracy from the default model.