Stratified K-Fold in Python

Why Cross-Validation is Used

Cross-validation in general is used for a few reasons. One is to make use of all of our data in both training and testing. Cross-validation will ensure that different combinations of train and test sets are used rather than just one. A second reason is get more information on how our model is actually performing. The ideal result would be similar accuracy metrics across all of the different train and test set combinations. However, it may be that one train and test set performs better (overfitting) or worse than all the others or that each train and test set performs differently. If either of these two is the case then you would have to go back and figure out what was going on with the data because without consistency the model won’t perform well on real data.

What is Stratified K-Fold Cross Validation

  1. Choose the value of ‘K’
  2. Split the dataset into ‘K’ sections. This is done using stratified sampling. This means that instead of randomly selecting data vectors for the folds and getting a potentially random distribution of labels if they aren’t equal, each fold contains the same percentage of labels as the whole dataset does.
  3. Use the K-1 fold as the test set and the remaining folds as the train set
  4. Train the model on the train set and validate it on the test set
  5. Repeat steps 3-4, but changing the K value of the test set. So, in the first iteration, the test set would be K-1, then in the second iteration it would be K-2. Iterations are repeated ‘K’ times until all the different train and test combinations are tried

When to Use Stratified K-Fold Cross Validation

Stratified K-Fold cross validation should be used when your dataset has an unequal label distribution. For example, if the dataset contained 70% of labels as ‘Yes’ and 30% as ‘No’. Performing random sampling might cause one or more of the folds to have barely any ‘No’ labels. This is not what ideal for getting an accurate idea of how the model is performing. Stratified sampling ensures that each fold will have 70% ‘Yes’ and 30% ‘No’, which would provide much more accurate results. If the label distribution in the dataset is approximately equal then K-Fold with random sampling should be used.

Data Cleaning

The dataset being used is the classic titanic dataset from the seaborn library. The models will attempt to predict whether the passenger survived or not.

The initial uncleaned dataset can be viewed and downloaded below.

The first step is removing duplicate columns.

  • survived is the same as alive
  • embarked is the same as embark_town
  • sex is the same as who
  • pclass is the same as class

So, aliveembark_townwho, and class were removed.

The second step is dealing with missing values for the deck column. Approximately 77% of the data for this column was missing, so the deck variable was removed alltogether.

The third step is dealing with missing values for the embarked column. There are only two missing values. Since, this is a categorical column, the missing values were replaced with the mode. The mode was ‘S’.

The fourth step is dealing with missing values for the age column. Since age may have a relationship with other columns, the missing values were imputed with the mean after grouping by pclass and sex.

The cleaned dataset can be viewed and downloaded below.

The code for the data cleaning can be viewed and downloaded below.

# -*- coding: utf-8 -*-
"""
Created on Mon May 15 16:37:51 2023

@author: casey
"""

## LOAD LIBRARIES
# Set seed for reproducibility
import random; 
random.seed(53)
import pandas as pd
import seaborn as sns

# ------------------------------------------------------------------------------------------------ #
## LOAD DATASET
df = sns.load_dataset('titanic')

# ------------------------------------------------------------------------------------------------ #
## DISPLAY DETAILS OF DATASET

# Number of rows and columns
df.shape

# General details (includes missing values)
df.info()

# Visual look at dataset
df.head(10)

# ------------------------------------------------------------------------------------------------ #
## REMOVE DUPLICATE COLUMNS
# 'survived' is same as 'alive', 
# 'embarked' is abbreviation of 'embark_town', 
# 'sex' is same as 'who' 
# 'pclass' is same as 'class'

df.drop(columns=['alive','embark_town','who','class'], axis = 1, inplace = True)

# ------------------------------------------------------------------------------------------------ #
## DEAL WITH MISSING VALUES
df.isnull().sum()

# 'deck' has 77% of values missing so that column will just be removed
df.drop(columns=['deck'], axis=1, inplace=True)

# `embarked` has 2 missing values, will replace with the mode
df.embarked.fillna('S', inplace=True)

# age may have a relationship with other columns, so try imputing after grouping
#df.age.fillna(df.age.median(), inplace = True)
df['age'] = df['age'].groupby([df['pclass'], df['sex']]).apply(lambda x: x.fillna(x.mean()))

# ------------------------------------------------------------------------------------------------ #
## WRITE CLEANED DATASET TO .CSV

df.to_csv('C:/Users/casey/OneDrive/Documents/Machine_Learning/Supervised_Learning/Data/Clean_Data_Titanic.csv',
          index = False)

The final cleaned dataset contains the following columns.

  • survived: whether the passenger survived or not (1=survived, 0=not survived)
  • pclass: the class the passenger stayed in (1, 2, or 3)
  • sex: the sex of the passenger
  • age: the age of the passenger
  • sibsp: number of siblings/spouses aboard for each passenger
  • parch: number of parents/children aboard for each passenger
  • fare: the fare for each passenger
  • embarked: port each passenger embarked from
  • adult_male: whether the passenger was an adult male or not
  • alone: whether the passenger was traveling alone or not

Stratified K-Fold Cross Validation Prep

  • Different classifiers need slightly different datasets
    • Naïve Bayes uses a dataset that has adult_male_True removed since it is correlated
    • Decision Trees/ Random Forest use a dataset that only contains the variables deemed important
    • KNN, Logistic Regression, and SVMs use a dataset with all of the variables that is normalized
    • XGBoost uses a dataset with all of the variables

The code for the validation prep, as well as the validation itself can be viewed and downloaded below.

# -*- coding: utf-8 -*-
"""
Created on Wed May 17 14:30:36 2023

@author: casey
"""

## LOAD LIBRARIES
import pandas as pd
from statistics import mean, stdev

# Import all we need from sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import MinMaxScaler
from sklearn import metrics
from sklearn.neighbors import KNeighborsClassifier
from sklearn import linear_model
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.naive_bayes import BernoulliNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold

# Import XGBoost
import xgboost as xgb

# Import visualization
import scikitplot as skplt
import matplotlib.pyplot as plt
import seaborn as sns

# ------------------------------------------------------------------------------------------------ #
## LOAD DATA
df = pd.read_csv('C:/Users/casey/OneDrive/Documents/Machine_Learning/Supervised_Learning/Data/Clean_Data_Titanic.csv')

# ------------------------------------------------------------------------------------------------ #
## CREATE DATASET FOR KNN, SVM, LOGISTIC REGRESSION

## CREATE DUMMY VARIABLES FOR CATEGORICAL VARIABLES
knn_svm_lr_onehot = df.copy()
knn_svm_lr_onehot = pd.get_dummies(knn_svm_lr_onehot, columns = ['pclass', 'sex', 'embarked', 'alone', 'adult_male'])

X = knn_svm_lr_onehot.iloc[:,1:]
y = knn_svm_lr_onehot['survived']

#Play with mim-max scalar and StandardScalar and RobustScalar
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

# ------------------------------------------------------------------------------------------------ #
## CROSS VALIDATE KNN or SVM or LOGISTIC REGRESSION
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html

# select the model to cross-validate
# optimal parameters from Titanic_KNN.py and Titanic_SVM.py
#model = linear_model.LogisticRegression()
#model = KNeighborsClassifier(n_neighbors=9)
model = svm.SVC(kernel='linear', C=1)

skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)
lst_accu_stratified = []
roc_scores = []

for train_index, test_index in skf.split(X, y):
    X_train_fold, X_test_fold = X_scaled[train_index], X_scaled[test_index]
    y_train_fold, y_test_fold = y[train_index], y[test_index]
    model.fit(X_train_fold, y_train_fold)
    predictions = model.predict(X_test_fold)
    lst_accu_stratified.append(metrics.accuracy_score(y_test_fold, predictions))
    #print(classification_report(y_test_fold, predictions))
    roc_scores.append(roc_auc_score(y_test_fold, predictions))
    
# Print the Accuracy output.
print('List of possible accuracy:', lst_accu_stratified)
print('\nMaximum Accuracy that can be obtained from this model is:',
      max(lst_accu_stratified)*100, '%')
print('\nMinimum Accuracy:',
      min(lst_accu_stratified)*100, '%')
print('\nOverall Accuracy:',
      mean(lst_accu_stratified)*100, '%')
print('\nStandard Deviation is:', stdev(lst_accu_stratified))

# Print the ROC output.
print('List of possible ROC Scores:', roc_scores)
print('\nMaximum ROC that can be obtained from this model is:',
      max(roc_scores))
print('\nMinimum ROC:',
      min(roc_scores))
print('\nOverall ROC:',
      mean(roc_scores))
print('\nStandard Deviation is:', stdev(roc_scores))

# ------------------------------------------------------------------------------------------------ #
## CREATE DATASET FOR DECISION TREES/ RANDOM FOREST

## CREATE DUMMY VARIABLES FOR CATEGORICAL VARIABLES
dt_rf_onehot = df.copy()
dt_rf_onehot = pd.get_dummies(dt_rf_onehot, columns = ['pclass', 'sex', 'embarked', 'alone', 'adult_male'])

X = dt_rf_onehot.iloc[:,1:]
# select only important variables for either decision tree or random forest
#X_dt = X[['adult_male_False', 'fare', 'pclass_3', 'age', 'pclass_2', 'parch', 'embarked_C']]
X_rf = X[['adult_male_False', 'fare', 'pclass_3', 'age', 'sex_male', 'sex_female', 'adult_male_True']]
y = dt_rf_onehot['survived']


# ------------------------------------------------------------------------------------------------ #
## CROSS VALIDATE DECISION TREES/ RANDOM FOREST
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html

# select the model to cross-validate
#model = DecisionTreeClassifier()
model = RandomForestClassifier()

skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)
lst_accu_stratified = []
roc_scores = []

for train_index, test_index in skf.split(X, y):
    X_train_fold, X_test_fold = X.iloc[train_index], X.iloc[test_index]
    y_train_fold, y_test_fold = y[train_index], y[test_index]
    model.fit(X_train_fold, y_train_fold)
    predictions = model.predict(X_test_fold)
    lst_accu_stratified.append(metrics.accuracy_score(y_test_fold, predictions))
    #print(classification_report(y_test_fold, predictions))
    roc_scores.append(roc_auc_score(y_test_fold, predictions))
    
# Print the Accuracy output.
print('List of possible accuracy:', lst_accu_stratified)
print('\nMaximum Accuracy that can be obtained from this model is:',
      max(lst_accu_stratified)*100, '%')
print('\nMinimum Accuracy:',
      min(lst_accu_stratified)*100, '%')
print('\nOverall Accuracy:',
      mean(lst_accu_stratified)*100, '%')
print('\nStandard Deviation is:', stdev(lst_accu_stratified))

# Print the ROC output.
print('List of possible ROC Scores:', roc_scores)
print('\nMaximum ROC that can be obtained from this model is:',
      max(roc_scores))
print('\nMinimum ROC:',
      min(roc_scores))
print('\nOverall ROC:',
      mean(roc_scores))
print('\nStandard Deviation is:', stdev(roc_scores))

# ------------------------------------------------------------------------------------------------ #
## CREATE DATASET FOR NAIVE BAYES

## CREATE DUMMY VARIABLES FOR CATEGORICAL VARIABLES
dt_onehot = df.copy()
dt_onehot = pd.get_dummies(dt_onehot, columns = ['pclass', 'sex', 'embarked', 'alone'])

X = dt_onehot.iloc[:,1:]
y = dt_onehot['survived']

# ------------------------------------------------------------------------------------------------ #
## CROSS VALIDATE NAIVE BAYES
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html

# select the model to cross-validate
model = BernoulliNB()

skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)
lst_accu_stratified = []
roc_scores = []

for train_index, test_index in skf.split(X, y):
    X_train_fold, X_test_fold = X.iloc[train_index], X.iloc[test_index]
    y_train_fold, y_test_fold = y[train_index], y[test_index]
    model.fit(X_train_fold, y_train_fold)
    predictions = model.predict(X_test_fold)
    lst_accu_stratified.append(metrics.accuracy_score(y_test_fold, predictions))
    #print(classification_report(y_test_fold, predictions))
    roc_scores.append(roc_auc_score(y_test_fold, predictions))
    
# Print the Accuracy output.
print('List of possible accuracy:', lst_accu_stratified)
print('\nMaximum Accuracy that can be obtained from this model is:',
      max(lst_accu_stratified)*100, '%')
print('\nMinimum Accuracy:',
      min(lst_accu_stratified)*100, '%')
print('\nOverall Accuracy:',
      mean(lst_accu_stratified)*100, '%')
print('\nStandard Deviation is:', stdev(lst_accu_stratified))

# Print the ROC output.
print('List of possible ROC Scores:', roc_scores)
print('\nMaximum ROC that can be obtained from this model is:',
      max(roc_scores))
print('\nMinimum ROC:',
      min(roc_scores))
print('\nOverall ROC:',
      mean(roc_scores))
print('\nStandard Deviation is:', stdev(roc_scores))

# ------------------------------------------------------------------------------------------------ #
## CREATE DATASET FOR XGBOOST

## CREATE DUMMY VARIABLES FOR CATEGORICAL VARIABLES
xg_onehot = df.copy()
xg_onehot = pd.get_dummies(xg_onehot, columns = ['pclass', 'sex', 'embarked', 'alone', 'adult_male'])

X = xg_onehot.iloc[:,1:]
y = xg_onehot['survived']


# ------------------------------------------------------------------------------------------------ #
## CROSS VALIDATE XGBOOST
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html

# select the model to cross-validate
model = xgb.XGBClassifier(max_depth = 5,
                          n_estimators = 450,
                          learning_rate = 0.01,
                          colsample_bytree = 0.5,
                          subsample = 0.5,
                          gamma = 0.5,
                          alpha = 0.1,
                          reg_lambda = 0,
                          seed = 4
                          )

skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)
lst_accu_stratified = []
roc_scores = []

for train_index, test_index in skf.split(X, y):
    X_train_fold, X_test_fold = X.iloc[train_index], X.iloc[test_index]
    y_train_fold, y_test_fold = y[train_index], y[test_index]
    model.fit(X_train_fold, y_train_fold)
    predictions = model.predict(X_test_fold)
    lst_accu_stratified.append(metrics.accuracy_score(y_test_fold, predictions))
    #print(classification_report(y_test_fold, predictions))
    roc_scores.append(roc_auc_score(y_test_fold, predictions))
    
# Print the Accuracy output.
print('List of possible accuracy:', lst_accu_stratified)
print('\nMaximum Accuracy that can be obtained from this model is:',
      max(lst_accu_stratified)*100, '%')
print('\nMinimum Accuracy:',
      min(lst_accu_stratified)*100, '%')
print('\nOverall Accuracy:',
      mean(lst_accu_stratified)*100, '%')
print('\nStandard Deviation is:', stdev(lst_accu_stratified))

# Print the ROC output.
print('List of possible ROC Scores:', roc_scores)
print('\nMaximum ROC that can be obtained from this model is:',
      max(roc_scores))
print('\nMinimum ROC:',
      min(roc_scores))
print('\nOverall ROC:',
      mean(roc_scores))
print('\nStandard Deviation is:', stdev(roc_scores))

Decision Trees Validation

AccuracyROC
Maximum87.64%0.87
Minimum73.33%0.72
Overall79.24%0.78
Standard Deviation0.0430.043

Random Forest Validation

AccuracyROC
Maximum89.89%0.90
Minimum73.33%0.71
Overall79.81%0.78
Standard Deviation0.0500.055

K-Nearest_Neighbors Validation

AccuracyROC
Maximum91.01%0.89
Minimum71.91%0.68
Overall79.91%0.77
Standard Deviation0.0550.065

Logistic Regression Validation

AccuracyROC
Maximum92.13%0.91
Minimum75.28%0.72
Overall81.82%0.80
Standard Deviation0.0560.058

Naïve Bayes Validation

AccuracyROC
Maximum88.76%0.88
Minimum70.79%0.69
Overall79.35%0.79
Standard Deviation0.0620.065

Support Vector Machines Validation

AccuracyROC
Maximum89.89%0.89
Minimum76.40%0.74
Overall81.71%0.80
Standard Deviation0.0410.041

XGBoost Validation

AccuracyROC
Maximum91.01%0.90
Minimum79.78%0.77
Overall83.17%0.81
Standard Deviation0.0400.046