Sports Betting Naive Bayes Data Prep

Data Prep in R

The R code along with the .csv files for the original (gbg_cleaned.csv) and prepped data (gbg_nb_full_r.csv) can be downloaded below.

A Naive Bayes model in R can be created using mixed data. The Naive Bayes model also assumes independence between variables. So, any variables that are correlated should not be included in the model. This means the only data prep necessary is selecting the variables to use for the Naive Bayes model and then checking for correlation.

Below is a plot showing the correlation between numeric variables selected for the Naive Bayes model. There are no strong correlations between any variables. There some moderate to weak correlations, but nothing strong enough to worry about. This means none of the variables need to be removed before modeling.

An image of the prepped data can be viewed below.

The R code for the data prep can be viewed below.

# LOAD LIBRARIES

library(tidyverse)
library(corrplot)

# --------------------------------------------------------------------------------- #
# LOAD GAMES DATASET

gbg_nb <- read.csv('C:/Users/casey/OneDrive/Documents/MSDS_Courses/Spring_2023/Machine_Learning/data/gbg_cleaned.csv')

# ---------------------------------------------------------------------------------- #
## SELECT VARIABLES TO USE FOR FULL DECISION TREE MODELING

gbg_nb_full <- gbg_nb %>%
  select(total_result, game_type, weekday, location, total_line, away_rest, home_rest, 
         div_game, roof, surface, temp, wind, total_team_elo, total_qb_elo,
         home_total_offense_rank, home_total_defense_rank, away_total_defense_rank,
         away_total_offense_rank, off_def_diff)

# ------------------------------------------------------------------------------- #
## CHECK FOR CORRELATION
# Naive Bayes assumes independence so we don't want correlation between variables

gbg_nb_num <- gbg_nb_full %>%
  keep(is.numeric)

col4 <- colorRampPalette(c("Black", "darkgrey", "#CFB87C"))

corrplot(cor(gbg_nb_num), method="ellipse", col=col4(100),
         addCoef.col = "black", tl.col="black", tl.cex=0.6, cl.cex=0.75,
         number.cex = 0.5)

# ------------------------------------------------------------------------------- #
## WRITE FULL PREPPED DATA TO .csv 
#  first decision tree model will include all of the above variables
write.csv(gbg_nb_full, 'C:/Users/casey/OneDrive/Documents/MSDS_Courses/Spring_2023/Machine_Learning/Naive_Bayes/prepped_data/gbg_nb_full_r.csv',
          row.names = FALSE)


Since, Naive Bayes is a supervised machine learning model it requires labels and the prepped data to be split into training and testing sets. The training set is used to train the model. The testing set is used to test the accuracy of the model. The training and testing sets must have the labels removed. However, the labels should be kept in a separate variable after they are removed, to be used during modeling. In the following example, the training set is created by randomly selecting 80% of the data and the testing set is created by randomly selecting 20% of the data. These numbers are not the only option, just a popular one. The training and testing sets must be kept disjoint (separated) throughout the modeling process. Failure to do so, will most likely result in overfitting and poor performance on real data that is not from the training or testing set.

The R code along with the prepped data (gbg_nb_full_r.csv) needed to create the training and testing sets can be downloaded below.

Images of the training and testing sets can be viewed below.

Train Set (80% of Data)
Test Set (20% of Data)

The training and testing sets can be downloaded below.

The R code for the creation of training and testing sets can be viewed below.

## LOAD LIBRARIES

library(dplyr)
library(ggplot2)
library(naivebayes)
library(tidyverse)
library(caret)
library(caretEnsemble)
library(psych)
library(Amelia)
library(mice)
library(GGally)
library(e1071)
library(klaR)

# ------------------------------------------------------------------------------ #
## LOAD DATA

gbg_nb_full <- read.csv('C:/Users/casey/OneDrive/Documents/MSDS_Courses/Spring_2023/Machine_Learning/Naive_Bayes/prepped_data/gbg_nb_full_r.csv')

# ------------------------------------------------------------------------------ #
## SET VARIABLES TO CORRECT DATA TYPES

gbg_nb_full$total_result <- as.factor(gbg_nb_full$total_result)
gbg_nb_full$game_type <- as.factor(gbg_nb_full$game_type)
gbg_nb_full$weekday <- as.factor(gbg_nb_full$weekday)
gbg_nb_full$location <- as.factor(gbg_nb_full$location)
gbg_nb_full$roof <- as.factor(gbg_nb_full$roof)
gbg_nb_full$surface <- as.factor(gbg_nb_full$surface)


str(gbg_nb_full)

# ------------------------------------------------------------------------------ #
## SPLIT DATA INTO TRAIN AND TEST

# will split 80% train and 20% test
# check to see how big the training and testing datasets should be after splitting the data
nrow(gbg_nb_full)*0.8
nrow(gbg_nb_full)*0.2

## set a seed if you want it to be the same each time you
## run the code. The number (like 1234) does not matter
#set.seed(1234)

# find the number corresponding to 80% of the data
n <- floor(0.8*nrow(gbg_nb_full)) 

# randomly sample indicies to be included in our training set (80%)
index <- sample(seq_len(nrow(gbg_nb_full)), size = n)

# set the training set to be randomly sampled rows of the data (80%)
train <- gbg_nb_full[index, ]

# set the testing set to be the remaining rows (20%)
test <- gbg_nb_full[-index, ] 

# check to see if the size of the training and testing sets match what was expected
cat("There are", dim(train)[1], "rows and", dim(train)[2], 
    "columns in the training set.")
cat("There are", dim(test)[1], "rows and", dim(test)[2], 
    "columns in the testing set.")

# make sure the testing and training sets have balanced labels
table(train$total_result)
table(test$total_result)

# remove labels from training and testing set and keep them
test_labels <- test$total_result
train_labels <- train$total_result
test <- test[ , -which(names(test) %in% c("total_result"))]
train <- train[ , -which(names(train) %in% c("total_result"))]


#write.csv(train, 'C:/Users/casey/OneDrive/Documents/MSDS_Courses/Spring_2023/Machine_Learning/Naive_Bayes/prepped_data/nb_train_r.csv', row.names=FALSE)
#write.csv(test, 'C:/Users/casey/OneDrive/Documents/MSDS_Courses/Spring_2023/Machine_Learning/Naive_Bayes/prepped_data/nb_test_r.csv', row.names=FALSE)

Data Prep in Python

Data Prep Key Ideas

  • Numeric variables only
    • Use one hot encoding for categorical variables
  • Check for variable correlation
    • Naïve Bayes assumes independence between variables, so need to remove any highly correlated variables
  • Positive values only for Multinomial Naive Bayes
    • Multinomial Naive Bayes assumes features have a multinomial distribution, and a multinomial distribution can’t have negative values
  • Split data into train and test sets

Creating Naïve Bayes models in python requires four key preparatory steps. The first, is that variables need to be numeric. So, any categorical variables need to be converted to numeric variables using one hot encoding. In this case, roof and surface were converted to numeric variables. The second, is to check for correlation among variables and remove any highly correlated variables. The third is that, if using the MultinomialNB() classifier from sklearn, negative values can’t be used. So, a min-max scaler between 0 and 1 needs to be used if the data contains negative values. This will need to be done since the temp variable contains some negative values. The fourth, is that since, Naïve Bayes is a supervised machine learning model it requires the prepped data to be split into training and testing sets. The training set is used to train the model. The testing set is used to test the accuracy of the model. In the following example, the training set is created by randomly selecting 80% of the data and the testing set is created by randomly selecting 20% of the data. These numbers are not the only option, just a popular one. The training and testing sets must be kept disjoint (separated) throughout the modeling process. Failure to do so, will most likely result in overfitting and poor performance on real data that is not from the training or testing set.

Check for Variable Correlation

To check for correlation, a correlation heatmap is used.

The avg_home_total_yards, avg_away_total_yards, avg_home_total_yards_against, and avg_away_total_yards_against are all highly correlated with each other. So, only avg_home_total_yards will be kept and the other variables will be removed.

The final cleaned dataset contains the following columns.

  • total_result:
  • total_line:
  • temp:
  • wind:
  • total_qb_elo:
  • team_elo_diff:
  • qb_elo_diff:
  • avg_home_total_yards:
  • roof:
  • surface:
  • div_game:

The code for the data prep, as well as the initial (gbg_cleaned) and cleaned (gbg_nb) datasets can be found below.

# -*- coding: utf-8 -*-
"""
Created on Sun Jun 25 14:15:23 2023

@author: casey
"""

## LOAD LIBRARIES
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# Import visualization
import matplotlib.pyplot as plt
import seaborn as sns

# --------------------------------------------------------------------------------------- #
## LOAD DATA

nb_df = pd.read_csv('C:/Users/casey/OneDrive/Documents/MSDS_Courses/Spring_2023/Machine_Learning/data/gbg_cleaned.csv')

# --------------------------------------------------------------------------------------- #
## SELECT COLUMNS FOR NAIVE BAYES MODELING

cols = ['total_result', 'total_line', 'temp', 'wind', 'total_qb_elo', 'team_elo_diff',
        'qb_elo_diff', 'avg_home_total_yards', 'avg_away_total_yards', 
        'avg_home_total_yards_against', 'avg_away_total_yards_against', 'roof', 'surface',
        'div_game']

nb_df = nb_df[cols]

# ------------------------------------------------------------------------------------------------ #
## CHECK FOR VARIABLE CORRELATION
cor_df = nb_df.copy()

# convert categorical variables to number codes so they can be used in the heatmap
cor_df['roof'] = cor_df['roof'].astype('category')
cor_df['roof'] = cor_df['roof'].cat.codes
cor_df['surface'] = cor_df['surface'].astype('category')
cor_df['surface'] = cor_df['surface'].cat.codes


# heatmap plot
fig, ax = plt.subplots(figsize=(12,12)) 
dataplot = sns.heatmap(nb_df.corr(), cmap="YlGnBu", annot=True, ax=ax)
  
# displaying heatmap
plt.show()

# remove highly correlated columns from nb_df
nb_df.drop(columns=['avg_away_total_yards', 'avg_home_total_yards_against', 'avg_away_total_yards_against'], axis=1, inplace=True)

# --------------------------------------------------------------------------------------- #
## ONE HOT ENCODING OF CATEGORICAL VARIABLES (roof, surface)

labels = nb_df['total_result']
nb_df.drop(['total_result'], inplace=True, axis=1)
nb_df = pd.get_dummies(nb_df)
nb_df['total_result'] = labels

# ------------------------------------------------------------------------------------------------ #
## NORMALIZE DATA TO DEAL WITH NEGATIVE VALUES
# Will use MinMaxScaler() to scale all quantitative variables between 0 and 1
scaler = MinMaxScaler(feature_range=(0, 1))
nb_scaled = scaler.fit_transform(nb_df.iloc[:,:-1])
nb_scaled_df = pd.DataFrame(nb_scaled, columns=nb_df.iloc[:,:-1].columns)
nb_scaled_df['total_result'] = nb_df['total_result']

# --------------------------------------------------------------------------------------- #
## WRITE PREPPED DATA TO A .CSV FILE

nb_df.to_csv('C:/Users/casey/OneDrive/Documents/MSDS_Courses/Spring_2023/Machine_Learning/Naive_Bayes/prepped_data/gbg_nb.csv',
              index=False)

nb_scaled_df.to_csv('C:/Users/casey/OneDrive/Documents/MSDS_Courses/Spring_2023/Machine_Learning/Naive_Bayes/prepped_data/gbg_nb_scaled.csv',
              index=False)