Sports Betting Decision Trees Data Prep

Data Prep in Python

Data Prep Key Ideas

  • Numeric variables only
    • Use one hot encoding for categorical variables
  • Split data into train and test sets
  • Tree-based, so no normalization necessary

Creating decision tree models in python requires two key steps. The first, is that variables need to be numeric. So, any categorical variables need to be converted to numeric variables using one hot encoding. In this case, roof and surface were converted to numeric variables. The second, is that since, Decision Trees are a supervised machine learning model it requires the prepped data to be split into training and testing sets. The training set is used to train the model. The testing set is used to test the accuracy of the model. In the following example, the training set is created by randomly selecting 80% of the data and the testing set is created by randomly selecting 20% of the data. These numbers are not the only option, just a popular one. The training and testing sets must be kept disjoint (separated) throughout the modeling process. Failure to do so, will most likely result in overfitting and poor performance on real data that is not from the training or testing set. Since Decision Trees is a tree-based method no data normalization is necessary prior to modeling.

The final cleaned dataset contains the following columns.

  • total_result:
  • total_line:
  • temp:
  • wind:
  • total_qb_elo:
  • team_elo_diff:
  • qb_elo_diff:
  • avg_home_total_yards:
  • avg_away_total_yards:
  • avg_home_total_yards_against:
  • avg_away_total_yards_against:
  • roof:
  • surface:
  • div_game:

The code for the data prep, as well as the initial (gbg_cleaned) and cleaned (gbg_dt) datasets can be found below.

# -*- coding: utf-8 -*-
"""
Created on Sun Jun 25 11:09:18 2023

@author: casey
"""

## LOAD LIBRARIES

import pandas as pd

# --------------------------------------------------------------------------------------- #
## LOAD DATA

dt_df = pd.read_csv('C:/Users/casey/OneDrive/Documents/MSDS_Courses/Spring_2023/Machine_Learning/data/gbg_cleaned.csv')

# --------------------------------------------------------------------------------------- #
## SELECT COLUMNS FOR DECISION TREE MODELING

cols = ['total_result', 'total_line', 'temp', 'wind', 'total_qb_elo', 'team_elo_diff',
        'qb_elo_diff', 'avg_home_total_yards', 'avg_away_total_yards', 
        'avg_home_total_yards_against', 'avg_away_total_yards_against', 'roof', 'surface',
        'div_game']

dt_df = dt_df[cols]

# --------------------------------------------------------------------------------------- #
## ONE HOT ENCODING OF qualitative CATEGORICAL VARIABLES (roof, surface)

labels = dt_df['total_result']
dt_df.drop(['total_result'], inplace=True, axis=1)
dt_df = pd.get_dummies(dt_df)
dt_df['total_result'] = labels

# --------------------------------------------------------------------------------------- #
## WRITE PREPPED DATA TO A .CSV FILE

dt_df.to_csv('C:/Users/casey/OneDrive/Documents/MSDS_Courses/Spring_2023/Machine_Learning/Decision_Trees/prepped_data/gbg_dt.csv',
              index=False

Data Prep in R

The R code along with the .csv files for the original (gbg_cleaned.csv) and prepped data (gbg_dt_full_r.csv) can be downloaded below.

Decision Trees in R can be created using mixed data. This means the only data prep necessary is selecting the variables to use for the decision tree model.

An image of the prepped data can be viewed below.

The R code for the data prep can be viewed below.

# LOAD LIBRARIES

library(tidyverse)

# --------------------------------------------------------------------------------- #
# LOAD GAMES DATASET

gbg_dt <- read.csv('C:/Users/casey/OneDrive/Documents/MSDS_Courses/Spring_2023/Machine_Learning/data/gbg_cleaned.csv')

# ---------------------------------------------------------------------------------- #
## SELECT VARIABLES TO USE FOR FULL DECISION TREE MODELING

gbg_dt_full <- gbg_dt %>%
  select(total_result, game_type, weekday, location, total_line, away_rest, home_rest, 
         div_game, roof, surface, temp, wind, total_team_elo, total_qb_elo,
         home_total_offense_rank, home_total_defense_rank, away_total_defense_rank,
         away_total_offense_rank, off_def_diff)

# ------------------------------------------------------------------------------- #
## WRITE FULL PREPPED DATA TO .csv 
#  first decision tree model will include all of the above variables
write.csv(gbg_dt_full, 'C:/Users/casey/OneDrive/Documents/MSDS_Courses/Spring_2023/Machine_Learning/Decision_Trees/prepped_data/gbg_dt_full_r.csv',
          row.names = FALSE)


Since, Decision Trees are a supervised machine learning model it requires labels and the prepped data to be split into training and testing sets. The training set is used to train the model. The testing set is used to test the accuracy of the model. The testing set must have the labels removed. However, the labels should be kept in a separate variable prior to being removed, to be used during modeling. When creating decision trees in R using the rpart() function, the labels don’t need to be removed from the training set. In the following example, the training set is created by randomly selecting 80% of the data and the testing set is created by randomly selecting 20% of the data. These numbers are not the only option, just a popular one. The training and testing sets must be kept disjoint (separated) throughout the modeling process. Failure to do so, will most likely result in overfitting and poor performance on real data that is not from the training or testing set.

The R code along with the prepped data (gbg_dt_full_r.csv) needed to create the training and testing sets can be downloaded below.

Images of the training and testing sets can be viewed below.

Train Set (80% of Data)
Test Set (20% of Data)

The training and testing sets can be downloaded below.

The R code for the creation of training and testing sets can be viewed below.

## LOAD LIBRARIES

library(dplyr)
library(rpart)   ## FOR Decision Trees
library(rattle)  ## FOR Decision Tree Vis
library(rpart.plot)
library(RColorBrewer)
library(Cairo)
library(network)
library(ggplot2)
library(slam)
library(quanteda)
library(proxy)
library(igraph)
library(caret)
library(randomForest)

# ------------------------------------------------------------------------------ #
## LOAD DATA

gbg_dt_full <- read.csv('C:/Users/casey/OneDrive/Documents/MSDS_Courses/Spring_2023/Machine_Learning/Decision_Trees/prepped_data/gbg_dt_full_r.csv')

# ------------------------------------------------------------------------------ #
## SET VARIABLES TO CORRECT DATA TYPES

gbg_dt_full$total_result <- as.factor(gbg_dt_full$total_result)
gbg_dt_full$game_type <- as.factor(gbg_dt_full$game_type)
gbg_dt_full$weekday <- as.factor(gbg_dt_full$weekday)
gbg_dt_full$location <- as.factor(gbg_dt_full$location)
gbg_dt_full$roof <- as.factor(gbg_dt_full$roof)
gbg_dt_full$surface <- as.factor(gbg_dt_full$surface)


str(gbg_dt_full)

# ------------------------------------------------------------------------------ #
## SPLIT DATA INTO TRAIN AND TEST

# will split 80% train and 20% test
# check to see how big the training and testing datasets should be after splitting the data
nrow(gbg_dt_full)*0.8
nrow(gbg_dt_full)*0.2

## set a seed if you want it to be the same each time you
## run the code. The number (like 1234) does not matter
set.seed(1234)

# find the number corresponding to 80% of the data
n <- floor(0.8*nrow(gbg_dt_full)) 

# randomly sample indicies to be included in our training set (80%)
index <- sample(seq_len(nrow(gbg_dt_full)), size = n)

# set the training set to be randomly sampled rows of the data (80%)
train <- gbg_dt_full[index, ]

# set the testing set to be the remaining rows (20%)
test <- gbg_dt_full[-index, ] 

# check to see if the size of the training and testing sets match what was expected
cat("There are", dim(train)[1], "rows and", dim(train)[2], 
    "columns in the training set.")
cat("There are", dim(test)[1], "rows and", dim(test)[2], 
    "columns in the testing set.")

# make sure the testing and training sets have balanced labels
table(train$total_result)
table(test$total_result)

# remove labels from testing set and keep them
test_labels <- test$total_result
test <- test[ , -which(names(test) %in% c("total_result"))]