Sports Betting Neural Networks Data Prep

Data Prep in Python

Data Prep Key Ideas

Numeric variables only
- Use one hot encoding for categorical variables
Normalize the dataset
Split data into train, test, and validation sets
Encode labels

Creating neural network models in Keras requires four key preparatory steps. The first, is that variables need to be numeric. So, any categorical variables need to be converted to numeric variables using one hot encoding. In this case, roof and surface were converted to numeric variables. The second, is that the dataset needs to undergo normalization. Normalization helps avoid exploding and vanishing gradients. The third, is that since, a neural network is a supervised machine learning model it requires the prepped data to be split into training and testing sets (this is not done in the provided code below, instead it’s done in the modeling python file for ease of use). The training set is used to train the model. The validation set is used to check the accuracy of the model on data that it isn’t being trained on. Using a validation set during training allows you to see if the model is overfitting or underfitting on the training set. The testing set is used to test the final accuracy of the model after training. Ideally the training set, validation set, and test set will all provide similar accuracies. In this case, the training set is created by randomly selecting 80% of the data and the testing and validation sets are created by randomly selecting 10% of the data. These numbers are not the only option, just a popular one. The training, validation, and testing sets must be kept disjoint (separated) throughout the modeling process. Failure to do so, will most likely result in overfitting and poor performance on real data that is not from the training or testing set. The fourth, is that for a neural network the labels need to be encoded. In this case, the labels will be integer encoded, where a 0 represents ‘Over’ and a 1 represents ‘Under’.

The final cleaned dataset contains the following columns.

total_result:
total_line:
temp:
wind:
total_qb_elo:
team_elo_diff:
qb_elo_diff:
avg_home_total_yards:
avg_away_total_yards:
avg_home_total_yards_against:
avg_away_total_yards_against:
roof:
surface:
div_game:

An image of the prepped data can be found below.

An image of the prepped labels can be found below.

The code for the data prep, as well as the initial (gbg_cleaned) and neural network cleaned (gbg_nn_py) datasets can be found below.

ANN_Data_Prep.py Download

gbg_cleaned.csv Download

gbg_nn.csv Download

Copy Code

# -*- coding: utf-8 -*-
"""
Created on Sat Jun 24 14:37:02 2023

@author: casey
"""

## LOAD LIBRARIES

import pandas as pd
from sklearn.preprocessing import LabelEncoder, MinMaxScaler

# --------------------------------------------------------------------------------------- #
## LOAD DATA

nn_df = pd.read_csv('C:/Users/casey/OneDrive/Documents/MSDS_Courses/Spring_2023/Machine_Learning/data/gbg_cleaned.csv')

# --------------------------------------------------------------------------------------- #
## SELECT COLUMNS FOR ANN MODELING

cols = ['total_result', 'total_line', 'temp', 'wind', 'total_qb_elo', 'team_elo_diff',
        'qb_elo_diff', 'avg_home_total_yards', 'avg_away_total_yards', 
        'avg_home_total_yards_against', 'avg_away_total_yards_against', 'roof', 'surface',
        'div_game']

nn_df = nn_df[cols]

# --------------------------------------------------------------------------------------- #
## CONVERT CATEGORICAL LABELS TO NUMERIC LABELS

# 0:Over, 1:Under
le = LabelEncoder()
nn_df['total_result'] = le.fit_transform(nn_df['total_result'])

# --------------------------------------------------------------------------------------- #
## ONE HOT ENCODING OF qualitative CATEGORICAL VARIABLES (roof, surface)

labels = nn_df['total_result']
nn_df.drop(['total_result'], inplace=True, axis=1)
nn_df = pd.get_dummies(nn_df)
nn_df['total_result'] = labels

# --------------------------------------------------------------------------------------- #
## NORMALIZE DATA USING MINMAXSCALER 
# Will use MinMaxScaler() to scale all quantitative variables between 0 and 1
scaler = MinMaxScaler(feature_range=(0, 1))
nn_df_scaled = scaler.fit_transform(nn_df.iloc[:,:-1])
nn_df_scaled = pd.DataFrame(nn_df_scaled, columns=nn_df.iloc[:,:-1].columns)
nn_df_scaled['total_result'] = nn_df['total_result']

# --------------------------------------------------------------------------------------- #
## WRITE PREPPED DATA TO A .CSV FILE

nn_df_scaled.to_csv('C:/Users/casey/OneDrive/Documents/MSDS_Courses/Fall_2023/Neural_Nets/Project/prepped_data/gbg_nn.csv',
              index=False)