Data Prep in Python
Data Prep Key Ideas
- Numeric variables only
- Use one hot encoding for categorical variables
- Normalize the dataset
- Split data into train, test, and validation sets
- Encode labels
Creating neural network models in Keras requires four key preparatory steps. The first, is that variables need to be numeric. So, any categorical variables need to be converted to numeric variables using one hot encoding. In this case, roof
and surface
were converted to numeric variables. The second, is that the dataset needs to undergo normalization. Normalization helps avoid exploding and vanishing gradients. The third, is that since, a neural network is a supervised machine learning model it requires the prepped data to be split into training and testing sets (this is not done in the provided code below, instead it’s done in the modeling python file for ease of use). The training set is used to train the model. The validation set is used to check the accuracy of the model on data that it isn’t being trained on. Using a validation set during training allows you to see if the model is overfitting or underfitting on the training set. The testing set is used to test the final accuracy of the model after training. Ideally the training set, validation set, and test set will all provide similar accuracies. In this case, the training set is created by randomly selecting 80% of the data and the testing and validation sets are created by randomly selecting 10% of the data. These numbers are not the only option, just a popular one. The training, validation, and testing sets must be kept disjoint (separated) throughout the modeling process. Failure to do so, will most likely result in overfitting and poor performance on real data that is not from the training or testing set. The fourth, is that for a neural network the labels need to be encoded. In this case, the labels will be integer encoded, where a 0 represents ‘Over’ and a 1 represents ‘Under’.
The final cleaned dataset contains the following columns.
total_result
:total_line
:temp
:wind
:total_qb_elo
:team_elo_diff
:qb_elo_diff
:avg_home_total_yards
:avg_away_total_yards
:avg_home_total_yards_against
:avg_away_total_yards_against
:roof
:surface
:div_game
:
An image of the prepped data can be found below.
An image of the prepped labels can be found below.
The code for the data prep, as well as the initial (gbg_cleaned) and neural network cleaned (gbg_nn_py) datasets can be found below.
# -*- coding: utf-8 -*-
"""
Created on Sat Jun 24 14:37:02 2023
@author: casey
"""
## LOAD LIBRARIES
import pandas as pd
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
# --------------------------------------------------------------------------------------- #
## LOAD DATA
nn_df = pd.read_csv('C:/Users/casey/OneDrive/Documents/MSDS_Courses/Spring_2023/Machine_Learning/data/gbg_cleaned.csv')
# --------------------------------------------------------------------------------------- #
## SELECT COLUMNS FOR ANN MODELING
cols = ['total_result', 'total_line', 'temp', 'wind', 'total_qb_elo', 'team_elo_diff',
'qb_elo_diff', 'avg_home_total_yards', 'avg_away_total_yards',
'avg_home_total_yards_against', 'avg_away_total_yards_against', 'roof', 'surface',
'div_game']
nn_df = nn_df[cols]
# --------------------------------------------------------------------------------------- #
## CONVERT CATEGORICAL LABELS TO NUMERIC LABELS
# 0:Over, 1:Under
le = LabelEncoder()
nn_df['total_result'] = le.fit_transform(nn_df['total_result'])
# --------------------------------------------------------------------------------------- #
## ONE HOT ENCODING OF qualitative CATEGORICAL VARIABLES (roof, surface)
labels = nn_df['total_result']
nn_df.drop(['total_result'], inplace=True, axis=1)
nn_df = pd.get_dummies(nn_df)
nn_df['total_result'] = labels
# --------------------------------------------------------------------------------------- #
## NORMALIZE DATA USING MINMAXSCALER
# Will use MinMaxScaler() to scale all quantitative variables between 0 and 1
scaler = MinMaxScaler(feature_range=(0, 1))
nn_df_scaled = scaler.fit_transform(nn_df.iloc[:,:-1])
nn_df_scaled = pd.DataFrame(nn_df_scaled, columns=nn_df.iloc[:,:-1].columns)
nn_df_scaled['total_result'] = nn_df['total_result']
# --------------------------------------------------------------------------------------- #
## WRITE PREPPED DATA TO A .CSV FILE
nn_df_scaled.to_csv('C:/Users/casey/OneDrive/Documents/MSDS_Courses/Fall_2023/Neural_Nets/Project/prepped_data/gbg_nn.csv',
index=False)