Games Dataset
Data Prep in Python
The python code along with the .csv files for the original (gbg_cleaned.csv) and prepped data (gbg_svm_py.csv) can be downloaded below.
A Support Vector Machines model in python can only be created using quantitative data. This is because the algorithm uses a distance metric that can only be calculated using quantitative data. A distance metric will also require the data to be normalized or scaled. So, to prep the data, the labels (total_result
) need to be selected, relevant quantitative variables need to be selected, and then those quantitative variables need to be scaled.
The following variables were selected:
total_result
total_line
temp
wind
total_qb_elo
team_elo_diff
qb_elo_diff
avg_home_total_yards
avg_away_total_yards
avg_home_total_yards_against
avg_away_total_yards_against
Two new variables were also created:
home_away_yards_diff
- The difference in total yards between the home team offense and away team defense
- A positive value favors the home team offense, a negative value favors the away team defense, a value of zero indicates a perfectly even matchup
away_home_yards_diff
- The difference in total yards between the away team offense and home team defense
- A positive value favors the away team offense, a negative value favors the home team defense, a value of zero indicates a perfectly even matchup
An image of the prepped data can be viewed below.
The python code for the data prep can be viewed below.
# -*- coding: utf-8 -*-
"""
Created on Tue Mar 21 10:39:54 2023
@author: casey
"""
## LOAD LIBRARIES
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler
# --------------------------------------------------------------------------------------- #
## LOAD DATA
svm_df = pd.read_csv('C:/Users/casey/OneDrive/Documents/MSDS_Courses/Spring_2023/Machine_Learning/data/gbg_cleaned.csv')
# --------------------------------------------------------------------------------------- #
## SELECT COLUMNS FOR SVM MODELING
# only select quantitative variables and label variable
cols = ['total_result', 'total_line', 'temp', 'wind', 'total_qb_elo', 'team_elo_diff',
'qb_elo_diff', 'avg_home_total_yards', 'avg_away_total_yards',
'avg_home_total_yards_against', 'avg_away_total_yards_against']
svm_df = svm_df[cols]
# create new variables to get the difference in yards between home team offense and away team defense
# positive number favors home team offense, negative number favors away team defense
svm_df['home_away_yards_diff'] = svm_df['avg_home_total_yards'] - svm_df['avg_away_total_yards_against']
# create new variable to get the difference in yards between away team offense and home team defense
# positive number favors away team offense, negative number favors home team defense
svm_df['away_home_yards_diff'] = svm_df['avg_away_total_yards'] - svm_df['avg_home_total_yards_against']
# --------------------------------------------------------------------------------------- #
## CHECK FOR ANY MISSING VALUES
# should be no missing values
svm_df.isna().sum()
# --------------------------------------------------------------------------------------- #
## NORMALIZE DATA USING MINMAXSCALER
# Will use MinMaxScaler() to scale all quantitative variables between 0 and 1
scaler = MinMaxScaler(feature_range=(0, 1))
svm_df_scaled = scaler.fit_transform(svm_df.iloc[:,1:])
svm_df_scaled = pd.DataFrame(svm_df_scaled, columns=svm_df.iloc[:,1:].columns)
svm_df_scaled['total_result'] = svm_df['total_result']
# --------------------------------------------------------------------------------------- #
## WRITE PREPPED SVM DATA TO A .CSV FILE
svm_df_scaled.to_csv('C:/Users/casey/OneDrive/Documents/MSDS_Courses/Spring_2023/Machine_Learning/SVM/prepped_data/gbg_svm_py.csv',
index=False)
Since, Support Vector Machines is a supervised machine learning model it requires labels and the prepped data to be split into training and testing sets. The training set is used to train the model. The testing set is used to test the accuracy of the model. The training and testing sets must have the labels removed. However, the labels should be kept in a separate variable after they are removed, to be used during modeling. In the following example, the training set is created by randomly selecting 80% of the data and the testing set is created by randomly selecting 20% of the data. These numbers are not the only option, just a popular one. The training and testing sets must be kept disjoint (separated) throughout the modeling process. Failure to do so, will most likely result in overfitting and poor performance on real data that is not from the training or testing set.
The python code along with the prepped data (gbg_svm_py.csv) that are needed to make the training and testing set can be downloaded below.
Images of the training and testing sets can be viewed below.
The training and testing sets can be downloaded below.
The python code for the creation of training and testing sets can be viewed below.
# -*- coding: utf-8 -*-
"""
Created on Wed Mar 22 13:40:28 2023
@author: casey
"""
## LOAD LIBRARIES
import pandas as pd
from sklearn.model_selection import train_test_split
# --------------------------------------------------------------------------------------- #
## LOAD DATA
svm_df = pd.read_csv('C:/Users/casey/OneDrive/Documents/MSDS_Courses/Spring_2023/Machine_Learning/SVM/prepped_data/gbg_svm_py.csv')
# --------------------------------------------------------------------------------------- #
## CREATE TRAINING AND TESTING SETS
# X will contain all variables except the labels (the labels are the last column 'total_result')
X = svm_df.iloc[:,:-1]
# y will contain the labels (the labels are the last column 'total_result')
y = svm_df.iloc[:,-1:]
# split the data vectors randomly into 80% train and 20% test
# X_train contains the quantitative variables for the training set
# X_test contains the quantitative variables for the testing set
# y_train contains the labels for training set
# y_test contains the lables for the testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)