Sports Betting Feature Importance/Selection

Feature Importance from Decision Trees

Looking at the plot, the most important features seem to be total_line, wind, avg_away_total_yards_against, and total_qb_elo. Variables that also may be impactful are qb_elo_diff, team_elo_diff, avg_away_total_yards, temp, avg_home_total_yards, avg_home_total_yards_against, and surface_dessograss. None of the remaining variables are impactful. So, the unimportant variables should be able to be removed with no significant effect to the accuracy of the model. A reduced model with only the important features should be created next to verify this.

Feature Importance from Logistic Regression

Using a logistic regression model the feature importance can be found from the coefficients and plotted. It’s important to note that the data must be scaled, so that all variables have the same scale prior to fitting the model and obtaining the coefficients. If not, the coefficients won’t be an accurate representation of importance (variables with a larger scale would be considered more important than variables with a smaller scale). The positive scores indicate a feature that contributes to the prediction of class 1 (Under), whereas the negative scores indicate a feature that contributes to the prediction of class 0 or (Over).

Looking at the plot, the most important features for predicting a total_result of Under, appear to be wind, avg_home_total_yards, and total_line. The most important features for predicting a total_result of Over, appear to be surface_dessograss, avg_away_total_yards, total_qb_elo, and avg_away_total_yards_against. A reduced model with only the important features should be created next to verify this.

Feature Importance from Random Forest

The Random Forest model appears to indicate more important features than the Decision Tree model. Looking at the plot, the most important variables appear to be team_elo_diff, total_qb_elo, qb_elo_diff, avg_home_total_yards_against, avg_away_total_yards_against, avg_away_total_yards, avg_home_total_yards, total_line, temp, and wind. A reduced model with only the important features should be created next to verify this.

Feature Importance from XGBoost

Interestingly, the XGBoost model indicates that all of the variables have some importance. This can be verified by doing feature selection with XGBoost next.

Feature Selection from XGBoost

Below are model accuracies that use a decreasing number of features, starting with all the features. Looking at the accuracies there tends to be no clear pattern. This is indicative of what was seen above with all of the features being important and none clearly being more important than the others.

Thresh, is the importance of the feature being removed (these should match up with the plot above). Features get removed starting with the least important. n, is the total number of features in the model. Accuracy, is the accuracy of the model.

Thresh=0.000, n=22, Accuracy: 50.73%
Thresh=0.021, n=21, Accuracy: 50.73%
Thresh=0.028, n=20, Accuracy: 50.00%
Thresh=0.035, n=19, Accuracy: 52.27%
Thresh=0.040, n=18, Accuracy: 51.14%
Thresh=0.043, n=17, Accuracy: 53.41%
Thresh=0.045, n=16, Accuracy: 48.46%
Thresh=0.047, n=15, Accuracy: 51.46%
Thresh=0.048, n=14, Accuracy: 52.84%
Thresh=0.048, n=13, Accuracy: 51.62%
Thresh=0.051, n=12, Accuracy: 49.84%
Thresh=0.051, n=11, Accuracy: 52.27%
Thresh=0.052, n=10, Accuracy: 50.49%
Thresh=0.053, n=9, Accuracy: 50.24%
Thresh=0.053, n=8, Accuracy: 51.22%
Thresh=0.053, n=7, Accuracy: 51.70%
Thresh=0.053, n=6, Accuracy: 52.52%
Thresh=0.054, n=5, Accuracy: 50.73%
Thresh=0.054, n=4, Accuracy: 52.03%
Thresh=0.055, n=3, Accuracy: 51.14%
Thresh=0.058, n=2, Accuracy: 50.24%
Thresh=0.061, n=1, Accuracy: 51.30%

Code

The python code as well as the dataset needed to run the code can be downloaded and found below

# -*- coding: utf-8 -*-
"""
Created on Tue Jul 18 16:35:03 2023

@author: casey
"""

## LOAD LIBRARIES

import pandas as pd

# Import for preprocessing
from sklearn.preprocessing import LabelEncoder # for xgboost
from sklearn.preprocessing import MinMaxScaler # for logistic regression


# Import all we need from sklearn
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import linear_model

# Import XGBoost
import xgboost as xgb

# Import visualization
import matplotlib.pyplot as plt
import seaborn as sns

# --------------------------------------------------------------------------------------- #
## LOAD DATA

df = pd.read_csv('C:/Users/casey/OneDrive/Documents/MSDS_Courses/Spring_2023/Machine_Learning/data/gbg_cleaned.csv')

# --------------------------------------------------------------------------------------- #
## PREPROCESSING
# --------------------------------------------------------------------------------------- #


# --------------------------------------------------------------------------------------- #
## SELECT COLUMNS FOR  MODELING

cols = ['total_result', 'total_line', 'temp', 'wind', 'total_qb_elo', 'team_elo_diff',
        'qb_elo_diff', 'avg_home_total_yards', 'avg_away_total_yards', 
        'avg_home_total_yards_against', 'avg_away_total_yards_against', 'roof', 'surface',
        'div_game']

dt_rf_df = df[cols] # df for decision trees and random forest
xg_df = df[cols] # df for xgboost
lr_df= df[cols] # df for logistic regression

# --------------------------------------------------------------------------------------- #
## CONVERT CATEGORICAL LABELS TO NUMERIC LABELS FOR XGBOOST

le = LabelEncoder()
xg_df['total_result'] = le.fit_transform(xg_df['total_result'])

# --------------------------------------------------------------------------------------- #
## ONE HOT ENCODING OF QUALITATIVE CATEGORICAL VARIABLES (roof, surface)
## FOR DECISION TREE/RANDOM FOREST, XGBOOST, AND LOGISTIC REGRESSION

labels = dt_rf_df['total_result']
dt_rf_df.drop(['total_result'], inplace=True, axis=1)
dt_rf_df = pd.get_dummies(dt_rf_df)
dt_rf_df['total_result'] = labels

labels = xg_df['total_result']
xg_df.drop(['total_result'], inplace=True, axis=1)
xg_df = pd.get_dummies(xg_df)
xg_df['total_result'] = labels

labels = lr_df['total_result']
lr_df.drop(['total_result'], inplace=True, axis=1)
lr_df = pd.get_dummies(lr_df)
lr_df['total_result'] = labels

# --------------------------------------------------------------------------------------- #
## NORMALIZE DATA USING MINMAXSCALER FOR LOGISTIC REGRESSION ONLY
# Will use MinMaxScaler() to scale all quantitative variables between 0 and 1
scaler = MinMaxScaler(feature_range=(0, 1))
lr_df_scaled = scaler.fit_transform(lr_df.iloc[:,:-1])
lr_df_scaled = pd.DataFrame(lr_df_scaled, columns=lr_df.iloc[:,:-1].columns)
lr_df_scaled['total_result'] = lr_df['total_result']

# --------------------------------------------------------------------------------------- #
## MODELING
# --------------------------------------------------------------------------------------- #

# ------------------------------------------------------------------------------------------------ #
## CREATE TRAIN AND TEST SETS FOR DECISION TREE/RANDOM FOREST

# X will contain all variables except the labels (the labels are the last column 'total_result')
X = dt_rf_df.iloc[:,:-1]
# y will contain the labels (the labels are the last column 'total_result)
y = dt_rf_df.iloc[:,-1:]

# split the data vectors randomly into 80% train and 20% test
# X_train contains the quantitative variables for the training set
# X_test contains the quantitative variables for the testing set
# y_train contains the labels for training set
# y_test contains the lables for the testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# CREATE FULL DECISION TREE MODEL
# Look at below documentation for parameters
# https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
DT_Classifier = DecisionTreeClassifier(criterion='entropy', max_depth=7)
DT_Classifier.fit(X_train, y_train)

# CREATE FULL RANDOM FOREST MODEL
# Look at below documentation for parameters
# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
RF_Classifier = RandomForestClassifier(criterion='entropy')
RF_Classifier.fit(X_train, y_train)


# ------------------------------------------------------------------------------------------------ #
## CREATE TRAIN AND TEST SETS FOR XGBOOST

# X will contain all variables except the labels (the labels are the last column 'total_result')
X = xg_df.iloc[:,:-1]
# y will contain the labels (the labels are the last column 'total_result)
y = xg_df.iloc[:,-1:]

# split the data vectors randomly into 80% train and 20% test
# X_train contains the quantitative variables for the training set
# X_test contains the quantitative variables for the testing set
# y_train contains the labels for training set
# y_test contains the lables for the testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# CREATE DEFAULT XGBOOST MODEL
# documentation for parameters
# https://xgboost.readthedocs.io/en/latest/parameter.html
xgb_Classifier = xgb.XGBClassifier()
xgb_Classifier.fit(X_train, y_train)

# ------------------------------------------------------------------------------------------------ #
## CREATE TRAIN AND TEST SETS FOR LOGISTIC REGRESSION

# X will contain all variables except the labels (the labels are the last column 'total_result')
X = lr_df_scaled.iloc[:,:-1]
# y will contain the labels (the labels are the last column 'total_result)
y = lr_df_scaled.iloc[:,-1:]

# split the data vectors randomly into 80% train and 20% test
# X_train contains the quantitative variables for the training set
# X_test contains the quantitative variables for the testing set
# y_train contains the labels for training set
# y_test contains the lables for the testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# CREATE DEFAULT LOGISTIC REGRESSION MODEL
# Look at below documentation for parameters
# https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
LOGR_Classifier = linear_model.LogisticRegression()
LOGR_Classifier.fit(X_train, y_train)


# --------------------------------------------------------------------------------------- #
## FEATURE IMPORTANCE
# --------------------------------------------------------------------------------------- #

## GET FEATURE IMPORTANCE FROM DECISION TREE
feat_dict= {}
for col, val in sorted(zip(X_train.columns, DT_Classifier.feature_importances_),key=lambda x:x[1],reverse=True):
  feat_dict[col]=val
  
feat_df = pd.DataFrame({'Feature':feat_dict.keys(),'Importance':feat_dict.values()})

## PLOT FEATURE IMPORTANCE FROM DECISION TREE
values = feat_df.Importance    
idx = feat_df.Feature
plt.figure(figsize=(10,8))
clrs = ['green' if (x < max(values)) else 'red' for x in values ]
sns.barplot(y=idx,x=values,palette=clrs).set(title='Important Features to Predict NFL Game Total Line Results (Decision Tree)')
plt.show()

## GET FEATURE IMPORTANCE FROM RANDOM FOREST
feat_dict= {}
for col, val in sorted(zip(X_train.columns, RF_Classifier.feature_importances_),key=lambda x:x[1],reverse=True):
  feat_dict[col]=val
  
feat_df = pd.DataFrame({'Feature':feat_dict.keys(),'Importance':feat_dict.values()})

## PLOT FEATURE IMPORTANCE FROM RANDOM FOREST
values = feat_df.Importance    
idx = feat_df.Feature
plt.figure(figsize=(10,8))
clrs = ['green' if (x < max(values)) else 'red' for x in values ]
sns.barplot(y=idx,x=values,palette=clrs).set(title='Important Features to Predict NFL Game Total Line Results (Random Forest)')
plt.show()

## GET FEATURE IMPORTANCE FROM XGBOOST
feat_dict= {}
for col, val in sorted(zip(X_train.columns, xgb_Classifier.feature_importances_),key=lambda x:x[1],reverse=True):
  feat_dict[col]=val
  
feat_df = pd.DataFrame({'Feature':feat_dict.keys(),'Importance':feat_dict.values()})

## PLOT FEATURE IMPORTANCE FROM XGBOOST
values = feat_df.Importance    
idx = feat_df.Feature
plt.figure(figsize=(10,8))
clrs = ['green' if (x < max(values)) else 'red' for x in values ]
sns.barplot(y=idx,x=values,palette=clrs).set(title='Important Features to Predict NFL Game Total Line Results (XGBoost)')
plt.show()

# GET FEATURE IMPORTANCE FROM LOGISTIC REGRESSION
LOGR_Classifier.coef_[0]

feat_dict= {}
for col, val in sorted(zip(X_train.columns, LOGR_Classifier.coef_[0]),key=lambda x:x[1],reverse=True):
  feat_dict[col]=val
  
feat_df = pd.DataFrame({'Feature':feat_dict.keys(),'Importance':feat_dict.values()})

# PLOT FEATURE IMPORTANCE FROM LOGISTIC REGRESSION
values = feat_df.Importance    
idx = feat_df.Feature
plt.figure(figsize=(10,8))
clrs = ['green' if (x > 0) else 'red' for x in values ]
sns.barplot(y=idx,x=values,palette=clrs).set(title='Important Features to Predict NFL Game Total Line Results (Logistic Regression)')
plt.show()

# --------------------------------------------------------------------------------------- #
## FEATURE SELECTION
# --------------------------------------------------------------------------------------- #

## GET FEATURE SELECTION FROM XGBOOST
from numpy import sort
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import SelectFromModel

 
X = xg_df.iloc[:,:-1]
# y will contain the labels (the labels are the last column 'total_result)
y = xg_df.iloc[:,-1:]
# split the data vectors randomly into 80% train and 20% test
# X_train contains the quantitative variables for the training set
# X_test contains the quantitative variables for the testing set
# y_train contains the labels for training set
# y_test contains the lables for the testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# CREATE DEFAULT XGBOOST MODEL
# documentation for parameters
# https://xgboost.readthedocs.io/en/latest/parameter.html
xgb_Classifier = xgb.XGBClassifier()
xgb_Classifier.fit(X_train, y_train)

# make predictions for test data and evaluate
predictions = xgb_Classifier.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
# Fit model using each importance as a threshold
thresholds = sort(xgb_Classifier.feature_importances_)
accuracies = []
for thresh in thresholds:
 # select features using threshold
 selection = SelectFromModel(xgb_Classifier, threshold=thresh, prefit=True)
 select_X_train = selection.transform(X_train)
 # train model
 selection_model = XGBClassifier()
 selection_model.fit(select_X_train, y_train)
 # eval model
 select_X_test = selection.transform(X_test)
 predictions = selection_model.predict(select_X_test)
 accuracy = accuracy_score(y_test, predictions)
 accuracies.append("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (thresh, select_X_train.shape[1], accuracy*100.0))
 
 for accuracy in accuracies:
     print(accuracy)