Sports Betting XGBoost Data Prep

Data Prep Key Ideas

  • Numeric variables only
    • Use one hot encoding for categorical variables
  • Labels must be numeric
    • If labels are categorical strings, they must be converted to numbers using LabelEncoder from sklearn
  • Split data into train and test sets
  • Tree-based, so no normalization necessary

Creating XGBoost models in python requires three key steps. The first, is that variables need to be numeric. So, any categorical variables need to be converted to numeric variables using one hot encoding. In this case, roof and surface were converted to numeric variables. The second, is that the categorical labels need to be converted to numeric labels. Currently the labels are ‘Over’ and ‘Under’. Using LabelEncoder from sklearn, ‘Over’ will be converted to 0 and ‘Under’ will be converted to 1. The third, is that since, XGBoost is a supervised machine learning model it requires the prepped data to be split into training and testing sets. The training set is used to train the model. The testing set is used to test the accuracy of the model. In the following example, the training set is created by randomly selecting 80% of the data and the testing set is created by randomly selecting 20% of the data. These numbers are not the only option, just a popular one. The training and testing sets must be kept disjoint (separated) throughout the modeling process. Failure to do so, will most likely result in overfitting and poor performance on real data that is not from the training or testing set. Since, XGBoost is a tree-based method no data normalization is necessary prior to modeling.

The final cleaned dataset contains the following columns.

  • total_result:
  • total_line:
  • temp:
  • wind:
  • total_qb_elo:
  • team_elo_diff:
  • qb_elo_diff:
  • avg_home_total_yards:
  • avg_away_total_yards:
  • avg_home_total_yards_against:
  • avg_away_total_yards_against:
  • roof:
  • surface:
  • div_game:

The code for the data prep, as well as the initial (gbg_cleaned) and cleaned (gbg_xg) datasets can be found below.

# -*- coding: utf-8 -*-
"""
Created on Sat Jun 24 14:37:02 2023

@author: casey
"""

## LOAD LIBRARIES

import pandas as pd
from sklearn.preprocessing import LabelEncoder

# --------------------------------------------------------------------------------------- #
## LOAD DATA

xg_df = pd.read_csv('C:/Users/casey/OneDrive/Documents/MSDS_Courses/Spring_2023/Machine_Learning/data/gbg_cleaned.csv')

# --------------------------------------------------------------------------------------- #
## SELECT COLUMNS FOR XGBoost MODELING

cols = ['total_result', 'total_line', 'temp', 'wind', 'total_qb_elo', 'team_elo_diff',
        'qb_elo_diff', 'avg_home_total_yards', 'avg_away_total_yards', 
        'avg_home_total_yards_against', 'avg_away_total_yards_against', 'roof', 'surface',
        'div_game']

xg_df = xg_df[cols]

# --------------------------------------------------------------------------------------- #
## CONVERT CATEGORICAL LABELS TO NUMERIC LABELS

le = LabelEncoder()
xg_df['total_result'] = le.fit_transform(xg_df['total_result'])

# --------------------------------------------------------------------------------------- #
## ONE HOT ENCODING OF qualitative CATEGORICAL VARIABLES (roof, surface)

labels = xg_df['total_result']
xg_df.drop(['total_result'], inplace=True, axis=1)
xg_df = pd.get_dummies(xg_df)
xg_df['total_result'] = labels

# --------------------------------------------------------------------------------------- #
## WRITE PREPPED DATA TO A .CSV FILE

xg_df.to_csv('C:/Users/casey/OneDrive/Documents/MSDS_Courses/Spring_2023/Machine_Learning/XGBoost/prepped_data/gbg_xg.csv',
              index=False)