Data Prep Key Ideas
- Numeric variables only
- Use one hot encoding for categorical variables
- Split data into train and test sets
- Tree-based, so no normalization necessary
Creating Random Forest models in python requires two key steps. The first, is that variables need to be numeric. So, any categorical variables need to be converted to numeric variables using one hot encoding. In this case, roof
and surface
were converted to numeric variables. The second, is that since, Random Forest is a supervised machine learning model it requires the prepped data to be split into training and testing sets. The training set is used to train the model. The testing set is used to test the accuracy of the model. In the following example, the training set is created by randomly selecting 80% of the data and the testing set is created by randomly selecting 20% of the data. These numbers are not the only option, just a popular one. The training and testing sets must be kept disjoint (separated) throughout the modeling process. Failure to do so, will most likely result in overfitting and poor performance on real data that is not from the training or testing set. Since, Random Forest is a tree-based method no data normalization is necessary prior to modeling.
The final cleaned dataset contains the following columns.
total_result
:total_line
:temp
:wind
:total_qb_elo
:team_elo_diff
:qb_elo_diff
:avg_home_total_yards
:avg_away_total_yards
:avg_home_total_yards_against
:avg_away_total_yards_against
:roof
:surface
:div_game
:
The code for the data prep, as well as the initial (gbg_cleaned) and cleaned (gbg_rf) datasets can be found below.
# -*- coding: utf-8 -*-
"""
Created on Sun Jun 18 14:35:43 2023
@author: casey
"""
## LOAD LIBRARIES
import pandas as pd
# --------------------------------------------------------------------------------------- #
## LOAD DATA
rf_df = pd.read_csv('C:/Users/casey/OneDrive/Documents/MSDS_Courses/Spring_2023/Machine_Learning/data/gbg_cleaned.csv')
# --------------------------------------------------------------------------------------- #
## SELECT COLUMNS FOR RANDOM FOREST MODELING
cols = ['total_result', 'total_line', 'temp', 'wind', 'total_qb_elo', 'team_elo_diff',
'qb_elo_diff', 'avg_home_total_yards', 'avg_away_total_yards',
'avg_home_total_yards_against', 'avg_away_total_yards_against', 'roof', 'surface',
'div_game']
rf_df = rf_df[cols]
# --------------------------------------------------------------------------------------- #
## ONE HOT ENCODING OF qualitative CATEGORICAL VARIABLES (roof, surface)
# since I will use min max scaling between 0 and 1 it's okay to one hot encode before normalization
labels = rf_df['total_result']
rf_df.drop(['total_result'], inplace=True, axis=1)
rf_df = pd.get_dummies(rf_df)
rf_df['total_result'] = labels
# --------------------------------------------------------------------------------------- #
## WRITE PREPPED DATA TO A .CSV FILE
rf_df.to_csv('C:/Users/casey/OneDrive/Documents/MSDS_Courses/Spring_2023/Machine_Learning/Random_Forest/prepped_data/gbg_rf.csv',
index=False)