Data Prep in Python
Data Prep Key Ideas
- Numeric variables only
- Use one hot encoding for categorical variables
- Normalize the dataset
- Only necessary if using regularization penalty parameters or feature importance is desired
- However, should always be done to reduce the probability of the vanishing gradient problem
- Split data into train and test sets
Creating logistic regression models in python requires three key preparatory steps. The first, is that variables need to be numeric. So, any categorical variables need to be converted to numeric variables using one hot encoding. In this case, roof
and surface
were converted to numeric variables. The second, is that the dataset needs to undergo normalization. Feature importance cannot be determined accurately without normalization. This is because variables of different scales will impact the model differently, leading to potentially inaccurate feature importance results. The model may give higher coefficients (higher importance) to features with larger values even if they are not more important. Normalization also avoids the vanishing gradient problem when the model is trained. The third, is that since, Logistic Regression is a supervised machine learning model it requires the prepped data to be split into training and testing sets. The training set is used to train the model. The testing set is used to test the accuracy of the model. In the following example, the training set is created by randomly selecting 80% of the data and the testing set is created by randomly selecting 20% of the data. These numbers are not the only option, just a popular one. The training and testing sets must be kept disjoint (separated) throughout the modeling process. Failure to do so, will most likely result in overfitting and poor performance on real data that is not from the training or testing set.
The final cleaned dataset contains the following columns.
total_result
:total_line
:temp
:wind
:total_qb_elo
:team_elo_diff
:qb_elo_diff
:avg_home_total_yards
:avg_away_total_yards
:avg_home_total_yards_against
:avg_away_total_yards_against
:roof
:surface
:div_game
:
The code for the data prep, as well as the initial (gbg_cleaned) and cleaned (gbg_lr_py) datasets can be found below.
# -*- coding: utf-8 -*-
"""
Created on Thu Mar 23 08:55:19 2023
@author: casey
"""
## LOAD LIBRARIES
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
# --------------------------------------------------------------------------------------- #
## LOAD DATA
lr_df = pd.read_csv('C:/Users/casey/OneDrive/Documents/MSDS_Courses/Spring_2023/Machine_Learning/data/gbg_cleaned.csv')
# --------------------------------------------------------------------------------------- #
## SELECT COLUMNS FOR Logistic Regression MODELING
cols = ['total_result', 'total_line', 'temp', 'wind', 'total_qb_elo', 'team_elo_diff',
'qb_elo_diff', 'avg_home_total_yards', 'avg_away_total_yards',
'avg_home_total_yards_against', 'avg_away_total_yards_against', 'roof', 'surface',
'div_game']
lr_df = lr_df[cols]
# --------------------------------------------------------------------------------------- #
## ONE HOT ENCODING OF qualitative CATEGORICAL VARIABLES (roof, surface)
# since I will use min max scaling between 0 and 1 it's okay to one hot encode before normalization
labels = lr_df['total_result']
lr_df.drop(['total_result'], inplace=True, axis=1)
lr_df = pd.get_dummies(lr_df)
lr_df['total_result'] = labels
# --------------------------------------------------------------------------------------- #
## NORMALIZE DATA USING MINMAXSCALER
# Will use MinMaxScaler() to scale all quantitative variables between 0 and 1
scaler = MinMaxScaler(feature_range=(0, 1))
lr_df_scaled = scaler.fit_transform(lr_df.iloc[:,:-1])
lr_df_scaled = pd.DataFrame(lr_df_scaled, columns=lr_df.iloc[:,:-1].columns)
lr_df_scaled['total_result'] = lr_df['total_result']
# --------------------------------------------------------------------------------------- #
## WRITE PREPPED DATA TO A .CSV FILE
lr_df_scaled.to_csv('C:/Users/casey/OneDrive/Documents/MSDS_Courses/Spring_2023/Machine_Learning/Logistic_Regression/prepped_data/gbg_lr_py.csv',
index=False)