Decision Trees in Python
Model Evaluation Key Ideas
- Simpler is better
- Begin by fitting a full tree that includes all of the variables
- Then reduce the depth and/or number of variables until accuracy is significantly impacted
- Leaf nodes with very few samples indicate overfitting
- Reduce the depth of the tree until there at least aren’t leaf nodes containing very few samples
- Can also specify the min_samples_leaf parameter to be greater than a chosen value
- Decision Trees can be used to determine variable importance
- Attempt to classify NFL games as going over or under the projected number of total points scored by both teams
Modeling (Default Tree: Depth 5)
To begin, a decision tree model with default parameters and a depth of 7 was created using all of the features. The resulting tree can be viewed below. If you zoom in and look at the leaf nodes, there are some nodes that have very few samples (1, 2, 3), which indicates overfitting. So, next a model with a depth of 6 was tried.
The default tree for a maximum depth of 6 can be viewed below. Again looking at the leaf nodes, there are some with very few values. This again indicates overfitting. So, a model with default parameters and a maximum depth of 5 was created.
The default tree for a maximum depth of 5 can be viewed below. This one looks much better than the other two and will be the baseline model to improve. There are a couple leaf nodes that still indicate overfitting, but this will be fixed by specifying the min_samples_leaf parameter to be higher than that when hyperparameter tuning.
The confusion matrix and evaluation metrics for the baseline model can be viewed below.
- Accuracy: 0.52
- ROC AUC: 0.52
- Precision: 0.53
- Precision (Over): 0.55
- Precision (Under): 0.51
- Recall: 0.52
- Recall (Over): 0.24
- Recall (Under): 0.79
- F1: 0.48
- F-1 (Over): 0.34
- F-1 (Under): 0.62
Overall, this baseline decision tree model does a really bad job of classifying whether the number of points went over or under the total. With an accuracy of 52% it barely does a better job than randomly predicting. It’s interesting to note that it really struggles to predict the Over correctly. This is where the majority of the accuracy is being lost. Next hyperparameter tuning will be used to hopefully improve the accuracy of the model.
Hyperparameter Tuning of Baseline Model
Now that a baseline model has been created, hyperparameter tuning will be implemented to determine the optimal parameters with regards to the accuracy of the model. GridSearchCV, RandomSearchCV, and Bayesian Optimization will be used.
GridSearchCV
Performing GridSearchCV to find the optimal parameters returned the following:
- max_depth=3
- criterion=’gini’
- min_samples_leaf=4
- min_samples_split=10
RandomSearchCV
Coming Soon…
Bayesian Optimization
Coming Soon…
Modeling (GridSearchCV Tuned Model)
The decision tree for the GridSearchCV tuned model can be viewed below.
The confusion matrix and evaluation metrics for the GridSearchCV tuned model can be viewed below.
- Accuracy: 0.52
- ROC AUC: 0.51
- Precision: 0.51
- Precision (Over): 0.50
- Precision (Under): 0.53
- Recall: 0.51
- Recall (Over): 0.47
- Recall (Under): 0.56
- F1: 0.51
- F-1 (Over): 0.48
- F-1 (Under): 0.55
There is very little change in accuracy from the baseline to the GridSearchCV tuned model. This indicates that altering parameters does not really have much of an effect on the accuracy of the model. The one interesting note is that the recall and precision scores stabilized a lot more. They don’t really favor either label. That along with the very poor performance means a Decision Tree is probably not a great classifier for the data.
Cross Validation (GridSearchCV Tuned Model)
To verify that the accuracy of the model is similar to what was obtained my using one random train and test set cross validation will be performed. KFold cross validation can be used since the label distribution is approximately equal (48.5% Over, 51.5% Under). The following cross validation used 10 folds.
Fold 1 : 0.51
Fold 2 : 0.54
Fold 3 : 0.5
Fold 4 : 0.52
Fold 5 : 0.54
Fold 6 : 0.56
Fold 7 : 0.52
Fold 8 : 0.53
Fold 9 : 0.54
Fold 10 : 0.57
Mean Accuracy: 0.53
Cross validation revealed that the random train and test split that the tuned model was trained on slightly underperformed compared to the mean. The accuracies are similar across each fold, which is ideal. This means that the model is performing similarly regardless of the way the training data is split (not much risk of having overfitting). So the mean accuracy of the cross validation, 53%, can be used as a good estimate of how the model will perform on real data.
Feature Importance
Using the GridSearchCV model, the feature importance can be found and plotted.
Looking at the plot, the only impactful variables are ‘wind’, ‘qb_elo_diff’, ‘avg_home_total_yards’, and ‘total_qb_elo’. The ‘wind’ variable is really the only significant one. None of the remaining variables are impactful. This indicates that there really aren’t many features that are helpful for creating the decision tree. Again, since the model didn’t perform well this reinforces that a decision tree isn’t the best model for these features. To verify that there really are only four important features, a reduced decision tree model can be created. A similar accuracy should be seen with the reduced model.
Modeling (Reduced Tree: Only Important Features)
Next, a reduced model using only important features, was created. The parameters remain the same as the ones found through GridSearchCV. The resulting tree can be viewed below.
The confusion matrix and evaluation metrics can be viewed below.
- Accuracy: 0.52
- ROC AUC: 0.51
- Precision: 0.51
- Precision (Over): 0.50
- Precision (Under): 0.53
- Recall: 0.51
- Recall (Over): 0.47
- Recall (Under): 0.56
- F1: 0.51
- F-1 (Over): 0.48
- F-1 (Under): 0.55
The evaluation metrics are exactly the same, which verifies that only those four features were contributing.
Decision Trees in R
Full Model
The full model decision tree was created using all of the variables in the prepped data.
The following variables were used in the full model (in order of importance).
off_def_diff
total_line
wind
temp
away_total_defense_rank
home_total_offense_rank
home_total_defense_rank
surface
total_qb_elo
away_total_offense_rank
total_team_elo
roof
game_type
div_game
weekday
home_rest
location
away_rest
Below is an image of the decision tree from the full model (download to enlarge, so it can be explored).
Full Model Accuracy: 0.502
This accuracy can be further looked into by analyzing the resulting confusion matrix below.
Reduced Model Version 1 (Max Depth = 8)
The reduced model decision tree was created using the variables that were deemed most important after constructing the full model. Using a reduced model provided a number of advantages. It reduced the dimensionality of the data, allowed for a simpler model, and performed better than the full model. The following variables were used in the reduced model (in order of importance). There are some variables that were considered more important in the full model than the ones listed below (away_total_defense_rank
, home_total_offense_rank
, home_total_defense_rank
). However, when ran with the reduced model they ended up being least important out of the ones listed below and were then not included in the final reduced model.
off_def_diff
total_line
total_qb_elo
wind
temp
surface
Below is an image of the decision tree created from the reduced model (download to enlarge, so it can be explored).
Reduced Model Accuracy: 0.53
This accuracy can be further looked into by analyzing the resulting confusion matrix below.
Some interesting notes from the confusion matrix:
- The overall accuracy of the reduced model is 0.53
- The reduced model was much better at correctly predicting the Under than the Over
- This is a pretty large difference and something to keep in mind
- The reduced model predicted the Over correctly with an accuracy of 36.7% (Sensitivity)
- The reduced model predicted the Under correctly with an accuracy of 69.2% (Specificity)
Reduced Model Version 2 (Max Depth = 6)
The second version of the reduced model decision tree was created using the variables that were deemed most important after constructing the full model. Using a reduced model provided a number of advantages. It reduced the dimensionality of the data, allowed for a simpler model, and performed better than the full model. Using a max depth of 6 creates a simpler tree compared to the first reduced model. However, it may not be as accurate. The following variables were used in the reduced model (in order of importance). There are some variables that were considered more important in the full model than the ones listed below (away_total_defense_rank
, home_total_offense_rank
, home_total_defense_rank
). However, when ran with the reduced model they ended up being least important out of the ones listed below and were then not included in the final reduced model.
off_def_diff
total_line
total_qb_elo
wind
temp
surface
Below is an image of the decision tree created from the reduced model (download to enlarge, so it can be explored).
Reduced Model Accuracy: 0.51
This accuracy can be further looked into by analyzing the resulting confusion matrix below.
Some interesting notes from the confusion matrix:
- The overall accuracy of the second version of the reduced model is 0.51
- The second version of the reduced model was also much better at correctly predicting the Under than the Over (the difference was even larger than the original reduced model)
- The reduced model predicted the Over correctly with an accuracy of 20.5% (Sensitivity)
- The reduced model predicted the Under correctly with an accuracy of 81.4% (Specificity)
Cross-Validation of Reduced Model Version 1
Cross-validation is used to determine how the model performs on multiple different training and testing sets. Cross-validation will find a more accurate determination of how the model is performing. This is because the data in the training and testing sets is randomly selected from the original dataset. So, random sampling might result in a model that performs above or below the average. Cross-validation will train and test the model, randomly sampling each time. This reduces the amount that random sampling can affect the overall performance of the model. K-fold cross-validation was performed on the reduced model with K equal to 5 (the model will be assessed 5 times and then the average accuracy will be taken).
Cross-Validation was performed on the first version of the reduced model as it performed the best out of all three models.
CV Accuracy of Reduced Model: 0.513
Decision Trees in Python Results
Coming Soon…