Sports Betting Data Prep/EDA

Below are links to the uncleaned datasets .csv files (elo_fill_na needed no cleaning). These four datasets are required for running the cleaning and EDA in the ML_EDA.R file.

Below are links to the cleaned datasets .csv files.

Below are links to the .py and .R files that contain the code used for the cleaning and EDA (Clicking download gives the .py and .R file, clicking the link gives a .txt file). The ‘API_Data’ is the .py file and contains code to get the ‘news_headlines’ and ‘news_headlines_vectorized’ datasets using an API. The ‘ML_EDA’ is the .R file and contains code for the rest of the cleaning and EDA. The ‘gbg_uncleaned’, ‘nfl_elo’, ‘weather_na’, and ‘elo_fill_na’ .csv files are required to run the .R file.

Below is the python code that API_Data.py contains.

# -*- coding: utf-8 -*-
"""
Created on Wed Jan 18 17:16:37 2023

@author: casey
"""
    
    
## ATTENTION READER...
##
## First, you will need to go to 
## https://newsapi.org/
## https://newsapi.org/register
## and get an API key



################## DO NOT USE MY KEY!!
## Get your own key. 
##
###################################################


### API KEY  - get a key!
##https://newsapi.org/

## Example URL
## https://newsapi.org/v2/everything?
## q=tesla&from=2021-05-20&sortBy=publishedAt&
## apiKey=e440d40102394e83888ea805f0485dfa


## What to import
import requests  ## for getting data from a server GET
import re   ## for regular expressions
import pandas as pd    ## for dataframes and related
from pandas import DataFrame

## To tokenize and vectorize text type data
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
## For word clouds
## conda install -c conda-forge wordcloud
## May also have to run conda update --all on cmd
#import PIL
#import Pillow
import wordcloud
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt


from sklearn.model_selection import train_test_split
import random as rd
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix
#from sklearn.naive_bayes import BernoulliNB
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn import tree
## conda install python-graphviz
## restart kernel (click the little red x next to the Console)
import graphviz

from sklearn.decomposition import LatentDirichletAllocation 
import matplotlib.pyplot as plt
import numpy as np

from sklearn.metrics import silhouette_samples, silhouette_score
import sklearn
from sklearn.cluster import KMeans

from sklearn import preprocessing

import seaborn as sns
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
from sklearn.manifold import MDS
from mpl_toolkits.mplot3d import Axes3D
from scipy.cluster.hierarchy import ward, dendrogram


####################################
##
##  Step 1: Connect to the server
##          Send a query
##          Collect and clean the 
##          results
####################################

####################################################
##In the following loop, we will query thenewsapi servers
##for all the topic names in the list
## We will then build a large csv file 
## where each article is a row
##
## From there, we will convert this data
## into a labeled dataframe
## so we can train and then test our DT
## model
####################################################

####################################################
## Build the URL and GET the results
## NOTE: At the bottom of this code
## commented out, you will find a second
## method for doing the following. This is FYI.
####################################################

## This is the endpoint - the server and 
## location on the server where your data 
## will be retrieved from

## TEST FIRST!
## We are about to build this URL:
## https://newsapi.org/v2/everything?apiKey=e440d40102394e83888ea805f0485dfa&q=bitcoin



topics=["sports betting", "sports", "sports business", "sports politics", "politics", "business"]



## topics needs to be a list of strings (words)
## Next, let's build the csv file
## first and add the column names
## Create a new csv file to save the headlines
MyFILE=open("C:/Users/casey/OneDrive/Documents/MSDS_Courses/Spring_2023/Machine_Learning/Module_1/data/news_headlines.csv","w")  # "a"  for append   "r" for read
## with open
### Place the column names in - write to the first row
WriteThis="LABEL,Date,Source,Title,Headline\n"
MyFILE.write(WriteThis)
MyFILE.close()

## CHeck it! Can you find this file?
    
#### --------------------> GATHER - CLEAN - CREATE FILE    

## RE: documentation and options
## https://newsapi.org/docs/endpoints/everything

endpoint="https://newsapi.org/v2/everything"

################# enter for loop to collect
################# data on three topics
#######################################

for topic in topics:

    ## Dictionary Structure
    URLPost = {'apiKey':'e440d40102394e83888ea805f0485dfa',
               'q':topic
    }

    response=requests.get(endpoint, URLPost)
    print(response)
    jsontxt = response.json()
    print(jsontxt)
    #####################################################
    
    
    ## Open the file for append
    MyFILE=open('C:/Users/casey/OneDrive/Documents/MSDS_Courses/Spring_2023/Machine_Learning/Module_1/data/news_headlines.csv', "a")
    LABEL=topic
    for items in jsontxt["articles"]:
        print(items, "\n\n\n")
                  
        #Author=items["author"]
        #Author=str(Author)
        #Author=Author.replace(',', '')
        
        Source=items["source"]["name"]
        print(Source)
        
        Date=items["publishedAt"]
        ##clean up the date
        NewDate=Date.split("T")
        Date=NewDate[0]
        print(Date)
        
        ## CLEAN the Title
        ##----------------------------------------------------------
        ##Replace punctuation with space
        # Accept one or more copies of punctuation         
        # plus zero or more copies of a space
        # and replace it with a single space
        Title=items["title"]
        Title=str(Title)
        #print(Title)
        Title=re.sub(r'[,.;@#?!&$\-\']+', ' ', str(Title), flags=re.IGNORECASE)
        Title=re.sub(' +', ' ', str(Title), flags=re.IGNORECASE)
        Title=re.sub(r'\"', ' ', str(Title), flags=re.IGNORECASE)
        
        # and replace it with a single space
        ## NOTE: Using the "^" on the inside of the [] means
        ## we want to look for any chars NOT a-z or A-Z and replace
        ## them with blank. This removes chars that should not be there.
        Title=re.sub(r'[^a-zA-Z]', " ", str(Title), flags=re.VERBOSE)
        Title=Title.replace(',', '')
        Title=' '.join(Title.split())
        Title=re.sub("\n|\r", "", Title)
        print(Title)
        ##----------------------------------------------------------
        
        Headline=items["description"]
        Headline=str(Headline)
        Headline=re.sub(r'[,.;@#?!&$\-\']+', ' ', Headline, flags=re.IGNORECASE)
        Headline=re.sub(' +', ' ', Headline, flags=re.IGNORECASE)
        Headline=re.sub(r'\"', ' ', Headline, flags=re.IGNORECASE)
        Headline=re.sub(r'[^a-zA-Z]', " ", Headline, flags=re.VERBOSE)
        ## Be sure there are no commas in the headlines or it will
        ## write poorly to a csv file....
        Headline=Headline.replace(',', '')
        Headline=' '.join(Headline.split())
        Headline=re.sub("\n|\r", "", Headline)
        
        ### AS AN OPTION - remove words of a given length............
        Headline = ' '.join([wd for wd in Headline.split() if len(wd)>3])
    
        #print("Author: ", Author, "\n")
        #print("Title: ", Title, "\n")
        #print("Headline News Item: ", Headline, "\n\n")
        
        #print(Author)
        print(Title)
        print(Headline)
        
        WriteThis=str(LABEL)+","+str(Date)+","+str(Source)+","+ str(Title) + "," + str(Headline) + "\n"
        print(WriteThis)
        
        MyFILE.write(WriteThis)
        
    ## CLOSE THE FILE
    MyFILE.close()
    
################## END for loop

####################################################
##
## Where are we now?
## 
## So far, we have created a csv file
## with labeled data. Each row is a news article
##
## - BUT - 
## We are not done. We need to choose which
## parts of this data to use to model our decision tree
## and we need to convert the data into a data frame.
##
# ------------------------------------------------------------------------------------------- #

news_df=pd.read_csv("C:/Users/casey/OneDrive/Documents/MSDS_Courses/Spring_2023/Machine_Learning/Module_1/data/news_headlines.csv", error_bad_lines=False)
print(news_df.head())
# iterating the columns 
for col in news_df.columns: 
    print(col) 
    
print(news_df["Headline"])

## REMOVE any rows with NaN in them
news_df = news_df.dropna()
print(news_df["Headline"])

### Tokenize and Vectorize the Headlines
## Create the list of headlines
## Keep the labels!

HeadlineLIST=[]
LabelLIST=[]

for nexthead, nextlabel in zip(news_df["Headline"], news_df["LABEL"]):
    HeadlineLIST.append(nexthead)
    LabelLIST.append(nextlabel)

print("The headline list is:\n")
print(HeadlineLIST)

print("The label list is:\n")
print(LabelLIST)


##########################################
## Remove all words that match the topics.
## For example, if the topics are food and covid
## remove these exact words.
##
## We will need to do this by hand. 
NewHeadlineLIST=[]

for element in HeadlineLIST:
    print(element)
    print(type(element))
    ## make into list
    AllWords=element.split(" ")
    print(AllWords)
    
    ## Now remove words that are in your topics
    NewWordsList=[]
    for word in AllWords:
        print(word)
        word=word.lower()
        if word in topics:
            print(word)
        else:
            NewWordsList.append(word)
            
    ##turn back to string
    NewWords=" ".join(NewWordsList)
    ## Place into NewHeadlineLIST
    NewHeadlineLIST.append(NewWords)


##
## Set the     HeadlineLIST to the new one
HeadlineLIST=NewHeadlineLIST
print(HeadlineLIST)     
#########################################
##
##  Build the labeled dataframe
##
######################################################

### Vectorize
## Instantiate your CV
MyCountV=CountVectorizer(
        input="content",  ## because we have a csv file
        lowercase=True, 
        stop_words = "english",
        max_features=50
        )

## Use your CV 
MyDTM = MyCountV.fit_transform(HeadlineLIST)  # create a sparse matrix
print(type(MyDTM))


ColumnNames=MyCountV.get_feature_names()
#print(type(ColumnNames))


## Build the data frame
MyDTM_DF=pd.DataFrame(MyDTM.toarray(),columns=ColumnNames)

## Convert the labels from list to df
Labels_DF = DataFrame(LabelLIST,columns=['LABEL'])

## Check your new DF and you new Labels df:
print("Labels\n")
print(Labels_DF)
print("News df\n")
print(MyDTM_DF.iloc[:,0:6])

##Save original DF - without the lables
My_Orig_DF=MyDTM_DF
print(My_Orig_DF)
######################
## AND - just to make sure our dataframe is fair
## let's remove columns called:
## food, bitcoin, and sports (as these are label names)
######################
#MyDTM_DF=MyDTM_DF.drop(topics, axis=1)


## Now - let's create a complete and labeled
## dataframe:
dfs = [Labels_DF, MyDTM_DF]
print(dfs)

Final_News_DF_Labeled = pd.concat(dfs,axis=1, join='inner')
## DF with labels
print(Final_News_DF_Labeled)

## with open
### Place the column names in - write to the first row
Final_News_DF_Labeled.to_csv("C:/Users/casey/OneDrive/Documents/MSDS_Courses/Spring_2023/Machine_Learning/Module_1/data/news_headlines_vectorized.csv", index=False)

Below is the R code that ML_EDA.R contains.

# Load necessary packages

library(tidyverse)
library(ggplot2)
library(nflfastR)
library(nflplotR)
library(plotly)
library(htmltools)
library(crosstalk)

# ----------------------------------------------------------------- #
## LOAD and CLEAN weather_na dataset

# loads dataset that contains temp and wind values for the 206 missing from games
weather_na <- read_csv("C:/Users/casey/OneDrive/Documents/MSDS_Courses/Spring_2023/Machine_Learning/data/weather_na.csv") 

# gets col types for weather_na
str(weather_na)

# select only necessary columns to join weather_na and games
weather_na <- weather_na %>% 
  select(game_id, temp, wind)

# ---------------------------------------------------------------- #
## LOAD and CLEAN nfl_elo dataset
# Dataset is current as of 2/18/2023
# To download more current .csv file go to
#  https://github.com/fivethirtyeight/data/blob/master/nfl-elo/README.md
nfl_elo <- read_csv('C:/Users/casey/OneDrive/Documents/MSDS_Courses/Spring_2023/Machine_Learning/data/nfl_elo.csv') %>%
  select(-c('qbelo_prob1', 'qbelo_prob2', 'playoff', 'quality',
            'qb1_game_value', 'qb2_game_value', 'qb1_value_post',
            'qb2_value_post', 'qbelo1_post', 'qbelo2_post', 'elo1_post',
            'elo2_post', 'neutral', 'importance', 'total_rating',
            'qb1', 'qb2')) %>%
  filter(season > 1999, !is.na(qbelo1_pre))

nfl_elo$team1[nfl_elo$team1 == 'WSH'] <- 'WAS'
nfl_elo$team2[nfl_elo$team2 == 'WSH'] <- 'WAS'
nfl_elo$team1[nfl_elo$team1 == 'LAC' & nfl_elo$season < 2017] <- 'SD'
nfl_elo$team2[nfl_elo$team2 == 'LAC' & nfl_elo$season < 2017] <- 'SD'
nfl_elo$team1[nfl_elo$team1 == 'LAR' & nfl_elo$season < 2016] <- 'STL'
nfl_elo$team2[nfl_elo$team2 == 'LAR' & nfl_elo$season < 2016] <- 'STL'
nfl_elo$team1[nfl_elo$team1 == 'OAK' & nfl_elo$season > 2019] <- 'LV'
nfl_elo$team2[nfl_elo$team2 == 'OAK' & nfl_elo$season > 2019] <- 'LV'

# ------------------------------------------------------------------------- #
## LOAD AND CLEAN elo_fill_na dataset
elo_fill_na <- read_csv('C:/Users/casey/OneDrive/Documents/MSDS_Courses/Spring_2023/Machine_Learning/data/elo_fill_na.csv')

str(elo_fill_na)
# ------------------------------------------------------------------------- #
## Load play by play

# Can take a little while to run, loads over 1 million rows
# play by play data from 2000-2022 from nflFastR
pbp <- load_pbp(2000:2022) %>%
  select(game_id, home_team, away_team, week, posteam, posteam_type,
         defteam, down, play_type, yards_gained, field_goal_result, two_point_conv_result,
         fourth_down_converted, fourth_down_failed, interception, penalty, penalty_team,
         penalty_yards, fumble_lost, sack, two_point_attempt, punt_attempt, season,
         stadium, weather, first_down, pass, rush, passing_yards, receiving_yards, rushing_yards)


# replace N/A with 0 for pass and rush yards because N/A currently represents 0
pbp <- pbp %>%
  mutate_at(c('passing_yards','rushing_yards'), ~replace_na(., 0)) %>%
  drop_na(posteam, posteam_type)

# calculate passing, rushing, and total yards per game and store in 'avg' df
avg <- pbp %>%
  group_by(game_id, home_team, away_team, week, season) %>%
  summarise(home_pass_yards = sum(passing_yards[posteam_type == 'home']),
            home_rush_yards = sum(rushing_yards[posteam_type == 'home']),
            away_pass_yards = sum(passing_yards[posteam_type == 'away']),
            away_rush_yards = sum(rushing_yards[posteam_type == 'away'])) %>%
  mutate(home_total_yards = home_pass_yards + home_rush_yards,
         away_total_yards = away_pass_yards + away_rush_yards)


# set averages to 0 to start, will be updated in for loop below
avg$avg_home_total_yards <- 0
avg$avg_away_total_yards <- 0
avg$avg_home_total_yards_against <- 0
avg$avg_away_total_yards_against <- 0

# gets a list of seasons to aggregate for
seasons <- unique(avg$season)

# For every season calculates the home and away team's average total yards 
#  for and against up to that current point in the season
# Can take a while to run
for (season_num in seasons) {
  print(season_num)
  f <- avg %>% filter(season == season_num)
  
  for(i in 1:nrow(f)) { 
    # gets number of games played for current home team
    h_gp <- length(which((f[i,]$home_team == f$home_team | f[i,]$home_team == f$away_team) & 
                           (f[i, ]$week > f$week)))
    
    # gets number of games played for current away team
    a_gp <- length(which((f[i,]$away_team == f$home_team | f[i,]$away_team == f$away_team) & 
                           (f[i, ]$week > f$week)))
    
    # gets indices of previous game played for the home and away team
    h_ind <- which((f[i,]$home_team == f$home_team | f[i,]$home_team == f$away_team) & 
                     (f[i, ]$week > f$week))
    a_ind <- which((f[i,]$away_team == f$home_team | f[i,]$away_team == f$away_team) & 
                     (f[i, ]$week > f$week))
    h_ind <- h_ind[length(h_ind)]
    a_ind <- a_ind[length(a_ind)]
    
    if (length(h_ind) == 0 & length(a_ind) == 0){
      
      
      # update away team average only
    } else if (length(h_ind) == 0 & length(a_ind) != 0){
      if (f[i, ]$away_team != f[a_ind, ]$away_team) {
        f[i, ]$avg_away_total_yards <- ((f[a_ind, ]$avg_home_total_yards * (a_gp - 1)) + f[a_ind, ]$home_total_yards) / a_gp
        f[i, ]$avg_away_total_yards_against <- ((f[a_ind, ]$avg_away_total_yards * (a_gp - 1)) + f[a_ind, ]$away_total_yards) / a_gp
      } else {
        f[i, ]$avg_away_total_yards <- ((f[a_ind, ]$avg_away_total_yards * (a_gp - 1)) + f[a_ind, ]$away_total_yards) / a_gp
        f[i, ]$avg_away_total_yards_against <- ((f[a_ind, ]$avg_home_total_yards * (a_gp - 1)) + f[a_ind, ]$home_total_yards) / a_gp
      }
      
      # update home team average only
    } else if (length(a_ind) == 0 & length(h_ind) != 0) {
      if (f[i, ]$home_team != f[h_ind, ]$home_team) {
        f[i, ]$avg_home_total_yards <- ((f[h_ind, ]$avg_away_total_yards * (h_gp - 1)) + f[h_ind, ]$away_total_yards) / h_gp
        f[i, ]$avg_home_total_yards_against <- ((f[h_ind, ]$avg_home_total_yards * (h_gp - 1)) + f[h_ind, ]$home_total_yards) / h_gp
      } else {
        f[i, ]$avg_home_total_yards <- ((f[h_ind, ]$avg_home_total_yards * (h-gp - 1)) + f[h_ind, ]$home_total_yards) / h_gp
        f[i, ]$avg_home_total_yards_against <- ((f[h_ind, ]$avg_away_total_yards * (h_gp - 1)) + f[h_ind, ]$away_total_yards) / h_gp
      }
      
      # update both home and away team average
    } else {
      if (f[i, ]$home_team != f[h_ind, ]$home_team) {
        f[i, ]$avg_home_total_yards <- ((f[h_ind, ]$avg_away_total_yards * (h_gp - 1)) + f[h_ind, ]$away_total_yards) / h_gp
        f[i, ]$avg_home_total_yards_against <- ((f[h_ind, ]$avg_home_total_yards * (h_gp - 1)) + f[h_ind, ]$home_total_yards) / h_gp
      } else {
        f[i, ]$avg_home_total_yards <- ((f[h_ind, ]$avg_home_total_yards * (h_gp - 1)) + f[h_ind, ]$home_total_yards) / h_gp
        f[i, ]$avg_home_total_yards_against <- ((f[h_ind, ]$avg_away_total_yards * (h_gp - 1)) + f[h_ind, ]$away_total_yards) / h_gp
      }
      if (f[i, ]$away_team != f[a_ind, ]$away_team) {
        f[i, ]$avg_away_total_yards <- ((f[a_ind, ]$avg_home_total_yards * (a_gp - 1)) + f[a_ind, ]$home_total_yards) / a_gp
        f[i, ]$avg_away_total_yards_against <- ((f[a_ind, ]$avg_away_total_yards * (a_gp - 1)) + f[a_ind, ]$away_total_yards) / a_gp
      } else {
        f[i, ]$avg_away_total_yards <- ((f[a_ind, ]$avg_away_total_yards * (a_gp - 1)) + f[a_ind, ]$away_total_yards) / a_gp
        f[i, ]$avg_away_total_yards_against <- ((f[a_ind, ]$avg_home_total_yards * (a_gp - 1)) + f[a_ind, ]$home_total_yards) / a_gp
      }
    }
  }
  # update dataframe
  avg$avg_away_total_yards[avg$season == season_num] <- f$avg_away_total_yards
  avg$avg_home_total_yards[avg$season == season_num] <- f$avg_home_total_yards
  avg$avg_away_total_yards_against[avg$season == season_num] <- f$avg_away_total_yards_against
  avg$avg_home_total_yards_against[avg$season == season_num] <- f$avg_home_total_yards_against
}

# check distributions to ensure above loop worked correctly
# should have some zero and everything else between ~150 and 500
hist(avg$avg_away_total_yards)
hist(avg$avg_away_total_yards_against)

# initializes ranks at 0
avg$home_total_offense_rank <- 0
avg$away_total_offense_rank <- 0
avg$home_total_defense_rank <- 0
avg$away_total_defense_rank <- 0

# Ranks the offense and defense of teams based on where they stand the current week
#  for offense the higher the rank the better the team is
#  for defense the lower the rank the better the team is
for (season_num in seasons){
  weeks <- unique(avg$week[avg$season == season_num])
  
  for (week_num in weeks) {
    # offensive ranks
    ho_list <- avg$avg_home_total_yards[avg$season == season_num & avg$week == week_num]
    ao_list <- avg$avg_away_total_yards[avg$season == season_num & avg$week == week_num]
    hao_list <- append(ho_list, ao_list)
    rank_list <- rank(hao_list, ties.method = 'max')
    ho_list <- rank_list[1:length(ho_list)]
    ao_list <- rank_list[(length(ho_list) + 1):length(rank_list)]
    avg$home_total_offense_rank[avg$season == season_num & avg$week == week_num] <- ho_list
    avg$away_total_offense_rank[avg$season == season_num & avg$week == week_num] <- ao_list
    
    # defensive ranks
    hd_list <- avg$avg_home_total_yards_against[avg$season == season_num & avg$week == week_num]
    ad_list <- avg$avg_away_total_yards_against[avg$season == season_num & avg$week == week_num]
    had_list <- append(hd_list, ad_list)
    rank_list <- rank(had_list, ties.method = 'max')
    hd_list <- rank_list[1:length(hd_list)]
    ad_list <- rank_list[(length(hd_list) + 1):length(rank_list)]
    avg$home_total_defense_rank[avg$season == season_num & avg$week == week_num] <- hd_list
    avg$away_total_defense_rank[avg$season == season_num & avg$week == week_num] <- ad_list
  }
}

# Check distributions to make sure above loop worked correctly
# Should be somewhat uniform between 1 and 32
#  above 30 may appear slightly higher due to week 1 everyone being equal
hist(avg$away_total_offense_rank)
hist(avg$away_total_defense_rank)

# write the average dataframe to .csv file
#write_csv(avg, 'C:/Users/casey/OneDrive/Documents/MSDS_Courses/Spring_2023/Machine_Learning/data/team_averages.csv')

# ------------------------------------------------------------------------- #
## LOAD and CLEAN game by game dataset

# loads live updated game by game dataset
# gbg <- read_csv("http://www.habitatring.com/games.csv")

# writes uncleaned updated gbg dataset to csv
# write_csv(gbg, 'C:/Users/casey/OneDrive/Documents/MSDS_Courses/Spring_2023/Machine_Learning/data/gbg_uncleaned.csv')

# loads uncleaned game by game dataset as of 2/18/23 
gbg <- read_csv("C:/Users/casey/OneDrive/Documents/MSDS_Courses/Spring_2023/Machine_Learning/data/gbg_uncleaned.csv")

# gets col types for gbg
str(gbg)

# number of rows in games dataframe
nrow(gbg)

# gets number of null values per column
gbg %>% summarise_all(~ sum(is.na(.))) %>%
  select(c(0:10))

# removes nfl_detail_id, gsis, pfr, pff, and espn columns and odds columns
# if odds are desired need to filter out rows prior to 2007 season
# filters out 1999 season because there were no game times
# filters out 2022 super bowl game (!is.na(away_score)) because it hasn't happened yet
gbg <- gbg %>%
  select(-c('gsis', 'pfr', 'pff', 'espn','old_game_id', 'nfl_detail_id',
            'away_qb_id', 'home_qb_id', 'stadium_id', 'away_moneyline',
            'home_moneyline', 'away_spread_odds', 'home_spread_odds',
            'under_odds', 'over_odds')) %>%
  filter(season > 1999, 
         !is.na(away_score))

# updates missing referee for '2021_15_NE_IND' game_id
gbg$referee[is.na(gbg$referee)] <- 'Carl Cheffers'

# updates abbreviation for Los Angeles Rams
gbg$home_team[gbg$home_team == 'LA'] <- 'LAR'
gbg$away_team[gbg$away_team == 'LA'] <- 'LAR'

# updates missing surface for the following game_id
gbg$surface[gbg$game_id == '2022_07_KC_SF'] <- 'grass'
gbg$surface[gbg$game_id == '2022_04_MIN_NO'] <- 'grass'
gbg$surface[gbg$game_id == '2022_05_NYG_GB'] <- 'grass'
gbg$surface[gbg$game_id == '2022_08_DEN_JAX'] <- 'grass'
gbg$surface[gbg$game_id == '2022_10_SEA_TB'] <- 'grass'
gbg$surface[gbg$game_id == '2022_11_SF_ARI'] <- 'grass'

# updates to make a_turf equal to astroturf
gbg$surface[gbg$surface == 'a_turf'] <- 'astroturf'

# updates missing wind and temp values for stadiums with closed roofs or dome stadiums
gbg$wind[is.na(gbg$wind) & (gbg$roof == 'dome' | gbg$roof == 'closed')] <- 0
gbg$temp[is.na(gbg$temp) & (gbg$roof == 'dome' | gbg$roof == 'closed')] <- 72

# joins weather_na and gbg to update missing 206 temp and wind values in games
gbg <- gbg %>%
  left_join(weather_na, by = c("game_id")) %>%
  mutate(temp = ifelse(is.na(temp.x), temp.y, temp.x),
         wind = ifelse(is.na(wind.x), wind.y, wind.x)) %>%
  select(-c(temp.x, temp.y, wind.x, wind.y))

# joins nfl_elo and gbg 
gbg <- gbg %>% 
  full_join(nfl_elo, by=c('gameday' = 'date', 'season'='season',
                      'home_team' = 'team1', 'away_team' = 'team2',
                      'home_score' = 'score1', 'away_score' = 'score2')) %>%
  filter(!is.na(game_id))

# joins elo_fill_na and gbg to fill missing elo values in gbg
gbg <- gbg %>%
  left_join(elo_fill_na, by = c("game_id")) %>%
  mutate(elo1_pre = ifelse(is.na(elo1_pre.x), elo1_pre.y, elo1_pre.x),
         elo2_pre = ifelse(is.na(elo2_pre.x), elo2_pre.y, elo2_pre.x),
         elo_prob1 = ifelse(is.na(elo_prob1.x), elo_prob1.y, elo_prob1.x),
         elo_prob2 = ifelse(is.na(elo_prob2.x), elo_prob2.y, elo_prob2.x),
         qbelo1_pre = ifelse(is.na(qbelo1_pre.x), qbelo1_pre.y, qbelo1_pre.x),
         qbelo2_pre = ifelse(is.na(qbelo2_pre.x), qbelo2_pre.y, qbelo2_pre.x),
         qb1_value_pre = ifelse(is.na(qb1_value_pre.x), qb1_value_pre.y, qb1_value_pre.x),
         qb2_value_pre = ifelse(is.na(qb2_value_pre.x), qb2_value_pre.y, qb2_value_pre.x),
         qb1_adj = ifelse(is.na(qb1_adj.x), qb1_adj.y, qb1_adj.x),
         qb2_adj = ifelse(is.na(qb2_adj.x), qb2_adj.y, qb2_adj.x)) %>%
  select(-c(elo1_pre.x, elo1_pre.y, elo2_pre.x, elo2_pre.y, elo_prob1.x,
            elo_prob1.y, elo_prob2.x, elo_prob2.y,
            qbelo1_pre.x, qbelo1_pre.y, qbelo2_pre.x, qbelo2_pre.y,
            qb1_value_pre.x, qb1_value_pre.y, qb2_value_pre.x, qb2_value_pre.y,
            qb1_adj.x, qb1_adj.y, qb2_adj.x, qb2_adj.y)) 

# adds total_team_elo and total_qb_elo columns
# adds team_elo_diff and qb_elo_diff columns
gbg <- gbg %>%
  mutate(total_team_elo = qbelo1_pre + qbelo2_pre,
         total_qb_elo = qb1_adj + qb2_adj,
         team_elo_diff = abs(qbelo1_pre - qbelo2_pre),
         qb_elo_diff = abs(qb1_adj - qb2_adj))

# adds total_result and spread_result labels
gbg <- gbg %>%
  mutate(total_result = ifelse(total > total_line, 'Over', 'Under'),
         spread_result = ifelse((spread_line > 0 & result > spread_line) |
                          (spread_line < 0 & result < spread_line), 'Favorite Covered', 'Underdog Covered'))

# join gbg and avg together
avg <- avg %>%
  select(game_id, avg_home_total_yards, avg_away_total_yards,
         avg_home_total_yards_against, avg_away_total_yards_against,
         home_total_offense_rank, home_total_defense_rank,
         away_total_offense_rank, away_total_defense_rank)

gbg <- gbg %>%
  full_join(avg, by=c('game_id')) %>%
  select(-c('home_team.y', 'away_team.y', 'week.y')) %>%
  rename(week = week.x, 
         away_team = away_team.x,
         home_team = home_team.x)

# removes 2 games that don't match up
gbg <- gbg %>%
  filter(!is.na(avg_away_total_yards))

# write clean gbg to csv
write_csv(gbg, 'C:/Users/casey/OneDrive/Documents/MSDS_Courses/Spring_2023/Machine_Learning/data/gbg_cleaned.csv')

# gets number of null values per column
gbg %>% summarise_all(~ sum(is.na(.))) %>%
  select(c(50:53))

# ------------------------------------------------------------------------- #
## CLEANING PLOTS
gbg_uncl <- read_csv('C:/Users/casey/OneDrive/Documents/MSDS_Courses/Spring_2023/Machine_Learning/data/gbg_uncleaned.csv')

gbg_cl <- read_csv('C:/Users/casey/OneDrive/Documents/MSDS_Courses/Spring_2023/Machine_Learning/data/gbg_cleaned.csv')
gbg_cl$overtime <- as.factor(gbg_cl$overtime)
gbg_cl$div_game <- as.factor(gbg_cl$div_game)
gbg_cl$total_result <- as.factor(gbg_cl$total_result)
gbg_cl$spread_result <- as.factor(gbg_cl$spread_result)
gbg_cl$season <- as.character(gbg_cl$season)

str(gbg_cl)

# temp column before and after cleaning plots
temp_plt_uncl <- gbg_uncl %>% group_by(roof) %>%
  summarize(total_na = sum(is.na(temp)))

ggplot(data=temp_plt_uncl, aes(x=roof, y=total_na)) +
  geom_bar(stat='identity', color='darkred', fill='lightcoral') +
  xlab('Roof Type') + ylab('Total Missing') +
  ggtitle('Total Number of Missing Temperatures for NFL Games by Roof Type from 1999 through 2022') 

temp_plt_cl <- gbg_cl %>% group_by(roof) %>%
  summarize(total_na = sum(is.na(temp)))

ggplot(data=temp_plt_cl, aes(x=roof, y=total_na)) +
  geom_bar(stat='identity', color='darkred', fill='lightcoral') +
  xlab('Roof Type') + ylab('Total Missing') + ylim(0, 1000) +
  ggtitle('Total Number of Missing Temperatures for NFL Games by Roof Type from 2000 through 2022')

# wind column before and after cleaning plots
wind_plt_uncl <- gbg_uncl %>% group_by(roof) %>%
  summarize(total_na = sum(is.na(wind)))

ggplot(data=wind_plt_uncl, aes(x=roof, y=total_na)) +
  geom_bar(stat='identity', color='darkred', fill='lightcoral') +
  xlab('Roof Type') + ylab('Total Missing') +
  ggtitle('Total Number of Missing Wind Speeds for NFL Games by Roof Type from 1999 through 2022') 

wind_plt_cl <- gbg_cl %>% group_by(roof) %>%
  summarize(total_na = sum(is.na(wind)))

ggplot(data=wind_plt_cl, aes(x=roof, y=total_na)) +
  geom_bar(stat='identity', color='darkred', fill='lightcoral') +
  xlab('Roof Type') + ylab('Total Missing') + ylim(0, 1000) +
  ggtitle('Total Number of Missing Wind Speeds for NFL Games by Roof Type from 2000 through 2022')

# surface column before and after cleaning plots
surface_plt_uncl <- gbg_uncl %>% group_by(season) %>%
  summarize(total_na = sum(is.na(surface)))

ggplot(data=surface_plt_uncl, aes(x=season, y=total_na)) +
  geom_bar(stat='identity', color='darkred', fill='lightcoral') +
  xlab('Season') + ylab('Total Missing') +
  ggtitle('Total Number of Missing Field Surfaces per NFL Season from 1999 through 2022')

surface_plt_cl <- gbg_cl %>% group_by(season) %>%
  summarize(total_na = sum(is.na(surface)))

ggplot(data=surface_plt_cl, aes(x=season, y=total_na)) +
  geom_bar(stat='identity', color='darkred', fill='lightcoral') +
  xlab('Season') + ylab('Total Missing') + ylim(0, 300) +
  ggtitle('Total Number of Missing Field Surfaces per NFL Season from 2000 through 2022')

# referee column before and after cleaning plots
referee_plt_uncl <- gbg_uncl %>% group_by(season) %>%
  summarize(total_na = sum(is.na(referee)))

ggplot(data=referee_plt_uncl, aes(x=season, y=total_na)) +
  geom_bar(stat='identity', color='darkred', fill='lightcoral') +
  xlab('Season') + ylab('Total Missing') +
  ggtitle('Total Number of Missing Referees per NFL Season from 1999 through 2022')

referee_plt_cl <- gbg_cl %>% group_by(season) %>%
  summarize(total_na = sum(is.na(referee)))

ggplot(data=referee_plt_cl, aes(x=season, y=total_na)) +
  geom_bar(stat='identity', color='darkred', fill='lightcoral') +
  xlab('Season') + ylab('Total Missing') + ylim(0, 300) +
  ggtitle('Total Number of Missing Referees per NFL Season from 2000 through 2022')

# gametime column before and after cleaning plots
gametime_plt_uncl <- gbg_uncl %>% group_by(season) %>%
  summarize(total_na = sum(is.na(gametime)))

ggplot(data=gametime_plt_uncl, aes(x=season, y=total_na)) +
  geom_bar(stat='identity', color='darkred', fill='lightcoral') +
  xlab('Season') + ylab('Total Missing') +
  ggtitle('Total Number of Missing Gametimes per NFL Season from 1999 through 2022')

gametime_plt_cl <- gbg_cl %>% group_by(season) %>%
  summarize(total_na = sum(is.na(gametime)))

ggplot(data=gametime_plt_cl, aes(x=season, y=total_na)) +
  geom_bar(stat='identity', color='darkred', fill='lightcoral') +
  xlab('Season') + ylab('Total Missing') + ylim(0, 300) +
  ggtitle('Total Number of Missing Gametimes per NFL Season from 2000 through 2022')

# ------------------------------------------------------------------------ #
## ANOMALY PLOTS
p <- ggplot(data=gbg_cl)

# wind
p + geom_bar(aes(x=wind), color='darkblue', fill='lightblue') +
  xlab('Wind Speed (mph)') + ylab('Count') + 
  ggtitle('Wind Speed for Every NFL Game from the 2020 through 2022 Seasons')

# week
p + geom_bar(aes(x=week), color='darkblue', fill='lightblue') +
  xlab('Week') + ylab('Count') + 
  ggtitle('Number of Games Played by Week for each NFL Season from 2000 through 2022')

# season
games_by_season <- gbg_cl %>% group_by(season) %>%
  summarize(num_games = sum(is.na(game_id) == FALSE))

ggplot(data=games_by_season) +
  geom_bar(aes(x=season, y=num_games), color='darkblue', fill='lightblue', stat='identity') +
  xlab('Season') + ylab('Number of Games Played') + 
  ggtitle('Number of Games Played per NFL Season from 2000 through 2022')

# --------------------------------------------------------------- #
## EDA PLOTS

# Over/Under balance plot
#  gets frequencies of overs and unders
gbg_ov <- as.data.frame(table(gbg_cl$total_result))

ggplot(data=gbg_ov, aes(x=Var1, y=Freq)) + 
  geom_bar(color="darkblue", fill="lightblue", stat='identity') +
  xlab('') + ylab('Number of Games Played') + 
  ggtitle('Number of Times the NFL Game Total Went Over or Under from 2000 through 2022') +
  geom_text(aes(label=paste(as.character(round(Freq/nrow(gbg_cl)*100, digits=2)), '%')), vjust=-.5)

# Favorite/Underdog Covered balance plot
#  gets frequencies of who covered favorite or underdog
gbg_fav<- as.data.frame(table(gbg_cl$spread_result))

ggplot(data=gbg_fav, aes(x=Var1, y=Freq)) + 
  geom_bar(color="darkblue", fill="lightblue", stat='identity') +
  xlab('') + ylab('Number of Games Played') + 
  ggtitle('Number of Times the NFL Game was Covered by the Favorite from 2000 through 2022') +
  geom_text(aes(label=paste(as.character(round(Freq/nrow(gbg_cl)*100, digits=2)), '%')), vjust=-.5)

# Scatterplots wind vs total points
# spread
ggplot(data=gbg_cl, aes(x=wind, y=total)) +
  geom_point(aes(color=spread_result), position='jitter', alpha=0.5) +
  geom_smooth(method='lm') +
  xlab('Wind Speed (mph)') + ylab('Total Point Scored') + labs(color= 'Spread Result') +
  ggtitle('Total Points Scored vs Wind Speed for NFL Games from 2000 through 2022')

# total
ggplot(data=gbg_cl, aes(x=wind, y=total)) +
  geom_point(aes(color=total_result), position='jitter', alpha=0.5) +
  geom_smooth(method='lm') +
  xlab('Wind Speed (mph)') + ylab('Total Point Scored') + labs(color='Total Result') +
  ggtitle('Total Points Scored vs Wind Speed for NFL Games from 2000 through 2022')

# Scatterplots temp vs total points
# total
ggplot(data=gbg_cl, aes(x=temp, y=total)) +
  geom_point(aes(color=total_result), position='jitter', alpha=0.5) +
  geom_smooth(method='lm') +
  xlab('Temperature (F)') + ylab('Total Points Scored') + labs(color= 'Total Result') +
  ggtitle('Total Points Scored vs Temperature for NFL Games from 2000 through 2022')

# spread
ggplot(data=gbg_cl, aes(x=temp, y=total)) +
  geom_point(aes(color=spread_result), position='jitter', alpha=0.5) +
  geom_smooth(method='lm') +
  xlab('Temperature (F)') + ylab('Total Points Scored') + labs(color='Spread Result') +
  ggtitle('Total Points Scored vs Temperature for NFL Games from 2000 through 2022')

# Scatterplots total_points vs total_team_elo
# total
ggplot(data=gbg_cl, aes(x=total_team_elo, y=total)) +
  geom_point(aes(color=total_result), position='jitter', alpha=0.5) +
  geom_smooth(method='lm') +
  xlab('Combined Team ELO') + ylab('Total Points Scored') + labs(color= 'Total Result') +
  ggtitle('Total Points Scored vs Combined Team ELO Ratings for NFL Games from 2000 through 2022')

# spread
ggplot(data=gbg_cl, aes(x=total_team_elo, y=total)) +
  geom_point(aes(color=spread_result), position='jitter', alpha=0.5) +
  geom_smooth(method='lm') +
  xlab('Combined Team ELO') + ylab('Total Points Scored') + labs(color= 'Spread Result') +
  ggtitle('Total Points Scored vs Combined Team ELO Ratings for NFL Games from 2000 through 2022')

# Scatterplots total_points vs total_qb_elo
# total
ggplot(data=gbg_cl, aes(x=total_qb_elo, y=total)) +
  geom_point(aes(color=total_result), position='jitter', alpha=0.5) +
  geom_smooth(method='lm') +
  xlab('Combined QB ELO') + ylab('Total Points Scored') + labs(color= 'Total Result') +
  ggtitle('Total Points Scored vs Combined QB ELO Ratings for NFL Games from 2000 through 2022')

# spread
ggplot(data=gbg_cl, aes(x=total_qb_elo, y=total)) +
  geom_point(aes(color=spread_result), position='jitter', alpha=0.5) +
  geom_smooth(method='lm') +
  xlab('Combined QB ELO') + ylab('Total Points Scored') + labs(color= 'Spread Result') +
  ggtitle('Total Points Scored vs Combined QB ELO Ratings for NFL Games from 2000 through 2022')

# Scatterplots total_points vs team_elo_diff
# total
ggplot(data=gbg_cl, aes(x=team_elo_diff, y=total)) +
  geom_point(aes(color=total_result), position='jitter', alpha=0.5) +
  geom_smooth(method='lm') +
  xlab('Absolute Value Team ELO Diff') + ylab('Total Points Scored') + labs(color= 'Total Result') +
  ggtitle('Total Points Scored vs Team ELO Rating Difference for NFL Games from 2000 through 2022')

# spread
ggplot(data=gbg_cl, aes(x=team_elo_diff, y=total)) +
  geom_point(aes(color=spread_result), position='jitter', alpha=0.5) +
  geom_smooth(method='lm') +
  xlab('Absolute Value Team ELO Diff') + ylab('Total Points Scored') + labs(color= 'Spread Result') +
  ggtitle('Total Points Scored vs Team ELO Rating Difference for NFL Games from 2000 through 2022')

# Scatterplots total_points vs qb_elo_diff
# total
ggplot(data=gbg_cl, aes(x=qb_elo_diff, y=total)) +
  geom_point(aes(color=total_result), position='jitter', alpha=0.5) +
  geom_smooth(method='lm') +
  xlab('Absolute Value QB ELO Diff') + ylab('Total Points Scored') + labs(color= 'Total Result') +
  ggtitle('Total Points Scored vs QB ELO Rating Difference for NFL Games from 2000 through 2022')

# spread
ggplot(data=gbg_cl, aes(x=qb_elo_diff, y=total)) +
  geom_point(aes(color=spread_result), position='jitter', alpha=0.5) +
  geom_smooth(method='lm') +
  xlab('Absolute Value QB ELO Diff') + ylab('Total Points Scored') + labs(color= 'Spread Result') +
  ggtitle('Total Points Scored vs QB ELO Rating Difference for NFL Games from 2000 through 2022')

# Stacked barplot of gametype as percentages with total and spread as main plots
# total
ggplot(data=gbg_cl, aes(x=total_result, fill=game_type)) + 
  geom_bar(color="darkblue") +
  xlab('') + ylab('Number of Games Played') + labs(fill='Game Type') +
  ggtitle('Number of Times the NFL Game Total Went Over or Under from 2000 through 2022')
  
# spread
ggplot(data=gbg_cl, aes(x=spread_result, fill=game_type)) + 
  geom_bar(color="darkblue") +
  xlab('') + ylab('Number of Games Played') + labs(fill='Game Type') +
  ggtitle('Number of Times the NFL Game was Covered by the Favorite from 2000 through 2022')

# Stacked barplot of roof as percentages with total and spread as main plots
# total
ggplot(data=gbg_cl, aes(x=total_result, fill=roof)) + 
  geom_bar(color="darkblue") +
  xlab('') + ylab('Number of Games Played') + labs(fill='Roof Type') +
  ggtitle('Number of Times the NFL Game Total Went Over or Under from 2000 through 2022')

#spread
ggplot(data=gbg_cl, aes(x=spread_result, fill=roof)) + 
  geom_bar(color="darkblue") +
  xlab('') + ylab('Number of Games Played') + labs(fill='Roof Type') +
  ggtitle('Number of Times the NFL Game was Covered by the Favorite from 2000 through 2022')

# Stacked barplot of surface as percentages with total and spread as main plots
# total
ggplot(data=gbg_cl, aes(x=total_result, fill=surface)) + 
  geom_bar(color="darkblue") +
  xlab('') + ylab('Number of Games Played') + labs(fill='Surface Type') +
  ggtitle('Number of Times the NFL Game Total Went Over or Under from 2000 through 2022')

# spread
ggplot(data=gbg_cl, aes(x=spread_result, fill=surface)) + 
  geom_bar(color="darkblue") +
  xlab('') + ylab('Number of Games Played') + labs(fill='Surface Type') +
  ggtitle('Number of Times the NFL Game was Covered by the Favorite from 2000 through 2022')

# Histograms of important numeric columns
gbg_cl %>%
  select(c('away_rest', 'home_rest', 'away_score', 'home_score', 
         'total', 'result', 'total_line', 'spread_line', 
         'wind', 'temp', 'total_qb_elo', 'total_team_elo',
         'qb_elo_diff', 'team_elo_diff')) %>%
  gather() %>%
  ggplot(aes(value)) +
  facet_wrap(~key, scales='free') +
  geom_histogram(bins=12, color='black', fill='#CFB87C', alpha=0.5) +
  theme_bw() + 
  ggtitle('Histograms of Possibly Important Numerical Columns from gbg Dataset')

# ---------------------------------------------------------------- #
sea_2000 <- gbg_cl %>%
  filter(season == '2000',
         week == 3)

ggplot(data=sea_2000, aes(x=home_total_defense_rank, y=home_total_offense_rank)) +
  geom_nfl_logos(aes(team_abbr=home_team), width=0.065, alpha=0.7) +
  labs(
    x='Home Team Defense Rank',
    y='Home Team Offense Rank',
    title = '2000 Week 3 Home Team Offense and Defense Ranks'
  )


  

Data Cleaning

‘gbg’ Dataset

This dataset contains NFL game by game statistics, information, and odds for every season from 1999 to present. This dataset came from leesharpe on GitHub. The GitHub with the dataset is linked here. The original uncleaned dataset contained 45 columns and 6420 rows each representing an specific NFL game.

Below are images of the uncleaned ‘gbg’ dataset (click to enlarge).

Below is a downloadable link to the .csv file of the uncleaned ‘gbg’ dataset.

‘gbg’ Dataset Cleaning Details

The gsis, pfr, pff, espn, old_game_id, nfl_detail_id, away_qb_id, home_qb_id, stadium_id columns were removed because they were unnecessary labels that aren’t needed for this project. The away_moneyline, home_moneyline, away_spread_odds, home_spread_odds, over_odds, and under_odds columns were removed because this project focuses on the total and spread assuming the odds are the typical -110. The odds also do not provide any relevance to predicting the total or the spread.

The gametime column had 259 missing values (4.03%). The missing values are all from 1999, so that season was removed. This meant the data frame now goes from the 2000 through 2022 season.

There are 259 missing values in the 1999 NFL season and 0 in all other seasons.
No missing values after removing the 1999 NFL season.

The awayscore, homescore, result, total, and overtime columns had 1 missing value (0.016%). These values are missing because they refer to one game that has not occurred yet this season (Super Bowl). This game were removed, which removed the missing values from those columns.

The surface column had 6 missing values (0.093%). All the missing values were from games played internationally in the 2022 season. These missing values were updated manually using data from pro football reference. The surface column also had two names for astro turf: ‘a_turf’ and ‘astroturf’. All of the ‘a_turf’ values were updated to be ‘astroturf’.

Before cleaning there are 6 missing values in the 2022 season.
No missing values after updating the missing surface values.

The temp and wind columns had 1746 missing values (27.20%). 1540 of these missing values were due to the roof being labeled as ‘closed’ or ‘dome’. Since there is no wind in these stadium types, the rows with missing wind values were updated to 0. Since these stadium types are enclosed, the rows with missing temp values were updated to the average temperature in a closed stadium (72). This still left 206 rows where temp and wind were missing. These values were added by hand to a .csv file. This file was then joined to the ‘games’ dataset to add the missing wind and temp values.

Before Cleaning there are many missing values for temperature.
After Cleaning there are no missing values.
Before Cleaning there are many missing values for wind speed.
After cleaning there are no missing values.

The referee column had 2 missing value (0.016%). One of the missing values was due to the Super Bowl not being played yet. Since the row containing statistics for the Super Bowl was removed, so was this missing value. The other missing value was from a game in week 15 of the 2021 season where the Patriots played the Colts. This missing value was added manually by looking up the referee for that game from pro football reference.

Before cleaning there are two missing referees.
After updating the referees there are no missing values.

Two columns were added to the dataset. These were the total_result and spread_result columns. These were added as labels, so that every game was labeled as going over or under the total as well whether the favorite or underdog covered the spread.

Looks at the balance of NFL games going over or under the total. It is very close to a 50/50 split.
Looks at the balance of NFL games where the favorite or underdog covered the spread. It is very close to a 50/50 split.
Data Cleaning Illusions
Appears to be an anomaly, but is actually expected. All closed and dome roofs had no wind. It’s also possible for their to have been no wind with open and outdoor stadiums. That’s why zero wind is such a common occurrence.
Appears to be an anomaly, but is actually expected. 2020 added a 7th seed into the playoffs resulting 2 more games being played. The 2021 season added a 18th into the regular season resulting in 16 more games being played. The 2022 season hasn’t concluded yet, so that’s why it has less games then 2021. The Houston Texans were added as an expansion team in 2002, so that’s why 2002 sees an uptick in games played.
Appears to be an anomaly, but actually expected. Bye weeks happen, games are occasionally moved, and playoffs have less games. So, that’s why weeks isn’t uniform.

The cleaned dataset contains 32 columns and 6161 games from 2000 through 2022. Below are definitions of all 32 columns.

  • game_id
    • Unique id assigned by the NFL to each game
  • season
    • The year of the NFL season
  • game_type
    • The type of NFL game (‘REG’ for regular season game, ‘WC’ for a wild card playoff game’, ‘DIV’ for a divisional playoff game, ‘CON’ for a conference championship playoff game, ‘SB’ for the super bowl)
  • week
    • The week of the NFL season
  • gameday
    • Date of the game
  • weekday
    • Day of the week the game is played on
  • gametime
    • Start time of the game
  • away_team
    • The abbreviation of the away team
  • away_score
    • The number of points scored by the away team
  • home_team
    • The abbreviation of the home team
  • home_score
    • The number of points scored by the home team
  • location
    • ‘Home’ when the home team is playing in their stadium or ‘Neutral’ when the game is played at a neutral stadium
  • result
    • The number of points the home team scored minus the number of points the away team scored
  • total
    • The sum of each teams points in the game
  • overtime
    • Whether the game went to overtime or not (1 for overtime and 0 for no overtime)
  • away_rest
    • The number of days the away team has had since playing their previous game
  • home_rest
    • The number of days the home team has had since playing their previous game
  • spread_line
    • The spread line for the game (positive number means the home team was favored by that many points and a negative number means the away team was favored by that many points)
  • total_line
    • The total line for the game
  • div_game
    • Whether the game is being played between two teams in the same division or not (1 is yes and 0 is no)
  • roof
    • The type of roof the stadium has
  • surface
    • The type of surface the stadium has
  • away_qb
    • The name of the away team’s quarterback
  • home_qb
    • The name of the home team’s quarterback
  • away_coach
    • The name of the away team’s coach
  • home_coach
    • The name of the home team’s coach
  • referee
    • The name of the head referee for the game
  • stadium
    • The name of the stadium the game is played in
  • temp
    • The temperature at or in the stadium in degrees fairenheight.
  • wind
    • The wind speed at or in the stadium in mph
  • total_result
    • Whether the ‘total’ was over or under the ‘total_line’ (‘Over’ for over and ‘Under’ for under)
  • spread_result
    • Whether the favorite covered the spread (‘Favorite Covered’ for the favorite covered and ‘Underdog Covered’ for the favorite not covering)

Below are images of the cleaned ‘gbg’ dataset (click to enlarge).

Below is a downloadable option for a .csv file of the cleaned ‘gbg’ dataset.

‘weather_na’ Dataset

This dataset contains temperature and wind speed values for specific NFL games that were missing this data in the above ‘games’ dataset. This dataset was manually created for the purpose of this project using information from https://nflweather.com.

Below is an image of the uncleaned ‘weather_na’ dataset (click to enlarge).

Below is a downloadable link to the uncleaned ‘weather_na’ .csv file.

‘weather_na’ Dataset Cleaning Details

Since this dataset was manually created it required very little cleaning. The only thing that needed to be done was selecting the necessary columns to join. Those columns are ‘game_id’, ‘temp’ and ‘wind’. This cleaned dataset was then joined with the above ‘gbg’ dataset to fix missing values. This dataset has 8 columns and 207 rows each representing a specific NFL game. Below are definitions of all 8 columns.

  • temp
    • The temperature at or in the stadium in degrees fairenheight.
  • wind
    • The wind speed at or in the stadium in mph
  • roof
    • The type of roof the stadium has
  • game_id
    • Unique id assigned by the NFL to each game
  • season
    • The year of the NFL season
  • week
    • The week of the NFL season
  • away_team
    • The abbreviation of the away team
  • home_team
    • The abbreviation of the home team

Below are images of the cleaned ‘weather_na’ dataset (click to enlarge).

Below is a downloadable link to the cleaned ‘weather_na’ .csv file.

‘nfl_elo’ Dataset

This dataset was obtained from the FiveThirtyEight GitHub. It contains NFL ELO ratings for overall teams as well as quarterbacks for every NFL game from 1920 through 2022. The uncleaned dataset contained 33 columns and 17379 rows representing specific NFL games. The purpose of this dataset was to join it with the ‘games’ dataset.

Below are images of the uncleaned ‘nfl_elo’ dataset (click to enlarge).

Below is a downloadable link to the .csv file of the uncleaned ‘nfl_elo’ dataset.

‘nfl_elo’ Dataset Cleaning Details

So, the first steps were removing unnecessary rows and columns. The ‘qbelo_prob1’, ‘qbelo_prob2’, ‘playoff’, ‘quality’, ‘qb1_game_value’, ‘qb2_game_value’, ‘qb1_value_post’, ‘qb2_value_post’, ‘qbelo1_post’, ‘qbelo2_post’, ‘elo1_post’, ‘elo2_post’, ‘neutral’, ‘importance’, ‘total_rating’, ‘qb1’, and ‘qb2’ columns were all removed as they provided no additional value to this project. The data set was then filtered to remove all season prior to 2000 as well as the Super Bowl for this season to match up with the ‘games’ dataset. The final step prior to joining was altering some team abbreviations to match up with the ‘games’ dataset. Prior to 2017 ‘LAC’ was changed to ‘SD’. Prior to 2016 ‘LAR’ was changed to ‘STL’. After 2019 ‘OAK’ was changed to ‘LV’. For every season ‘WSH’ was changed to ‘WAS’. Then this dataset was joined with the ‘games’ dataset. Four columns were added to this dataset. They are ‘total_team_elo’, ‘total_qb_elo’, ‘team_elo_diff’, and ‘qb_elo_diff’. These columns were added because they might be relevant to predicting the total or spread. The cleaned ‘nfl_elo’ dataset contains 16 columns and 6161 games from 2000 through 2022. Below are definitions of all 16 columns.

  • date
    • Date the game was played
  • season
    • The year of the NFL season
  • team1
    • The abbreviation for the home team
  • team2
    • The abbreviation for the away team
  • elo1_pre
    • Home team’s ELO rating before the game
  • elo2_pre
    • Away team’s ELO rating before the game
  • elo_prob1
    • Home team’s probability of winning according to Elo ratings
  • elo_prob2
    • Away team’s probability of winning according to Elo ratings
  • qbelo1_pre
    • Home team’s quarterback-adjusted base rating before the game
  • qbelo2_pre
    • Away team’s quarterback-adjusted base rating before the game
  • qb1_value_pre
    • Home starting quarterbacks’ raw Elo value before the game
  • qb2_value_pre
    • Away starting quarterbacks’ raw Elo value before the game
  • qb1_adj
    • Home starting quarterbacks’ Elo adjustment for the game
  • qb2_adj
    • Away starting quarterbacks’ Elo adjustment for the game
  • score1
    • The number of points scored by the home team
  • score2
    • The number of points scored by the away team
  • total_team_elo
    • ‘qbelo1_pre’ + ‘qbelo2_pre’
  • total_qb_elo
    • ‘qb1_adj’ + ‘qb2_adj’
  • team_elo_diff
    • The absolute value of ‘qbelo1_pre’ – ‘qbelo2_pre’
  • qb_elo_diff
    • The absolute value of ‘qb1_adj’ – ‘qb2_adj’

Below is an image of the cleaned ‘nfl_elo’ dataset (click to enlarge).

Below is a downloadable link to the .csv file of the cleaned ‘nfl_elo’ dataset.

‘elo_fill_na’ Dataset

This dataset was manually created using information from FiveThirtyEight to fill in missing values for ELO columns in the ‘games’ dataset. After joining the ‘games’ and ‘nfl_elo’ datasets there were still 29 rows that had missing ELO data. This was due to joining issues. Below is a link to the .csv file of the ‘elo_fill_na’ dataset.

Below is an image of the ‘elo_fill_na’ dataset (click to enlarge).

‘elo_fill_na’ Dataset Cleaning Details

This dataset did not need any additional cleaning as it was created with the sole purpose to replace missing values in those 29 rows. The dataset contains 11 columns and 29 rows that represent specific games. Here is a link to the dataset. Below are definitions of all 1 1 columns.

  • game_id
    • Unique id assigned by the NFL to each game
  • elo1_pre
    • Home team’s ELO rating before the game
  • elo2_pre
    • Away team’s ELO rating before the game
  • elo_prob1
    • Home team’s probability of winning according to Elo ratings
  • elo_prob2
    • Away team’s probability of winning according to Elo ratings
  • qbelo1_pre
    • Home team’s quarterback-adjusted base rating before the game
  • qbelo2_pre
    • Away team’s quarterback-adjusted base rating before the game
  • qb1_value_pre
    • Home starting quarterbacks’ raw Elo value before the game
  • qb2_value_pre
    • Away starting quarterbacks’ raw Elo value before the game
  • qb1_adj
    • Home starting quarterbacks’ Elo adjustment for the game
  • qb2_adj
    • Away starting quarterbacks’ Elo adjustment for the game

‘news_headlines’ Dataset

This dataset was obtained using an API from https://newsapi.org. The dataset contains news articles labeled by topic. The six topics that were selected are ‘sports betting’, ‘sports’, ‘sports business’, ‘sports politics’, ‘politics’, and ‘business’. The dataset contains 5 columns and 600 rows (100 rows for each topic).

‘news_headlines’ Dataset Cleaning Details
  • Endpoint: “https://newsapi.org/v2/everything”
  • URLPost: https://newsapi.org/v2/everything?apiKey=your_api_key_here&q=topic
    • Your own API key is inserted in place of ‘your_api_key_here’
    • ‘topic’ is replaced by the selected new topic, so those six topics above are inserted here

Below are definitions of all 5 columns.

  • ‘LABEL’
    • Topic of the news article (one of the six topics above)
  • ‘Date’
    • Date the news article was published
  • ‘Source’
    • Publisher of the news article
  • ‘Title’
    • Title of the news article
  • ‘Headline’
    • Headline of the news article

Below is an image of the ‘news_headlines’ dataset (click to enlarge).

Below is a downloadable link to the .csv file of the ‘news_headlines’ dataset.

‘news_headlines_vectorized’ Dataset

This dataset contains the headlines of the news articles in a vectorized format for further text analysis.

Below is a downloadable link to the .csv file of the ‘news_headlines_vectorized’ dataset.

Further EDA

These plots show that total points scored and wind speed appear to be negatively correlated. When considering the spread result it appears that wind has no effect on whether the favorite or underdog covered and there is no correlation with total points scored. When considering the total result it appears that there might be a trend with higher winds. Below approximately 30 mph wind doesn’t seem to have any effect on the total result. However, when the wind speed is higher than approximately 30mph the only result is the under. This might be something to watch out for later in this project. It also appears that the under is much more likely to happen when the total points scored is less than approximately 50.

These plots show that there is essentially no correlation between the the temperature and total points scored. When looking at the spread result there appears to be no correlation with temperature or total points scored. When looking at the total result there also appears to be no correlation with temperature. As in the above plot the total result and total points scored appears to be strongly correlated.

This plot appears to show that there is no correlation between the type of game and the total result.

This plot appears to show that there is no correlation between the type of game and the spread result.

This plot appears to show that when the stadium is a dome or has a closed roof the over occurs a greater percentage of the time. Would need to confirm this numerically, since this is only looking at it visually.

This plot appears to show that there is no correlation between roof type and spread result.

This plot appears to show that astro turf, sport turf and field turf lead to a higher percentage of overs. Would need to confirm this numerically, since this is only looking at it visually.

This plot appears to show that there is no correlation between the field surface type and the spread result.

There appears to be a very slight positive correlation between combined team ELO and total points scored. There does not appear to be any correlation between combined team ELO and the total result or spread result. There does not appear to be any correlation between the spread result and total points scored. There appears to be a strong correlation between the total result and total points scored.

There appears to be a very slight positive correlation between combined quarterback ELO and total points scored. There does not appear to be any correlation between combined quarterback ELO and the total result or spread result. There does not appear to be any correlation between the spread result and total points scored. There appears to be a strong correlation between the total result and total points scored.

There appears to be no correlation between the absolute value of the difference between the two teams’ ELO ratings and total points scored. There does not appear to be any correlation between the absolute value of the difference between the two teams’ ELO ratings and the total result or spread result. There does not appear to be any correlation between the spread result and total points scored. There appears to be a strong correlation between the total result and total points scored.

There appears to be a very slight negative correlation between the absolute value of the difference between the two quarterbacks’ ELO ratings and total points scored. There does not appear to be any correlation between the absolute value of the difference between the two quarterbacks’ ELO ratings and the total result or spread result. There does not appear to be any correlation between the spread result and total points scored. There appears to be a strong correlation between the total result and total points scored.

This plot shows distributions of numerical columns that might be important for predicting the spread or total result. This is just to get a sense of what the distributions of these variables look like in case this impacts later analysis. The ‘result’, ‘spread_line’, ‘total’, ‘total_line’, and ‘total_team_elo’ variables appear to be approximately normal. The ‘qb_elo_diff’, ‘team_elo_diff’, and ‘wind’ variables appear to be right-skewed. The ‘away_score’ and ‘home_score’ variables appear to be slightly right-skewed. The ‘temp’ and ‘total_qb_elo’ variables appear to be left-skewed. The ‘away_rest’ and ‘home_rest’ variables appear to be unimodal on 7.