Sports Betting ARM Data Prep

Association Rule Mining requires data to be in transaction format. Often labels are removed also as they are not necessary. However, for this project the ‘total_result’ labels (either ‘Over’ or ‘Under’) were kept as the goal is to determine if there is an association between the ‘total_result’ and other game pre-game statistics/information.

Games Dataset

The .R file containing the code to prep the games dataset for Association Rule Mining, the .csv file of the games dataset prior to being prepped, and the games dataset after being prepped can be downloaded below.

ARM_gbg_data_prep.R Download

gbg_cleaned.csv Download

gbg_prepped_for_ARM.csv Download

The first step was to select the columns from the games dataset to use for the Association Rule Mining model. The columns selected were: ‘total_result’, ‘week_day’, ‘location’, ‘game_type’, ‘roof’, ‘surface’, ‘referee’, ‘total_line’, ‘div_game’, ‘temp’, ‘wind’, ‘total_team_elo’, ‘total_qb_elo’, ‘home_total_offense_rank’, ‘home_total_defense_rank’, ‘away_total_offense_rank’, and ‘away_total_defense_rank’.

The second step was to alter the ‘div_game’ column. Currently the ‘div_game’ column contains the value 1 for a division game or the value 0 for a non-division game. These were updated so that the value of 1 was replaced with the text ‘div_game’ and the value of 0 was replaced with the text ‘non_div_game’. This makes it easier to interpret the results of the Association Rule Mining model for any rules containing values from the ‘div_game’ column.

The third step was to create new columns ‘total_team_elo_rank’ and ‘total_qb_elo_rank’. These columns contain the rank of the ‘total_team_elo’ and ‘total_qb_elo’ for each game in the current week. The higher the rank the better.

The fourth step was to create some new columns that could be used for Association Rule Mining. This was mainly done through discretization of numeric variables. The ‘total_line’, ‘temp’, ‘wind’, ‘home_total_offense_rank’, ‘home_total_defense_rank’, ‘away_total_offense_rank’, and ‘away_total_defense_rank’, ‘total_team_elo_rank’, and ‘total_qb_elo_rank’ were all binned. This created categories that could be tested in the Association Rule Mining model to see if they had association with the total result of over or under occurring. The newly created columns were named ‘total_line_bin’, ‘temp_bin’, ‘wind_bin’, ‘total_team_elo_bin’, ‘total_qb_elo_bin’, ‘home_total_offense_rank_bin’, ‘home_total_defense_rank_bin’, ‘away_total_offense_rank_bin’, and ‘away_total_defense_rank_bin’.

The fifth step was to create a new column ‘total_qb_elo_pos’, which contains the values ‘negative_qb_elo’ when the ‘total_qb_elo’ is negative and ‘positive_qb_elo’ when the ‘total_qb_elo’ is positive. This column was created to see if there was an association between the sign of the combined ELO ratings of the quarterbacks in the game and over or under occurring.

The sixth step was to remove any unnecessary columns that were only used to create or edit existing columns. So, this involved removing ‘season’, ‘week’, ‘total_line’, ‘temp’, ‘wind’, ‘total_team_elo’, ‘total_qb_elo’, ‘home_total_offense_rank’, ‘home_total_defense_rank’, ‘away_total_offense_rank’, ‘away_total_defense_rank’, ‘total_team_elo_rank’, and ‘total_qb_elo_rank’.

The final step was to write the prepped data frame to a .csv file. Then the first row from the .csv file needed to be removed because it contained column names.

The code for prepping the games dataset for Association Rule Mining in R is found below.

Copy Code

library(dplyr)

# Read in games dataset
gbg_cl <- read.csv('C:/Users/casey/OneDrive/Documents/MSDS_Courses/Spring_2023/Machine_Learning/data/gbg_cleaned.csv')

# Select relevant columns for Association Rule Mining
gbg_cl <- gbg_cl %>% select(season, week, total_result,weekday, location, game_type, roof, surface, 
                            referee, total_line, div_game, temp, wind, 
                            total_team_elo, total_qb_elo, home_total_offense_rank,
                            home_total_defense_rank, away_total_offense_rank,
                            away_total_defense_rank)

# ---------------------------------------------------------------------------- #
# Ranks the total_team_elo and total_qb_elo of games based on where they stand the current week
# Ranks are 1-16 where 16 is the best and 1 is the worst
seasons <- unique(gbg_cl$season)

gbg_cl$total_team_elo_rank <- 0
gbg_cl$total_qb_elo_rank <- 0

for (season_num in seasons){
  weeks <- unique(gbg_cl$week[gbg_cl$season == season_num])
  
  for (week_num in weeks) {
    # total_team_elo ranks
    team_list <- gbg_cl$total_team_elo[gbg_cl$season == season_num & gbg_cl$week == week_num]
    rank_list <- rank(team_list, ties.method = 'max')
    gbg_cl$total_team_elo_rank[gbg_cl$season == season_num & gbg_cl$week == week_num] <- rank_list
    
    
    # total_qb_elo_ranks
    qb_list <- gbg_cl$total_qb_elo[gbg_cl$season == season_num & gbg_cl$week == week_num]
    rank_list <- rank(qb_list, ties.method = 'max')
    gbg_cl$total_qb_elo_rank[gbg_cl$season == season_num & gbg_cl$week == week_num] <- rank_list
  }
}

# ---------------------------------------------------------------------------- #
# BINNING

# Bins the necessary columns
gbg_cl <- gbg_cl %>%
  mutate(div_game = ifelse(div_game == 1, 'div_game', 'non_div_game'),
         total_line_bin = cut(total_line, labels=c('Under 35', '35-39', '40-50',
                                                   '51-55', 'Over 55'), 
                              breaks=c(0, 34, 39, 50, 55, 100)),
         temp_bin = cut(temp, labels=c('Below 0', '0-10', '11-20', '21-32',
                                       '33-45', '46-60', '61-72', '73-84', 'Over 84'),
                        breaks=c(-50, -1, 10, 20, 32, 45, 60, 72, 84, 200)),
         wind_bin = cut(wind, labels=c('No Wind', 'Light Wind', 'Moderate Wind',
                                       'Strong Wind', 'Gale', 'Strong Gale'), 
                        breaks=c(-50, 0, 12, 24, 31, 46, 150)),
         total_team_elo_bin = cut(total_team_elo_rank, labels=c('1-5', '6-10', 
                                                           '11+'),
                                  breaks=c(0, 5, 10, 20)),
         total_qb_elo_bin = cut(total_qb_elo_rank, labels=c('1-5', '6-10',
                                                             '11+'),
                                  breaks = c(0, 5, 10 , 20)),
         total_qb_elo_pos = ifelse(total_qb_elo < 0, 'negative_qb_elo', 
                                   'positive_qb_elo'),
         home_total_offense_rank_bin = cut(home_total_offense_rank, labels=c('1-5', '6-10',
                                                                         '11-15', '16-20',
                                                                         '21-25', '25+'),
                                       breaks=c(0, 5, 10, 15, 20, 25, 35)),
         away_total_offense_rank_bin = cut(away_total_offense_rank, labels=c('1-5', '6-10',
                                                                         '11-15', '16-20',
                                                                         '21-25', '25+'),
                                       breaks=c(0, 5, 10, 15, 20, 25, 35)),
         home_total_defense_rank_bin = cut(home_total_defense_rank, labels=c('1-5', '6-10',
                                                                         '11-15', '16-20',
                                                                         '21-25', '25+'),
                                       breaks=c(0, 5, 10, 15, 20, 25, 35)),
         away_total_defense_rank_bin = cut(away_total_defense_rank, labels=c('1-5', '6-10',
                                                                         '11-15', '16-20',
                                                                         '21-25', '25+'),
                                       breaks=c(0, 5, 10, 15, 20, 25, 35)))


# For the binned columns pastes on words to indicate what column the rank is from
# There will be no columns names when ARM is performed so this is necessary
gbg_cl$total_line_bin <- paste('Total of ', gbg_cl$total_line_bin)
gbg_cl$temp_bin <- paste(gbg_cl$temp_bin, ' degrees')
gbg_cl$total_team_elo_bin <- paste('Total Team ELO Rank of ', gbg_cl$total_team_elo_bin)
gbg_cl$total_qb_elo_bin <- paste('Total QB ELO Rank of ', gbg_cl$total_qb_elo_bin)
gbg_cl$home_total_offense_rank_bin <- paste('Total Home Offense Rank of ', gbg_cl$home_total_offense_rank_bin)
gbg_cl$away_total_offense_rank_bin <- paste('Total Away Offense Rank of ', gbg_cl$away_total_offense_rank_bin)
gbg_cl$home_total_defense_rank_bin <- paste('Total Home Defense Rank of ', gbg_cl$home_total_defense_rank_bin)
gbg_cl$away_total_defense_rank_bin <- paste('Total Away Defense Rank of ', gbg_cl$away_total_defense_rank_bin)

# --------------------------------------------------------------------------- #
# Remove Unneeded columns
gbg_cl <- gbg_cl %>% select(-c('season', 'week', 'total_line', 'temp', 'wind',
                               'total_team_elo', 'total_qb_elo', 
                               'home_total_offense_rank', 'away_total_offense_rank',
                               'home_total_defense_rank', 'away_total_defense_rank',
                               'total_team_elo_rank', 'total_qb_elo_rank')) 


# Write prepped data to .csv file
# Doesn't include row names
# After writing, remove the first row from the .csv file to get rid of column names
write.csv(gbg_cl, 'C:/Users/casey/OneDrive/Documents/MSDS_Courses/Spring_2023/Machine_Learning/ARM/prepped_data/gbg_prepped_for_ARM.csv',
          row.names=FALSE)