Clustering is an unsupervised machine learning model. This means that data used cannot have any labels. The reason for this, is that the purpose of clustering is to create groups. If groups are already created by the dataset having labels then there is essentially no reason to perform clustering. The exception to this is using clustering to confirm that the labels are correct. Clustering also requires a distance metric to determine how similar vectors (rows) are to each other. Calculating distance requires data to be numeric. So, to prep for clustering any non-numeric columns and label columns need to be removed.
Data Prep of Text in R
The articles obtained using the News API in .txt format need to be prepped in order to cluster them. Twenty articles were randomly selected from the sports betting, politics, and business topics. They were then placed in their own folder. This folder was read into R and stored in a Corpus. Then general text processing steps were applied to the Corpus. All the words were converted to lowercase, all the punctuation was removed, all stop words were removed, and all numbers were removed. The Corpus was then converted into a Document Term Matrix and then subsequently converted into a data frame. Using a Document Term Matrix allows clustering by document (article). If a Term Document Matrix was used, then clustering would occur by word. Once the Document Term Matrix has been converted to a data frame, the articles are ready for clustering.
Below is an image of the dataset after cleaning and vectorizing the .txt articles.
The file below contains R code for these prep steps listed above (click the Download button for the .R file of the code, click the file name for a .txt file of the code). The final product is a .csv file that is ready to use for clustering. This code will work with any folder containing .txt documents. This code is specific for prepping .txt documents to be clustered by document. To cluster by word just change the DocumentTermMatrix() function to TermDocumentMatrix(). Text data starting in .csv format that needs to be cleaned requires different code.
Below is the R code used to clean the the .txt articles for clustering.
## The following code is courtesy of
## Professor Ami Gates, Dept. Applied Math, Data Science, University of Colorado
## and has been adapted to fit this project
library(stats)
# clustering libraries
library(NbClust)
library(cluster)
library(mclust)
library(amap) ## for using Kmeans (notice the cap K)
library(factoextra) ## for cluster vis, silhouette, etc.
library(purrr)
library(stylo) ## for dist.cosine
library(philentropy) ## for distance() which offers 46 metrics
library(SnowballC)
library(caTools)
library(dplyr)
library(textstem)
library(stringr)
library(wordcloud)
library(tm) ## to read in corpus (text data)
library(dplyr)
# ---------------------------------------------------------------------------- #
## LOAD .txt documents from the given folder (corpus) containing the .txt documents
ArticlesCorpus <- Corpus(DirSource("sp_po_bu_articles_20"))
(getTransformations()) ## These work with library tm
# gets num of documents in the folder
(ndocs<-length(ArticlesCorpus))
# ---------------------------------------------------------------------------- #
## TEXT DATA CLEANING for multiple .txt documents
## Do some clean-up.............
# Convert all words to lowercase
ArticlesCorpus <- tm_map(ArticlesCorpus, content_transformer(tolower))
# Remove any punctuation
ArticlesCorpus <- tm_map(ArticlesCorpus, removePunctuation)
# Creat a list of stop words to remove
MyStopWords <- c("and","like", "very", "can", "I", "also", "lot", 'say', 'get',
'said', 'will')
# Remove stop words from the above list
ArticlesCorpus <- tm_map(ArticlesCorpus, removeWords, MyStopWords)
##-------------------------------------------------------------
## Convert to Document Term Matrix
## If clustering by Word is desired then change DocumentTermMatrix() to TermDocumentMatrix()
## DOCUMENT Term Matrix (Docs are rows)
# Removes normal stopwords as well
# Only keeps word lengths between 4 and 10
# Removes punctutation
# Removes numbers
# Converts all words to lowercase
ArticlesCorpus_DTM <- DocumentTermMatrix(ArticlesCorpus,
control = list(
stopwords = TRUE, ## remove normal stopwords
wordLengths=c(4, 10),
removePunctuation = TRUE,
removeNumbers = TRUE,
tolower=TRUE
#stemming = TRUE,
))
inspect(ArticlesCorpus_DTM)
### Create a DF as well................
Articles_DF_DT <- as.data.frame(as.matrix(ArticlesCorpus_DTM))
## Create .csv file of cleaned and prepped articles
# Need to use write.csv instead of write_csv to keep row names
write.csv(Articles_DF_DT, 'C:/Users/casey/OneDrive/Documents/MSDS_Courses/Spring_2023/Machine_Learning/Clustering/prepped_data/articles_prepped_for_cluster.csv')
## Create a Articles Matrix for a Word Cloud
My_articles_m <- (as.matrix(Articles_DF_DT))
nrow(My_articles_m)
## WORD CLOUD
# Creates a word cloud for the articles
word.freq <- sort(rowSums(t(My_articles_m)), decreasing = T)
wordcloud(words = names(word.freq), freq = word.freq*2, min.freq = 2,
random.order = F)
Data Prep of Record Data in Python
The cleaned game by game dataset, ‘gbg_cleaned’, also needs to prepped for clustering. Prepping non-text data in record format is much simpler. The only thing that needs to be done is that the non-numeric columns need to be removed prior to clustering. For the ‘gbg_cleaned’ dataset this involves removing the ‘game_id’, ‘season’, ‘game_type’, ‘week’, ‘gameday’, ‘weekday’, ‘game_time’, ‘away_team’, ‘home_team”, ‘location’, ‘total’, ‘overtime’, ‘spread_line’, ‘div_game’, ‘roof’, ‘surface’, ‘away_qb’, ‘home_qb’, ‘away_coach’, ‘home_coach’, ‘referee’, ‘stadium’, and ‘spread_result’ columns. The numeric columns ‘ away_score’, ‘home_score’, ‘result’, and ‘total’ were also removed because those are values that cannot be obtained before a game, so they wouldn’t help in prediction. The ‘total_result’ column was kept even though it is a label because it will be used to check the accuracy of the clustering. Ensure the ‘total_result’ column is removed before actual clustering takes place.
Below are images of the ‘gbg_cleaned’ dataset before prepping for clustering and the ‘gbg_prepped_for_cluster’ dataset after cluster preparation is complete.
Below is the Python code used to clean the ‘gbg_cleaned’ dataset for clustering. The .py file along with the gbg_cleaned.csv and gbg_prepped_for_cluster.csv (final product after ‘gbg_cleaned’ is prepped for clustering) can also be downloaded.
# -*- coding: utf-8 -*-
"""
Created on Mon Feb 20 10:54:49 2023
@author: casey
"""
# import necessary libraries
import pandas as pd
# read in game dataset
gbg = pd.read_csv('C:/Users/casey/OneDrive/Documents/MSDS_Courses/Spring_2023/Machine_Learning/data/gbg_cleaned.csv')
# need to only select numeric columns
# can't use labels or categorical or text for clustering
# will keep the 'total_result' column (contains labels) for now
# keep to check accuracy of clustering (make sure to remove the 'total_result' before clustering)
gbg_cl = gbg[['total_result','temp', 'wind','total_line', 'away_rest', 'home_rest', 'avg_away_total_yards', 'avg_home_total_yards',
'avg_away_total_yards_against', 'avg_home_total_yards_against',
'away_total_offense_rank', 'home_total_offense_rank',
'away_total_defense_rank', 'home_total_defense_rank',
'elo1_pre', 'elo2_pre', 'elo_prob1', 'elo_prob2', 'qbelo1_pre',
'qbelo2_pre', 'qb1_value_pre', 'qb2_value_pre', 'qb1_adj','qb2_adj',
'total_team_elo', 'total_qb_elo', 'team_elo_diff', 'qb_elo_diff']]
gbg_cl.to_csv('C:/Users/casey/OneDrive/Documents/MSDS_Courses/Spring_2023/Machine_Learning/Clustering/prepped_data/gbg_prepped_for_cluster.csv',
index=False)