Articles Dataset
Hierarchical (hclust())
The above dendrogram shows a visual of the results from clustering the 60 articles hierarchically. The distance metric used was cosine similarity. Based on the dendrogram, the hierarchical clustering appears to have worked fairly well. There are few articles that were clustered incorrectly. There are three main clusters, which is what we expect. The first two clusters appear to be sports betting and business, while the third cluster appears to be politics. The sports betting and business clusters are both homogenous, which is desired. However, the politics cluster does seem to have some issues. It has mostly politics articles, but there are some sports betting and business articles mixed in as well. Overall, hierarchical clustering of the articles performed well and indicates that the value of ‘K’ should be 3, which is equivalent to the number of labels in the articles dataset.
K-Means
The above plots show clustering of the articles using K-Means when ‘K’ is equal to 2, 3, and 4. As the plots show, K-Means was bad at clustering the articles. This is somewhat expected though since the articles dataset is high dimensional with 6359 dimensions. K-Means uses Euclidean Distance as the similarity measure. Euclidean Distance does not work well on high dimensional data and so as a result it is not unexpected for K-Means to struggle with high dimensional data.
The above silhouette plot shows the optimal value of ‘K’. For the articles dataset, the optimal value of ‘K’ for K-Means is 2. This conflicts with the information gained from the dendrogram and also from the actual labels of the articles dataset. Since, the number of different labels from the dataset is 3 and the dendrogram supports ‘K’ equal to 3, it further confirms that K-Means does a poor job of clustering the articles dataset.
Games Dataset
Hierarchical (hclust())
The above dendrogram shows a visual of the results from clustering the 6160 NFL games hierarchically. The distance metric used was cosine similarity. Based on the dendrogram, the hierarchical clustering appears to have not worked very well. Since there are two labels in the games dataset and they are approximately evenly distributed the hope was to see two even clusters. Looking at the dendrogram there appears to be more than 2 clusters, which is not ideal. The dendrogram indicates that the value of ‘K’ should probably be 4 or 5.
K-Means
The above plots show clustering of the games using K-Means when ‘K’ is equal to 2, 3, and 4. As the plots show, K-Means was actually somewhat successful in clustering the games. The clusters seem to be approximately equal for all three values of ‘K’. The silhouette plot below will reveal which value of ‘K’ is actually the best. The hope is that it is 2 as there are two labels for the games dataset (Over and Under).
The above silhouette plot shows the optimal value of ‘K’ . For the games dataset, the optimal value of ‘K’ for K-Means is 2. This conflicts with the information gained from the dendrogram. However, it is the value of ‘K’ that is desired based on the number of labels in the dataset. So, based on this information it appears that K-Means does a better job of clustering the games dataset than hierarchical clustering does.