Clustering is an unsupervised method of machine learning. This means that the data used can’t be labeled and it can’t be categorical. The main use of clustering is to find out if the data can be grouped into categories or classes. Then if categories are identified, they can be used to classify or predict the categories of other vectors (rows) of data. Another use of clustering is if data is already labeled. Clustering can then be used to confirm that the labels are correct (the labels must be removed and saved prior to clustering). Clustering also reduces dimensionality of data and if groups are discovered, labels for the data can be produced. There are three main types of clustering: partitional, hierarchical, and density. Since this project focuses on partitional and hierarchical more details on those clustering types are below.
Partitional Clustering
In general, partitional clustering divides data vectors into groups that are not overlapping. This means that no vector can be can be a member of of more than one cluster and every cluster must have at least one vector. The most common type of partitional clustering is K-Means. Some other types of partitional clustering are PAM (K-Medoid) and CLARA (Clustering Large Applications).
K-Means
K-Means is a partitional clustering algorithm that clusters vectors into ‘K’ groups, where ‘K’ is the number of clusters desired. The general goal of K-Means is to minimize the intracluster distances and maximize the intercluster distances. Essentially this is saying that the distance between vectors in a cluster are as small as possible, while the distances between each of the centers of the clusters are as large as possible. Each cluster is defined by it’s centroid. The centroid is the center of the cluster and is equivalent to the average of all vectors in the cluster. Below are the general steps to K-Means clustering.
- Choose the value of ‘K’ (number of clusters)
- ‘K’ centroids are chosen to be the centers of the clusters. The centroids are usually randomly selected from the data by the algorithm. However, centroids can be specified if desired.
- Each vector in the dataset is now assigned to it’s nearest centroid based on how similar it is to the centroid. Similarity is determined by a selected distance metric (will go into more detail on distance metrics below)
- After each vector is assigned to a centroid, the centroids of each cluster are recalculated. This is done by taking the mean of all the vectors in the cluster.
- Each vector is again assigned to it’s nearest newly calculated centroid. This checks to see if a vector is closer to a different cluster than it was originally in.
- Steps 4 and 5 are repeated until the number of iterations is determined to be sufficient by the algorithm. After this occurs the final clusters have been created.
So, how is ‘K’ chosen? If beforehand the number of labels or groups desired is known then that will be the value of ‘K’. However, this is often unknown as clustering is commonly used to figure out the number of groups in a dataset. When this is the case, there are a few methods to determine ‘K’. The first, is just trying different values of ‘K’ and determining which performs best. The second, is using the ‘elbow’ or ‘silhouette’ method. These methods run different values of ‘K’ and determine which value performs best. Both these methods can be plotted easily to quickly show the best value of ‘K’. the third method, is to cluster the data hierarchically and determine the value of ‘K’ by looking at the resulting dendrogram.
Hierarchical Clustering
Hierarchical clustering works by creating a hierarchy of clusters. Since a hierarchy is created it is not necessary to specify the number of clusters beforehand (this is also why hierarchical clustering can be used to find ‘K’ for K-Means). In general, hierarchical clustering algorithms will use a similarity or distance matrix to either merge or split one cluster at a time. The end result of hierarchical clustering is a set of nested clusters organized as a hierarchical tree. Each branch indicates the creation of a new cluster. The end result can be visualized using a dendrogram. There are two main types of hierarchical clustering: agglomerative (referred to as AGNES) and divisive (referred to as DIANA).
Agglomerative (AGNES)
Agglomerative clustering works in a bottom-up way. So, agglomerative clustering creates a hierarchical tree starting from the bottom and merging until one large cluster is formed at the top. How this works is, initially, every vector in the dataset is considered to be it’s own cluster. Then the two clusters that are most similar are combined into one larger cluster. This process is iterated until every single vector is part of one large cluster.
Divisive (DIANA)
Divisive clustering works in a top-down way. So, divisive clustering creates a hierarchical tree starting from the top in one large cluster and splitting until every vector is in their own cluster at the bottom. How this works is, initially every vector in the dataset is in one large cluster. Then the large cluster is split into two clusters that are most different from each other. This process the continues with the newly created cluster, until every single vector in the dataset is split into it’s own cluster.
Distance Measures
The similarity between two or more vectors is determined by the distance between them. This is why clustering requires numeric data. There are numerous different distance metrics that can be used. Some of the most common ones are Euclidean, Manhattan, and Cosine Similarity. Euclidean and Manhattan distance are both derived from Minkowski distance. Euclidean distance is equal to Minkowski distance when ‘p’ is equal to two and Manhattan distance is equal to Minkowski distance when ‘p’ is equal to one.
Euclidean Distance
Euclidean distance is defined as the square root of the sum of squares of the differences between two vectors in each dimension. This is the most common notion of distance and is the L2 norm. Euclidean distance is based on the most direct distance from point to point. However, Euclidean distance tends to be less useful when working with vectors with high dimensions. This is because at high dimensions any two points are going to be very far apart no matter how similar they are.
Manhattan Distance
The Manhattan distance is defined as the sum of the differences between two vectors in each dimension. This is the L1 norm. Manhattan distance is the distance if you had to travel along coordinates only, rather than finding the most direct path. As a result, Manhattan distance is typically used to calculate the fastest route between two points in a grid.
Cosine Similarity
Cosine Similarity is a distance metric that is based on the angle between the vectors from the origin to the points being measured. Cosine Similarity is considered non-Euclidean because it is based on the properties of the vectors rather than what their location is in a defined space. Since, Cosine Similarity is non-Euclidean, it is a good distance metric for high dimensional data.
Clustering in this Project
Since, the data used in this project is labeled, clustering will be used mainly to confirm these labels, but also to see if clustering can be used to predict labels. Clustering will be done on both the ‘gbg’ dataset and some news articles, to show clustering with text also. The ‘gbg’ dataset has total labels. So, clustering on the ‘gbg’ dataset will be done to try and determine that the games can be clustered how those labels say it should. If they can, then clustering will be used to try and correctly predict, if a game should go in the ‘Over’ or ‘Under’ cluster. The articles are labeled by topic. So, clustering on the articles will attempt to determine if articles can be correctly clustered based on the text content they contain.