The problem of unsupervised learning can be formulated as follows: Given a set of observed data, the goal is to estimate an unobservable underlying probability density function. This definition is reminiscent of density estimation in statistics. However, unsupervised learning consists of many other methods that try to summarize and explain key features of the data.
Clustering is one of the key ideas in unsupervised learning. It is the assignment of a set of data (observations) into subsets (clusters), so that data in the same cluster are similar in some sense. For example, one can find that most objects in the natural world fall into one of two categories: things that are brown and run away, and things that are green and don't run away. The first group is called animals, and the second, plants. Clustering is the operation of grouping things together.
There are several different types of clustering that generally fall into the two following types:
- Partitional Clustering
- Hierarchical Clustering
Partitional Clustering
Partitional clustering aims to determine all clusters at once. It is simply a division of the set of data objects into non-overlapping subsets. An example of partitional clustering is K-means Clustering, which is presented below.
K-means Clustering is a parametric approach that aims to partition \(n\) data objects into \(k\) clusters where each data object is assigned to the cluster with the nearest mean. In other words, it tries to find the centers of natural clusters of the data objects.
To provide a formal description, consider a set of data objects \((x_1, x_2, \dots, x_n)\), where each data object is a real vector. K-means clustering aims to partition the \(n\) data objects into \(k\) clusters \(C = {C_1, C_2, \dots, C_k} (k \leq n) \), so that the within-cluster sum of squares is minimum:
$$\underset{\mathbf{C}} {\operatorname{arg\ min}} \sum_{i=1}^{k} \sum_{\mathbf x_j \in C_i} \left| \mathbf x_j - \boldsymbol\mu_i \right|^2$$
where \(\mu_i\) is the mean of data objects in \(C_i\). An algorithm that is often used for k-means clustering has the following steps, which are iterated (counted by \(t\)) until a convergence is reached (i.e. the assignments do not longer change).
Hierarchical Clustering
Hierarchical clustering is a non-parametric approach that aims to build a hierarchy of clusters. Approaches to hierarchical clustering generally fall into two types:
Agglomerative: a "bottom up" approach, in which one starts by inserting each data object into its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
Divisive: a "top down" approach, in which one starts by inserting all data objects in one cluster, and splits are performed recursively as one moves down the hierarchy.
In order to determine which clusters should be merged, or where a cluster should be split, a measure of similarity/dissimilarity (e.g. a measure of distance) is needed between sets of data objects. Some commonly used measures of distance for hierarchical clustering are (where \(a\) and \(b\) are two clusters; and \(a_i\) and \(b_i\) are the data objects of the corresponding clusters):
-
Euclidean distance: \(|a-b |_2 = \sqrt{\sum_i (a_i-b_i)^2}\)
-
Squared Euclidean distance: \(|a-b |_2^2 = \sum_i (a_i-b_i)^2\)
-
Manhattan distance: \(|a-b |_1 = \sum_i |a_i-b_i|\)
-
Maximum distance: \(|a-b |_\infty = \max_i |a_i-b_i|\)
-
Mahalanobis distance: \(\sqrt{ (a-b)^T S^{-1} (a-b) }\) where \(S\) is the covariance matrix.