Thursday, July 21, 2016

Reading Notes: Semi-Supervised Clustering

Reading Notes: A brief summary of papers I've read, mainly for personal future reference. The claims about the papers are due to my own (mis-)understanding and reflect my personal opinion. If you know other interesting papers on this topic, I would be happy if you could leave a comment!

Semi-supervised clustering, or constraint clustering, deals with clustering where few of the data points are labeled, either with constraints that they must belong to the same or to different clusters, or with a cluster label.

Constraints as Features (Asafi, Cohen-Or): Data points are assumed to come from an $M$-dimensional space. It is assumed that a pairwise distance matrix is given. Must-Link constraints reduce distances, after which all other distances have to be recomputed to satisfy the triangle inequality. Each Cannot-Link constraint adds a dimension to the data set. Moreover, each Cannot-Link constraint adds an additional distance between the two corresponding data points; the distances between unconstrained data points are increased according to their diffusion distance to the constrained points (data points that have small diffusion distances to the constrained points get larger distances added). By adding a dimension for each constraint, these added distances are "orthogonal" to the original distances, thus avoiding a change of the "general shape" of the data set. The thus obtained distance matrix can be clustered using standard, unsupervised clustering techniques. Experiments: Image segmentation, UCI repo. Noisy constraints are not considered.

Clustering with Partition Level Side Information (Liu, Fu): Data points are assumed to come from an $M$-dimensional space. It is assumed that a pairwise distance matrix is given. A subset of the data points are labeled with a cluster label $1$ through $K$. Labeling is assumed to extend the dimension, i.e., each data point is an element of $M+K$-dimensional space; all data points lie in an $M$-dimensional subspace. Data points labeled to cluster $i$ lie in the subspace that is obtained by shifting the original $M$-dimensional subspace by the unit vector with in the $M+i$-th direction. A k-means-like clustering algorithm is designed, computing alternatingly cluster memberships and centroids. Experiments: UCI repo, analysis of noisy labels.

Fuzzy Clustering with Partial Supervision (Pedrycz, Waletzky): The authors suggest adapting the fuzzy c-means algorithm for semi-supervised clustering. Data points are assumed to come from an $M$-dimensional space, and it is assumed that a pairwise distance matrix or a pairwise Mahalanobis distance matrix is given. The partial supervision comes in terms of partition level side information. The cost function for fuzzy c-means is regularized with a term penalizing a difference between obtained cluster membership matrix and the cluster membership of the labeled data points. (Although I believe that the term $b_k$ in their equation (2) should be outside the $p$-power.) Experiments: synthetic data, IRIS data set, software data set. Noise was, as far as I know, not considered.

Spectral Learning (Kamvar, Klein, Manning): Data points are given in terms of an affinity matrix. Labeling can either be given as cluster labels or a Must-Link/Cannot-Link constraints. Data points in the same group override affinity values to the max (without changing the values of close data points; still, the effect on unlabeled points is seen in spectral space), data points in different groups get the minimum affinity values. From the affinity matrix a Markov transition probability matrix is generated, from which the $K$ leading eigenvalues are extracted. These eigenvalues are the (spectral) feature based on which a classifier is trained on the labeled data points, which is then used to classify the unlabeled data points. Experiments: Newsgroup data, performance increases both with the percentage of labeled data points, as well as with the number of unlabeled data points. Noisy labels/constraints are not considered.

Flexible Constrained Spectral Clustering (Wang, Davidson): The constraints are given as Must-Link/Cannot-Link constraints with an additional belief value between 0 and 1. The utility is given as an inner product, i.e., if $Q$ is the matrix with the (real-valued) constraints (positive if $x$ and $y$ are in the same cluster, negative if in different clusters, 0 if unlabeled) and $u$ a cluster assignment vector, then $u^TQu$ decreases by a clustering inconsistent with the labels. Clustering is obtained by spectral clustering, i.e., solving the min-cut problem, under the constraint that the clustering utility has to be larger than a predefined threshold. This can be done by solving a generalized eigenvalue problem. The focus is on bi-clustering. Experiments: Image segmentation, UCI repo. Analysis of noisy labels implicit by assigning belief in labeling.

Affinity and Penalty Jointly Constrained Spectral Clustering With All-Compatibility, Flexibility, and Robustness (Qian, Jiang, Wang, Su, Wang, Hu, Muzic): The approach is very similar to the one of Wang and Davidson, with two main differences: 1) They incorporate the Must-Link/Cannot-Link constraints also in the affinity matrix (by setting affinities to one and zero, respectively), and 2) the generalized eigenvalue problem is converted to a standard eigenvalue problem with a Lagrange parameter. Moreover, the authors propose a second option for the constraint matrix and extend their analysis to more than two clusters. Finally, they perform a grid search over the parameters, which can be justified by having partial supervision available. Experiments: Synthetic data, Keel data, UCI repo, USPS, human facial data set, text data, image segmentation. Noise was not considered, as far as I know.

Semi-Supervised Information-Maximization Clustering (Calandriello, Niu, Sugiyama): Data points are assumed to come from an $M$-dimensional space and are given. The clustering utility function is the $\chi^2$-divergence between the joint data-label distribution and the product of their marginals ("squared mutual information"). The utility/cost of Must-Link and Cannot-Link constraints are encoded as inner products between the conditional class distribution of the two corresponding data points. Assuming a kernel density estimate of the probabilities, the cost function can be written as a matrix multiplication, and the optimal clustering is obtained by spectral methods (leading eigenvectors of the matrix). Strange things are going on: The authors add additional constraints for transitivity of the Must-Link and Cannot-Link constraints (for binary clustering), and the kernel matrix is modified to incorporate these constraints, too. It is therefore not clear which idea (modifikation of the kernel matrix, encoding transitivity, etc.) leads to the desired performance. Experiments: UCI repo, USPS digits, Olivetti faces. Performance similar to spectral lerning and constraint spectral clustering.  Noisy constraints are not considered.

Computing Gaussian Mixture Models with EM using Equivalence Constraints (Shental, Bar-Hillel, Hertz, Weinshall): Data points are assumed to come from an $M$-dimensional space. The constraints are given as Must-Link/Cannot-Link constraints. Data is modeled as coming from a GMM. The transitive closure of Must-Link constraints (called chunklet) is treated as a single data point in EM that is weighted according to the number of data points in the chunklet. Cannot-Link constraints introduce dependencies between the hidden variables (i.e., GMM component indicator), which leads to a hidden Markov random field. This complicates the EM algorithm, requires a generalized EM procedure that approximately solves the problem. The EM algorithm then trains a GMM which provides a soft/hard clustering. Experiments: UCI repo, noisy labels/constraints are not considered.

Semi-supervised Learning with Penalized Probabilistic Clustering (Lu, Leen): Data points are assumed to come from an $M$-dimensional space. The constraints are given as Must-Link/Cannot-Link constraints with an additional belief value between 0 and 1. Data is modeled as coming from a GMM. The (beliefs about the) constraints give a prior distribution over all clusterings (ruling out certain clusterings when hard constraints are given). The probabilities for the GMM components are evaluated numerically, after simplifying the model by, e.g., removing overlapping pairwise relations. The generalized EM algorithm trains a GMM which provides a soft/hard clustering. Experiments: UCI repo, partially labeled images (cluster-classes are given), texture image segmentation. Noisy labels/constraints are not considered.

Semi-Supervised Density-Based Clustering (Lelis, Sander): Again, data points come from an $M$-dimensional space, and a subset of the data points are labeled with a cluster label $1$ through $K$. The method extends DBSCAN. DBSCAN requires two parameters: an real-valued parameter defining the area of a data point's neighborhood, and an integer that specifies how many data points must lie in the neighborhood of a data point to make it a core point. The semi-supervised method uses the partial labeling to compute the real-valued parameter (such that data points with different labels fall in different clusters), hence the method only requires the integer parameter. Experiments: UCI. Noise was not considered.