Tuesday, October 31, 2017

Reading Notes: Co-Clustering

In co-clustering we have two sets of related entities, both of which we want to cluster -- the main goal is that the clusters of these two sets preserve the "meaningful" relationships between the original sets. As a running example, we assume that the two sets are a set of words $\mathcal{W}$ and a set of documents $\mathcal{D}$, and that the relation between these two is the number of times a word appears in a document. We furthermore assume that the relationship between these two sets is described by a $\mathcal{W}\times\mathcal{D}$ matrix $A$.

Co-clustering has tight connections to/applications in community detection in bi-partite graphs.

Block Models

Block models assume that the entries of $A$ are independently drawn from distributions parameterized by the word and document clusters; for example, if $A$ is binary, then for each co-cluster, a different Bernoulli distribution is assumed, while for $A$ being a contingency table, a Poisson distribution with a parameter depending on the co-cluster is often assumed. Clustering can then be performed by finding a model that maximizes the likelihood of the $A$, i.e., by applying a variant of classification EM. Relevant references can be found in the paper "Sparse Poisson Latent Block Model for Document Clustering (2017)" by Ailem et al. and mainly go back to Govaert and Nadif. Ailem et al. assumed that the block model is "block diagonal", i.e., that the words and documents are clustered in the same number of clusters, and they assumed that the off-diagonal blocks are parameterized with the same underlying parameter, leading to fewer parameters to be estimated by the EM-type algorithm.

Information-Theoretic Cost Functions

These cost functions are based on transforming $A$ to a joint probability distribution $P$ of two RVs $W$ and $D$ corresponding to the words and documents, i.e., $W$ has alphabet $\mathcal{W}$ and $D$ has alphabet $\mathcal{D}$. We denote the RVs corresponding to clusters by an overline, i.e., $\overline{W}$ and $\overline{D}$.
  • Not really simultaneous co-clustering, but still co-clustering: In "Document Clustering using Word Clusters via the Information Bottleneck Method (2000)" by Slonim and Tishby, the authors first clustered the words in order to maximize $I(\overline{W};D)$, and then used these word clusters in order to find document clusters such that $I(\overline{W};\overline{D})$ is maximized. They used the agglomerative information bottleneck method in each of these two steps. In "Iterative Double Clustering for Unsupervised and Semi-Supervised Learning (2002)", El-Yaniv and Sourourjon improved the method by clustering the words (to clusters $\overline{W}^2$) again, but this time maximizing $I(\overline{W}^2;\overline{D}^1)$, where $\overline{D}^1$ are the document clusters obtained in the first iteration. They they obtained the document clusters by maximizing $I(\overline{D}^2;\overline{W}^2)$. They then repeated this procedure several times until convergence.
  • The work "A divisive information-theoretic feature clustering algorithm for text classification (2003)" by Dhillon et al. is not on co-clustering, but on word clustering in order to make document classification more robust. The goal of this technique is to find word clusters such that the mutual information $I(D;\overline{W})$, the mutual information between the word clusters and the documents, is maximized. The maximization procedure is similar to k-means, i.e., cluster memberships are computed, on the basis of which new cluster centroids are determined.
  • Based on the previous work, Dhillon et al. wrote the influential paper "Information-Theoretic Co-Clustering (2003)". Co-clustering was suggested to yield improved performance compared to one-way clustering, presumably because of an implicit regularization/dimensionality reduction. The goal was to maximize $I(\overline{D};\overline{W})$, i.e., to maximize the information the word clusters share with the document clusters. This is equivalent to approximating $P$ by a non-negative matrix trifactorization $\tilde{P}=A_W Q A_D$, where $Q$ is the joint probability distribution between $\overline{D}$ and $\overline{W}$ (it is hence a $\overline{\mathcal{W}}\times\overline{\mathcal{D}}$ matrix), and where the cost function is the Kullback-Leibler divergence $D(P\Vert \tilde{P})$. The authors then suggested a sequential optimization algorithm, i.e., one starts with a co-clustering, then one computes the block model, based on which all word clusters are updated, which then changes the block model before the document clusters are updated.
  • The paper "Information Bottleneck Co-Clustering (2010)" by Wang et al. follows the spirit of the information bottleneck method for multiple variables, i.e., using multi-information. The multi-information of a collection of RVs, described as a Bayesian network, is the sum of the mutual information between each RV and its parent RVs in this network. The goal is then to minimize the multi-information of the "input graph" (described by $P$ and the clustering functions) while at the same time maximizing the multi-information of the co-clustering to the target variables (described by $Q$ as in the previous section and the mutual information between document clusters and words/word clusters and documents). The cost function thus becomes
    $$ I(W;\overline{W}) + I(D;\overline{D}) + I(W;D) - \beta I(W;\overline{D}) - \beta I(D;\overline{W}) - \beta I(\overline{W};\overline{D}).$$ They minimize this cost function either using an agglomerative technique or by a Blahut-Arimoto-type fixed-point iteration in combination with some heuristic (continuation method) to escape local optima.
  • An information-theoretic cost function was also used by Bekkerman et al. in "Multi-Way Distributional Clustering via Pairwise Interactions (2005)", where they investigated simultaneous clustering of $m\ge 2$ sets (or $m$ RVs). Rather than using multi-information, they depended on a graph structure with $m$ vertices and an edge $e_{ij}$ if they wanted to maximize $I(\tilde{X}_i;\tilde{X}_j)$, i.e., the mutual information between the clusterings of the $i$-th and $j$-th RV. While they essentially proposed a simple sequential technique, they combined it with splits and merges, i.e., they started with an initial clustering solution in which $\tilde{X}_i=X_i$ for some $i$, while $\tilde{X}_j=const.$ for others.

Spectral/Modularity-Based Methods

Spectral and modularity-based methods assume that the word-document matrix $A$ can be interpreted as the biadjacency matrix of a bipartite graph. In these techniques, one either tries to minimize some cut score, or one tries to maximize modularity, which is always defined relative to a null model. In these models, one always has $\overline{\mathcal{W}}=\overline{\mathcal{D}}$.
  • In "Co-clustering documents and words using bipartite spectral graph partitioning (2001)", Dhillon suggested co-clustering by cutting the graph, minimizing the normalized cut. Relaxing the cut criterion, for $\overline{\mathcal{W}}=\overline{\mathcal{D}}=2$, one can compute the second singular vectors of a normalization of $A$, stack them, and cluster the elements of this single vector into two clusters using k-means. By unstacking, these two clusters then correspond to the two word clusters and the two document clusters. This technique can be extended to $\overline{\mathcal{W}}=\overline{\mathcal{D}}=k$ by computing $\ell=\lceil\log_2 k\rceil$ singular vectors, stacking them to a matrix $Z$, and perform k-means for $k$ clusters on the $\ell$-dimensional rows of $Z$. There is a Matlab implemenation of this method.
  • The paper "Co-clustering for Binary and Categorical Data with Maximum Modularity (2011)" by Labiod and Nadif uses modularity as an objective function; modularity is maximized solving the relaxed generalized eigenvalue problem and then clustering the eigenvalues using k-means just as Dhillon proposed. They focused on binary and categorical data.
  • Ailem et al. proposed modularity as an objective function in "Co-clustering document-term matrices by direct maximization of graph modularity (2015)". They proposed searching for modules in the bipartite graph, by alternatingly maximizing the modularity over word clusters and document clusters, i.e., they first fix the word clusters and maximize modularity over the document clusters, then fix document clusters to find optimal word clusters.

Other Methods

  • Co-clustering can be seen also as a non-negative matrix trifactorization problem; Sra and Dhillon present update rules for both Euclidean distance and Kullback-Leibler divergence in their report " Generalized nonnegative matrix approximations (2006)". The latter cost function is also used in Information-Theoretic Co-Clustering by Dhillon et al.
  • The paper "Co-clustering through optimal transport (2017)" by Laclau et al. formulates the co-clustering problem as the problem to transport the empirical probability mass from the words to the documents, the solution of which is a joint probability distribution $\tilde{P}$ approximating the empirical one. Entropically regularized transport yields again the Kullback-Leibler divergence as cost function, thus minimizing $D(\tilde{P}\Vert P)$, where $P$ is obtained from $A$ (note the difference to Information-Theoretic Co-Clustering), and where $\tilde{P}$ is connected to variational inference. The authors showed that co-clustering can be solved via the Sinkhorn-Knopp algorithm and suggested a heuristic to determine the number of clusters.
  • Not exactly a co-clustering method, but rather a meta-heuristic, was proposed by Cheng et al. in "Co-ClusterD: A Distributed Framework for Data Co-Clustering with Sequential Updates (2015)". They analyzed the "concurrent" update rule of, e.g., Information-Theoretic Co-Clustering of Dhillon et al. and replaced it by a "sequential" update rule: Essentially, they propose updating the statistics for a co-cluster as soon as a single element of a set changes its cluster membership, thus influencing the next cluster update immediately. Concurrent update rules, on the other hand, update the co-cluster statistics only after all words have been reassigned to word clusters (and similarly for document clusters). They then present a framework for performing co-clustering in a distributed fashion.
  • A hierarchical co-clustering method was introduced by Li et al. in "Hierarchical co-clustering: a new way to organize the music data (2012)". Their "divisive" version applies Dhillon's spectral method recursively, while their "agglomerative" version simply merges clusters in order to minimize cluster heterogenity.
  • Banerjee et al. view co-clustering as a matrix approximation task in "A generalized maximum entropy approach to Bregman co-clustering and matrix approximation (2007)". Essentially, they view the co-clustering as a way to get a summary statistic of $A$ based on the co-clusters, and they proposed several different types of summary statistics (e.g., co-cluster means). They then show that, given $A$ and the type of summary statistic, the optimal summary statistic and the optimal co-clustering is obtained via minimizing any Bregman divergence, such as the squared Euclidean distance or the Kullback-Leibler divergence. Their algorithm is similar to the one proposed by Dhillon et al. for Information-Theoretic Co-Clustering.
  • In "Locally Discriminative Coclustering (2012)", Zhang et al. proposed a cost function mapping a word and a document to the same co-cluster if the corresponding entry in $A$ is large. In addition, they proposed also enforcing co-clusters that respect dependencies between words alone, and between documents alone. They showed that the complete problem can be relaxed to an eigenvalue problem.