Co-clustering
From ScribbleWiki: Analysis of Social Media
A short summary of the following paper.
In co-clustering, we are given a contigency table of a bipartite graph, the number of row cluster and column cluster; and the goal is to simultaneously find row cluster and column cluster. As its name implies, it is two-side clustering, which differs from the traditional one-side clustering.
In co-clustering, we treat row and clumn as two discrete random variables (X and Y), respectively; and the contigency table (by proper normalization) is treated as the joint distribution between X and Y (p(X,Y)). Simultaneously finding row cluster and column cluster is equivalent to find a mapping function C_X from X to \hat(X), where \hat(X) is the index for row cluster, as well as a mapping function C_Y from Y to \hat(Y), where \hat(Y) is the index for column cluster. Notice that both \hat(X) and \hat(Y) are random variable since they are the function of random variable X and Y, respectively. so, we can compute all desired distributions based on the mapping functions (e.g. the joint distribution between \hat(X) and \hat(Y), conditional distribution of X given \hat(X) and so on). In order to find a good co-cluster, the authors suggest to maximize the mutual information between \hat(X) and \hat(Y). This formulation is equivalent to find an approximate joint distribution between X and Y (q(X,Y)), which minimize the KL-Dist between p(X,Y) and q(X,Y).
The problem they define is NP-Hard. The authors suggest an greedy algorithm to gurantee a locol minimum. The algorithm alternates between updating row cocluster (while fixing column cluster) and updating column cluster (while fixing the row cluster), until a local minimum is found.
Due to the high-dimension, sparsity, and noise in the data, the authors find this two-side clustering usually performs better than one-side clustering. Another advantage of co-clustering is interprebility, - by two-side clustering, not only do we know how to find a cluster (e.g., CS documents), but also we know why they form a cluster (e.g., all these documents are talking about software, 'software', 'database', 'machine learning' etc, which themselves forms a term cluster)