Thursday, March 15, 2018

An Inequality Between Conditional Cross Entropy and Error Probability

Some time ago I wrote about an inequality between conditional entropy and error probability. Specifically, one can show that if $\hat{x}(Y)$ is the maximum a posteriori estimate of a discrete random variable $X$ given the random observation $Y$, then the probability of making an error is bounded from above by

  P_e = \mathbb{P}(X\neq \hat{x}(Y)) \le 1- 2^{H(X|Y)}.

This inequality is not new. I was not aware, however, of any previous references until I met Igal Sason at the International Zurich Seminar earlier this year. He was speaking about a paper he recently published which includes this inequality (eq. (2)), together with many more general results relating error probability with Arimoto-Renyi conditional entropy.

I was hoping to generalize above inequality to the mismatched case. Specifically, let $p_{X|Y}$ be the conditional distribution of $X$ given $Y$. The MAP estimator is $\hat{x}(y) = \arg\max_x p_{X|Y}(x|y)$. Now suppose that the MAP estimator does not have access to $p_{X|Y}$ but only to a surrogate distribution $q_{X|Y}$, in which case the estimator becomes $\tilde{x}(y) = \arg\max_x q_{X|Y}(x|y)$. I thought to get a bound similar to the one above that includes this surrogate distribution and that bounds the error probability $\tilde{P}_e = \mathbb{P}(X\neq \tilde{x}(Y))$. To be more precise, I was hoping to exchange the conditional entropy above by some conditional cross entropy, i.e., to obtain a bound similar to

\tilde{P}_e \le 1- 2^{\mathbb{E}(H^\times(p_{X|Y}(\cdot|Y)\Vert q_{X|Y}(\cdot|Y)))}

possibly with the order of the distributions reversed. Since cross entropy is an upper bound on entropy, the bound will be larger than the bound for $P_e$. (Interestingly, even if $p_{X|Y}$ and $q_{X|Y}$ are different, we nay have $\tilde{P}_e =P_e$: It suffices that, for every $y$, both conditional distributions have their modes at the same values of $x$.)

Unfortunately, my hope was not justified. We show this in a counterexample, i.e., we show that there exists distributions such that the inequality on $\tilde{P}_e$ is violated (for any choice of order of distributions). For the sake of simplicity, we now look at guessing rather than estimation, i.e., the conditional distributions satisfy $p_{X|Y}=p_X$ and $q_{X|Y}=q_{X}$. Now suppose that $X$ is binary, i.e., it takes values 0 and 1. Furthermore, let $p_X(0)=q_X(1)=\epsilon\ll 0.5$. We have that $\tilde{x}=\arg\max_x q_X(x) = 0$, thus

$$\tilde{P}_e=\mathbb{P}(X\neq 0)=1-\epsilon.$$

Since $p_X$ and $q_X$ are permutations of each other, we have that $H(q_X)=H(p_X)$, that $D(p_X\Vert q_X) = D(q_X\Vert p_X)$, and that thus

H^\times(p_X\Vert q_X) = H^\times(q_X\Vert p_X) = -\epsilon\log(1-\epsilon)-(1-\epsilon)\log\epsilon = -\log \epsilon - \epsilon\log\left(\frac{1-\epsilon}{\epsilon}\right).

With this we get

2^{-H^\times(p_X\Vert q_X)} = \epsilon\cdot 2^{\epsilon\log\left(\frac{1-\epsilon}{\epsilon}\right)} > \epsilon

and thus the bound $\tilde{P}_e \le 1- 2^{-H^\times(p_X\Vert q_X)}$ cannot hold.

Monday, February 26, 2018

Talk at "International Zurich Seminar on Information and Communication"

Just coming back from Zurich, where I attended the Int. Zurich Seminar. Being there for the fourth time in a row, it felt like a class reunion :). I was speaking about a joint work with Tobi Koch on the information dimension of multivariate Gaussian processes (see below for the slides). The work is on arXiv here and here, and Tobi and I discussed a roadmap for more results to come - stay tuned!

By the way, the IZS publishes its submissions on an ETH server that is free to access for everyone and that does not require to transfer copyrights from the authors to the publisher. This is not only open access at its best, but also gives you a chance to look at the amazing papers presented in Zurich.

Monday, February 19, 2018

"A Rate-Distortion Approach to Caching" Just Got Published!

Our joint work on rate-distortion theory for caching just got published in the IEEE Transactions on Information Theory (it's also available on arXiv, of course; there's also an older post). Thanks Roy, Shirin, and Michele for letting me be a part of this!

In this paper, we consider a lossy single-user caching problem with correlated sources -- just think of streaming compressed videos! Most users will watch these videos in the evening, leading to network congestion. If you have a player with a cache, though, you can fill this cache with data during times of low network usage, even though you may not know which video the user wants to watch in the evening. In our paper, we characterize the transmission rate required in the evening as a function of the cache size and as a function of the distortion one accepts when watching the videos. We furthermore hint at what should be put in the cache such that it is useful for a variety of videos, and we connect these results to the common-information measures proposed by Wyner, Gacs and Koerner.

The picture shows achievable upper (red) and lower converse (black) bounds on the transmission rate as a function of the cache capacity $C$ for doubly symmetric binary source, with the requirement the video is reconstructed perfectly at the receiver. $I(X_1;X_2)$ denotes the mutual information between the two videos, while $K_W(X_1,X_2)$ denotes Wyner's common information.

Friday, January 26, 2018

Talk at "Advanced Topics in Discrete Mathematics"

I was recently giving a talk in the "Advanced Topics in Discrete Mathematics" series organized by the doctoral program on discrete mathematics. In case you missed it, be glad that you did: I went overtime quite a bit. You can obtain the slide by clicking on the image below.

Wednesday, January 17, 2018

My Story with Polar Codes Comes to an End

I have long been wondering whether polar codes are fractal, found out that they are in 2015, and put a paper on arXiv. The story that began in 2010 finally comes to end -- the paper is now published!

I submitted the paper to the IEEE Transactions on Information Theory, where it was under review for 18 months. I was told that the two reviewers thought the paper collects a few "interesting and cute mathematical observations", but that it does not tell us anything about the relevant code properties such as error probabilities or code complexity. I was further told that if I could "add a convincing application", then the paper could get to a second review round.

I withdrew the paper and submitted it to Entropy, and open-access journal published by MDPI, in October 2017. I received reviews within less than two months. Four of the five reviewers were inclined to accept the paper after some revision, while the fifth reviewer insisted to reject the paper (and probably still does) because of its missing connection to real-world codes. It is worth mentioning that the review process was not only much faster than the one for the IEEE Transactions on Information Theory, but that I also got high-quality reviews: Some of the reviewers went through the math in detail, made corrections and suggestions for improvement that led to generalizing some of the results. This shows that a proper, high-quality review process cannot be used to justify long review cycles.

Friday, January 12, 2018

Reading Notes: Neural Networks and the Information Bottleneck Principle

There is a recent hype regarding the connection between neural networks (NNs) and the information bottleneck principle. In this post, I try to summarize a few papers that have been published since 2015. A popsci treatment of this topic can be found here. All linked papers are available on arXiv or on OpenReview.

The reader should be familiar with the information bottleneck (IB) principle. Specifically, it assumes that we have a class random variable (RV) $Y$ and a (multi-dimensional) feature RV $X$, and that our goal is to compress $X$ to a "less complex" RV $\hat{Y}$ that shares much information with $Y$ but little with $X$ (compression versus preserving relevant information). Mathematically, one tries to minimize
  I(X;\hat{Y}) - \beta I(Y;\hat{Y}) \text{ or equivalently } I(X;\hat{Y}) + \beta I(X;Y|\hat{Y})
where $\beta$ is a parameter trading between these two goals. The first term in either expression is often called the rate $R$, while the second term in the latter expression is the information distortion $D_{IB}$. This allows to draw the IB curve with $R$ on one and $D_{IB}$ on the other axis. This curve depends only on the joint distribution of $(X,Y)$ and is monotonically decreasing.

2015 - Tishby and Zaslavsky, "Deep Learning and the Information Bottleneck Principle". How it all started.

The authors claim that the goal of NNs is to compress the features as much as possible, with the constraint that the compression should be a minimal sufficient statistic of the class RV. The IB principle formulates this goal, hence they claim that it can be used to better understand NNs. Indeed, feed-forward NNs are a sequence of compression functions; the feature RV $X$ is compressed to the output of the first hidden layer $h_1$, which is then compressed to the output of the second hidden layer $h_2$, etc. Finally, the output of the last hidden layer, $h_m$, is compressed to $\hat{Y}$. The authors claim that the IB functional not only characterizes the NN as a whole, but each layer separately. I.e., each layer should minimize $I(h_i;h_{i-1})$ while simultaneously maximizing $I(Y;h_i)$. The former has an interpretation in terms of the minimum description length principle, while the latter bounds the classification error. With this in mind, one can evaluate the optimality of each layer separately by comparing its performance with the IB curve. Each layer can only increase $D_{IB}$, but the hope is that $R$ is reduced by compression.
The authors then discuss finite sample bounds, which are relevant in estimating the IB curve from a finite data set. Specifically, it turns out that the empirical IB curve is bounded from below by the decreasing true IB curve, but from above by a convex function. I.e., for increasing $R$ the actual value on the IB curve may not necessarily decrease with increasing rate $R$. This is exactly the phenomenon of overfitting: A large $R$ may lead to a small $D_{IB}$ on the training set, but to a much larger $D_{IB}$ on the test set. It may thus be beneficial to aim at obtaining a representation close to the minimum of the convex upper bound. There, the rate $R$ is smaller and leads to a slightly larger $D_{IB}$ on the training set, but the worst-case increase of $D_{IB}$ for the test set is bounded. The authors argue that compression is thus necessary for generalization. Based on that, they argue that the IB principle may be used for the design of NNs. Finally, the authors briefly mention that there is a connection between bifurcations of the IB curve corresponding to constraints on $\hat{Y}$ (e.g., cardinality of the support) and the structure of the NN.

Based on this paper, to directions of research have been initiated: Designing NNs using the IB principle, and understanding IBs using the IB principle.

Design of NNs

2016 - Khadivi, Tandon, Ramakrishnan, "Flow of Information in Feed-Forward Deep Neural Networks"

This paper proposes using the IB functional as a cost function for training a NN. Specifically, they decompose the rate $R$, which happens to be $H(\hat{Y})$ because the NN is deterministic, as a telescoping sum of information losses per layer. Moreover, they decompose the information distortion $D_{IB}$ as a sum of terms $I(h_{i-1};Y|h_{i})$; indeed, their equation (19) is precisely my Proposition 4 in this paper. Their goal is to maximize compression with a constrained distortion, as opposed to the variational formulation of Tishby. Strangely, though, they have their compression goal is minimizing $I(X;\hat{Y})+\sum_{i=1}^m I(X;h_i)$, i.e., they also want to compress the individual layer outputs separately. This leads to a stronger compression factor for early layers, or equivalently, to smaller values of $\beta$ for early layers.
Finally, the authors formulate the training problem not as finding weights, but first finding admissible functions between layers (where admissibility means that the function can be implemented using a NN layer of given structure). They then obtain the weights from the obtained function. This approach is rather strange and hard to implement in practice, I suppose, and that may be the reason why the authors focus on Boolean functions in their example section.

2016 - Alemi, Fischer, Dillon, Murphy, "Deep Variational Information Bottleneck"

The authors also propose using IB to train a NN, claiming that it is "natural and useful". However, since computing mutual information is "computationally challenging", they replace the terms in the IB functional by upper and lower bounds. Specifically, they replace the decoding distribution $P_{Y|\hat{Y}}$ by a surrogate in the term $I(\hat{Y};Y)$, and the marginal distribution $P_{\hat{Y}}$ by a surrogate in the term $I(\hat{Y};X)$. For example, they assume that $P_{Y|\hat{Y}}$ is a simple softmax and that $P_{\hat{Y}}$ is a spherical Gaussian. Moreover, they assume that the encoder that maps $X$ to $\hat{Y}$ is stochastic, i.e., it responds with a Gaussian RV for which the mean vector and covariance matrix are obtained from a neural network. This allows to sample from the distribution of $\hat{Y}$ and to compute gradients and to use backpropagation. Note that this latter assumption is not one on the surrogate $P_{\hat{Y}|X}$, but rather one on the structure of the NN: Inside the NN there is a "bottleneck layer" that responds with $\hat{Y}$ (it may be the last layer), where the layer explicitly adds Gaussian noise.
They discover that the IB principle works as a decent regularizer in the sense of achieving good classification performance on MNIST data. They moreover plot the information plane (the IB curve) and discover that a poor choice of $\beta$, leading to low compression, leads to overfitting. An interesting figure shows a "two-dimensional bottleneck layer", i.e., $\hat{Y}$ has two components. For small compression $\mathbb{R}^2$ is clearly separated into ten sectors corresponding to the 10 classes of the MNIST data set, but that larger compression these regions tend to overlap, leading to misclassification but also to robustness to adversarial examples. For image classification tasks it was shown that the IB principle leads to NNs that are more robust to adversarial examples.
It is worth mentioning that the authors apply the IB principle to the NN as a whole, i.e., not to individual layers as Tishby suggested. They leave applying it to multiple (or all) layers for future work.

2017 - Kolchinsky, Tracey, Wolpert, "Nonlinear Information Bottleneck"

Although conceptually similar to the work of Alemi et al., these authors actually aim at implementing IB for non-Gaussian and non-discrete distributions (leading, generally, to a nonlinear compression function). Again, the authors argue that the IB functional is hard to compute and hence resort to bounds by replacing distributions by surrogate ones. Specifically, they assume a stochastic encoder $P_{\hat{Y}|X}$, which is based on a differential function (implemented using a neural network) to which spherical Gaussian noise is added. They moreover assume that the marginal distribution at the output of the deterministic neural network can be modeled by a Gaussian mixture. Using this marginal distribution and the conditional distribution, they present a non-parametric bound on the term $I(X;\hat{Y})$ (rather than a variational bound as in the previously summarized paper). They moreover implement the decoding function via a neural network and estimate the log decoding probability $\log P_{Y|\hat{Y}}$ by evaluating an "appropriately chosen neural network cost function" and by sampling from the "stochastic" bottleneck variable $\hat{Y}$. In other words, they use two neural networks to implement the encoding and decoding functions of the IB functional. Very interesting is their Figure 2, which shows how the bottleneck layer manages to separate AND compress the classes, such that data points from the same class almost fall together. If the parameter $\beta$ is set to infinity, then compression is not a goal anymore. The bottleneck layer then still manages to separate the classes, but each class corresponds to a somewhat spread-out cloud of data points. Data points are still distinguishable, which means that the bottleneck layer not only contains class information, but also additional information about the data points that is actually irrelevant.

2018 - ?, Parametric Information Bottleneck to Optimize Stochastic Neural Networks

This paper on OpenReview trains a neural network using the IB functional in every layer, i.e., in each layer they want to minimize $I(h_i;h_{i-1})$ and maximize $I(h_i;Y)$, with the trade-off determined by a layer-specific parameter $\beta_i$. They do not rely on a NN decoder as Alemi et al. did, but empirically estimate $I(X;h_i)$ and use a variational approximation for $I(Y;h_i)$. For this, they treat the subsequent layers of the NN as an approximation of $P_{Y|h_i}$. They furthermore estimate $I(h_i;h_{i-1})$ by Monte Carlo sampling. They show that their approach, minimizing the IB functional, performs well on the MNIST data set, where all layers had the same parameter $\beta$.

Understanding NNs

2017 - Schwartz-Ziv, Tishby, "Opening the Black Box of Deep Neural Networks via Information"

This paper probably kicked up most of the dust; a recording of a talk of Tishby's on this topic is available. In this paper, the authors train a NN with sigmoid activation functions and without regularization. They furthermore evaluate the IB functional for every layer throughout training. They observe that the training phase is split into two parts: A phase in which the training error reduces quickly. This phase is characterized by a stochastic gradient with a large mean and small variance ("drift phase") and by an increase of $I(h_i; Y)$ (the layer learns). Simultaneously, the terms $I(h_i; X)$ seem to increase, too (because they also learn something about the features). This drift phase is very short. The much longer phase is characterized by a compression of the input information subject to the training error criterion (i.e., training error does not change?). In this phase, the gradient has a small mean and a large variance ("diffusion phase"), and is characterized by further increase of $I(h_i; Y)$, but a simultaneous decrease of $I(h_i; X)$ in the deeper layers: The deeper layers compress, in the sense that they forget input information not relevant for classification. This phase is significantly longer. What is interesting is that, without regularization, there is no decrease in weight norms nor sparsity observable. Rather, the compression occurs because of the stochasticity of gradient descent. This suggests that, without regularization, there are multiple NNs with (possibly random) weights that have equal classification performance. It therefore suffices to find a NN in this equivalence class; interpreting the output of the sigmoid function as a probability (which is equivalent to adding noise?) leads to choosing a function with low complexity (rather than one which is random and has high complexity; cf. Section 2.4). The authors furthermore indicate that the converged layers have IB functionals that lie close to the optimal curve for special values of $\beta$, where $\beta$ is smaller for deeper layers (more focus on compression). Furthermore, they observed that the compression phase is faster for NNs with more hidden layers. They also claim that the convergence of the layers to the IB functional can be used in training NNs.

2018 - ?, On the Information Bottleneck Theory of Deep Learning

This paper on OpenReview, where a long discussion between Tishby (a reviewer?) and the authors can be found, essentially refutes some of the claims of Tishby's paper. Specifically, they argue that compression in the diffusion phase does not appear for linear and ReLU activation functions. They argure moreover that the compression observed in Tishby's paper is due to the fact that the sigmoid nonlinearity is saturated, leading to small mutual information values when the latter is estimated via binning (to which Tishby responded that the authors don't know how to estimate mutual information). They showed that compression occurs independently of generalization, by showing that 1) a linear network generalizes but does not compress, and that an overfitted network compresses but does not generalize. Finally, they showed that similar results are obtained by batch gradient descent instead of stochastic gradient descent, which shows that compression is not achieved by stochasticity. The verdict which of these papers is closer to the truth is still out. It is worth mentioning, though, that they discuss several problems with information-theoretic cost functions in Appendix C, where they also discuss the importance of noise.

Tuesday, October 31, 2017

Reading Notes: Co-Clustering

In co-clustering we have two sets of related entities, both of which we want to cluster -- the main goal is that the clusters of these two sets preserve the "meaningful" relationships between the original sets. As a running example, we assume that the two sets are a set of words $\mathcal{W}$ and a set of documents $\mathcal{D}$, and that the relation between these two is the number of times a word appears in a document. We furthermore assume that the relationship between these two sets is described by a $\mathcal{W}\times\mathcal{D}$ matrix $A$.

Co-clustering has tight connections to/applications in community detection in bi-partite graphs.

Block Models

Block models assume that the entries of $A$ are independently drawn from distributions parameterized by the word and document clusters; for example, if $A$ is binary, then for each co-cluster, a different Bernoulli distribution is assumed, while for $A$ being a contingency table, a Poisson distribution with a parameter depending on the co-cluster is often assumed. Clustering can then be performed by finding a model that maximizes the likelihood of the $A$, i.e., by applying a variant of classification EM. Relevant references can be found in the paper "Sparse Poisson Latent Block Model for Document Clustering (2017)" by Ailem et al. and mainly go back to Govaert and Nadif. Ailem et al. assumed that the block model is "block diagonal", i.e., that the words and documents are clustered in the same number of clusters, and they assumed that the off-diagonal blocks are parameterized with the same underlying parameter, leading to fewer parameters to be estimated by the EM-type algorithm.

Information-Theoretic Cost Functions

These cost functions are based on transforming $A$ to a joint probability distribution $P$ of two RVs $W$ and $D$ corresponding to the words and documents, i.e., $W$ has alphabet $\mathcal{W}$ and $D$ has alphabet $\mathcal{D}$. We denote the RVs corresponding to clusters by an overline, i.e., $\overline{W}$ and $\overline{D}$.
  • Not really simultaneous co-clustering, but still co-clustering: In "Document Clustering using Word Clusters via the Information Bottleneck Method (2000)" by Slonim and Tishby, the authors first clustered the words in order to maximize $I(\overline{W};D)$, and then used these word clusters in order to find document clusters such that $I(\overline{W};\overline{D})$ is maximized. They used the agglomerative information bottleneck method in each of these two steps. In "Iterative Double Clustering for Unsupervised and Semi-Supervised Learning (2002)", El-Yaniv and Sourourjon improved the method by clustering the words (to clusters $\overline{W}^2$) again, but this time maximizing $I(\overline{W}^2;\overline{D}^1)$, where $\overline{D}^1$ are the document clusters obtained in the first iteration. They they obtained the document clusters by maximizing $I(\overline{D}^2;\overline{W}^2)$. They then repeated this procedure several times until convergence.
  • The work "A divisive information-theoretic feature clustering algorithm for text classification (2003)" by Dhillon et al. is not on co-clustering, but on word clustering in order to make document classification more robust. The goal of this technique is to find word clusters such that the mutual information $I(D;\overline{W})$, the mutual information between the word clusters and the documents, is maximized. The maximization procedure is similar to k-means, i.e., cluster memberships are computed, on the basis of which new cluster centroids are determined.
  • Based on the previous work, Dhillon et al. wrote the influential paper "Information-Theoretic Co-Clustering (2003)". Co-clustering was suggested to yield improved performance compared to one-way clustering, presumably because of an implicit regularization/dimensionality reduction. The goal was to maximize $I(\overline{D};\overline{W})$, i.e., to maximize the information the word clusters share with the document clusters. This is equivalent to approximating $P$ by a non-negative matrix trifactorization $\tilde{P}=A_W Q A_D$, where $Q$ is the joint probability distribution between $\overline{D}$ and $\overline{W}$ (it is hence a $\overline{\mathcal{W}}\times\overline{\mathcal{D}}$ matrix), and where the cost function is the Kullback-Leibler divergence $D(P\Vert \tilde{P})$. The authors then suggested a sequential optimization algorithm, i.e., one starts with a co-clustering, then one computes the block model, based on which all word clusters are updated, which then changes the block model before the document clusters are updated.
  • The paper "Information Bottleneck Co-Clustering (2010)" by Wang et al. follows the spirit of the information bottleneck method for multiple variables, i.e., using multi-information. The multi-information of a collection of RVs, described as a Bayesian network, is the sum of the mutual information between each RV and its parent RVs in this network. The goal is then to minimize the multi-information of the "input graph" (described by $P$ and the clustering functions) while at the same time maximizing the multi-information of the co-clustering to the target variables (described by $Q$ as in the previous section and the mutual information between document clusters and words/word clusters and documents). The cost function thus becomes
    $$ I(W;\overline{W}) + I(D;\overline{D}) + I(W;D) - \beta I(W;\overline{D}) - \beta I(D;\overline{W}) - \beta I(\overline{W};\overline{D}).$$ They minimize this cost function either using an agglomerative technique or by a Blahut-Arimoto-type fixed-point iteration in combination with some heuristic (continuation method) to escape local optima.
  • An information-theoretic cost function was also used by Bekkerman et al. in "Multi-Way Distributional Clustering via Pairwise Interactions (2005)", where they investigated simultaneous clustering of $m\ge 2$ sets (or $m$ RVs). Rather than using multi-information, they depended on a graph structure with $m$ vertices and an edge $e_{ij}$ if they wanted to maximize $I(\tilde{X}_i;\tilde{X}_j)$, i.e., the mutual information between the clusterings of the $i$-th and $j$-th RV. While they essentially proposed a simple sequential technique, they combined it with splits and merges, i.e., they started with an initial clustering solution in which $\tilde{X}_i=X_i$ for some $i$, while $\tilde{X}_j=const.$ for others.

Spectral/Modularity-Based Methods

Spectral and modularity-based methods assume that the word-document matrix $A$ can be interpreted as the biadjacency matrix of a bipartite graph. In these techniques, one either tries to minimize some cut score, or one tries to maximize modularity, which is always defined relative to a null model. In these models, one always has $\overline{\mathcal{W}}=\overline{\mathcal{D}}$.
  • In "Co-clustering documents and words using bipartite spectral graph partitioning (2001)", Dhillon suggested co-clustering by cutting the graph, minimizing the normalized cut. Relaxing the cut criterion, for $\overline{\mathcal{W}}=\overline{\mathcal{D}}=2$, one can compute the second singular vectors of a normalization of $A$, stack them, and cluster the elements of this single vector into two clusters using k-means. By unstacking, these two clusters then correspond to the two word clusters and the two document clusters. This technique can be extended to $\overline{\mathcal{W}}=\overline{\mathcal{D}}=k$ by computing $\ell=\lceil\log_2 k\rceil$ singular vectors, stacking them to a matrix $Z$, and perform k-means for $k$ clusters on the $\ell$-dimensional rows of $Z$. There is a Matlab implemenation of this method.
  • The paper "Co-clustering for Binary and Categorical Data with Maximum Modularity (2011)" by Labiod and Nadif uses modularity as an objective function; modularity is maximized solving the relaxed generalized eigenvalue problem and then clustering the eigenvalues using k-means just as Dhillon proposed. They focused on binary and categorical data.
  • Ailem et al. proposed modularity as an objective function in "Co-clustering document-term matrices by direct maximization of graph modularity (2015)". They proposed searching for modules in the bipartite graph, by alternatingly maximizing the modularity over word clusters and document clusters, i.e., they first fix the word clusters and maximize modularity over the document clusters, then fix document clusters to find optimal word clusters.

Other Methods

  • Co-clustering can be seen also as a non-negative matrix trifactorization problem; Sra and Dhillon present update rules for both Euclidean distance and Kullback-Leibler divergence in their report " Generalized nonnegative matrix approximations (2006)". The latter cost function is also used in Information-Theoretic Co-Clustering by Dhillon et al.
  • The paper "Co-clustering through optimal transport (2017)" by Laclau et al. formulates the co-clustering problem as the problem to transport the empirical probability mass from the words to the documents, the solution of which is a joint probability distribution $\tilde{P}$ approximating the empirical one. Entropically regularized transport yields again the Kullback-Leibler divergence as cost function, thus minimizing $D(\tilde{P}\Vert P)$, where $P$ is obtained from $A$ (note the difference to Information-Theoretic Co-Clustering), and where $\tilde{P}$ is connected to variational inference. The authors showed that co-clustering can be solved via the Sinkhorn-Knopp algorithm and suggested a heuristic to determine the number of clusters.
  • Not exactly a co-clustering method, but rather a meta-heuristic, was proposed by Cheng et al. in "Co-ClusterD: A Distributed Framework for Data Co-Clustering with Sequential Updates (2015)". They analyzed the "concurrent" update rule of, e.g., Information-Theoretic Co-Clustering of Dhillon et al. and replaced it by a "sequential" update rule: Essentially, they propose updating the statistics for a co-cluster as soon as a single element of a set changes its cluster membership, thus influencing the next cluster update immediately. Concurrent update rules, on the other hand, update the co-cluster statistics only after all words have been reassigned to word clusters (and similarly for document clusters). They then present a framework for performing co-clustering in a distributed fashion.
  • A hierarchical co-clustering method was introduced by Li et al. in "Hierarchical co-clustering: a new way to organize the music data (2012)". Their "divisive" version applies Dhillon's spectral method recursively, while their "agglomerative" version simply merges clusters in order to minimize cluster heterogenity.
  • Banerjee et al. view co-clustering as a matrix approximation task in "A generalized maximum entropy approach to Bregman co-clustering and matrix approximation (2007)". Essentially, they view the co-clustering as a way to get a summary statistic of $A$ based on the co-clusters, and they proposed several different types of summary statistics (e.g., co-cluster means). They then show that, given $A$ and the type of summary statistic, the optimal summary statistic and the optimal co-clustering is obtained via minimizing any Bregman divergence, such as the squared Euclidean distance or the Kullback-Leibler divergence. Their algorithm is similar to the one proposed by Dhillon et al. for Information-Theoretic Co-Clustering.
  • In "Locally Discriminative Coclustering (2012)", Zhang et al. proposed a cost function mapping a word and a document to the same co-cluster if the corresponding entry in $A$ is large. In addition, they proposed also enforcing co-clusters that respect dependencies between words alone, and between documents alone. They showed that the complete problem can be relaxed to an eigenvalue problem.