There is a recent hype regarding the connection between neural networks (NNs) and the information bottleneck principle. In this post, I try to summarize a few papers that have been published since 2015. A popsci treatment of this topic can be found

here. All linked papers are available on arXiv or on OpenReview.

The reader should be familiar with the

information bottleneck (IB) principle. Specifically, it assumes that we have a class random variable (RV) $Y$ and a (multi-dimensional) feature RV $X$, and that our goal is to compress $X$ to a "less complex" RV $\hat{Y}$ that shares much information with $Y$ but little with $X$ (compression versus preserving relevant information). Mathematically, one tries to minimize

$$

I(X;\hat{Y}) - \beta I(Y;\hat{Y}) \text{ or equivalently } I(X;\hat{Y}) + \beta I(X;Y|\hat{Y})

$$

where $\beta$ is a parameter trading between these two goals. The first term in either expression is often called the rate $R$, while the second term in the latter expression is the information distortion $D_{IB}$. This allows to draw the IB curve with $R$ on one and $D_{IB}$ on the other axis. This curve depends only on the joint distribution of $(X,Y)$ and is monotonically decreasing.

The authors claim that the goal of NNs is to compress the features as much as possible, with the constraint that the compression should be a minimal sufficient statistic of the class RV. The IB principle formulates this goal, hence they claim that it can be used to better understand NNs. Indeed, feed-forward NNs are a sequence of compression functions; the feature RV $X$ is compressed to the output of the first hidden layer $h_1$, which is then compressed to the output of the second hidden layer $h_2$, etc. Finally, the output of the last hidden layer, $h_m$, is compressed to $\hat{Y}$. The authors claim that the IB functional not only characterizes the NN as a whole, but each layer separately. I.e., each layer should minimize $I(h_i;h_{i-1})$ while simultaneously maximizing $I(Y;h_i)$. The former has an interpretation in terms of the minimum description length principle, while the latter bounds the classification error. With this in mind, one can evaluate the optimality of each layer separately by comparing its performance with the IB curve. Each layer can only increase $D_{IB}$, but the hope is that $R$ is reduced by compression.

The authors then discuss finite sample bounds, which are relevant in estimating the IB curve from a finite data set. Specifically, it turns out that the empirical IB curve is bounded from below by the decreasing true IB curve, but from above by a convex function. I.e., for increasing $R$ the actual value on the IB curve may not necessarily decrease with increasing rate $R$. This is exactly the phenomenon of overfitting: A large $R$ may lead to a small $D_{IB}$ on the training set, but to a much larger $D_{IB}$ on the test set. It may thus be beneficial to aim at obtaining a representation close to the minimum of the convex upper bound. There, the rate $R$ is smaller and leads to a slightly larger $D_{IB}$ on the training set, but the worst-case increase of $D_{IB}$ for the test set is bounded. The authors argue that compression is thus necessary for generalization. Based on that, they argue that the IB principle may be used for the design of NNs. Finally, the authors briefly mention that there is a connection between bifurcations of the IB curve corresponding to constraints on $\hat{Y}$ (e.g., cardinality of the support) and the structure of the NN.

Based on this paper, to directions of research have been initiated: Designing NNs using the IB principle, and understanding IBs using the IB principle.

###
Design of NNs

This paper proposes using the IB functional as a cost function for training a NN. Specifically, they decompose the rate $R$, which happens to be $H(\hat{Y})$ because the NN is deterministic, as a telescoping sum of information losses per layer. Moreover, they decompose the information distortion $D_{IB}$ as a sum of terms $I(h_{i-1};Y|h_{i})$; indeed, their equation (19) is precisely my Proposition 4 in

this paper. Their goal is to maximize compression with a constrained distortion, as opposed to the variational formulation of Tishby. Strangely, though, they have their compression goal is minimizing $I(X;\hat{Y})+\sum_{i=1}^m I(X;h_i)$, i.e., they also want to compress the individual layer outputs separately. This leads to a stronger compression factor for early layers, or equivalently, to smaller values of $\beta$ for early layers.

Finally, the authors formulate the training problem not as finding weights, but first finding admissible functions between layers (where admissibility means that the function can be implemented using a NN layer of given structure). They then obtain the weights from the obtained function. This approach is rather strange and hard to implement in practice, I suppose, and that may be the reason why the authors focus on Boolean functions in their example section.

The authors also propose using IB to train a NN, claiming that it is "natural and useful". However, since computing mutual information is "computationally challenging", they replace the terms in the IB functional by upper and lower bounds. Specifically, they replace the decoding distribution $P_{Y|\hat{Y}}$ by a surrogate in the term $I(\hat{Y};Y)$, and the marginal distribution $P_{\hat{Y}}$ by a surrogate in the term $I(\hat{Y};X)$. For example, they assume that $P_{Y|\hat{Y}}$ is a simple softmax and that $P_{\hat{Y}}$ is a spherical Gaussian. Moreover, they assume that the encoder that maps $X$ to $\hat{Y}$ is stochastic, i.e., it responds with a Gaussian RV for which the mean vector and covariance matrix are obtained from a neural network. This allows to sample from the distribution of $\hat{Y}$ and to compute gradients and to use backpropagation. Note that this latter assumption is not one on the surrogate $P_{\hat{Y}|X}$, but rather one on the structure of the NN: Inside the NN there is a "bottleneck layer" that responds with $\hat{Y}$ (it may be the last layer), where the layer explicitly adds Gaussian noise.

They discover that the IB principle works as a decent regularizer in the sense of achieving good classification performance on MNIST data. They moreover plot the information plane (the IB curve) and discover that a poor choice of $\beta$, leading to low compression, leads to overfitting. An interesting figure shows a "two-dimensional bottleneck layer", i.e., $\hat{Y}$ has two components. For small compression $\mathbb{R}^2$ is clearly separated into ten sectors corresponding to the 10 classes of the MNIST data set, but that larger compression these regions tend to overlap, leading to misclassification but also to robustness to adversarial examples. For image classification tasks it was shown that the IB principle leads to NNs that are more robust to adversarial examples.

It is worth mentioning that the authors apply the IB principle to the NN as a whole, i.e., not to individual layers as Tishby suggested. They leave applying it to multiple (or all) layers for future work.

Although conceptually similar to the work of Alemi et al., these authors actually aim at implementing IB for non-Gaussian and non-discrete distributions (leading, generally, to a nonlinear compression function). Again, the authors argue that the IB functional is hard to compute and hence resort to bounds by replacing distributions by surrogate ones. Specifically, they assume a stochastic encoder $P_{\hat{Y}|X}$, which is based on a differential function (implemented using a neural network) to which spherical Gaussian noise is added. They moreover assume that the marginal distribution at the output of the deterministic neural network can be modeled by a Gaussian mixture. Using this marginal distribution and the conditional distribution, they present a non-parametric bound on the term $I(X;\hat{Y})$ (rather than a variational bound as in the previously summarized paper). They moreover implement the decoding function via a neural network and estimate the log decoding probability $\log P_{Y|\hat{Y}}$ by evaluating an "appropriately chosen neural network cost function" and by sampling from the "stochastic" bottleneck variable $\hat{Y}$. In other words, they use two neural networks to implement the encoding and decoding functions of the IB functional. Very interesting is their Figure 2, which shows how the bottleneck layer manages to separate AND compress the classes, such that data points from the same class almost fall together. If the parameter $\beta$ is set to infinity, then compression is not a goal anymore. The bottleneck layer then still manages to separate the classes, but each class corresponds to a somewhat spread-out cloud of data points. Data points are still distinguishable, which means that the bottleneck layer not only contains class information, but also additional information about the data points that is actually irrelevant.

This paper on OpenReview trains a neural network using the IB functional in every layer, i.e., in each layer they want to minimize $I(h_i;h_{i-1})$ and maximize $I(h_i;Y)$, with the trade-off determined by a layer-specific parameter $\beta_i$. They do not rely on a NN decoder as Alemi et al. did, but empirically estimate $I(X;h_i)$ and use a variational approximation for $I(Y;h_i)$. For this, they treat the subsequent layers of the NN as an approximation of $P_{Y|h_i}$. They furthermore estimate $I(h_i;h_{i-1})$ by Monte Carlo sampling. They show that their approach, minimizing the IB functional, performs well on the MNIST data set, where all layers had the same parameter $\beta$.

###
Understanding NNs

This paper probably kicked up most of the dust; a

recording of a talk of Tishby's on this topic is available. In this paper, the authors train a NN with sigmoid activation functions and without regularization. They furthermore evaluate the IB functional for every layer throughout training. They observe that the training phase is split into two parts: A phase in which the training error reduces quickly. This phase is characterized by a stochastic gradient with a large mean and small variance ("drift phase") and by an increase of $I(h_i; Y)$ (the layer learns). Simultaneously, the terms $I(h_i; X)$ seem to increase, too (because they also learn something about the features). This drift phase is very short. The much longer phase is characterized by a compression of the input information subject to the training error criterion (i.e., training error does not change?). In this phase, the gradient has a small mean and a large variance ("diffusion phase"), and is characterized by further increase of $I(h_i; Y)$, but a simultaneous decrease of $I(h_i; X)$ in the deeper layers: The deeper layers compress, in the sense that they forget input information not relevant for classification. This phase is significantly longer. What is interesting is that, without regularization, there is no decrease in weight norms nor sparsity observable. Rather, the compression occurs because of the stochasticity of gradient descent. This suggests that, without regularization, there are multiple NNs with (possibly random) weights that have equal classification performance. It therefore suffices to find a NN in this equivalence class; interpreting the output of the sigmoid function as a probability (which is equivalent to adding noise?) leads to choosing a function with low complexity (rather than one which is random and has high complexity; cf. Section 2.4). The authors furthermore indicate that the converged layers have IB functionals that lie close to the optimal curve for special values of $\beta$, where $\beta$ is smaller for deeper layers (more focus on compression). Furthermore, they observed that the compression phase is faster for NNs with more hidden layers. They also claim that the convergence of the layers to the IB functional can be used in training NNs.

This paper on OpenReview, where a long discussion between Tishby (a reviewer?) and the authors can be found, essentially refutes some of the claims of Tishby's paper. Specifically, they argue that compression in the diffusion phase does not appear for linear and ReLU activation functions. They argure moreover that the compression observed in Tishby's paper is due to the fact that the sigmoid nonlinearity is saturated, leading to small mutual information values when the latter is estimated via binning (to which Tishby responded that the authors don't know how to estimate mutual information). They showed that compression occurs independently of generalization, by showing that 1) a linear network generalizes but does not compress, and that an overfitted network compresses but does not generalize. Finally, they showed that similar results are obtained by batch gradient descent instead of stochastic gradient descent, which shows that compression is not achieved by stochasticity. The verdict which of these papers is closer to the truth is still out. It is worth mentioning, though, that they discuss several problems with information-theoretic cost functions in Appendix C, where they also discuss the importance of noise.