Some time ago I wrote about an inequality between conditional entropy and error probability. Specifically, one can show that if $\hat{x}(Y)$ is the maximum a posteriori estimate of a discrete random variable $X$ given the random observation $Y$, then the probability of making an error is bounded from above by
$$
P_e = \mathbb{P}(X\neq \hat{x}(Y)) \le 1- 2^{H(X|Y)}.
$$
This inequality is not new. I was not aware, however, of any previous references until I met Igal Sason at the International Zurich Seminar earlier this year. He was speaking about a paper he recently published which includes this inequality (eq. (2)), together with many more general results relating error probability with Arimoto-Renyi conditional entropy.
I was hoping to generalize above inequality to the mismatched case. Specifically, let $p_{X|Y}$ be the conditional distribution of $X$ given $Y$. The MAP estimator is $\hat{x}(y) = \arg\max_x p_{X|Y}(x|y)$. Now suppose that the MAP estimator does not have access to $p_{X|Y}$ but only to a surrogate distribution $q_{X|Y}$, in which case the estimator becomes $\tilde{x}(y) = \arg\max_x q_{X|Y}(x|y)$. I thought to get a bound similar to the one above that includes this surrogate distribution and that bounds the error probability $\tilde{P}_e = \mathbb{P}(X\neq \tilde{x}(Y))$. To be more precise, I was hoping to exchange the conditional entropy above by some conditional cross entropy, i.e., to obtain a bound similar to
$$
\tilde{P}_e \le 1- 2^{\mathbb{E}(H^\times(p_{X|Y}(\cdot|Y)\Vert q_{X|Y}(\cdot|Y)))}
$$
possibly with the order of the distributions reversed. Since cross entropy is an upper bound on entropy, the bound will be larger than the bound for $P_e$. (Interestingly, even if $p_{X|Y}$ and $q_{X|Y}$ are different, we nay have $\tilde{P}_e =P_e$: It suffices that, for every $y$, both conditional distributions have their modes at the same values of $x$.)
Unfortunately, my hope was not justified. We show this in a counterexample, i.e., we show that there exists distributions such that the inequality on $\tilde{P}_e$ is violated (for any choice of order of distributions). For the sake of simplicity, we now look at guessing rather than estimation, i.e., the conditional distributions satisfy $p_{X|Y}=p_X$ and $q_{X|Y}=q_{X}$. Now suppose that $X$ is binary, i.e., it takes values 0 and 1. Furthermore, let $p_X(0)=q_X(1)=\epsilon\ll 0.5$. We have that $\tilde{x}=\arg\max_x q_X(x) = 0$, thus
$$\tilde{P}_e=\mathbb{P}(X\neq 0)=1-\epsilon.$$
Since $p_X$ and $q_X$ are permutations of each other, we have that $H(q_X)=H(p_X)$, that $D(p_X\Vert q_X) = D(q_X\Vert p_X)$, and that thus
$$
H^\times(p_X\Vert q_X) = H^\times(q_X\Vert p_X) = -\epsilon\log(1-\epsilon)-(1-\epsilon)\log\epsilon = -\log \epsilon - \epsilon\log\left(\frac{1-\epsilon}{\epsilon}\right).
$$
With this we get
$$
2^{-H^\times(p_X\Vert q_X)} = \epsilon\cdot 2^{\epsilon\log\left(\frac{1-\epsilon}{\epsilon}\right)} > \epsilon
$$
and thus the bound $\tilde{P}_e \le 1- 2^{-H^\times(p_X\Vert q_X)}$ cannot hold.
A collection of mathematical curiosities, chosen purely based on my own interest. Be prepared for a little information theory, combinatorics, and probability! (This blog is trackable: 5ZVNC2)
Showing posts with label information theory. Show all posts
Showing posts with label information theory. Show all posts
Thursday, March 15, 2018
Monday, February 26, 2018
Talk at "International Zurich Seminar on Information and Communication"
Just coming back from Zurich, where I attended the Int. Zurich Seminar. Being there for the fourth time in a row, it felt like a class reunion :). I was speaking about a joint work with Tobi Koch on the information dimension of multivariate Gaussian processes (see below for the slides). The work is on arXiv here and here, and Tobi and I discussed a roadmap for more results to come - stay tuned!
By the way, the IZS publishes its submissions on an ETH server that is free to access for everyone and that does not require to transfer copyrights from the authors to the publisher. This is not only open access at its best, but also gives you a chance to look at the amazing papers presented in Zurich.
Monday, February 19, 2018
"A Rate-Distortion Approach to Caching" Just Got Published!
Our joint work on rate-distortion theory for caching just got published in the IEEE Transactions on Information Theory (it's also available on arXiv, of course; there's also an older post). Thanks Roy, Shirin, and Michele for letting me be a part of this!
In this paper, we consider a lossy single-user caching problem with correlated sources -- just think of streaming compressed videos! Most users will watch these videos in the evening, leading to network congestion. If you have a player with a cache, though, you can fill this cache with data during times of low network usage, even though you may not know which video the user wants to watch in the evening. In our paper, we characterize the transmission rate required in the evening as a function of the cache size and as a function of the distortion one accepts when watching the videos. We furthermore hint at what should be put in the cache such that it is useful for a variety of videos, and we connect these results to the common-information measures proposed by Wyner, Gacs and Koerner.
Friday, January 26, 2018
Talk at "Advanced Topics in Discrete Mathematics"
I was recently giving a talk in the "Advanced Topics in Discrete Mathematics" series organized by the doctoral program on discrete mathematics. In case you missed it, be glad that you did: I went overtime quite a bit. You can obtain the slide by clicking on the image below.
Wednesday, January 17, 2018
My Story with Polar Codes Comes to an End
I have long been wondering whether polar codes are fractal, found out that they are in 2015, and put a paper on arXiv. The story that began in 2010 finally comes to end -- the paper is now published!
I submitted the paper to the IEEE Transactions on Information Theory, where it was under review for 18 months. I was told that the two reviewers thought the paper collects a few "interesting and cute mathematical observations", but that it does not tell us anything about the relevant code properties such as error probabilities or code complexity. I was further told that if I could "add a convincing application", then the paper could get to a second review round.
I withdrew the paper and submitted it to Entropy, and open-access journal published by MDPI, in October 2017. I received reviews within less than two months. Four of the five reviewers were inclined to accept the paper after some revision, while the fifth reviewer insisted to reject the paper (and probably still does) because of its missing connection to real-world codes. It is worth mentioning that the review process was not only much faster than the one for the IEEE Transactions on Information Theory, but that I also got high-quality reviews: Some of the reviewers went through the math in detail, made corrections and suggestions for improvement that led to generalizing some of the results. This shows that a proper, high-quality review process cannot be used to justify long review cycles.
I submitted the paper to the IEEE Transactions on Information Theory, where it was under review for 18 months. I was told that the two reviewers thought the paper collects a few "interesting and cute mathematical observations", but that it does not tell us anything about the relevant code properties such as error probabilities or code complexity. I was further told that if I could "add a convincing application", then the paper could get to a second review round.
I withdrew the paper and submitted it to Entropy, and open-access journal published by MDPI, in October 2017. I received reviews within less than two months. Four of the five reviewers were inclined to accept the paper after some revision, while the fifth reviewer insisted to reject the paper (and probably still does) because of its missing connection to real-world codes. It is worth mentioning that the review process was not only much faster than the one for the IEEE Transactions on Information Theory, but that I also got high-quality reviews: Some of the reviewers went through the math in detail, made corrections and suggestions for improvement that led to generalizing some of the results. This shows that a proper, high-quality review process cannot be used to justify long review cycles.
Friday, January 12, 2018
Reading Notes: Neural Networks and the Information Bottleneck Principle
There is a recent hype regarding the connection between neural networks (NNs) and the information bottleneck principle. In this post, I try to summarize a few papers that have been published since 2015. A popsci treatment of this topic can be found here. All linked papers are available on arXiv or on OpenReview.
The reader should be familiar with the information bottleneck (IB) principle. Specifically, it assumes that we have a class random variable (RV) $Y$ and a (multi-dimensional) feature RV $X$, and that our goal is to compress $X$ to a "less complex" RV $\hat{Y}$ that shares much information with $Y$ but little with $X$ (compression versus preserving relevant information). Mathematically, one tries to minimize
$$
I(X;\hat{Y}) - \beta I(Y;\hat{Y}) \text{ or equivalently } I(X;\hat{Y}) + \beta I(X;Y|\hat{Y})
$$
where $\beta$ is a parameter trading between these two goals. The first term in either expression is often called the rate $R$, while the second term in the latter expression is the information distortion $D_{IB}$. This allows to draw the IB curve with $R$ on one and $D_{IB}$ on the other axis. This curve depends only on the joint distribution of $(X,Y)$ and is monotonically decreasing.
The reader should be familiar with the information bottleneck (IB) principle. Specifically, it assumes that we have a class random variable (RV) $Y$ and a (multi-dimensional) feature RV $X$, and that our goal is to compress $X$ to a "less complex" RV $\hat{Y}$ that shares much information with $Y$ but little with $X$ (compression versus preserving relevant information). Mathematically, one tries to minimize
$$
I(X;\hat{Y}) - \beta I(Y;\hat{Y}) \text{ or equivalently } I(X;\hat{Y}) + \beta I(X;Y|\hat{Y})
$$
where $\beta$ is a parameter trading between these two goals. The first term in either expression is often called the rate $R$, while the second term in the latter expression is the information distortion $D_{IB}$. This allows to draw the IB curve with $R$ on one and $D_{IB}$ on the other axis. This curve depends only on the joint distribution of $(X,Y)$ and is monotonically decreasing.
2015 - Tishby and Zaslavsky, "Deep Learning and the Information Bottleneck Principle". How it all started.
The authors claim that the goal of NNs is to compress the features as much as possible, with the constraint that the compression should be a minimal sufficient statistic of the class RV. The IB principle formulates this goal, hence they claim that it can be used to better understand NNs. Indeed, feed-forward NNs are a sequence of compression functions; the feature RV $X$ is compressed to the output of the first hidden layer $h_1$, which is then compressed to the output of the second hidden layer $h_2$, etc. Finally, the output of the last hidden layer, $h_m$, is compressed to $\hat{Y}$. The authors claim that the IB functional not only characterizes the NN as a whole, but each layer separately. I.e., each layer should minimize $I(h_i;h_{i-1})$ while simultaneously maximizing $I(Y;h_i)$. The former has an interpretation in terms of the minimum description length principle, while the latter bounds the classification error. With this in mind, one can evaluate the optimality of each layer separately by comparing its performance with the IB curve. Each layer can only increase $D_{IB}$, but the hope is that $R$ is reduced by compression.
The authors then discuss finite sample bounds, which are relevant in estimating the IB curve from a finite data set. Specifically, it turns out that the empirical IB curve is bounded from below by the decreasing true IB curve, but from above by a convex function. I.e., for increasing $R$ the actual value on the IB curve may not necessarily decrease with increasing rate $R$. This is exactly the phenomenon of overfitting: A large $R$ may lead to a small $D_{IB}$ on the training set, but to a much larger $D_{IB}$ on the test set. It may thus be beneficial to aim at obtaining a representation close to the minimum of the convex upper bound. There, the rate $R$ is smaller and leads to a slightly larger $D_{IB}$ on the training set, but the worst-case increase of $D_{IB}$ for the test set is bounded. The authors argue that compression is thus necessary for generalization. Based on that, they argue that the IB principle may be used for the design of NNs. Finally, the authors briefly mention that there is a connection between bifurcations of the IB curve corresponding to constraints on $\hat{Y}$ (e.g., cardinality of the support) and the structure of the NN.
Based on this paper, to directions of research have been initiated: Designing NNs using the IB principle, and understanding IBs using the IB principle.
Design of NNs
2016 - Khadivi, Tandon, Ramakrishnan, "Flow of Information in Feed-Forward Deep Neural Networks"
This paper proposes using the IB functional as a cost function for training a NN. Specifically, they decompose the rate $R$, which happens to be $H(\hat{Y})$ because the NN is deterministic, as a telescoping sum of information losses per layer. Moreover, they decompose the information distortion $D_{IB}$ as a sum of terms $I(h_{i-1};Y|h_{i})$; indeed, their equation (19) is precisely my Proposition 4 in this paper. Their goal is to maximize compression with a constrained distortion, as opposed to the variational formulation of Tishby. Strangely, though, they have their compression goal is minimizing $I(X;\hat{Y})+\sum_{i=1}^m I(X;h_i)$, i.e., they also want to compress the individual layer outputs separately. This leads to a stronger compression factor for early layers, or equivalently, to smaller values of $\beta$ for early layers.
Finally, the authors formulate the training problem not as finding weights, but first finding admissible functions between layers (where admissibility means that the function can be implemented using a NN layer of given structure). They then obtain the weights from the obtained function. This approach is rather strange and hard to implement in practice, I suppose, and that may be the reason why the authors focus on Boolean functions in their example section.
2016 - Alemi, Fischer, Dillon, Murphy, "Deep Variational Information Bottleneck"
The authors also propose using IB to train a NN, claiming that it is "natural and useful". However, since computing mutual information is "computationally challenging", they replace the terms in the IB functional by upper and lower bounds. Specifically, they replace the decoding distribution $P_{Y|\hat{Y}}$ by a surrogate in the term $I(\hat{Y};Y)$, and the marginal distribution $P_{\hat{Y}}$ by a surrogate in the term $I(\hat{Y};X)$. For example, they assume that $P_{Y|\hat{Y}}$ is a simple softmax and that $P_{\hat{Y}}$ is a spherical Gaussian. Moreover, they assume that the encoder that maps $X$ to $\hat{Y}$ is stochastic, i.e., it responds with a Gaussian RV for which the mean vector and covariance matrix are obtained from a neural network. This allows to sample from the distribution of $\hat{Y}$ and to compute gradients and to use backpropagation. Note that this latter assumption is not one on the surrogate $P_{\hat{Y}|X}$, but rather one on the structure of the NN: Inside the NN there is a "bottleneck layer" that responds with $\hat{Y}$ (it may be the last layer), where the layer explicitly adds Gaussian noise.
They discover that the IB principle works as a decent regularizer in the sense of achieving good classification performance on MNIST data. They moreover plot the information plane (the IB curve) and discover that a poor choice of $\beta$, leading to low compression, leads to overfitting. An interesting figure shows a "two-dimensional bottleneck layer", i.e., $\hat{Y}$ has two components. For small compression $\mathbb{R}^2$ is clearly separated into ten sectors corresponding to the 10 classes of the MNIST data set, but that larger compression these regions tend to overlap, leading to misclassification but also to robustness to adversarial examples. For image classification tasks it was shown that the IB principle leads to NNs that are more robust to adversarial examples.
It is worth mentioning that the authors apply the IB principle to the NN as a whole, i.e., not to individual layers as Tishby suggested. They leave applying it to multiple (or all) layers for future work.
2017 - Kolchinsky, Tracey, Wolpert, "Nonlinear Information Bottleneck"
Although conceptually similar to the work of Alemi et al., these authors actually aim at implementing IB for non-Gaussian and non-discrete distributions (leading, generally, to a nonlinear compression function). Again, the authors argue that the IB functional is hard to compute and hence resort to bounds by replacing distributions by surrogate ones. Specifically, they assume a stochastic encoder $P_{\hat{Y}|X}$, which is based on a differential function (implemented using a neural network) to which spherical Gaussian noise is added. They moreover assume that the marginal distribution at the output of the deterministic neural network can be modeled by a Gaussian mixture. Using this marginal distribution and the conditional distribution, they present a non-parametric bound on the term $I(X;\hat{Y})$ (rather than a variational bound as in the previously summarized paper). They moreover implement the decoding function via a neural network and estimate the log decoding probability $\log P_{Y|\hat{Y}}$ by evaluating an "appropriately chosen neural network cost function" and by sampling from the "stochastic" bottleneck variable $\hat{Y}$. In other words, they use two neural networks to implement the encoding and decoding functions of the IB functional. Very interesting is their Figure 2, which shows how the bottleneck layer manages to separate AND compress the classes, such that data points from the same class almost fall together. If the parameter $\beta$ is set to infinity, then compression is not a goal anymore. The bottleneck layer then still manages to separate the classes, but each class corresponds to a somewhat spread-out cloud of data points. Data points are still distinguishable, which means that the bottleneck layer not only contains class information, but also additional information about the data points that is actually irrelevant.
2018 - ?, Parametric Information Bottleneck to Optimize Stochastic Neural Networks
This paper on OpenReview trains a neural network using the IB functional in every layer, i.e., in each layer they want to minimize $I(h_i;h_{i-1})$ and maximize $I(h_i;Y)$, with the trade-off determined by a layer-specific parameter $\beta_i$. They do not rely on a NN decoder as Alemi et al. did, but empirically estimate $I(X;h_i)$ and use a variational approximation for $I(Y;h_i)$. For this, they treat the subsequent layers of the NN as an approximation of $P_{Y|h_i}$. They furthermore estimate $I(h_i;h_{i-1})$ by Monte Carlo sampling. They show that their approach, minimizing the IB functional, performs well on the MNIST data set, where all layers had the same parameter $\beta$.
Understanding NNs
2017 - Schwartz-Ziv, Tishby, "Opening the Black Box of Deep Neural Networks via Information"
This paper probably kicked up most of the dust; a recording of a talk of Tishby's on this topic is available. In this paper, the authors train a NN with sigmoid activation functions and without regularization. They furthermore evaluate the IB functional for every layer throughout training. They observe that the training phase is split into two parts: A phase in which the training error reduces quickly. This phase is characterized by a stochastic gradient with a large mean and small variance ("drift phase") and by an increase of $I(h_i; Y)$ (the layer learns). Simultaneously, the terms $I(h_i; X)$ seem to increase, too (because they also learn something about the features). This drift phase is very short. The much longer phase is characterized by a compression of the input information subject to the training error criterion (i.e., training error does not change?). In this phase, the gradient has a small mean and a large variance ("diffusion phase"), and is characterized by further increase of $I(h_i; Y)$, but a simultaneous decrease of $I(h_i; X)$ in the deeper layers: The deeper layers compress, in the sense that they forget input information not relevant for classification. This phase is significantly longer. What is interesting is that, without regularization, there is no decrease in weight norms nor sparsity observable. Rather, the compression occurs because of the stochasticity of gradient descent. This suggests that, without regularization, there are multiple NNs with (possibly random) weights that have equal classification performance. It therefore suffices to find a NN in this equivalence class; interpreting the output of the sigmoid function as a probability (which is equivalent to adding noise?) leads to choosing a function with low complexity (rather than one which is random and has high complexity; cf. Section 2.4). The authors furthermore indicate that the converged layers have IB functionals that lie close to the optimal curve for special values of $\beta$, where $\beta$ is smaller for deeper layers (more focus on compression). Furthermore, they observed that the compression phase is faster for NNs with more hidden layers. They also claim that the convergence of the layers to the IB functional can be used in training NNs.
2018 - ?, On the Information Bottleneck Theory of Deep Learning
This paper on OpenReview, where a long discussion between Tishby (a reviewer?) and the authors can be found, essentially refutes some of the claims of Tishby's paper. Specifically, they argue that compression in the diffusion phase does not appear for linear and ReLU activation functions. They argure moreover that the compression observed in Tishby's paper is due to the fact that the sigmoid nonlinearity is saturated, leading to small mutual information values when the latter is estimated via binning (to which Tishby responded that the authors don't know how to estimate mutual information). They showed that compression occurs independently of generalization, by showing that 1) a linear network generalizes but does not compress, and that an overfitted network compresses but does not generalize. Finally, they showed that similar results are obtained by batch gradient descent instead of stochastic gradient descent, which shows that compression is not achieved by stochasticity. The verdict which of these papers is closer to the truth is still out. It is worth mentioning, though, that they discuss several problems with information-theoretic cost functions in Appendix C, where they also discuss the importance of noise.
Wednesday, November 30, 2016
(Lossy) Wyner's Common Information and Gacs-Körner's Common Randomness Revisited
Recently, I was involved (not so much, to be honest) in a research project dealing with caching, i.e., the storage of data close to a user such that it need not be transmitted anymore. A typical use case is Netflix: Netflix offers $N$ movies/series, and the user can choose any of these anytime. Most users watch during prime time, which taxes the communication network. During the day, the communication network has unused capacities, so one could try storing part of the Netflix database on a hard drive connected to the user's player. This reduces the network load during prime time and improves the user experience.
I our work, we not only looked at the fundamental limits of this caching scheme, but we also got a good idea of what should actually be stored in the user's hard drive ("cache"). The answers are decades old: One should store either what is known as Gacs-Körner's common randomness, or what is known as Wyner's common information.
For simplicity, we assume that there are only two files available on Netflix, namely $X$ and $Y$. The common information of these two random variables (RVs; in information theory, everything is random) is another RV $W$ that minimizes
$$I(X,Y; W)$$
subject to $X-W-Y$, i.e., subject to the fact that, given $W$, $X$ and $Y$ are conditionally independent. The common randomness of these two RVs is an RV $U$ that maximizes
$$ I(X,Y; U) $$
subject to the constraint that $X-Y-U$ and $Y-X-U$. It can be shown that
$$ I(X,Y; U) \le I(X;Y) \le I(X,Y; W).$$
Common randomness is something that is truly common, in the usual sense of the word. In particular, if $A$, $B$, and $C$ are independent RVs, and if $X=(A,B)$ and $Y=(A,C)$, then $U=A$. So much for the math. Now here comes an interpretation (and probably one for which an information theorists wants to punch me in the face): If $X$ and $Y$ are two different episodes of your favorite series, then $U$ is the opening theme. If you have a small hard drive, storing the opening theme is probably the most useful idea. Common information is a bit more intricate, but an intuitive explanation is the following: Suppose $X$ and $Y$ are two different episodes of your favorite series, and suppose the only thing $X$ and $Y$ have in common is an actress called $A$. In $A$, some parts of $A$ are seen, while in $Y$ other parts of $A$ are seen. (Say, $X$ shows $A$ running, while $Y$ shows $A$ talking to a person.) Obviously, if you could store all information about actress $A$ as a whole on your hard drive, the small computer in your player could reconstruct running scenes and talking scenes in which $A$ participates. Thus, if your hard drive is large enough, that's what you should store.
Let's take this one step further. Suppose that you are fine with watching not $X$ or $Y$, but lower-quality versions $\hat X$ or $\hat Y$, respectively. This requires the definition of lossy versions of common randomness and common information. And indeed, such definitions have been proposed in 1403.8093. Namely, the lossy common information is an RV $W$ that minimizes
$$ I(X,Y; W)$$
subject to $\hat X - W - \hat Y$ and $(X,Y) - (\hat X,\hat Y) - W$. Lossy common randomness is an RV $U$ that maximizes
$$ I(X,Y; U) $$
subject to the constraints $X-Y-U$, $Y-X-U$, $X-\hat X - U$, and $Y-\hat Y-U$. These definitions are all meaningful because they solve communication problems: The get their operational characterization in the Gray-Wyner problem (1403.8093) or in lossy caching (1610.07304).
The lossy version of common information seems fine to me: I only need to store as much information about actress $A$ in order to be able to reconstruct approximate depictions of her running and speaking. This might need more or less space on the hard drive than the perfect reconstruction: I might be able to store not only this reduced version of the actress $A$, but I might also store a, say, generic building, that allows to reconstruct two different but similar houses in episodes $X$ and $Y$. And indeed, lossy common information may be larger or smaller than lossless common information (see Lemma 1 in 1403.8093).
The lossy version of common randomness, however, is not fully satisfactory to me. If I am not interested in perfect reconstruction, should I not be able to store more on my internal hard drive? For example, if the credits of my two episodes $X$ and $Y$ differ only in a few actors' names, can I not store a "noisy" version of the credits in addition to the opening theme? Nevertheless, the lossy version of common randomness is usually smaller than the lossless version (see Corollary 2 in 1403.8093).
I would thus suggest different definitions of lossy common randomness: The lossy common randomness is an RV $U'$ that maximizes
$$ I(X,Y; U') \text{ or } I(\hat X,\hat Y; U')$$
subject to the constraints $\hat X-\hat Y-U'$, $\hat Y-\hat X-U'$, $X-\hat X - U'$, and $Y-\hat Y-U'$. The difference is subtle and mirrors the definition of lossy common information. While lossy common information requires $W$ to make the reconstructions $\hat X$ and $\hat Y$ conditionally independent, the original definition of lossy common randomness requires that $U$ is a common part of $X$ and $Y$ (the first two Markov constraints). The definition proposed here would require that $U'$ is only a common part of the reconstructions $\hat{X}$ and $\hat Y$. Given this definition, we may ask the following questions:
I our work, we not only looked at the fundamental limits of this caching scheme, but we also got a good idea of what should actually be stored in the user's hard drive ("cache"). The answers are decades old: One should store either what is known as Gacs-Körner's common randomness, or what is known as Wyner's common information.
For simplicity, we assume that there are only two files available on Netflix, namely $X$ and $Y$. The common information of these two random variables (RVs; in information theory, everything is random) is another RV $W$ that minimizes
$$I(X,Y; W)$$
subject to $X-W-Y$, i.e., subject to the fact that, given $W$, $X$ and $Y$ are conditionally independent. The common randomness of these two RVs is an RV $U$ that maximizes
$$ I(X,Y; U) $$
subject to the constraint that $X-Y-U$ and $Y-X-U$. It can be shown that
$$ I(X,Y; U) \le I(X;Y) \le I(X,Y; W).$$
Common randomness is something that is truly common, in the usual sense of the word. In particular, if $A$, $B$, and $C$ are independent RVs, and if $X=(A,B)$ and $Y=(A,C)$, then $U=A$. So much for the math. Now here comes an interpretation (and probably one for which an information theorists wants to punch me in the face): If $X$ and $Y$ are two different episodes of your favorite series, then $U$ is the opening theme. If you have a small hard drive, storing the opening theme is probably the most useful idea. Common information is a bit more intricate, but an intuitive explanation is the following: Suppose $X$ and $Y$ are two different episodes of your favorite series, and suppose the only thing $X$ and $Y$ have in common is an actress called $A$. In $A$, some parts of $A$ are seen, while in $Y$ other parts of $A$ are seen. (Say, $X$ shows $A$ running, while $Y$ shows $A$ talking to a person.) Obviously, if you could store all information about actress $A$ as a whole on your hard drive, the small computer in your player could reconstruct running scenes and talking scenes in which $A$ participates. Thus, if your hard drive is large enough, that's what you should store.
Let's take this one step further. Suppose that you are fine with watching not $X$ or $Y$, but lower-quality versions $\hat X$ or $\hat Y$, respectively. This requires the definition of lossy versions of common randomness and common information. And indeed, such definitions have been proposed in 1403.8093. Namely, the lossy common information is an RV $W$ that minimizes
$$ I(X,Y; W)$$
subject to $\hat X - W - \hat Y$ and $(X,Y) - (\hat X,\hat Y) - W$. Lossy common randomness is an RV $U$ that maximizes
$$ I(X,Y; U) $$
subject to the constraints $X-Y-U$, $Y-X-U$, $X-\hat X - U$, and $Y-\hat Y-U$. These definitions are all meaningful because they solve communication problems: The get their operational characterization in the Gray-Wyner problem (1403.8093) or in lossy caching (1610.07304).
The lossy version of common information seems fine to me: I only need to store as much information about actress $A$ in order to be able to reconstruct approximate depictions of her running and speaking. This might need more or less space on the hard drive than the perfect reconstruction: I might be able to store not only this reduced version of the actress $A$, but I might also store a, say, generic building, that allows to reconstruct two different but similar houses in episodes $X$ and $Y$. And indeed, lossy common information may be larger or smaller than lossless common information (see Lemma 1 in 1403.8093).
The lossy version of common randomness, however, is not fully satisfactory to me. If I am not interested in perfect reconstruction, should I not be able to store more on my internal hard drive? For example, if the credits of my two episodes $X$ and $Y$ differ only in a few actors' names, can I not store a "noisy" version of the credits in addition to the opening theme? Nevertheless, the lossy version of common randomness is usually smaller than the lossless version (see Corollary 2 in 1403.8093).
I would thus suggest different definitions of lossy common randomness: The lossy common randomness is an RV $U'$ that maximizes
$$ I(X,Y; U') \text{ or } I(\hat X,\hat Y; U')$$
subject to the constraints $\hat X-\hat Y-U'$, $\hat Y-\hat X-U'$, $X-\hat X - U'$, and $Y-\hat Y-U'$. The difference is subtle and mirrors the definition of lossy common information. While lossy common information requires $W$ to make the reconstructions $\hat X$ and $\hat Y$ conditionally independent, the original definition of lossy common randomness requires that $U$ is a common part of $X$ and $Y$ (the first two Markov constraints). The definition proposed here would require that $U'$ is only a common part of the reconstructions $\hat{X}$ and $\hat Y$. Given this definition, we may ask the following questions:
- Does the new definition satisfy the desired property that it is larger than the lossless common randomness?
- Is there a communications problem for which this definition is a single-letter characterization?
Thursday, September 8, 2016
My First Lecture in Numbers
It's done! The semester is over and I graded the last student of my lecture "Information-Theoretic System Analysis and Design". It was a great experience, and I thank Gerhard Kramer for offering me this opportunity. I also owe Lars, my colleage, more than a few drinks for taking care of most of the problem classes, and my colleagues Andrei and Ali for interesting discussions and comments about the course material. And I also thank my students, who bore with me until the end of the semester, and who were never tired of asking interesting questions -- I think I've never had such an engaged, excited audience.
Here's the course in numbers:
Here's the course in numbers:
- (I think) I held 14 lectures. Lars and I shared 6 problem classes.
- I spend roughly 200 hours preparing course notes, lecturing, correcting deliverables, etc.
- In the problem classes, students calculated 11 problems in front of their peers. The remaining 7 problems were done by Lars.
- We had 1 exam and 2 mandatory homework assignments, weighed 50%-25%-25%.
- Both homeworks had 9 problems.
- We had 8 regular students with (as far as I know) 8 different nationalities, originating from 2 continents.
- 7 of these students took the exam and delivered the 2 homework exercises. The average grade of this course was 1.35.
- The course notes totaled to 102 pages, containing 8 chapters, 40 worked examples, and 62 end-of-chapter problems.
- 6 students finished the course evaluation. An excerpt is shown below.
And here are some lessons I learned:
- My PhD thesis is far from complete.
- There is never enough time to prepare course notes. Even though the course was based on my PhD thesis, it was hard work to prepare the results in an understandable manner.
- You have to make sure that the description of homework problems is precise and complete.
- Don't get confused by registration numbers. 30+ students registered for the course, but in the first lecture less than 10 showed up. Apparently (so I was told), many students register for a course to get the course notes on Moodle.
- Students prefer a script that they can get printed at the Fachschaft. Putting chapters online a few days before the lecture is ok, but a script is better.
- It's good not to have a sttrict course syllabus if's the first time you give the course. You never how exactly how much time you have to spend at a given topic, so it's better to leave room for changes. Still, it's a good idea to have at least a rough sketch of what topics you want to cover.
Monday, June 20, 2016
The “I” in IT
This article has been published in the June issue of the Information Theory Society Newsletter, as this month's student column (students are always encouraged to send us their thoughts and ideas!). I gratefully acknowledge Gerhard Kramer's help in improving this article.
Have you ever asked yourself whether IT is the right field of science for you? I never did, but I guess IT wondered more than once whether I am the right person to work on its problems. Here’s my story.
I’m currently a postdoc at the Institute for Communications Engineering, Technical University of Munich, Germany. In this regard, my story has a happy ending, something I wouldn’t have dreamed of years ago. When I started my PhD at Graz University of Technology, Austria, I hadn’t heard a lecture on IT yet. My supervisor and I nevertheless decided to attempt developing a “theory for deterministic information processing”, or what we called it back then. I knew entropy from communications engineering and thought that I could get all the necessary information from Chapter 2 of the “Elements”. I later read Chapter 4 for stochastic processes and even resorted to more detailed results, such as those from Kolmogorov and Pinsker, when I tried taking limits of random variables. Nevertheless, phrases like “a sequence of codes” never appeared in my work. Probably the rest of my remaining scientific credibility would be lost if I told you how many of my papers got rejected - so I won’t. I will tell you, though, what most reviewers agreed on: That the information-theoretic quantities I introduced to characterize deterministic systems lack an operational characterization. That was a valid criticism, and I learned what an operational characterization is only a few months before I obtained my PhD. Relating to IEEE Spectrum’s article on Claude Shannon, I despaired that I had “jumped on the bandwagon without really understanding” (Slepian) and contributed to the “widespread abuse” (Pierce) of IT by other fields.
However, since then I have discovered that this kind of “abuse” is successful in scientific fields outside IT. For example, nonlinear adaptive filters are trained using an error entropy, Rényi and Cauchy-Schwarz divergences are used for learning and classification, and the information-bottleneck method enjoys widespread use in clustering. To the best of my knowledge, only few of these works accompany their objective functions with an operational characterization - mutual information is “just” a nonlinear measure of statistical dependence, and entropy is “just” a statistic that captures more than the error signal’s variance. Remarkably, these heuristic methods often show superior performance. Hence, at least in these other fields, the works are scientifically justified by experiment. (OK, I admit that this is a feeble attempt to justify my work a posteriori. As more self-justification, the Aims & Scope of the IEEE Transactions on Information Theory does state that the journal aims to publish theoretical and experimental papers.)
Should I, therefore, try to publish in other fields and transactions? Probably. My supervisor, suggested more than once that I should spread the word, trying to convince scientists in signal processing to use higher-order statistics, i.e., IT quantities. I haven’t listened, though, because I feel at home in the IT community, and I would not want to miss meeting my IT friends regularly at our annual events. Even in hindsight, I would rather submit to orthodox pressure and provide operational characterizations rather than to publish in a different community. In the future, I hope that I can do both.
That is my decision. As the IT society, we all must decide: what are our traditions (really) and how strong shall we hold on to them? Especially when dealing with future directions of IT, as mentioned in arXiv:1507.05941, how should we contribute to make our quantities be used in fields such as signal processing, control theory, and machine learning? Or should we not? For example, should we propose clustering algorithms using IT quantities, or should we rather focus on deriving asymptotic limits for the clustering problem? You, as a PhD student, must make your own decision: Are you sufficiently “in IT” so as to be accepted by the community? If your research topic is on the border between IT and another field, in which direction do you want to go?
Have you ever asked yourself whether IT is the right field of science for you? I never did, but I guess IT wondered more than once whether I am the right person to work on its problems. Here’s my story.
I’m currently a postdoc at the Institute for Communications Engineering, Technical University of Munich, Germany. In this regard, my story has a happy ending, something I wouldn’t have dreamed of years ago. When I started my PhD at Graz University of Technology, Austria, I hadn’t heard a lecture on IT yet. My supervisor and I nevertheless decided to attempt developing a “theory for deterministic information processing”, or what we called it back then. I knew entropy from communications engineering and thought that I could get all the necessary information from Chapter 2 of the “Elements”. I later read Chapter 4 for stochastic processes and even resorted to more detailed results, such as those from Kolmogorov and Pinsker, when I tried taking limits of random variables. Nevertheless, phrases like “a sequence of codes” never appeared in my work. Probably the rest of my remaining scientific credibility would be lost if I told you how many of my papers got rejected - so I won’t. I will tell you, though, what most reviewers agreed on: That the information-theoretic quantities I introduced to characterize deterministic systems lack an operational characterization. That was a valid criticism, and I learned what an operational characterization is only a few months before I obtained my PhD. Relating to IEEE Spectrum’s article on Claude Shannon, I despaired that I had “jumped on the bandwagon without really understanding” (Slepian) and contributed to the “widespread abuse” (Pierce) of IT by other fields.
However, since then I have discovered that this kind of “abuse” is successful in scientific fields outside IT. For example, nonlinear adaptive filters are trained using an error entropy, Rényi and Cauchy-Schwarz divergences are used for learning and classification, and the information-bottleneck method enjoys widespread use in clustering. To the best of my knowledge, only few of these works accompany their objective functions with an operational characterization - mutual information is “just” a nonlinear measure of statistical dependence, and entropy is “just” a statistic that captures more than the error signal’s variance. Remarkably, these heuristic methods often show superior performance. Hence, at least in these other fields, the works are scientifically justified by experiment. (OK, I admit that this is a feeble attempt to justify my work a posteriori. As more self-justification, the Aims & Scope of the IEEE Transactions on Information Theory does state that the journal aims to publish theoretical and experimental papers.)
Should I, therefore, try to publish in other fields and transactions? Probably. My supervisor, suggested more than once that I should spread the word, trying to convince scientists in signal processing to use higher-order statistics, i.e., IT quantities. I haven’t listened, though, because I feel at home in the IT community, and I would not want to miss meeting my IT friends regularly at our annual events. Even in hindsight, I would rather submit to orthodox pressure and provide operational characterizations rather than to publish in a different community. In the future, I hope that I can do both.
That is my decision. As the IT society, we all must decide: what are our traditions (really) and how strong shall we hold on to them? Especially when dealing with future directions of IT, as mentioned in arXiv:1507.05941, how should we contribute to make our quantities be used in fields such as signal processing, control theory, and machine learning? Or should we not? For example, should we propose clustering algorithms using IT quantities, or should we rather focus on deriving asymptotic limits for the clustering problem? You, as a PhD student, must make your own decision: Are you sufficiently “in IT” so as to be accepted by the community? If your research topic is on the border between IT and another field, in which direction do you want to go?
Tuesday, March 8, 2016
OMG, KLD is only LSC!
Recently, my colleage Georg came to my office to show me some exciting result he found in a book: A bound on mutual information (MI) based on the variational distance (VD) between the joint distribution and the product of marginal distributions. In other words, if $X$ and $Y$ are random variables with probability mass functions (PMFs) $p_X$ and $p_Y$, and a joint PMF $p_{XY}$, one can write
$$ I(X;Y) \le C \Vert p_{XY}-p_Xp_Y\Vert, $$
where $C$ only depends on the alphabet sizes of $X$ and $Y$ and where
$$\Vert p_{XY}-p_Xp_Y\Vert = \sum_{x,y} |p_{XY}(x,y) - p_X(x)p_Y(y)| .$$
This of course implies that if the variational distance goes to zero, then so does the mutual information. This result is exciting since in general it is not possible to bound Kullback-Leibler divergence (KLD; mutual information is a special case of a Kullback-Leibler divergence) via variational distance, at least not by a bound that does not depend on the distributions. The reverse is always possible: The Kullback-Leibler divergence bounds variational distance via the well-known Pinsker inequality.
While I was surprised by the existence of such a bound, I was not surprised by the weaker result that if VD goes to zero then also MI does. At least if $X$ and $Y$ have a finite alphabet, then the entropies $H(X)$, $H(Y)$, and $H(X,Y)$ are continuous in the PMFs, hence convergence of the PMFs implies convergence of the entropies, and hence, of the MI.
Very interestingly, however, it turned out that KLD is NOT continuous, but only lower semi-continuous (LSC), even if the supports of the distributions are finite:
A sequence of (probability) measures $\mu_n$ is said to converge setwise or strongly to another probability measure $\mu$ iff for every measurable set $A$ one has $\mu_n(A)\to \mu(A)$. I think if $\mu=n$ are discrete probability measures with finite alphabets, i.e., can be described by PMFs $p_n$ with finite support, then setwise convergence is equivalent to the fact that, for every point $x$ in the support, $p_n(x)\to p(x)$. I furthermore think that for such finite PMFs setwise convergence is equivalent to the fact that $\Vert p_n-p\Vert\to 0$. (I should check that.) Now let us take two sequences $p_n$ and $q_n$ of PMFs. Let both have a finite support. One would guess that if $p_n \to p$ and $q_n\to p$, then the KLD
$$ D(p_n\Vert q_n) = \sum_x p_n(x)\log\frac{p_n(x)}{q_n(x)} \to 0.$$
But that's not true. While entropies are continuous w.r.t. setwise convergence, KLD is only LSC. It follows that the right-hand side of above equation can be strictly positive, even if both $p_n$ and $q_n$ converge to $p$. Here is an example that you can easily work out yourself: $p_n(0) = 1-1/n$, $p_n(1)=1-p_n(0)$ and $q_n(0)=1-e^{-n^2}$, $q_n(1)=1-q_n(0)$.
$$ I(X;Y) \le C \Vert p_{XY}-p_Xp_Y\Vert, $$
where $C$ only depends on the alphabet sizes of $X$ and $Y$ and where
$$\Vert p_{XY}-p_Xp_Y\Vert = \sum_{x,y} |p_{XY}(x,y) - p_X(x)p_Y(y)| .$$
This of course implies that if the variational distance goes to zero, then so does the mutual information. This result is exciting since in general it is not possible to bound Kullback-Leibler divergence (KLD; mutual information is a special case of a Kullback-Leibler divergence) via variational distance, at least not by a bound that does not depend on the distributions. The reverse is always possible: The Kullback-Leibler divergence bounds variational distance via the well-known Pinsker inequality.
While I was surprised by the existence of such a bound, I was not surprised by the weaker result that if VD goes to zero then also MI does. At least if $X$ and $Y$ have a finite alphabet, then the entropies $H(X)$, $H(Y)$, and $H(X,Y)$ are continuous in the PMFs, hence convergence of the PMFs implies convergence of the entropies, and hence, of the MI.
Very interestingly, however, it turned out that KLD is NOT continuous, but only lower semi-continuous (LSC), even if the supports of the distributions are finite:
A sequence of (probability) measures $\mu_n$ is said to converge setwise or strongly to another probability measure $\mu$ iff for every measurable set $A$ one has $\mu_n(A)\to \mu(A)$. I think if $\mu=n$ are discrete probability measures with finite alphabets, i.e., can be described by PMFs $p_n$ with finite support, then setwise convergence is equivalent to the fact that, for every point $x$ in the support, $p_n(x)\to p(x)$. I furthermore think that for such finite PMFs setwise convergence is equivalent to the fact that $\Vert p_n-p\Vert\to 0$. (I should check that.) Now let us take two sequences $p_n$ and $q_n$ of PMFs. Let both have a finite support. One would guess that if $p_n \to p$ and $q_n\to p$, then the KLD
$$ D(p_n\Vert q_n) = \sum_x p_n(x)\log\frac{p_n(x)}{q_n(x)} \to 0.$$
But that's not true. While entropies are continuous w.r.t. setwise convergence, KLD is only LSC. It follows that the right-hand side of above equation can be strictly positive, even if both $p_n$ and $q_n$ converge to $p$. Here is an example that you can easily work out yourself: $p_n(0) = 1-1/n$, $p_n(1)=1-p_n(0)$ and $q_n(0)=1-e^{-n^2}$, $q_n(1)=1-q_n(0)$.
Wednesday, December 16, 2015
Auf der Jagd nach den verlorenen Bits
In
einer Zeit, in der Information eine so große Rolle spielt, sollten
wir genau wissen was mit der Information in den technischen Systemen,
die wir täglich verwenden, passiert. Wie viel davon geht auf ihrem
Weg von der Quelle zum Empfänger verloren? Und können wir unsere
Systeme so bauen, dass dieser Informationsverlust minimiert wird?
Das Smartphone-App Ihrer Bank ist sehr praktisch wenn Ihnen auf dem Weg durch die Herrengasse in Graz in einem Schaufenster ein schöner Pulli auffällt: Ein Blick auf den Kontostand genügt und Sie wissen, ob Sie sich diese sicherlich unnötige Ausgabe leisten können. Um die Sache zu vereinfachen, stellt das App negative Kontostände mit einer roten Zahl, positive mit einer schwarzen dar. Aufgrund eines merkwürdigen (und zugegebenermaßen unrealistischen) Displayfehlers sehen Sie aber nur eine blaue Zahl. Das Display hat einen wichtigen Teil der Kontoinformation zerstört.
Das Smartphone-App Ihrer Bank ist sehr praktisch wenn Ihnen auf dem Weg durch die Herrengasse in Graz in einem Schaufenster ein schöner Pulli auffällt: Ein Blick auf den Kontostand genügt und Sie wissen, ob Sie sich diese sicherlich unnötige Ausgabe leisten können. Um die Sache zu vereinfachen, stellt das App negative Kontostände mit einer roten Zahl, positive mit einer schwarzen dar. Aufgrund eines merkwürdigen (und zugegebenermaßen unrealistischen) Displayfehlers sehen Sie aber nur eine blaue Zahl. Das Display hat einen wichtigen Teil der Kontoinformation zerstört.
Aber
was ist Information
eigentlich? Im Wesentlichen ist Information unser Wissen
bzw.
Unwissen über etwas Zufälliges, je
nachdem ob wir die Information besitzen oder nicht.
Messen lässt sich Information z.B. über die durchschnittliche
Anzahl von Ja/Nein-Fragen, die gestellt werden müssen, um unser
Unwissen zu beseitigen:
Nehmen wir einen zufälligen Münzwurf als Beispiel, so
reicht bereits eine Frage, um uns über das Ergebnis zu erkundigen:
Kopf
oder Zahl? Bei
zwei Münzwürfen benötigen wir zwei Fragen, um beide Ergebnisse
zu erfahren,
bei drei Würfen drei Fragen, und so weiter. In
seiner
bahnbrechenden
Arbeit "A Mathematical Theory of Communication" präsentierte
ClaudeE. Shannon eine
mathematische
Formel, um die
Information eines Münzwurfes – seine
Entropie
– zu berechnen und definierte das Bit als Maßeinheit.
Die
Information eines Münzwurfes ist genau ein Bit:
man
benötigt exakt
eine Ja/Nein-Frage,
um
den Ausgang des Wurfes zu bestimmen.
Anders
sieht es bei einer gezinkten Münze aus die
in neun von zehn Fällen Kopf zeigt. Unser
(Vor-)Wissen
ist größer
als
im Fall einer fairen Münze und
"im
Durchschnitt"
benötigen
wir weniger Fragen, um den Ausgang des Münzwurfes zu bestimmen.
Konkret kann man sich das folgendermaßen veranschaulichen: Werfen
wir eine
faire
Münze zweimal, müssten wir immer
zwei
Fragen
stellen,
um
die Ergebnisse beider Würfe zu bestimmen.
Werfen
wir die
gezinkte Münze zweimal
ist
in
vielen Fällen eine einzige, schlau gestellte Frage ausreichend:
"Zeigten beide Würfe Kopf?" Wird diese Frage bejaht
(und das wird sie in 81% aller Fälle), muss eine zweite Frage mehr
nicht gestellt werden. Mit
Hilfe
von
Shannons Formel kann
man zeigen,
dass die Information
dieses gezinkten Münzwurfes bei ca. 0.46 Bit
liegt:
Es reichen
"durchschnittlich"
etwas weniger als eine halbe Frage, um unser Unwissen über einen
Münzwurf zu beseitigen,
etwas weniger als eine Frage für zwei Münzwürfe und etwas weniger
als eineinhalb Fragen für drei Würfe.
Wie
viel
Bit
der Kontoinformation
wurden
aber
durch
Ihren
Displayfehler zerstört? Mit
dieser Frage beschäftigte ich mich im
Zuge meiner Dissertation an der TU Graz, in
der ich den Informationsverlust
in
technischen
Systemen untersuchte.
Da
sich mit einer einzigen Frage feststellen lässt ob
der Kontostand positiv oder negativ ist, kann
der Informationsverlust Ihres Displays höchstens ein Bit betragen.
Dass der genaue
Wert
im
Wesentlichen von der Zufälligkeit Ihres Kontostandes abhängt ist
weniger offensichtlich, aber
in
Hinblick auf die gezinkte Münze leicht
verständlich.
Wenn
Sie eine sehr sparsame Person sind und immer einen kleinen Puffer auf
Ihrem
Konto wissen, fehlt Ihnen nicht viel Information: Sie können sehr
sicher sein
dass die Zahl eigentlich schwarz sein sollte und Sie sich den Pulli
leisten können. Wenn Sie eine sehr verschwenderische Person sind,
deren Kontostand höchstens ein paar Tage im Monat positiv ist, fehlt
Ihnen auch nicht viel Information: Die Zahl ist mit hoher
Wahrscheinlichkeit rot. (Den Pulli würden Sie sich in diesem Fall
wahrscheinlich trotzdem kaufen.) Bewegen Sie sich allerdings zwischen
diesen beiden Extremen, kann Ihr Display einen beträchtlichen Teil
der Information zerstört haben – im
schlimmsten Fall ein
Bit.
Ein
Bit ist
doch nicht
viel, sagen
Sie? Das
kommt ganz auf das Bit an! Wenn Sie wissen, dass Ihr Kontostand
zwischen -5000 und 5000 € liegt, müssen Sie nach Shannons Formel
rund 20 Fragen stellen, um den genauen Betrag bis auf den Cent zu
erfahren. Ihr Display beantwortet 19 dieser 20 Fragen für Sie –
nur leider die wichtigste nicht: Ist der Kontostand positiv oder
negativ?
So
interessant es also ist
den Informationsverlust eines Systems zu untersuchen, praktische
Bedeutung bekommt
diese Theorie erst unter Miteinbeziehung des Relevanzbegriffs:
Welcher Anteil
der verlorenen Information ist für uns relevant, und welcher
irrelevant bzw. störend? Ich erweiterte die Theorie in meiner
Dissertation in dieser Hinsicht und verwendete die Resultate, um zu
tun, was einem Ingenieur zu tun bestimmt ist:
Systeme zu
bauen!
Systeme
werden nach gewissen Anforderungen gebaut, um gewisse
Aufgaben
zu erfüllen. Ein elektronisches
Filter
kann zum Beispiel entworfen werden, um
störendes
Rauschen
beim Telefonieren zu
unterdrücken
und
dabei
das relevante
Sprachsignal des Gesprächspartners möglichst nicht zu
beeinflussen.
Historisch bedingt – und schlichtweg am
einfachsten
– werden Filter nach Energiekriterien
entworfen: Die Energie des störenden Rauschens
soll
nach der Filterung so
klein wie möglich sein. Gleichzeitig darf die Energie der durch das
Filter hervorgerufenen Störungen im Sprachsignal nicht zu groß
werden, um eine angenehme Kommunikation der Gesprächspartner zu
garantieren.
Ich
versuchte
mit
meiner Arbeit einen
anderen Weg einzuschlagen: In einer Zeit, in der viele
unserer Systeme Information verarbeiten oder übertragen, sollte
Information
als
Kriterium
für
den Systementwurf verwendet werden.
Die
von
Shannon entwickelte
Informationstheorie besagt,
dass jedes System Information nur
verringern
aber nicht vergrößern
kann. Jene
Information, die wir aus dem Lautsprecher am Smartphone
hören, war zuvor in der elektromagnetischen Welle in der Luft, im
digitalen Signal im Smartphone
unseres Gesprächspartners und
in den Schallwellen zwischen dessen
Mund und dem Mikrofon. Mehr
als das: In
jedem dieser
verarbeitenden Systeme
– Mikrofon, digitale Schaltkreise, Lautsprecher – ging
Information verloren.
Welches
Entwurfskriterium
könnte also besser geeignet sein als der Informationsverlust? Auf
das obige Beispiel angewendet gilt es also ein Filter zu entwerfen
welches
so wenig Sprachinformation
wie möglich zerstört
und dabei die "Information"
des
störenden
Rauschens
soweit
wie möglich reduziert.
Dass
ein
hinsichtlich Informationsverlust entworfenes
Filter besser
zur
Informationsübertragung
geeignet ist als
ein nach Energiekriterien entworfenes,
dürfte Sie inzwischen nicht mehr überraschen – eine andere
Methode liefert eben ein anderes Ergebnis.
Nichtsdestotrotz
stieß ich während meiner Dissertation in der Literatur immer wieder
auf Stellen, in denen Energie mit Information gleichgesetzt
wurde. So wird zum Beispiel in der Statistik
seit Jahrzehnten die Hauptkomponentenanalyse eingesetzt, um die
Komplexität
großer
Datensätze
zu verringern. Dabei werden die mehrdimensionalen Datensätze
transformiert und Daten
mit geringer Energie verworfen. Diese
Vorgehensweise wird mit der Behauptung gerechtfertigt, dass
die Daten
mit der
größten
Energie
auch
die
meiste relevante
Information
beinhalten. Diese
Behauptung ist nicht immer richtig (und wer
am lautesten schreit, hat auch
nicht immer recht):
Für
die Hauptkomponentenanalyse konnte ich zum Beispiel zeigen, dass sie
den
Informationsverlust
nur dann
minimiert,
wenn
die relevante Information in einem besonderen Zusammenhang mit den
Datensätzen
steht, einer
Tatsache, die
in der Statistik
nicht immer zutrifft.
Es
ist höchste Zeit, umzudenken.
Neben
Filterentwurf, der
Hauptkomponentenanalyse
und der Analyse Ihres defekten
Displays gibt es natürlich eine Vielzahl weiterer
Anwendungen einer Theorie des Informationsverlusts: Zum
Beispiel entwickelte
ich gemeinsam mit anderen Forschern eine
Methode, um
Markoffschen Ketten zu
vereinfachen, ohne
dabei Information zu zerstören. Markoffsche Ketten – Folgen
von zufälligen Zahlen, die in einem statistischen Zusammenhang
miteinander
stehen
–
sind
wichtige mathematische Modelle und werden
in der Sprachverarbeitung, als Modelle chemischer Reaktionen, in
der
Genetik,
in
der Bioinformatik und in der Warteschlangentheorie eingesetzt.
Information
ist überall. Um sie nutzbar zu machen, sollten unsere technischen
Systeme – Computer, Smartphones, etc. – so wenig wie möglich
davon zerstören. Und selbst wenn wir es nicht schaffen sollten,
diese Systeme dementsprechend zu bauen, so sollten wir zumindest
wissen wie viel Information durch ungeeignet entworfene Systeme
verloren geht. Die Wichtigkeit der von Shannon begründeten
Informationstheorie, die ich mit meinen Resultaten zum
Informationsverlust ein klein wenig ergänzen durfte, ist nicht zu
unterschätzen. Es gilt heute viel mehr als je zuvor Norbert Wieners
Behauptung: "Information
ist Information, weder Materie noch Energie. Kein Materialismus, der
dies nicht berücksichtigt, kann heute überleben."
Tuesday, September 8, 2015
My Story with Polar Codes
It was the end of August 2010, and the weather was unusually good in Dublin. During the breaks at the IEEE Information Theory Workshop we could easily go outside, sit in the grass, and discuss the previous talks. Well, some people discussed, some just listened. Since this was my first conference in information theory (I did not even have a paper there, my boss allowed me to go there without actively participating -- an opportunity I am extremely grateful for), I clearly belonged to the second group. But then, already on the second day, there were some talks I felt comfortable with: Erdal Arikan gave a lecture on Polar Codes, a new coding technique he invented recently. What followed were several talks related to polar coding, including talks from Emmanuel Abbe (who was my main source of inspiration for my investigation of the fractality of polar codes), Tanaka (who looked at the polar coding procedure from a stochastic dynamical system point-of-view), Ido Tal and Alexander Vardy (whose code construction technique was extremely useful in proving fractality), and Eran Hof. While I did not understand a lot from his talk, I owe Eran big time: Being a total noob, I did not know anybody at this conference, and he was so kind to introduce the "big players" to me. That was absolutely necessary, because during one of the breaks I approached Sergio Verdu and asked him about the polar code construction technique he just presented. Well, apparently I confused him with Alexander Vardy (I have no idea why). Anyways, although this was one of the best conferences I've ever been to (an excursion to Castle Malahide, Whiskey tasting at Jameson's, a Riverdance show during the conference dinner, a pleasant stay at a cute little hotel, great weather, nice people, and all this together with my back-then-future-wife), this was just the beginning of my interest in polar coding.
It was in November 2011 when I presented my first paper at an information theory-related conference, the ISWCS in Aachen (during one of the short coffee breaks I met Gerhard Kramer, my current boss, for the first time; I also met Georg Böcherer, just finishing his PhD and winning the best paper award). As usual for these conferences, the first day featured tutorials in the afternoon. One of these tutorials was on polar codes, held by Emmanuel Abbe. I still have the printed handouts; one of the few paper documents I moved from Graz to Munich. It was an unforgettable experience, partly because of the awesome presentation and Emmanuel's great didactic skills, partly because of Emmanuel's lively discussion with the audience (particularly with Tor Aulin). And this great impressions led me to think about possible research directions on the train ride home to Graz. In a document called "Resarch Ideas.txt" I wrote
Although I am not sure if any of the obtained results is useful, I'm extremely happy to have these things out of my mind (and in a paper on arXiv). I could also present these results at the NEWCOM# Emerging Topics Workshop in Cambridge. One of the participants of this workshop was Erdal Arikan, the father of polar codes, and the very same person we were sitting next to during the whiskey tasting event at Jameson's in Dublin five years ago.
If a channel (one of the partitioned channels) is already very good (close to perfect), is it possible to assume all its descendant channels to be good? [...] Can we use this to find an approximation of a "good" polar code?And then nothing happened for quite some time, as I continued my PhD research on information loss in deterministic systems. All this time, these polar codes stayed in the back of my mind and came to the front whenever I attended a conference (there were ALWAYS talks on polar codes). But time went by fast, I finished my PhD and -- totally unexpected -- started working at the Institute for Communications Engineering in Munich, becoming a member of Gerhard's lab and Georg's colleague. Some months there, not much research done (I was busy writing project proposals), the whole lab took a skiing trip to the alps in South Tirol, disguised under the title "Joint Conference on Communications and Coding". Well, in fact we did more research than skiing, something which I am quite grateful for (there are some very embarrassing photos of me trying cross-country skiing...). Instead of presenting work already done, I presented open problems, among them some ideas about polar codes. The preparation for the talk inspired me to write a blog article, and soon after that all the questions in this article were answered. Emmanuel suggested via email to extend the analysis to Reed-Muller codes, which was completed almost equally fast. It was one of the very few lucky streaks I had in my research career so far, and still some open questions pester me.
Although I am not sure if any of the obtained results is useful, I'm extremely happy to have these things out of my mind (and in a paper on arXiv). I could also present these results at the NEWCOM# Emerging Topics Workshop in Cambridge. One of the participants of this workshop was Erdal Arikan, the father of polar codes, and the very same person we were sitting next to during the whiskey tasting event at Jameson's in Dublin five years ago.
Wednesday, July 8, 2015
Candy Crush Codes
Candy Crush Saga is a game in which... no, I don't think I have to tell you! But I bet you didn't know yet that the game is actually NP-hard! The original proof can be found on arXiv (or, again, on arXiv; these authors claim to prove a little more general result), but there are also plenty of pop-sci articles around. (Now, that is quite interesting: Walsh posted his paper on March 8th, 2014, and the press wrote articles about it only five days later.) Anyways, that's not really what I wanted to tell you.
If you look at above picture, you will see no three candies in a row or in a column having the same flavor. Why? Because they would disappear immediately (that's more-less the only game rule). But this game rule more-less immediately leads to the following question:
If and $n\times m$ field, filled with candies of $k$ different flavors that are randomly, independently and uniformly chosen, what is the probability that there is at least one horizontal or vertical triple of the same flavor?
In other words, what is the probability that, starting a new level, something happens without you actually touching the screen. The question is obviously combinatorial and interesting in itself. But there is even more to it.
If we look at $n\times m$ fields filled with candies of $k$ different flavors, there are in total $k^{nm}$ possible arrangements. The rules of Candy Crush Saga, however, allow only a subset of these, namely, $k^{nm}$ minus the answer to above question. For example, if we take just one row with three columns and two flavors, there are eight possible arrangements, of which two are excluded. Taking two rows allows 36 out of 64 arrangements, and three rows permit 102 out of 512. (At least that's what my first calculations show.) In other words, the set of all possible fields allows for a redundancy: If we want to transmit messages by sending pictures of Candy Crush fields, a $3\times 3$ field allows you transmit only up to 102 different messages, although there are 512 different possible arrangements. If now, during the transmission, an error occurs -- the flavor of a candy changes -- you might end up with an invalid field! Candy Crush Saga can be used as a nonlinear block code! Super-exciting, right?
At that, in turn, allows us to ask several questions:
![]() |
| A typical game field of Candy Crush Saga (copyright with King Inc., taken from Wikipedia) |
If and $n\times m$ field, filled with candies of $k$ different flavors that are randomly, independently and uniformly chosen, what is the probability that there is at least one horizontal or vertical triple of the same flavor?
In other words, what is the probability that, starting a new level, something happens without you actually touching the screen. The question is obviously combinatorial and interesting in itself. But there is even more to it.
If we look at $n\times m$ fields filled with candies of $k$ different flavors, there are in total $k^{nm}$ possible arrangements. The rules of Candy Crush Saga, however, allow only a subset of these, namely, $k^{nm}$ minus the answer to above question. For example, if we take just one row with three columns and two flavors, there are eight possible arrangements, of which two are excluded. Taking two rows allows 36 out of 64 arrangements, and three rows permit 102 out of 512. (At least that's what my first calculations show.) In other words, the set of all possible fields allows for a redundancy: If we want to transmit messages by sending pictures of Candy Crush fields, a $3\times 3$ field allows you transmit only up to 102 different messages, although there are 512 different possible arrangements. If now, during the transmission, an error occurs -- the flavor of a candy changes -- you might end up with an invalid field! Candy Crush Saga can be used as a nonlinear block code! Super-exciting, right?
At that, in turn, allows us to ask several questions:
- How should we encode a message efficiently? Is there anything better than just keeping a table of , say,102 fields? (There is.)
- How should we decode the message efficiently? That's a more difficult question if you don't want to store a table of all allowed arrangements. Maybe belief propagation can help here, as it does for Sudoku codes.
- What are the basic properties of an $(n,m,k)$ Candy Crush Code? What is the maximally possible rate given the game rule, i.e., what is the ratio between the number of possible and allowed arrangements? What is the minimum Hamming distance, etc.? How can we improve our encoder in order to guarantee a positive Hamming distance?
- Given that we change the flavor of every candy with a given probability $p \ll 1$ (and choose the new flavor uniformly among the remaining ones), what is the error probability? I.e., what is the probability that the changed arrangement is not detected as erroneous, because it also satisfies the game rules?
- What about error propagation of this code? I.e., if a given candy changes its flavor, how many "information bits" are effected? Can we still reconstruct part of the information, or is the whole block destroyed? (I guess the answer is no.)
- Connecting to the last question: What happens if we let one dimension of the field go to infinity? We would need sequential encoding and decoding techniques, such as they are usual for convolutional codes.
- What happens to all of these topics if we span the field on a cylinder (such that the rules extend over one side of the field: two candies of a given flavor on one side rule out a single candy of the same flavor on the other side) or on a torus?
Most probably, the code thus constructed is crappy. But it is nevertheless a nice topic for a research internship, I guess. And it makes learning nonlinear codes for symmetric channels as fun as Sudoku codes do for erasure channels!
Wednesday, June 17, 2015
Polar Code ARE Fractal!
This is a post about very recent work that has not yet been reviewed, but is publicly available on arXiv. This is not science by press conference, but hopefully encouraging discussions and comments from other researchers in the field.
At least, that is what I found out -- I'm pretty confident of the correctness of the results, but, you know, mistakes happen all the time. If you feel that some of my derivations may have flaws, I'd be happy if you could contact me!
Alright, regarding the questions in one of my previous blog post, what have I found out? Here's the short version, for more details please take a look at the paper on arXiv:
At least, that is what I found out -- I'm pretty confident of the correctness of the results, but, you know, mistakes happen all the time. If you feel that some of my derivations may have flaws, I'd be happy if you could contact me!
Alright, regarding the questions in one of my previous blog post, what have I found out? Here's the short version, for more details please take a look at the paper on arXiv:
- The set of good channel $\mathcal{G}$ is Lebesgue measurable, and its Lebesgue measure is $I(W)$, the capacity of the (unpolarized) channel. This is intuitive in the sense that a fraction of $I(W)$ channels will be polarized to perfect channels. From this immediately follows that the Hausdorff dimension of this set is one.
- The dyadic rationals are good and bad. This is because dyadic rationals have two possible binary expansions, one terminating (with infinitely many zeros) and one repeating (with infinitely many ones). While the former polarizes the channel to a useless one, the latter polarizes it to a perfect one.
- Even if we do not take into account the dyadic rationals, the good channels still form a dense subset of the unit interval. The reason is that between any two dyadic rationals we can find a rational number (with repeating binary expansion) which leads to a good channel. For the binary erasure channel we can also find a rational number leading to a bad channel, so we can say a little more about BECs.
- And finally, the set of good channels is indeed self-similar, in the sense that it is a subset of its (scaled and shifted) right half (and left half for BECs). This can be most easily explained by a picture:
This picture (generated by the following code, polynomial composition from this forum) shows the thresholds on the Bhattacharyya parameter for some values inside the unit interval: If the Bhattacharyya parameter is smaller than the threshold, then the channel will polarize to a perfect channel. It can be seen that the right half (top) is an upper bound on the original set (center), which is (for BECs) an upper bound on its left half (bottom).
The most useful tools to derive these results were results about the binary expansion of real numbers, about fixed points of dynamical systems, and the Tal-Vardy polar code construction technique.
Subscribe to:
Posts (Atom)





