Showing posts with label information theory. Show all posts
Showing posts with label information theory. Show all posts

Thursday, March 15, 2018

An Inequality Between Conditional Cross Entropy and Error Probability

Some time ago I wrote about an inequality between conditional entropy and error probability. Specifically, one can show that if $\hat{x}(Y)$ is the maximum a posteriori estimate of a discrete random variable $X$ given the random observation $Y$, then the probability of making an error is bounded from above by

$$
  P_e = \mathbb{P}(X\neq \hat{x}(Y)) \le 1- 2^{H(X|Y)}.
$$

This inequality is not new. I was not aware, however, of any previous references until I met Igal Sason at the International Zurich Seminar earlier this year. He was speaking about a paper he recently published which includes this inequality (eq. (2)), together with many more general results relating error probability with Arimoto-Renyi conditional entropy.

I was hoping to generalize above inequality to the mismatched case. Specifically, let $p_{X|Y}$ be the conditional distribution of $X$ given $Y$. The MAP estimator is $\hat{x}(y) = \arg\max_x p_{X|Y}(x|y)$. Now suppose that the MAP estimator does not have access to $p_{X|Y}$ but only to a surrogate distribution $q_{X|Y}$, in which case the estimator becomes $\tilde{x}(y) = \arg\max_x q_{X|Y}(x|y)$. I thought to get a bound similar to the one above that includes this surrogate distribution and that bounds the error probability $\tilde{P}_e = \mathbb{P}(X\neq \tilde{x}(Y))$. To be more precise, I was hoping to exchange the conditional entropy above by some conditional cross entropy, i.e., to obtain a bound similar to

$$
\tilde{P}_e \le 1- 2^{\mathbb{E}(H^\times(p_{X|Y}(\cdot|Y)\Vert q_{X|Y}(\cdot|Y)))}
$$

possibly with the order of the distributions reversed. Since cross entropy is an upper bound on entropy, the bound will be larger than the bound for $P_e$. (Interestingly, even if $p_{X|Y}$ and $q_{X|Y}$ are different, we nay have $\tilde{P}_e =P_e$: It suffices that, for every $y$, both conditional distributions have their modes at the same values of $x$.)

Unfortunately, my hope was not justified. We show this in a counterexample, i.e., we show that there exists distributions such that the inequality on $\tilde{P}_e$ is violated (for any choice of order of distributions). For the sake of simplicity, we now look at guessing rather than estimation, i.e., the conditional distributions satisfy $p_{X|Y}=p_X$ and $q_{X|Y}=q_{X}$. Now suppose that $X$ is binary, i.e., it takes values 0 and 1. Furthermore, let $p_X(0)=q_X(1)=\epsilon\ll 0.5$. We have that $\tilde{x}=\arg\max_x q_X(x) = 0$, thus

$$\tilde{P}_e=\mathbb{P}(X\neq 0)=1-\epsilon.$$

Since $p_X$ and $q_X$ are permutations of each other, we have that $H(q_X)=H(p_X)$, that $D(p_X\Vert q_X) = D(q_X\Vert p_X)$, and that thus

$$
H^\times(p_X\Vert q_X) = H^\times(q_X\Vert p_X) = -\epsilon\log(1-\epsilon)-(1-\epsilon)\log\epsilon = -\log \epsilon - \epsilon\log\left(\frac{1-\epsilon}{\epsilon}\right).
$$

With this we get

$$
2^{-H^\times(p_X\Vert q_X)} = \epsilon\cdot 2^{\epsilon\log\left(\frac{1-\epsilon}{\epsilon}\right)} > \epsilon
$$

and thus the bound $\tilde{P}_e \le 1- 2^{-H^\times(p_X\Vert q_X)}$ cannot hold.

Monday, February 26, 2018

Talk at "International Zurich Seminar on Information and Communication"

Just coming back from Zurich, where I attended the Int. Zurich Seminar. Being there for the fourth time in a row, it felt like a class reunion :). I was speaking about a joint work with Tobi Koch on the information dimension of multivariate Gaussian processes (see below for the slides). The work is on arXiv here and here, and Tobi and I discussed a roadmap for more results to come - stay tuned!


By the way, the IZS publishes its submissions on an ETH server that is free to access for everyone and that does not require to transfer copyrights from the authors to the publisher. This is not only open access at its best, but also gives you a chance to look at the amazing papers presented in Zurich.

Monday, February 19, 2018

"A Rate-Distortion Approach to Caching" Just Got Published!

Our joint work on rate-distortion theory for caching just got published in the IEEE Transactions on Information Theory (it's also available on arXiv, of course; there's also an older post). Thanks Roy, Shirin, and Michele for letting me be a part of this!

In this paper, we consider a lossy single-user caching problem with correlated sources -- just think of streaming compressed videos! Most users will watch these videos in the evening, leading to network congestion. If you have a player with a cache, though, you can fill this cache with data during times of low network usage, even though you may not know which video the user wants to watch in the evening. In our paper, we characterize the transmission rate required in the evening as a function of the cache size and as a function of the distortion one accepts when watching the videos. We furthermore hint at what should be put in the cache such that it is useful for a variety of videos, and we connect these results to the common-information measures proposed by Wyner, Gacs and Koerner.


The picture shows achievable upper (red) and lower converse (black) bounds on the transmission rate as a function of the cache capacity $C$ for doubly symmetric binary source, with the requirement the video is reconstructed perfectly at the receiver. $I(X_1;X_2)$ denotes the mutual information between the two videos, while $K_W(X_1,X_2)$ denotes Wyner's common information.

Friday, January 26, 2018

Talk at "Advanced Topics in Discrete Mathematics"

I was recently giving a talk in the "Advanced Topics in Discrete Mathematics" series organized by the doctoral program on discrete mathematics. In case you missed it, be glad that you did: I went overtime quite a bit. You can obtain the slide by clicking on the image below.

Wednesday, January 17, 2018

My Story with Polar Codes Comes to an End

I have long been wondering whether polar codes are fractal, found out that they are in 2015, and put a paper on arXiv. The story that began in 2010 finally comes to end -- the paper is now published!

I submitted the paper to the IEEE Transactions on Information Theory, where it was under review for 18 months. I was told that the two reviewers thought the paper collects a few "interesting and cute mathematical observations", but that it does not tell us anything about the relevant code properties such as error probabilities or code complexity. I was further told that if I could "add a convincing application", then the paper could get to a second review round.

I withdrew the paper and submitted it to Entropy, and open-access journal published by MDPI, in October 2017. I received reviews within less than two months. Four of the five reviewers were inclined to accept the paper after some revision, while the fifth reviewer insisted to reject the paper (and probably still does) because of its missing connection to real-world codes. It is worth mentioning that the review process was not only much faster than the one for the IEEE Transactions on Information Theory, but that I also got high-quality reviews: Some of the reviewers went through the math in detail, made corrections and suggestions for improvement that led to generalizing some of the results. This shows that a proper, high-quality review process cannot be used to justify long review cycles.

Friday, January 12, 2018

Reading Notes: Neural Networks and the Information Bottleneck Principle

There is a recent hype regarding the connection between neural networks (NNs) and the information bottleneck principle. In this post, I try to summarize a few papers that have been published since 2015. A popsci treatment of this topic can be found here. All linked papers are available on arXiv or on OpenReview.

The reader should be familiar with the information bottleneck (IB) principle. Specifically, it assumes that we have a class random variable (RV) $Y$ and a (multi-dimensional) feature RV $X$, and that our goal is to compress $X$ to a "less complex" RV $\hat{Y}$ that shares much information with $Y$ but little with $X$ (compression versus preserving relevant information). Mathematically, one tries to minimize
$$
  I(X;\hat{Y}) - \beta I(Y;\hat{Y}) \text{ or equivalently } I(X;\hat{Y}) + \beta I(X;Y|\hat{Y})
$$
where $\beta$ is a parameter trading between these two goals. The first term in either expression is often called the rate $R$, while the second term in the latter expression is the information distortion $D_{IB}$. This allows to draw the IB curve with $R$ on one and $D_{IB}$ on the other axis. This curve depends only on the joint distribution of $(X,Y)$ and is monotonically decreasing.

2015 - Tishby and Zaslavsky, "Deep Learning and the Information Bottleneck Principle". How it all started.

The authors claim that the goal of NNs is to compress the features as much as possible, with the constraint that the compression should be a minimal sufficient statistic of the class RV. The IB principle formulates this goal, hence they claim that it can be used to better understand NNs. Indeed, feed-forward NNs are a sequence of compression functions; the feature RV $X$ is compressed to the output of the first hidden layer $h_1$, which is then compressed to the output of the second hidden layer $h_2$, etc. Finally, the output of the last hidden layer, $h_m$, is compressed to $\hat{Y}$. The authors claim that the IB functional not only characterizes the NN as a whole, but each layer separately. I.e., each layer should minimize $I(h_i;h_{i-1})$ while simultaneously maximizing $I(Y;h_i)$. The former has an interpretation in terms of the minimum description length principle, while the latter bounds the classification error. With this in mind, one can evaluate the optimality of each layer separately by comparing its performance with the IB curve. Each layer can only increase $D_{IB}$, but the hope is that $R$ is reduced by compression.
The authors then discuss finite sample bounds, which are relevant in estimating the IB curve from a finite data set. Specifically, it turns out that the empirical IB curve is bounded from below by the decreasing true IB curve, but from above by a convex function. I.e., for increasing $R$ the actual value on the IB curve may not necessarily decrease with increasing rate $R$. This is exactly the phenomenon of overfitting: A large $R$ may lead to a small $D_{IB}$ on the training set, but to a much larger $D_{IB}$ on the test set. It may thus be beneficial to aim at obtaining a representation close to the minimum of the convex upper bound. There, the rate $R$ is smaller and leads to a slightly larger $D_{IB}$ on the training set, but the worst-case increase of $D_{IB}$ for the test set is bounded. The authors argue that compression is thus necessary for generalization. Based on that, they argue that the IB principle may be used for the design of NNs. Finally, the authors briefly mention that there is a connection between bifurcations of the IB curve corresponding to constraints on $\hat{Y}$ (e.g., cardinality of the support) and the structure of the NN.

Based on this paper, to directions of research have been initiated: Designing NNs using the IB principle, and understanding IBs using the IB principle.

Design of NNs

2016 - Khadivi, Tandon, Ramakrishnan, "Flow of Information in Feed-Forward Deep Neural Networks"

This paper proposes using the IB functional as a cost function for training a NN. Specifically, they decompose the rate $R$, which happens to be $H(\hat{Y})$ because the NN is deterministic, as a telescoping sum of information losses per layer. Moreover, they decompose the information distortion $D_{IB}$ as a sum of terms $I(h_{i-1};Y|h_{i})$; indeed, their equation (19) is precisely my Proposition 4 in this paper. Their goal is to maximize compression with a constrained distortion, as opposed to the variational formulation of Tishby. Strangely, though, they have their compression goal is minimizing $I(X;\hat{Y})+\sum_{i=1}^m I(X;h_i)$, i.e., they also want to compress the individual layer outputs separately. This leads to a stronger compression factor for early layers, or equivalently, to smaller values of $\beta$ for early layers.
Finally, the authors formulate the training problem not as finding weights, but first finding admissible functions between layers (where admissibility means that the function can be implemented using a NN layer of given structure). They then obtain the weights from the obtained function. This approach is rather strange and hard to implement in practice, I suppose, and that may be the reason why the authors focus on Boolean functions in their example section.

2016 - Alemi, Fischer, Dillon, Murphy, "Deep Variational Information Bottleneck"

The authors also propose using IB to train a NN, claiming that it is "natural and useful". However, since computing mutual information is "computationally challenging", they replace the terms in the IB functional by upper and lower bounds. Specifically, they replace the decoding distribution $P_{Y|\hat{Y}}$ by a surrogate in the term $I(\hat{Y};Y)$, and the marginal distribution $P_{\hat{Y}}$ by a surrogate in the term $I(\hat{Y};X)$. For example, they assume that $P_{Y|\hat{Y}}$ is a simple softmax and that $P_{\hat{Y}}$ is a spherical Gaussian. Moreover, they assume that the encoder that maps $X$ to $\hat{Y}$ is stochastic, i.e., it responds with a Gaussian RV for which the mean vector and covariance matrix are obtained from a neural network. This allows to sample from the distribution of $\hat{Y}$ and to compute gradients and to use backpropagation. Note that this latter assumption is not one on the surrogate $P_{\hat{Y}|X}$, but rather one on the structure of the NN: Inside the NN there is a "bottleneck layer" that responds with $\hat{Y}$ (it may be the last layer), where the layer explicitly adds Gaussian noise.
They discover that the IB principle works as a decent regularizer in the sense of achieving good classification performance on MNIST data. They moreover plot the information plane (the IB curve) and discover that a poor choice of $\beta$, leading to low compression, leads to overfitting. An interesting figure shows a "two-dimensional bottleneck layer", i.e., $\hat{Y}$ has two components. For small compression $\mathbb{R}^2$ is clearly separated into ten sectors corresponding to the 10 classes of the MNIST data set, but that larger compression these regions tend to overlap, leading to misclassification but also to robustness to adversarial examples. For image classification tasks it was shown that the IB principle leads to NNs that are more robust to adversarial examples.
It is worth mentioning that the authors apply the IB principle to the NN as a whole, i.e., not to individual layers as Tishby suggested. They leave applying it to multiple (or all) layers for future work.

2017 - Kolchinsky, Tracey, Wolpert, "Nonlinear Information Bottleneck"

Although conceptually similar to the work of Alemi et al., these authors actually aim at implementing IB for non-Gaussian and non-discrete distributions (leading, generally, to a nonlinear compression function). Again, the authors argue that the IB functional is hard to compute and hence resort to bounds by replacing distributions by surrogate ones. Specifically, they assume a stochastic encoder $P_{\hat{Y}|X}$, which is based on a differential function (implemented using a neural network) to which spherical Gaussian noise is added. They moreover assume that the marginal distribution at the output of the deterministic neural network can be modeled by a Gaussian mixture. Using this marginal distribution and the conditional distribution, they present a non-parametric bound on the term $I(X;\hat{Y})$ (rather than a variational bound as in the previously summarized paper). They moreover implement the decoding function via a neural network and estimate the log decoding probability $\log P_{Y|\hat{Y}}$ by evaluating an "appropriately chosen neural network cost function" and by sampling from the "stochastic" bottleneck variable $\hat{Y}$. In other words, they use two neural networks to implement the encoding and decoding functions of the IB functional. Very interesting is their Figure 2, which shows how the bottleneck layer manages to separate AND compress the classes, such that data points from the same class almost fall together. If the parameter $\beta$ is set to infinity, then compression is not a goal anymore. The bottleneck layer then still manages to separate the classes, but each class corresponds to a somewhat spread-out cloud of data points. Data points are still distinguishable, which means that the bottleneck layer not only contains class information, but also additional information about the data points that is actually irrelevant.

2018 - ?, Parametric Information Bottleneck to Optimize Stochastic Neural Networks

This paper on OpenReview trains a neural network using the IB functional in every layer, i.e., in each layer they want to minimize $I(h_i;h_{i-1})$ and maximize $I(h_i;Y)$, with the trade-off determined by a layer-specific parameter $\beta_i$. They do not rely on a NN decoder as Alemi et al. did, but empirically estimate $I(X;h_i)$ and use a variational approximation for $I(Y;h_i)$. For this, they treat the subsequent layers of the NN as an approximation of $P_{Y|h_i}$. They furthermore estimate $I(h_i;h_{i-1})$ by Monte Carlo sampling. They show that their approach, minimizing the IB functional, performs well on the MNIST data set, where all layers had the same parameter $\beta$.

Understanding NNs

2017 - Schwartz-Ziv, Tishby, "Opening the Black Box of Deep Neural Networks via Information"

This paper probably kicked up most of the dust; a recording of a talk of Tishby's on this topic is available. In this paper, the authors train a NN with sigmoid activation functions and without regularization. They furthermore evaluate the IB functional for every layer throughout training. They observe that the training phase is split into two parts: A phase in which the training error reduces quickly. This phase is characterized by a stochastic gradient with a large mean and small variance ("drift phase") and by an increase of $I(h_i; Y)$ (the layer learns). Simultaneously, the terms $I(h_i; X)$ seem to increase, too (because they also learn something about the features). This drift phase is very short. The much longer phase is characterized by a compression of the input information subject to the training error criterion (i.e., training error does not change?). In this phase, the gradient has a small mean and a large variance ("diffusion phase"), and is characterized by further increase of $I(h_i; Y)$, but a simultaneous decrease of $I(h_i; X)$ in the deeper layers: The deeper layers compress, in the sense that they forget input information not relevant for classification. This phase is significantly longer. What is interesting is that, without regularization, there is no decrease in weight norms nor sparsity observable. Rather, the compression occurs because of the stochasticity of gradient descent. This suggests that, without regularization, there are multiple NNs with (possibly random) weights that have equal classification performance. It therefore suffices to find a NN in this equivalence class; interpreting the output of the sigmoid function as a probability (which is equivalent to adding noise?) leads to choosing a function with low complexity (rather than one which is random and has high complexity; cf. Section 2.4). The authors furthermore indicate that the converged layers have IB functionals that lie close to the optimal curve for special values of $\beta$, where $\beta$ is smaller for deeper layers (more focus on compression). Furthermore, they observed that the compression phase is faster for NNs with more hidden layers. They also claim that the convergence of the layers to the IB functional can be used in training NNs.

2018 - ?, On the Information Bottleneck Theory of Deep Learning

This paper on OpenReview, where a long discussion between Tishby (a reviewer?) and the authors can be found, essentially refutes some of the claims of Tishby's paper. Specifically, they argue that compression in the diffusion phase does not appear for linear and ReLU activation functions. They argure moreover that the compression observed in Tishby's paper is due to the fact that the sigmoid nonlinearity is saturated, leading to small mutual information values when the latter is estimated via binning (to which Tishby responded that the authors don't know how to estimate mutual information). They showed that compression occurs independently of generalization, by showing that 1) a linear network generalizes but does not compress, and that an overfitted network compresses but does not generalize. Finally, they showed that similar results are obtained by batch gradient descent instead of stochastic gradient descent, which shows that compression is not achieved by stochasticity. The verdict which of these papers is closer to the truth is still out. It is worth mentioning, though, that they discuss several problems with information-theoretic cost functions in Appendix C, where they also discuss the importance of noise.

Wednesday, November 30, 2016

(Lossy) Wyner's Common Information and Gacs-Körner's Common Randomness Revisited

Recently, I was involved (not so much, to be honest) in a research project dealing with caching, i.e., the storage of data close to a user such that it need not be transmitted anymore. A typical use case is Netflix: Netflix offers $N$ movies/series, and the user can choose any of these anytime. Most users watch during prime time, which taxes the communication network. During the day, the communication network has unused capacities, so one could try storing part of the Netflix database on a hard drive connected to the user's player. This reduces the network load during prime time and improves the user experience.

I our work, we not only looked at the fundamental limits of this caching scheme, but we also got a good idea of what should actually be stored in the user's hard drive ("cache"). The answers are decades old: One should store either what is known as Gacs-Körner's common randomness, or what is known as Wyner's common information.

For simplicity, we assume that there are only two files available on Netflix, namely $X$ and $Y$. The common information of these two random variables (RVs; in information theory, everything is random) is another RV $W$ that minimizes
$$I(X,Y; W)$$
subject to $X-W-Y$, i.e., subject to the fact that, given $W$, $X$ and $Y$ are conditionally independent. The common randomness of these two RVs is an RV $U$ that maximizes
$$ I(X,Y; U) $$
subject to the constraint that $X-Y-U$ and $Y-X-U$. It can be shown that
$$ I(X,Y; U) \le I(X;Y) \le I(X,Y; W).$$

Common randomness is something that is truly common, in the usual sense of the word. In particular, if $A$, $B$, and $C$ are independent RVs, and if $X=(A,B)$ and $Y=(A,C)$, then $U=A$. So much for the math. Now here comes an interpretation (and probably one for which an information theorists wants to punch me in the face): If $X$ and $Y$ are two different episodes of your favorite series, then $U$ is the opening theme. If you have a small hard drive, storing the opening theme is probably the most useful idea. Common information is a bit more intricate, but an intuitive explanation is the following: Suppose $X$ and $Y$ are two different episodes of your favorite series, and suppose the only thing $X$ and $Y$ have in common is an actress called $A$. In $A$, some parts of $A$ are seen, while in $Y$ other parts of $A$ are seen. (Say, $X$ shows $A$ running, while $Y$ shows $A$ talking to a person.) Obviously, if you could store all information about actress $A$ as a whole on your hard drive, the small computer in your player could reconstruct running scenes and talking scenes in which $A$ participates. Thus, if your hard drive is large enough, that's what you should store.

Let's take this one step further. Suppose that you are fine with watching not $X$ or $Y$, but lower-quality versions $\hat X$ or $\hat Y$, respectively. This requires the definition of lossy versions of common randomness and common information. And indeed, such definitions have been proposed in 1403.8093. Namely, the lossy common information is an RV $W$ that minimizes
$$ I(X,Y; W)$$
subject to $\hat X - W - \hat Y$ and $(X,Y) - (\hat X,\hat Y) - W$. Lossy common randomness is an RV $U$ that maximizes
$$ I(X,Y; U) $$
subject to the constraints $X-Y-U$, $Y-X-U$, $X-\hat X - U$, and $Y-\hat Y-U$. These definitions are all meaningful because they solve communication problems: The get their operational characterization in the Gray-Wyner problem (1403.8093) or in lossy caching (1610.07304).

The lossy version of common information seems fine to me: I only need to store as much information about actress $A$ in order to be able to reconstruct approximate depictions of her running and speaking. This might need more or less space on the hard drive than the perfect reconstruction: I might be able to store not only this reduced version of the actress $A$, but I might also store a, say, generic building, that allows to reconstruct two different but similar houses in episodes $X$ and $Y$. And indeed, lossy common information may be larger or smaller than lossless common information (see Lemma 1 in 1403.8093).

The lossy version of common randomness, however, is not fully satisfactory to me. If I am not interested in perfect reconstruction, should I not be able to store more on my internal hard drive? For example, if the credits of my two episodes $X$ and $Y$ differ only in a few actors' names, can I not store a "noisy" version of the credits in addition to the opening theme? Nevertheless, the lossy version of common randomness is usually smaller than the lossless version (see Corollary 2 in 1403.8093).

I would thus suggest different definitions of lossy common randomness: The lossy common randomness is an RV $U'$ that maximizes
$$ I(X,Y; U') \text{ or } I(\hat X,\hat Y; U')$$
subject to the constraints $\hat X-\hat Y-U'$, $\hat Y-\hat X-U'$, $X-\hat X - U'$, and $Y-\hat Y-U'$. The difference is subtle and mirrors the definition of lossy common information. While lossy common information requires $W$ to make the reconstructions $\hat X$ and $\hat Y$ conditionally independent, the original definition of lossy common randomness requires that $U$ is a common part of $X$ and $Y$ (the first two Markov constraints). The definition proposed here would require that $U'$ is only a common part of the reconstructions $\hat{X}$ and $\hat Y$. Given this definition, we may ask the following questions:

  1. Does the new definition satisfy the desired property that it is larger than the lossless common randomness?
  2. Is there a communications problem for which this definition is a single-letter characterization?

Thursday, September 8, 2016

My First Lecture in Numbers

It's done! The semester is over and I graded the last student of my lecture "Information-Theoretic System Analysis and Design". It was a great experience, and I thank Gerhard Kramer for offering me this opportunity. I also owe Lars, my colleage, more than a few drinks for taking care of most of the problem classes, and my colleagues Andrei and Ali for interesting discussions and comments about the course material. And I also thank my students, who bore with me until the end of the semester, and who were never tired of asking interesting questions -- I think I've never had such an engaged, excited audience.

Here's the course in numbers:
  • (I think) I held 14 lectures. Lars and I shared 6 problem classes.
  • I spend roughly 200 hours preparing course notes, lecturing, correcting deliverables, etc.
  • In the problem classes, students calculated 11 problems in front of their peers. The remaining 7 problems were done by Lars.
  • We had 1 exam and 2 mandatory homework assignments, weighed 50%-25%-25%.
  • Both homeworks had 9 problems.
  • We had 8 regular students with (as far as I know) 8 different nationalities, originating from 2 continents.
  • 7 of these students took the exam and delivered the 2 homework exercises. The average grade of this course was 1.35.
  • The course notes totaled to 102 pages, containing 8 chapters, 40 worked examples, and 62 end-of-chapter problems.
  • 6 students finished the course evaluation. An excerpt is shown below.
Course Evaluation (in parts)
And here are some lessons I learned:
  • My PhD thesis is far from complete.
  • There is never enough time to prepare course notes. Even though the course was based on my PhD thesis, it was hard work to prepare the results in an understandable manner.
  • You have to make sure that the description of homework problems is precise and complete.
  • Don't get confused by registration numbers. 30+ students registered for the course, but in the first lecture less than 10 showed up. Apparently (so I was told), many students register for a course to get the course notes on Moodle.
  • Students prefer a script that they can get printed at the Fachschaft. Putting chapters online a few days before the lecture is ok, but a script is better.
  • It's good not to have a sttrict course syllabus if's the first time you give the course. You never how exactly how much time you have to spend at a given topic, so it's better to leave room for changes. Still, it's a good idea to have at least a rough sketch of what topics you want to cover.

Monday, June 20, 2016

The “I” in IT

This article has been published in the June issue of the Information Theory Society Newsletter, as this month's student column (students are always encouraged to send us their thoughts and ideas!). I gratefully acknowledge Gerhard Kramer's help in improving this article.

Have you ever asked yourself whether IT is the right field of science for you? I never did, but I guess IT wondered more than once whether I am the right person to work on its problems. Here’s my story.

I’m currently a postdoc at the Institute for Communications Engineering, Technical University of Munich, Germany. In this regard, my story has a happy ending, something I wouldn’t have dreamed of years ago. When I started my PhD at Graz University of Technology, Austria, I hadn’t heard a lecture on IT yet. My supervisor and I nevertheless decided to attempt developing a “theory for deterministic information processing”, or what we called it back then. I knew entropy from communications engineering and thought that I could get all the necessary information from Chapter 2 of the “Elements”. I later read Chapter 4 for stochastic processes and even resorted to more detailed results, such as those from Kolmogorov and Pinsker, when I tried taking limits of random variables. Nevertheless, phrases like “a sequence of codes” never appeared in my work. Probably the rest of my remaining scientific credibility would be lost if I told you how many of my papers got rejected - so I won’t. I will tell you, though, what most reviewers agreed on: That the information-theoretic quantities I introduced to characterize deterministic systems lack an operational characterization. That was a valid criticism, and I learned what an operational characterization is only a few months before I obtained my PhD. Relating to IEEE Spectrum’s article on Claude Shannon, I despaired that I had “jumped on the bandwagon without really understanding” (Slepian) and contributed to the “widespread abuse” (Pierce) of IT by other fields.

However, since then I have discovered that this kind of “abuse” is successful in scientific fields outside IT. For example, nonlinear adaptive filters are trained using an error entropy, Rényi and Cauchy-Schwarz divergences are used for learning and classification, and the information-bottleneck method enjoys widespread use in clustering. To the best of my knowledge, only few of these works accompany their objective functions with an operational characterization - mutual information is “just” a nonlinear measure of statistical dependence, and entropy is “just” a statistic that captures more than the error signal’s variance. Remarkably, these heuristic methods often show superior performance. Hence, at least in these other fields, the works are scientifically justified by experiment. (OK, I admit that this is a feeble attempt to justify my work a posteriori. As more self-justification, the Aims & Scope of the IEEE Transactions on Information Theory does state that the journal aims to publish theoretical and experimental papers.)

Should I, therefore, try to publish in other fields and transactions? Probably. My supervisor, suggested more than once that I should spread the word, trying to convince scientists in signal processing to use higher-order statistics, i.e., IT quantities. I haven’t listened, though, because I feel at home in the IT community, and I would not want to miss meeting my IT friends regularly at our annual events. Even in hindsight, I would rather submit to orthodox pressure and provide operational characterizations rather than to publish in a different community. In the future, I hope that I can do both.

That is my decision. As the IT society, we all must decide: what are our traditions (really) and how strong shall we hold on to them? Especially when dealing with future directions of IT, as mentioned in arXiv:1507.05941, how should we contribute to make our quantities be used in fields such as signal processing, control theory, and machine learning? Or should we not? For example, should we propose clustering algorithms using IT quantities, or should we rather focus on deriving asymptotic limits for the clustering problem? You, as a PhD student, must make your own decision: Are you sufficiently “in IT” so as to be accepted by the community? If your research topic is on the border between IT and another field, in which direction do you want to go?

Tuesday, March 8, 2016

OMG, KLD is only LSC!

Recently, my colleage Georg came to my office to show me some exciting result he found in a book: A bound on mutual information (MI) based on the variational distance (VD) between the joint distribution and the product of marginal distributions. In other words, if $X$ and $Y$ are random variables with probability mass functions (PMFs) $p_X$ and $p_Y$, and a joint PMF $p_{XY}$, one can write

$$ I(X;Y) \le C \Vert p_{XY}-p_Xp_Y\Vert, $$

where $C$ only depends on the alphabet sizes of $X$ and $Y$ and where

$$\Vert p_{XY}-p_Xp_Y\Vert = \sum_{x,y} |p_{XY}(x,y) - p_X(x)p_Y(y)| .$$

This of course implies that if the variational distance goes to zero, then so does the mutual information. This result is exciting since in general it is not possible to bound Kullback-Leibler divergence (KLD; mutual information is a special case of a Kullback-Leibler divergence) via variational distance, at least not by a bound that does not depend on the distributions. The reverse is always possible: The Kullback-Leibler divergence bounds variational distance via the well-known Pinsker inequality.

While I was surprised by the existence of such a bound, I was not surprised by the weaker result that if VD goes to zero then also MI does. At least if $X$ and $Y$ have a finite alphabet, then the entropies $H(X)$, $H(Y)$, and $H(X,Y)$ are continuous in the PMFs, hence convergence of the PMFs implies convergence of the entropies, and hence, of the MI.

Very interestingly, however, it turned out that KLD is NOT continuous, but only lower semi-continuous (LSC), even if the supports of the distributions are finite:

A sequence of (probability) measures $\mu_n$ is said to converge setwise or strongly to another probability measure $\mu$ iff for every measurable set $A$ one has $\mu_n(A)\to \mu(A)$. I think if $\mu=n$ are discrete probability measures with finite alphabets, i.e., can be described by PMFs $p_n$ with finite support, then setwise convergence is equivalent to the fact that, for every point $x$ in the support, $p_n(x)\to p(x)$. I furthermore think that for such finite PMFs setwise convergence is equivalent to the fact that $\Vert p_n-p\Vert\to 0$. (I should check that.) Now let us take two sequences $p_n$ and $q_n$ of PMFs. Let both have a finite support. One would guess that if $p_n \to p$ and $q_n\to p$, then the KLD

$$ D(p_n\Vert q_n) = \sum_x p_n(x)\log\frac{p_n(x)}{q_n(x)} \to 0.$$

But that's not true. While entropies are continuous w.r.t. setwise convergence, KLD is only LSC. It follows that the right-hand side of above equation can be strictly positive, even if both $p_n$ and $q_n$ converge to $p$. Here is an example that you can easily work out yourself: $p_n(0) = 1-1/n$, $p_n(1)=1-p_n(0)$ and $q_n(0)=1-e^{-n^2}$, $q_n(1)=1-q_n(0)$.

Wednesday, December 16, 2015

Auf der Jagd nach den verlorenen Bits

In einer Zeit, in der Information eine so große Rolle spielt, sollten wir genau wissen was mit der Information in den technischen Systemen, die wir täglich verwenden, passiert. Wie viel davon geht auf ihrem Weg von der Quelle zum Empfänger verloren? Und können wir unsere Systeme so bauen, dass dieser Informationsverlust minimiert wird?

Das Smartphone-App Ihrer Bank ist sehr praktisch wenn Ihnen auf dem Weg durch die Herrengasse in Graz in einem Schaufenster ein schöner Pulli auffällt: Ein Blick auf den Kontostand genügt und Sie wissen, ob Sie sich diese sicherlich unnötige Ausgabe leisten können. Um die Sache zu vereinfachen, stellt das App negative Kontostände mit einer roten Zahl, positive mit einer schwarzen dar. Aufgrund eines merkwürdigen (und zugegebenermaßen unrealistischen) Displayfehlers sehen Sie aber nur eine blaue Zahl. Das Display hat einen wichtigen Teil der Kontoinformation zerstört.

Aber was ist Information eigentlich? Im Wesentlichen ist Information unser Wissen bzw. Unwissen über etwas Zufälliges, je nachdem ob wir die Information besitzen oder nicht. Messen lässt sich Information z.B. über die durchschnittliche Anzahl von Ja/Nein-Fragen, die gestellt werden müssen, um unser Unwissen zu beseitigen: Nehmen wir einen zufälligen Münzwurf als Beispiel, so reicht bereits eine Frage, um uns über das Ergebnis zu erkundigen: Kopf oder Zahl? Bei zwei Münzwürfen benötigen wir zwei Fragen, um beide Ergebnisse zu erfahren, bei drei Würfen drei Fragen, und so weiter. In seiner bahnbrechenden Arbeit "A Mathematical Theory of Communicationpräsentierte ClaudeE. Shannon eine mathematische Formel, um die Information eines Münzwurfes – seine Entropie – zu berechnen und definierte das Bit als Maßeinheit.

Die Information eines Münzwurfes ist genau ein Bit: man benötigt exakt eine Ja/Nein-Frage, um den Ausgang des Wurfes zu bestimmen. Anders sieht es bei einer gezinkten Münze aus die in neun von zehn Fällen Kopf zeigt. Unser (Vor-)Wissen ist größer als im Fall einer fairen Münze und "im Durchschnitt" benötigen wir weniger Fragen, um den Ausgang des Münzwurfes zu bestimmen. Konkret kann man sich das folgendermaßen veranschaulichen: Werfen wir eine faire Münze zweimal, müssten wir immer zwei Fragen stellen, um die Ergebnisse beider Würfe zu bestimmen. Werfen wir die gezinkte Münze zweimal ist in vielen Fällen eine einzige, schlau gestellte Frage ausreichend: "Zeigten beide Würfe Kopf?" Wird diese Frage bejaht (und das wird sie in 81% aller Fälle), muss eine zweite Frage mehr nicht gestellt werden. Mit Hilfe von Shannons Formel kann man zeigen, dass die Information dieses gezinkten Münzwurfes bei ca. 0.46 Bit liegt: Es reichen "durchschnittlich" etwas weniger als eine halbe Frage, um unser Unwissen über einen Münzwurf zu beseitigen, etwas weniger als eine Frage für zwei Münzwürfe und etwas weniger als eineinhalb Fragen für drei Würfe.

Wie viel Bit der Kontoinformation wurden aber durch Ihren Displayfehler zerstört? Mit dieser Frage beschäftigte ich mich im Zuge meiner Dissertation an der TU Graz, in der ich den Informationsverlust in technischen Systemen untersuchte. Da sich mit einer einzigen Frage feststellen lässt ob der Kontostand positiv oder negativ ist, kann der Informationsverlust Ihres Displays höchstens ein Bit betragen. Dass der genaue Wert im Wesentlichen von der Zufälligkeit Ihres Kontostandes abhängt ist weniger offensichtlich, aber in Hinblick auf die gezinkte Münze leicht verständlich. Wenn Sie eine sehr sparsame Person sind und immer einen kleinen Puffer auf Ihrem Konto wissen, fehlt Ihnen nicht viel Information: Sie können sehr sicher sein dass die Zahl eigentlich schwarz sein sollte und Sie sich den Pulli leisten können. Wenn Sie eine sehr verschwenderische Person sind, deren Kontostand höchstens ein paar Tage im Monat positiv ist, fehlt Ihnen auch nicht viel Information: Die Zahl ist mit hoher Wahrscheinlichkeit rot. (Den Pulli würden Sie sich in diesem Fall wahrscheinlich trotzdem kaufen.) Bewegen Sie sich allerdings zwischen diesen beiden Extremen, kann Ihr Display einen beträchtlichen Teil der Information zerstört haben – im schlimmsten Fall ein Bit.

Ein Bit ist doch nicht viel, sagen Sie? Das kommt ganz auf das Bit an! Wenn Sie wissen, dass Ihr Kontostand zwischen -5000 und 5000 € liegt, müssen Sie nach Shannons Formel rund 20 Fragen stellen, um den genauen Betrag bis auf den Cent zu erfahren. Ihr Display beantwortet 19 dieser 20 Fragen für Sie – nur leider die wichtigste nicht: Ist der Kontostand positiv oder negativ? So interessant es also ist den Informationsverlust eines Systems zu untersuchen, praktische Bedeutung bekommt diese Theorie erst unter Miteinbeziehung des Relevanzbegriffs: Welcher Anteil der verlorenen Information ist für uns relevant, und welcher irrelevant bzw. störend? Ich erweiterte die Theorie in meiner Dissertation in dieser Hinsicht und verwendete die Resultate, um zu tun, was einem Ingenieur zu tun bestimmt ist: Systeme zu bauen!

Systeme werden nach gewissen Anforderungen gebaut, um gewisse Aufgaben zu erfüllen. Ein elektronisches Filter kann zum Beispiel entworfen werden, um störendes Rauschen beim Telefonieren zu unterdrücken und dabei das relevante Sprachsignal des Gesprächspartners möglichst nicht zu beeinflussen. Historisch bedingt – und schlichtweg am einfachsten – werden Filter nach Energiekriterien entworfen: Die Energie des störenden Rauschens soll nach der Filterung so klein wie möglich sein. Gleichzeitig darf die Energie der durch das Filter hervorgerufenen Störungen im Sprachsignal nicht zu groß werden, um eine angenehme Kommunikation der Gesprächspartner zu garantieren.

Ich versuchte mit meiner Arbeit einen anderen Weg einzuschlagen: In einer Zeit, in der viele unserer Systeme Information verarbeiten oder übertragen, sollte Information als Kriterium für den Systementwurf verwendet werden. Die von Shannon entwickelte Informationstheorie besagt, dass jedes System Information nur verringern aber nicht vergrößern kann. Jene Information, die wir aus dem Lautsprecher am Smartphone hören, war zuvor in der elektromagnetischen Welle in der Luft, im digitalen Signal im Smartphone unseres Gesprächspartners und in den Schallwellen zwischen dessen Mund und dem Mikrofon. Mehr als das: In jedem dieser verarbeitenden Systeme – Mikrofon, digitale Schaltkreise, Lautsprecher – ging Information verloren. Welches Entwurfskriterium könnte also besser geeignet sein als der Informationsverlust? Auf das obige Beispiel angewendet gilt es also ein Filter zu entwerfen welches so wenig Sprachinformation wie möglich zerstört und dabei die "Information" des störenden Rauschens soweit wie möglich reduziert.

Dass ein hinsichtlich Informationsverlust entworfenes Filter besser zur Informationsübertragung geeignet ist als ein nach Energiekriterien entworfenes, dürfte Sie inzwischen nicht mehr überraschen – eine andere Methode liefert eben ein anderes Ergebnis. Nichtsdestotrotz stieß ich während meiner Dissertation in der Literatur immer wieder auf Stellen, in denen Energie mit Information gleichgesetzt wurde. So wird zum Beispiel in der Statistik seit Jahrzehnten die Hauptkomponentenanalyse eingesetzt, um die Komplexität großer Datensätze zu verringern. Dabei werden die mehrdimensionalen Datensätze transformiert und Daten mit geringer Energie verworfen. Diese Vorgehensweise wird mit der Behauptung gerechtfertigt, dass die Daten mit der größten Energie auch die meiste relevante Information beinhalten. Diese Behauptung ist nicht immer richtig (und wer am lautesten schreit, hat auch nicht immer recht): Für die Hauptkomponentenanalyse konnte ich zum Beispiel zeigen, dass sie den Informationsverlust nur dann minimiert, wenn die relevante Information in einem besonderen Zusammenhang mit den Datensätzen steht, einer Tatsache, die in der Statistik nicht immer zutrifft. Es ist höchste Zeit, umzudenken.

Neben Filterentwurf, der Hauptkomponentenanalyse und der Analyse Ihres defekten Displays gibt es natürlich eine Vielzahl weiterer Anwendungen einer Theorie des Informationsverlusts: Zum Beispiel entwickelte ich gemeinsam mit anderen Forschern eine Methode, um Markoffschen Ketten zu vereinfachen, ohne dabei Information zu zerstören. Markoffsche Ketten – Folgen von zufälligen Zahlen, die in einem statistischen Zusammenhang miteinander stehen sind wichtige mathematische Modelle und werden in der Sprachverarbeitung, als Modelle chemischer Reaktionen, in der Genetik, in der Bioinformatik und in der Warteschlangentheorie eingesetzt.

Information ist überall. Um sie nutzbar zu machen, sollten unsere technischen Systeme – Computer, Smartphones, etc. – so wenig wie möglich davon zerstören. Und selbst wenn wir es nicht schaffen sollten, diese Systeme dementsprechend zu bauen, so sollten wir zumindest wissen wie viel Information durch ungeeignet entworfene Systeme verloren geht. Die Wichtigkeit der von Shannon begründeten Informationstheorie, die ich mit meinen Resultaten zum Informationsverlust ein klein wenig ergänzen durfte, ist nicht zu unterschätzen. Es gilt heute viel mehr als je zuvor Norbert Wieners Behauptung: "Information ist Information, weder Materie noch Energie. Kein Materialismus, der dies nicht berücksichtigt, kann heute überleben."

Tuesday, September 8, 2015

My Story with Polar Codes

It was the end of August 2010, and the weather was unusually good in Dublin. During the breaks at the IEEE Information Theory Workshop we could easily go outside, sit in the grass, and discuss the previous talks. Well, some people discussed, some just listened. Since this was my first conference in information theory (I did not even have a paper there, my boss allowed me to go there without actively participating -- an opportunity I am extremely grateful for), I clearly belonged to the second group. But then, already on the second day, there were some talks I felt comfortable with: Erdal Arikan gave a lecture on Polar Codes, a new coding technique he invented recently. What followed were several talks related to polar coding, including talks from Emmanuel Abbe (who was my main source of inspiration for my investigation of the fractality of polar codes), Tanaka (who looked at the polar coding procedure from a stochastic dynamical system point-of-view), Ido Tal and Alexander Vardy (whose code construction technique was extremely useful in proving fractality), and Eran Hof. While I did not understand a lot from his talk, I owe Eran big time: Being a total noob, I did not know anybody at this conference, and he was so kind to introduce the "big players" to me. That was absolutely necessary, because during one of the breaks I approached Sergio Verdu and asked him about the polar code construction technique he just presented. Well, apparently I confused him with Alexander Vardy (I have no idea why). Anyways, although this was one of the best conferences I've ever been to (an excursion to Castle Malahide, Whiskey tasting at Jameson's, a Riverdance show during the conference dinner, a pleasant stay at a cute little hotel, great weather, nice people, and all this together with my back-then-future-wife), this was just the beginning of my interest in polar coding.

It was in November 2011 when I presented my first paper at an information theory-related conference, the ISWCS in Aachen (during one of the short coffee breaks I met Gerhard Kramer, my current boss, for the first time; I also met Georg Böcherer, just finishing his PhD and winning the best paper award). As usual for these conferences, the first day featured tutorials in the afternoon. One of these tutorials was on polar codes, held by Emmanuel Abbe. I still have the printed handouts; one of the few paper documents I moved from Graz to Munich. It was an unforgettable experience, partly because of the awesome presentation and Emmanuel's great didactic skills, partly because of Emmanuel's lively discussion with the audience (particularly with Tor Aulin). And this great impressions led me to think about possible research directions on the train ride home to Graz. In a document called "Resarch Ideas.txt" I wrote
If a channel (one of the partitioned channels) is already very good (close to perfect), is it possible to assume all its descendant channels to be good? [...] Can we use this to find an approximation of a "good" polar code?
And then nothing happened for quite some time, as I continued my PhD research on information loss in deterministic systems. All this time, these polar codes stayed in the back of my mind and came to the front whenever I attended a conference (there were ALWAYS talks on polar codes). But time went by fast, I finished my PhD and -- totally unexpected -- started working at the Institute for Communications Engineering in Munich, becoming a member of Gerhard's lab and Georg's colleague. Some months there, not much research done (I was busy writing project proposals), the whole lab took a skiing trip to the alps in South Tirol, disguised under the title "Joint Conference on Communications and Coding". Well, in fact we did more research than skiing, something which I am quite grateful for (there are some very embarrassing photos of me trying cross-country skiing...). Instead of presenting work already done, I presented open problems, among them some ideas about polar codes. The preparation for the talk inspired me to write a blog article, and soon after that all the questions in this article were answered. Emmanuel suggested via email to extend the analysis to Reed-Muller codes, which was completed almost equally fast. It was one of the very few lucky streaks I had in my research career so far, and still some open questions pester me.

Although I am not sure if any of the obtained results is useful, I'm extremely happy to have these things out of my mind (and in a paper on arXiv). I could also present these results at the NEWCOM# Emerging Topics Workshop in Cambridge. One of the participants of this workshop was Erdal Arikan, the father of polar codes, and the very same person we were sitting next to during the whiskey tasting event at Jameson's in Dublin five years ago.

Wednesday, July 8, 2015

Candy Crush Codes

Candy Crush Saga is a game in which... no, I don't think I have to tell you! But I bet you didn't know yet that the game is actually NP-hard! The original proof can be found on arXiv (or, again, on arXiv; these authors claim to prove a little more general result), but there are also plenty of pop-sci articles around. (Now, that is quite interesting: Walsh posted his paper on March 8th, 2014, and the press wrote articles about it only five days later.) Anyways, that's not really what I wanted to tell you.

A typical game field of Candy Crush Saga (copyright with King Inc., taken from Wikipedia)
If you look at above picture, you will see no three candies in a row or in a column having the same flavor. Why? Because they would disappear immediately (that's more-less the only game rule). But this game rule more-less immediately leads to the following question:

If and $n\times m$ field, filled with candies of $k$ different flavors that are randomly, independently and uniformly chosen, what is the probability that there is at least one horizontal or vertical triple of the same flavor?

In other words, what is the probability that, starting a new level, something happens without you actually touching the screen. The question is obviously combinatorial and interesting in itself. But there is even more to it.

If we look at $n\times m$ fields filled with candies of $k$ different flavors, there are in total $k^{nm}$ possible arrangements. The rules of Candy Crush Saga, however, allow only a subset of these, namely,  $k^{nm}$ minus the answer to above question. For example, if we take just one row with three columns and two flavors, there are eight possible arrangements, of which two are excluded. Taking two rows allows 36 out of  64 arrangements, and three rows permit 102 out of 512. (At least that's what my first calculations show.) In other words, the set of all possible fields allows for a redundancy: If we want to transmit messages by sending pictures of Candy Crush fields, a $3\times 3$ field allows you transmit only up to 102 different messages, although there are 512 different possible arrangements. If now, during the transmission, an error occurs -- the flavor of a candy changes -- you might end up with an invalid field! Candy Crush Saga can be used as a nonlinear block code! Super-exciting, right?

At that, in turn, allows us to ask several questions:

  • How should we encode a message efficiently? Is there anything better than just keeping a table of , say,102 fields? (There is.)
  • How should we decode the message efficiently? That's a more difficult question if you don't want to store a table of all allowed arrangements. Maybe belief propagation can help here, as it does for Sudoku codes.
  • What are the basic properties of an $(n,m,k)$ Candy Crush Code? What is the maximally possible rate given the game rule, i.e., what is the ratio between the number of possible and allowed arrangements? What is the minimum Hamming distance, etc.? How can we improve our encoder in order to guarantee a positive Hamming distance?
  • Given that we change the flavor of every candy with a given probability $p \ll 1$ (and choose the new flavor uniformly among the remaining ones), what is the error probability? I.e., what is the probability that the changed arrangement is not detected as erroneous, because it also satisfies the game rules?
  • What about error propagation of this code? I.e., if a given candy changes its flavor, how many "information bits" are effected? Can we still reconstruct part of the information, or is the whole block destroyed? (I guess the answer is no.)
  • Connecting to the last question: What happens if we let one dimension of the field go to infinity? We would need sequential encoding and decoding techniques, such as they are usual for convolutional codes.
  • What happens to all of these topics if we span the field on a cylinder (such that the rules extend over one side of the field: two candies of a given flavor on one side rule out a single candy of the same flavor on the other side) or on a torus?
Most probably, the code thus constructed is crappy. But it is nevertheless a nice topic for a research internship, I guess. And it makes learning nonlinear codes for symmetric channels as fun as Sudoku codes do for erasure channels!

Wednesday, June 17, 2015

Polar Code ARE Fractal!

This is a post about very recent work that has not yet been reviewed, but is publicly available on arXiv. This is not science by press conference, but hopefully encouraging discussions and comments from other researchers in the field.

At least, that is what I found out -- I'm pretty confident of the correctness of the results, but, you know, mistakes happen all the time. If you feel that some of my derivations may have flaws, I'd be happy if you could contact me!

Alright, regarding the questions in one of my previous blog post, what have I found out? Here's the short version, for more details please take a look at the paper on arXiv:

  • The set of good channel $\mathcal{G}$ is Lebesgue measurable, and its Lebesgue measure is $I(W)$, the capacity of the (unpolarized) channel. This is intuitive in the sense that a fraction of $I(W)$ channels will be polarized to perfect channels. From this immediately follows that the Hausdorff dimension of this set is one.
  • The dyadic rationals are good and bad. This is because dyadic rationals have two possible binary expansions, one terminating (with infinitely many zeros) and one repeating (with infinitely many ones). While the former polarizes the channel to a useless one, the latter polarizes it to a perfect one.
  • Even if we do not take into account the dyadic rationals, the good channels still form a dense subset of the unit interval. The reason is that between any two dyadic rationals we can find a rational number (with repeating binary expansion) which leads to a good channel. For the binary erasure channel we can also find a rational number leading to a bad channel, so we can say a little more about BECs.
  • And finally, the set of good channels is indeed self-similar, in the sense that it is a subset of its (scaled and shifted) right half (and left half for BECs). This can be most easily explained by a picture:

This picture (generated by the following code, polynomial composition from this forum) shows the thresholds on the Bhattacharyya parameter for some values inside the unit interval: If the Bhattacharyya parameter is smaller than the threshold, then the channel will polarize to a perfect channel. It can be seen that the right half (top) is an upper bound on the original set (center), which is (for BECs) an upper bound on its left half (bottom).

The most useful tools to derive these results were results about the binary expansion of real numbers, about fixed points of dynamical systems, and the Tal-Vardy polar code construction technique.