Tuesday, March 10, 2015

Are Polar Codes Fractal?

Alright, this is very preliminary. I don't believe in most of the things I say below myself, but I just hope that they are true. I also hope that in the near future we will be able to prove that these things are true, or that we will prove them false. Let's see. In any case, I will keep you posted!

Polar codes are a beautiful new scheme to achieve the capacity of a binary input memoryless channel (BIMC) without requiring iterative decoders, and without relying on random coding arguments (which are beautiful mathematical methods, too). They were first introduced by Erdal Arikan in 2009, so the topic of polar coding is relatively new and still hot. Although encoding and decoding complexity of these codes is low compared to other methods with similar performance, the construction of polar codes is a difficult (and computationally costly) problem. An approximate code construction method was proposed by Tal and Vardy, which has sub-exponential complexity. That polar codes might be fractal was mentioned in a blog post by Emmanuel Abbe long time ago. Kahraman, Viterbo, and Celebi observed that the coding matrix (very similar -- or identical? -- to a Hadamard matrix on Galois field 2) used there resembles the Sierpinski triangle, a well-known fractal.

What are Polar Codes?

I'd like to keep this one very short, and instead recommend the interested reader to Arikan's paper or to Sasoglu's monograph. But on a very abstract basis, we can interpret these codes in the following way: Assume you have a BIMC with a capacity $0<C<1$. You are interested in a code using the channel multiple times, such that you can transmit with a rate sufficiently close to $C$ and with a very small probability of error. 

There are two ways you can use the channel twice: Either as it is leading to a total capacity of $C+C=2C$, or you do preprocessing of the channel inputs. The preprocessing will lead to a channel use with a higher capacity $C^1>C$ and to a channel use with lower capacity $C^0<C$, which satisfy $C^1+C^0=2C$. Similarly, if you want to use the channel four times, you can either take $C+C+C+C=4C$, or $C^1+C^1+C^0+C^0=4C$, or you can preprocess the two better channels $C^1$ to two channels with $C^{11}>C^1>C^{10}$, and the two worse channels $C^0$ to two channels with $C^{01}>C^0>C^{00}$. Note that here you already have

$$C^{11} > C^1 >C>C^0>C^{00}.$$

In other words, out of multiple uses of a given channel with capacity $C$, you can generate multiple channels with different capacities by preprocessing; some of these channels will have a higher capacity, some of the will have a lower capacity. Instead of using the same channel multiple times, you use preprocessing to get better channels and worse channels. And indeed, if we do infinitely many preprocessing steps (doubling the number of channel uses in each step), we end up either with a perfect channel ($C=1$) or with a completely useless one ($C=0$): The channels have been polarized, hence the name for this type of code.

$$C^{110101000101101001101100101010100101011101101000...} \text{ is either 0 or 1}$$

The interesting fact is that if we do these preprocessing steps randomly, i.e., we randomly choose between making the channel better (a 1 in the superscript) or worse (a 0 in the superscript), then with probability $C$ we get a perfect channel, and with probability $1-C$ a useless one. The idea of polar coding is to transmit data on the perfect channels (no error occurs), and to set all the other, bad channels to fixed or "frozen" bit values. Since a fraction of $C$ channels will always be good according to the polar coding theorem, the rate of the code is equal to (or at least close to) the capacity of the original channel $C$.

Which Channels are Good, Which are Bad?

That's the interesting question! But before we go on to talk about fractality, let me mention that in practice we never go all the way to the perfect or useless channels, as this would require using infinite block lengths. In fact, we fix a number $n$ and perform exactly as many preprocessing steps to pairs of channel each, ending up with $2^n$ channel uses or a block length of $2^n$. However, even in this case it is often not easy to decide which of these $2^n$ channels is close to perfect! But I hope to convince you that the true problems appear if we go with $n$ to infinity.

The picture below show the channels obtained by preprocessing a binary erasure channel with a capacity of $C=0.433$. Preprocessing, or polarization, has been performed seven times, leading to 128 channels. As you can see, some of these channels are already quite close to having a capacity of 1, some of them have a capacity close to 0. Intuition would suggest that if we continue polarizing, channels on the left will stay bad, while channels on the right will become perfect.

The capacity of 128 channels derived from a single binary erasure channel with capacity 0.433

To show that this is not the case, let's focus on the channel marked with a red point:

Polarizing a channel with low capacity (shown in red)...

This channel has a very low capacity, and from now on we will start polarizing this channel only. Starting from a very bad channel, even the "better" channel derived from it will be relatively bad (i.e., have a low capacity). But after sufficiently many polarization steps, this channel with a capacity close to zero will have generated several channels with a high capacity, and in the long run some of these channels will even be perfect ($C=1$): Indeed, if we start polarizing this channel with capacity $C^{--\cdots--}$, after infinitely many steps a relative fraction of $C^{--\cdots--}$ channels will be perfect.

...will eventually give rise to good channels.
Thus, even if after some steps a channel appears to be very bad (good), this channel will in the long run give rise to good (bad) channels.

Let's Talk Business!

I hope by now I gave you some intuition on why I believe that polar codes could be fractal: wherever you start, wherever you look, you will always find good and bad channels. But let's make this all a bit more precise: Take a real number $b\in [0,1]$, and let $b_2$ be the fractional part of its binary expansion. In the general case, the binary expansion of $b$ is an infinite sequence of zeros and ones. To be precise, if $b$ is irrational, $b_2$ is non-repeating and infinitely long and if $b$ is rational, then $b_2$ is either terminating or recurring. Let $C^b$ be the capacity of the channel obtained by preprocessing/polarizing the original channel with capacity $C$ according to the sequence specified by $b_2$. Then, the polar coding theorem says that

$$ \text{for almost all } b\in[0,1]{:}\ C^b\in\{0,1\}.$$

The term "for almost all" is well-defined and relates to the fact that there may be $b$ such that $C^b$ is neither zero or one, but that these $b$ will be extremely rare. The quest for a good polar code is the quest for finding good channels, i.e., it is the quest for specifying the set

$$G:=\{b\in[0,1]{:}\ C^b=1\}.$$

Let's see what we can say about this set.

$G$ is Lebesgue-measurable and its Lebesgue measure is $C$?

Why do I think that this is true? Well, I hope that we can replace the standard probability space $(\Omega,\mathcal{A},P)$ by $([0,1],\mathfrak{B}([0,1]),\lambda)$, where $\lambda$ is the Lebesgue measure and where $\mathfrak{B}([0,1])$ is the Borel sigma algebra for the unit interval $[0,1]$. Clearly, $([0,1],\mathfrak{B}([0,1]),\lambda)$ is a probability space. The original proof for channel polarization either relies on the theory of martingales, or on stochastic dynamical systems. As far as I know, both define the (random) sequence of polarizations leading to a specific channel $C^b$ as a realization of an i.i.d. Bernoulli process. One such realization corresponds to one element of the probability space $\Omega$. If we can show that, equivalently, one such realization corresponds to exactly one element $b$ in $[0,1]$, we can use the probability space associated with the unit interval. The claim that $G$ is measurable follows from the fact that $C^b$ is a well-defined random variable (hence measurable). And since we know that

$$ C=Prob[C^b=1]=\lambda(\{b\in[0,1]{:}\ C^b=1\})=\lambda(G)$$

this would show that, indeed, the Lebesgue measure of $G$ would be $C$.

$G^c$ is Dense in $[0,1]$?

I think this should be relatively easy to show: The complement of $G$, $G^c$, contains all $b$ such that the corresponding channel has capacity $C^b<1$. Now assume that the binary expansion of $b$ is terminating, i.e., after finitely many zeros and ones we can be sure that there will be only zeros left, and infinitely many of them. These zeros will polarize the channel to a capacity of zero, hence, all $b$ with a terminating binary expansion will be in $G^c$. But all dyadic rationals, i.e., rationals that can be written as $p/2^q$ for integers $p$ and $q$, have terminating binary expansions. Consequently, the set of dyadic rationals in $[0,1]$ is a subset of $G^c$. Moreover, since the set of dyadic rationals is dense in the set of reals, the set of dyadic rationals in $[0,1]$ is dense in $[0,1]$. Hence, $G^c$ is dense in $[0,1]$.

What does this mean? This means that every interval $I\subset [0,1]$ will contain at least one "bad" channel. More abstractly, if a polar code is a stochastic dynamical system with two attractors, then the basins of attraction (in terms of the number $b$) are no intervals. (I'm not yet sure if this is true, and even if it is, I have no idea what this means.)

$G$ is Self-Similar?

Self-similarity is typical for fractals, so showing that $G$ is self-similar may suggest that $G$ is fractal, too. Fractality is a a concept related to the dimension of a set, and to the fact that the dimension of this set is non-integer. In fact, since there are many measures of dimensionality, the definition of fractality could in fact depend on the measure used. To have a strong statement that (and why!) the construction of polar codes is difficult, I think it suffices to show that $G$ is self-similar. To prove self-similarity, we would need the following result:

Claim: Let $C_1$ and $C_2$ be the capacities of two different channels, and let $C_1<C_2$. Then, for all $b\in[0,1]$, $C_1^b<C_2^b$. Hence, if $G_1$ and $G_2$ are the sets of good channels associated with $C_1$ and $C_2$, respectively, then $G_1 \subseteq G_2$.

In other words, we would need that if a specific sequence polarizes a given channel to zero (one), then it will polarize all worse (better) channels to zero (one) as well. This is equivalent to saying that a dynamical system iterating according to a number $b$ has exactly two attractors and two basins of attraction. In other words, for a given $b$ we can find a number $\theta$, such that for all $C>\theta$, $C^b=1$ and for all $C<\theta$, $C^b=0$. $C=\theta$ may of course be another fixed point, but this fixed point is unstable.

Example: Consider the binary erasure channel and let $b=1/3$. Then, the binary expansion of $b$ is $0101010101...$. A binary erasure channel with capacity $C>\frac{\sqrt{5}-1}{2}$ will converge to a perfect channel, one with a capacity strictly lower than this number will converge to a useless channel, and one with a capacity exactly equal to this number will stay at this number. (This is why the polar coding theorem can only show convergence almost surely, or for almost all $b$: There may be $b$ for which a given channel does not converge to either zero or one.)

If this claim holds, then we can say the following: Clearly $[0,0.5]$ is similar to $[0,1]$ in the sense that the unit interval is obtained by a simple left shift of the binary expansion of $[0,0.5]$, or $[0,1]=2\cdot[0,0.5]$. Similarly, the set $[0.5,1]$ differs from the set $[0,0.5]$ only in the first digit of the binary expansion. Moreover, $[0,1]=2\cdot[0.5,1]-1$. We consider again a channel with capacity $C$, and the set of good channels shall be $G$. Preprocessing the channel once either improves it or worsens it, so we get capacities $C^1>C>C^0$. Let $G_1$ be the set of good channels when starting with channel $C^0$, and let $G_2$ be the set of good channels when starting with $C^1$. Let

$$ G_1:=\{b\in[0,0.5]{:}\ C^b=1\}$$

and

$$ G_2:=\{b\in[0.5,1]{:}\ C^b=1\}.$$

But since the set $[0,0.5]$ is similar to the set $[0,1]$, the set $G_1$ will also be similar to the set $G$: In fact,

$$ 2\cdot G_1=\{b\in[0,1]{:}\ C^{0b}=1\}$$

and

$$ 2\cdot G_2-1=\{b\in[0,1]{:}\ C^{1b}=1\}.$$

Since the first preprocessing changes the capacity of the channel with which we start polarizing, we can apply the claim to get $2\cdot G_1 \subset G \subset 2\cdot G_2-1$. At the same time, however, $G=G_1\cup G_2$. In other words, the set $G$ consists of a better (right) half and a worse (left) half, both of which are similar in the sense of above set inclusion. (If the sets would be equal rather than inclusions of each other, the set $G$ would be composed of exactly two scaled copies of $G$.) Of course, this analysis can be iterated ad infinitum. $G$ is self-similar.

Fractal Dimension of $G$?

Alright, this might be a tough one. Since $G$ will probably be dense, the box-counting dimension will probably be 1. Since $G$ consists of (more-less) two copies of its own, scaled by a factor of $1/2$, the scaling dimension will be 1, too. Hausdorff dimension is beyond my understanding.