Entropy of Random Talks: analysis

Recently, my colleage Georg came to my office to show me some exciting result he found in a book: A bound on mutual information (MI) based on the variational distance (VD) between the joint distribution and the product of marginal distributions. In other words, if $X$ and $Y$ are random variables with probability mass functions (PMFs) $p_X$ and $p_Y$, and a joint PMF $p_{XY}$, one can write

$$ I(X;Y) \le C \Vert p_{XY}-p_Xp_Y\Vert, $$

where $C$ only depends on the alphabet sizes of $X$ and $Y$ and where

$$\Vert p_{XY}-p_Xp_Y\Vert = \sum_{x,y} |p_{XY}(x,y) - p_X(x)p_Y(y)| .$$

This of course implies that if the variational distance goes to zero, then so does the mutual information. This result is exciting since in general it is not possible to bound Kullback-Leibler divergence (KLD; mutual information is a special case of a Kullback-Leibler divergence) via variational distance, at least not by a bound that does not depend on the distributions. The reverse is always possible: The Kullback-Leibler divergence bounds variational distance via the well-known Pinsker inequality.

While I was surprised by the existence of such a bound, I was not surprised by the weaker result that if VD goes to zero then also MI does. At least if $X$ and $Y$ have a finite alphabet, then the entropies $H(X)$, $H(Y)$, and $H(X,Y)$ are continuous in the PMFs, hence convergence of the PMFs implies convergence of the entropies, and hence, of the MI.

Very interestingly, however, it turned out that KLD is NOT continuous, but only lower semi-continuous (LSC), even if the supports of the distributions are finite:

A sequence of (probability) measures $\mu_n$ is said to converge setwise or strongly to another probability measure $\mu$ iff for every measurable set $A$ one has $\mu_n(A)\to \mu(A)$. I think if $\mu=n$ are discrete probability measures with finite alphabets, i.e., can be described by PMFs $p_n$ with finite support, then setwise convergence is equivalent to the fact that, for every point $x$ in the support, $p_n(x)\to p(x)$. I furthermore think that for such finite PMFs setwise convergence is equivalent to the fact that $\Vert p_n-p\Vert\to 0$. (I should check that.) Now let us take two sequences $p_n$ and $q_n$ of PMFs. Let both have a finite support. One would guess that if $p_n \to p$ and $q_n\to p$, then the KLD

$$ D(p_n\Vert q_n) = \sum_x p_n(x)\log\frac{p_n(x)}{q_n(x)} \to 0.$$

But that's not true. While entropies are continuous w.r.t. setwise convergence, KLD is only LSC. It follows that the right-hand side of above equation can be strictly positive, even if both $p_n$ and $q_n$ converge to $p$. Here is an example that you can easily work out yourself: $p_n(0) = 1-1/n$, $p_n(1)=1-p_n(0)$ and $q_n(0)=1-e^{-n^2}$, $q_n(1)=1-q_n(0)$.

This post is loosely coupled with my recent post on the fractality of polar codes; although the problem posed there is already solved (I might write about that later), there are still a few open questions which might be of independent interest.

Assume you are given two functions $g_0{:}\ [0,1]\to[0,1]$ and $g_1{:}\ [0,1]\to[0,1]$, which are defined as
$$ g_0(x)=2x-x^2 $$
and
$$ g_1(x)=x^2. $$

Now assume that we have a sequence $a^k:=(a_1,a_2,\dots,a_k)\in\{0,1\}^k$ which determines the polynomial
$$ p_{a^k}(x) := g_{a_k}(g_{a_{k-1}}(\dots g_1(x)\dots )). $$

We are now interested in the properties of this polynomial. Since the case where all $a_i$ are equal (i.e., all are either zero or one) is trivial to analyze, we will assume that $a^k$ contains zeros and ones. W.l.o.g. (see below) we may assume that $a_1=1$ and $a_2=0$, hence $p_{a^2}(x)=2x^2-x^4$. We are looking for the number of fixed points of $p_{a^k}$ for fixed $k$, i.e., points $x^*\in[0,1]$ which satisfy $x^*=p_{a^k}(x^*)$. Can't be that complicated, can it?

What do we know about $p_{a^k}$?

Only a few things, unfortunately:

Since $g_0$ and $g_1$ are from $[0,1]$ to $[0,1]$, so is $p_{a^k}$.
Since $g_0$ and $g_1$ are strictly monotonously increasing, so is $p_{a^k}$.
Since $g_0$ and $g_1$ have fixed points at 0 and 1, so has $p_{a^k}$.
For $a_1=1$ and $a_2=0$, the derivative of $p_{a^k}$ vanishes at 0 and 1; hence, these two fixed points are attractive.
$p_{a^k}$ is a polynomial in $x$ of order $2^k$; hence, by the fundamental theorem of algebra, it has exactly $2^k$ roots; hence, so has $p_{a^k}(x)-x$, which means that there are exactly $2^k$ fixed points.

What do we expect?

Well, this is easy: I think (and extensive numerical analyses confirmed that) $p_{a^k}$ has exactly one (repelling) fixed point on $(0,1)$. This can be understood intuitively as follows: The fixed point on $(0,1)$ is the intersection of $p_{a^k}$ with the identity function ($f(x)=x$), both of which pass through 0 and 1. Since the slope of $p_{a^k}$ is zero at 0 and 1, it must lie above the identity function for values close to 1 and below it for values close to 0. Hence, there should be one point in between where $p_{a^k}$ crosses the identity function. The properties of $p_{a^k}$ ensure that there can be only an odd number of fixed points on $(0,1)$.

What about the rest of the $2^k$ fixed points? They either have to lie outside $[0,1]$ or have to be complex (and conjugated).

What have I already tried?

Not much, I'm afraid, but I spend a few days on this problem. I tried essentially two different approaches, but neither lead (quickly) to the desired result:

The first approach relied upon mathematical induction: For $p_{a^2}$ it is easy to show that the derivative is a quasi-concave function, which admits the application of certain inequalities. If then $g_0(p_{a^k})$ and $g_1(p_{a^k})$ have quasi-concave derivatives, the problem will be solved. But unfortunately, this cannot be shown (there are counter-examples). The approach could be refined by using the fact that the derivative of $p_{a^2}$ is convex on intervals $[0,a]$ and $[b,1]$, but concave on the interval $[a,b]$ (for $a$ and $b$ chosen appropriately). That this property is preserved under applying $g_0$ or $g_1$ is not as easy to show. And even if we would succeed in that, it is not clear if this delivers the desired result.

This brings us to the second, more direct approach: It relies on counting the number of real roots of $p_{a^2}(x)-x$ in the interval $(0,1)$. While Descartes' rule of signs is of not much use here, and while Sturm's method is too difficult to apply in the general case, Fourier's theorem looks very promising, given that the derivatives need to be evaluated only at the fixed points 0 and 1. Since Fourier's theorem involves higher-order derivatives of a composition of functions, the most straightforward way would be to apply Faa di Bruno's formula to either $g_i(p_{a^k}(x))$ or to $p_{a^k}(g_i(x))$. "Straightforward" in this case does not mean "easy", and I had to quit my efforts soon (mental note: I should offer a student's project...).

A third, related approach is to show that the second derivative of $p_{a^k}$ has exactly one root in $(0,1)$, showing that $p_{a^k}$ has only one inflection point in this interval. This would do the job, but again requires tedious calculations with Faa di Bruno's formula.

What does this have to do with Polar Codes?

I'm glad that you ask. You might remember the example from my previous post:

Example: Consider the binary erasure channel and let $b=2/3$. Then, the binary expansion of $b$ is 101010101.... A binary erasure channel with capacity $C>(\sqrt{5}−1)/2$ will converge to a perfect channel, one with a capacity strictly lower than this number will converge to a useless channel, and one with a capacity exactly equal to this number will stay at this number.

Essentially, all rational numbers have binary expansions that are recurring, i.e., that repeat after a given period. We do not consider dyadic rationals here, although they can be called recurring, too. In general, i.e., if the rational is not dyadic, then the period will contain zeros and ones. And since we can analyze any period we want, we can choose the period such that $a_1=1$ and $a_2=0$.

Polar coding also suggests that what we expect is in fact true: Lemma 11 in this paper states that, for almost every binary expansion $a^\infty$ there is exactly one fixed point of $p_{a^\infty}$ in $(0,1)$. This "almost every" unfortunately does not tell us a little thing about binary expansions of rational numbers, since in this context, rational numbers are almost nothing. Hence, if we were able to show above property for $p_{a^k}$, we could improve our knowledge about the behavior of polar codes, or of the polarization effect in general.

Let me know if you want to work on this -- it is a nice problem, and I would love to see a definite answer to it.

Entropy of Random Talks

Pages

Tuesday, March 8, 2016

OMG, KLD is only LSC!

Wednesday, April 1, 2015

Properties of a Special Iterated Function System (open)

What do we know about $p_{a^k}$?

What do we expect?

What have I already tried?

What does this have to do with Polar Codes?