Entropy of Random Talks: probability

Showing posts with label probability. Show all posts

Thursday, March 15, 2018

An Inequality Between Conditional Cross Entropy and Error Probability

Some time ago I wrote about an inequality between conditional entropy and error probability. Specifically, one can show that if $\hat{x}(Y)$ is the maximum a posteriori estimate of a discrete random variable $X$ given the random observation $Y$, then the probability of making an error is bounded from above by

$$
P_e = \mathbb{P}(X\neq \hat{x}(Y)) \le 1- 2^{H(X|Y)}.
$$

This inequality is not new. I was not aware, however, of any previous references until I met Igal Sason at the International Zurich Seminar earlier this year. He was speaking about a paper he recently published which includes this inequality (eq. (2)), together with many more general results relating error probability with Arimoto-Renyi conditional entropy.

I was hoping to generalize above inequality to the mismatched case. Specifically, let $p_{X|Y}$ be the conditional distribution of $X$ given $Y$. The MAP estimator is $\hat{x}(y) = \arg\max_x p_{X|Y}(x|y)$. Now suppose that the MAP estimator does not have access to $p_{X|Y}$ but only to a surrogate distribution $q_{X|Y}$, in which case the estimator becomes $\tilde{x}(y) = \arg\max_x q_{X|Y}(x|y)$. I thought to get a bound similar to the one above that includes this surrogate distribution and that bounds the error probability $\tilde{P}_e = \mathbb{P}(X\neq \tilde{x}(Y))$. To be more precise, I was hoping to exchange the conditional entropy above by some conditional cross entropy, i.e., to obtain a bound similar to

$$
\tilde{P}_e \le 1- 2^{\mathbb{E}(H^\times(p_{X|Y}(\cdot|Y)\Vert q_{X|Y}(\cdot|Y)))}
$$

possibly with the order of the distributions reversed. Since cross entropy is an upper bound on entropy, the bound will be larger than the bound for $P_e$. (Interestingly, even if $p_{X|Y}$ and $q_{X|Y}$ are different, we nay have $\tilde{P}_e =P_e$: It suffices that, for every $y$, both conditional distributions have their modes at the same values of $x$.)

Unfortunately, my hope was not justified. We show this in a counterexample, i.e., we show that there exists distributions such that the inequality on $\tilde{P}_e$ is violated (for any choice of order of distributions). For the sake of simplicity, we now look at guessing rather than estimation, i.e., the conditional distributions satisfy $p_{X|Y}=p_X$ and $q_{X|Y}=q_{X}$. Now suppose that $X$ is binary, i.e., it takes values 0 and 1. Furthermore, let $p_X(0)=q_X(1)=\epsilon\ll 0.5$. We have that $\tilde{x}=\arg\max_x q_X(x) = 0$, thus

$$\tilde{P}_e=\mathbb{P}(X\neq 0)=1-\epsilon.$$

Since $p_X$ and $q_X$ are permutations of each other, we have that $H(q_X)=H(p_X)$, that $D(p_X\Vert q_X) = D(q_X\Vert p_X)$, and that thus

$$
H^\times(p_X\Vert q_X) = H^\times(q_X\Vert p_X) = -\epsilon\log(1-\epsilon)-(1-\epsilon)\log\epsilon = -\log \epsilon - \epsilon\log\left(\frac{1-\epsilon}{\epsilon}\right).
$$

With this we get

$$
2^{-H^\times(p_X\Vert q_X)} = \epsilon\cdot 2^{\epsilon\log\left(\frac{1-\epsilon}{\epsilon}\right)} > \epsilon
$$

and thus the bound $\tilde{P}_e \le 1- 2^{-H^\times(p_X\Vert q_X)}$ cannot hold.

Friday, September 15, 2017

Invariant Distribution of a Higher-Order Markov Chain (Pt. 2)

Some time ago I was looking into the uniqueness of the invariant distribution of a second-(or higher-)order Markov chain, and I was running into problems. After a few weeks of work, a manuscript was ready in which I found some useful sufficient conditions for this invariant distribution to be unique. The manuscript is available on arXiv, and published in the Statistics and Probability Letters. You can access the full-text of the manuscript for free until October 1st 2017 (afterwards, the arXiv version remains available, of course). I'm happy to get feedback, so please don't hold back :)!

Wednesday, November 30, 2016

(Lossy) Wyner's Common Information and Gacs-Körner's Common Randomness Revisited

Recently, I was involved (not so much, to be honest) in a research project dealing with caching, i.e., the storage of data close to a user such that it need not be transmitted anymore. A typical use case is Netflix: Netflix offers $N$ movies/series, and the user can choose any of these anytime. Most users watch during prime time, which taxes the communication network. During the day, the communication network has unused capacities, so one could try storing part of the Netflix database on a hard drive connected to the user's player. This reduces the network load during prime time and improves the user experience.

I our work, we not only looked at the fundamental limits of this caching scheme, but we also got a good idea of what should actually be stored in the user's hard drive ("cache"). The answers are decades old: One should store either what is known as Gacs-Körner's common randomness, or what is known as Wyner's common information.

For simplicity, we assume that there are only two files available on Netflix, namely $X$ and $Y$. The common information of these two random variables (RVs; in information theory, everything is random) is another RV $W$ that minimizes
$$I(X,Y; W)$$
subject to $X-W-Y$, i.e., subject to the fact that, given $W$, $X$ and $Y$ are conditionally independent. The common randomness of these two RVs is an RV $U$ that maximizes
$$ I(X,Y; U) $$
subject to the constraint that $X-Y-U$ and $Y-X-U$. It can be shown that
$$ I(X,Y; U) \le I(X;Y) \le I(X,Y; W).$$

Common randomness is something that is truly common, in the usual sense of the word. In particular, if $A$, $B$, and $C$ are independent RVs, and if $X=(A,B)$ and $Y=(A,C)$, then $U=A$. So much for the math. Now here comes an interpretation (and probably one for which an information theorists wants to punch me in the face): If $X$ and $Y$ are two different episodes of your favorite series, then $U$ is the opening theme. If you have a small hard drive, storing the opening theme is probably the most useful idea. Common information is a bit more intricate, but an intuitive explanation is the following: Suppose $X$ and $Y$ are two different episodes of your favorite series, and suppose the only thing $X$ and $Y$ have in common is an actress called $A$. In $A$, some parts of $A$ are seen, while in $Y$ other parts of $A$ are seen. (Say, $X$ shows $A$ running, while $Y$ shows $A$ talking to a person.) Obviously, if you could store all information about actress $A$ as a whole on your hard drive, the small computer in your player could reconstruct running scenes and talking scenes in which $A$ participates. Thus, if your hard drive is large enough, that's what you should store.

Let's take this one step further. Suppose that you are fine with watching not $X$ or $Y$, but lower-quality versions $\hat X$ or $\hat Y$, respectively. This requires the definition of lossy versions of common randomness and common information. And indeed, such definitions have been proposed in 1403.8093. Namely, the lossy common information is an RV $W$ that minimizes
$$ I(X,Y; W)$$
subject to $\hat X - W - \hat Y$ and $(X,Y) - (\hat X,\hat Y) - W$. Lossy common randomness is an RV $U$ that maximizes
$$ I(X,Y; U) $$
subject to the constraints $X-Y-U$, $Y-X-U$, $X-\hat X - U$, and $Y-\hat Y-U$. These definitions are all meaningful because they solve communication problems: The get their operational characterization in the Gray-Wyner problem (1403.8093) or in lossy caching (1610.07304).

The lossy version of common information seems fine to me: I only need to store as much information about actress $A$ in order to be able to reconstruct approximate depictions of her running and speaking. This might need more or less space on the hard drive than the perfect reconstruction: I might be able to store not only this reduced version of the actress $A$, but I might also store a, say, generic building, that allows to reconstruct two different but similar houses in episodes $X$ and $Y$. And indeed, lossy common information may be larger or smaller than lossless common information (see Lemma 1 in 1403.8093).

The lossy version of common randomness, however, is not fully satisfactory to me. If I am not interested in perfect reconstruction, should I not be able to store more on my internal hard drive? For example, if the credits of my two episodes $X$ and $Y$ differ only in a few actors' names, can I not store a "noisy" version of the credits in addition to the opening theme? Nevertheless, the lossy version of common randomness is usually smaller than the lossless version (see Corollary 2 in 1403.8093).

I would thus suggest different definitions of lossy common randomness: The lossy common randomness is an RV $U'$ that maximizes
$$ I(X,Y; U') \text{ or } I(\hat X,\hat Y; U')$$
subject to the constraints $\hat X-\hat Y-U'$, $\hat Y-\hat X-U'$, $X-\hat X - U'$, and $Y-\hat Y-U'$. The difference is subtle and mirrors the definition of lossy common information. While lossy common information requires $W$ to make the reconstructions $\hat X$ and $\hat Y$ conditionally independent, the original definition of lossy common randomness requires that $U$ is a common part of $X$ and $Y$ (the first two Markov constraints). The definition proposed here would require that $U'$ is only a common part of the reconstructions $\hat{X}$ and $\hat Y$. Given this definition, we may ask the following questions:

Does the new definition satisfy the desired property that it is larger than the lossless common randomness?
Is there a communications problem for which this definition is a single-letter characterization?

Thursday, July 21, 2016

Reading Notes: Semi-Supervised Clustering

Reading Notes: A brief summary of papers I've read, mainly for personal future reference. The claims about the papers are due to my own (mis-)understanding and reflect my personal opinion. If you know other interesting papers on this topic, I would be happy if you could leave a comment!

Semi-supervised clustering, or constraint clustering, deals with clustering where few of the data points are labeled, either with constraints that they must belong to the same or to different clusters, or with a cluster label.

Constraints as Features (Asafi, Cohen-Or): Data points are assumed to come from an $M$-dimensional space. It is assumed that a pairwise distance matrix is given. Must-Link constraints reduce distances, after which all other distances have to be recomputed to satisfy the triangle inequality. Each Cannot-Link constraint adds a dimension to the data set. Moreover, each Cannot-Link constraint adds an additional distance between the two corresponding data points; the distances between unconstrained data points are increased according to their diffusion distance to the constrained points (data points that have small diffusion distances to the constrained points get larger distances added). By adding a dimension for each constraint, these added distances are "orthogonal" to the original distances, thus avoiding a change of the "general shape" of the data set. The thus obtained distance matrix can be clustered using standard, unsupervised clustering techniques. Experiments: Image segmentation, UCI repo. Noisy constraints are not considered.

Clustering with Partition Level Side Information (Liu, Fu): Data points are assumed to come from an $M$-dimensional space. It is assumed that a pairwise distance matrix is given. A subset of the data points are labeled with a cluster label $1$ through $K$. Labeling is assumed to extend the dimension, i.e., each data point is an element of $M+K$-dimensional space; all data points lie in an $M$-dimensional subspace. Data points labeled to cluster $i$ lie in the subspace that is obtained by shifting the original $M$-dimensional subspace by the unit vector with in the $M+i$-th direction. A k-means-like clustering algorithm is designed, computing alternatingly cluster memberships and centroids. Experiments: UCI repo, analysis of noisy labels.

Fuzzy Clustering with Partial Supervision (Pedrycz, Waletzky): The authors suggest adapting the fuzzy c-means algorithm for semi-supervised clustering. Data points are assumed to come from an $M$-dimensional space, and it is assumed that a pairwise distance matrix or a pairwise Mahalanobis distance matrix is given. The partial supervision comes in terms of partition level side information. The cost function for fuzzy c-means is regularized with a term penalizing a difference between obtained cluster membership matrix and the cluster membership of the labeled data points. (Although I believe that the term $b_k$ in their equation (2) should be outside the $p$-power.) Experiments: synthetic data, IRIS data set, software data set. Noise was, as far as I know, not considered.

Spectral Learning (Kamvar, Klein, Manning): Data points are given in terms of an affinity matrix. Labeling can either be given as cluster labels or a Must-Link/Cannot-Link constraints. Data points in the same group override affinity values to the max (without changing the values of close data points; still, the effect on unlabeled points is seen in spectral space), data points in different groups get the minimum affinity values. From the affinity matrix a Markov transition probability matrix is generated, from which the $K$ leading eigenvalues are extracted. These eigenvalues are the (spectral) feature based on which a classifier is trained on the labeled data points, which is then used to classify the unlabeled data points. Experiments: Newsgroup data, performance increases both with the percentage of labeled data points, as well as with the number of unlabeled data points. Noisy labels/constraints are not considered.

Flexible Constrained Spectral Clustering (Wang, Davidson): The constraints are given as Must-Link/Cannot-Link constraints with an additional belief value between 0 and 1. The utility is given as an inner product, i.e., if $Q$ is the matrix with the (real-valued) constraints (positive if $x$ and $y$ are in the same cluster, negative if in different clusters, 0 if unlabeled) and $u$ a cluster assignment vector, then $u^TQu$ decreases by a clustering inconsistent with the labels. Clustering is obtained by spectral clustering, i.e., solving the min-cut problem, under the constraint that the clustering utility has to be larger than a predefined threshold. This can be done by solving a generalized eigenvalue problem. The focus is on bi-clustering. Experiments: Image segmentation, UCI repo. Analysis of noisy labels implicit by assigning belief in labeling.

Affinity and Penalty Jointly Constrained Spectral Clustering With All-Compatibility, Flexibility, and Robustness (Qian, Jiang, Wang, Su, Wang, Hu, Muzic): The approach is very similar to the one of Wang and Davidson, with two main differences: 1) They incorporate the Must-Link/Cannot-Link constraints also in the affinity matrix (by setting affinities to one and zero, respectively), and 2) the generalized eigenvalue problem is converted to a standard eigenvalue problem with a Lagrange parameter. Moreover, the authors propose a second option for the constraint matrix and extend their analysis to more than two clusters. Finally, they perform a grid search over the parameters, which can be justified by having partial supervision available. Experiments: Synthetic data, Keel data, UCI repo, USPS, human facial data set, text data, image segmentation. Noise was not considered, as far as I know.

Semi-Supervised Information-Maximization Clustering (Calandriello, Niu, Sugiyama): Data points are assumed to come from an $M$-dimensional space and are given. The clustering utility function is the $\chi^2$-divergence between the joint data-label distribution and the product of their marginals ("squared mutual information"). The utility/cost of Must-Link and Cannot-Link constraints are encoded as inner products between the conditional class distribution of the two corresponding data points. Assuming a kernel density estimate of the probabilities, the cost function can be written as a matrix multiplication, and the optimal clustering is obtained by spectral methods (leading eigenvectors of the matrix). Strange things are going on: The authors add additional constraints for transitivity of the Must-Link and Cannot-Link constraints (for binary clustering), and the kernel matrix is modified to incorporate these constraints, too. It is therefore not clear which idea (modifikation of the kernel matrix, encoding transitivity, etc.) leads to the desired performance. Experiments: UCI repo, USPS digits, Olivetti faces. Performance similar to spectral lerning and constraint spectral clustering. Noisy constraints are not considered.

Computing Gaussian Mixture Models with EM using Equivalence Constraints (Shental, Bar-Hillel, Hertz, Weinshall): Data points are assumed to come from an $M$-dimensional space. The constraints are given as Must-Link/Cannot-Link constraints. Data is modeled as coming from a GMM. The transitive closure of Must-Link constraints (called chunklet) is treated as a single data point in EM that is weighted according to the number of data points in the chunklet. Cannot-Link constraints introduce dependencies between the hidden variables (i.e., GMM component indicator), which leads to a hidden Markov random field. This complicates the EM algorithm, requires a generalized EM procedure that approximately solves the problem. The EM algorithm then trains a GMM which provides a soft/hard clustering. Experiments: UCI repo, noisy labels/constraints are not considered.

Semi-supervised Learning with Penalized Probabilistic Clustering (Lu, Leen): Data points are assumed to come from an $M$-dimensional space. The constraints are given as Must-Link/Cannot-Link constraints with an additional belief value between 0 and 1. Data is modeled as coming from a GMM. The (beliefs about the) constraints give a prior distribution over all clusterings (ruling out certain clusterings when hard constraints are given). The probabilities for the GMM components are evaluated numerically, after simplifying the model by, e.g., removing overlapping pairwise relations. The generalized EM algorithm trains a GMM which provides a soft/hard clustering. Experiments: UCI repo, partially labeled images (cluster-classes are given), texture image segmentation. Noisy labels/constraints are not considered.

Semi-Supervised Density-Based Clustering (Lelis, Sander): Again, data points come from an $M$-dimensional space, and a subset of the data points are labeled with a cluster label $1$ through $K$. The method extends DBSCAN. DBSCAN requires two parameters: an real-valued parameter defining the area of a data point's neighborhood, and an integer that specifies how many data points must lie in the neighborhood of a data point to make it a core point. The semi-supervised method uses the partial labeling to compute the real-valued parameter (such that data points with different labels fall in different clusters), hence the method only requires the integer parameter. Experiments: UCI. Noise was not considered.

Tuesday, May 3, 2016

Busfahren in Graz

This post about the waiting time paradox is in German; there are several instances in the web which explain the paradox, such as the article on Mahalanobis (requires login) with a Poisson arrival process and the more general article on Implicit Note.

Heute machen wir eine ganz kleine Rechnung zum Wartezeitenparadoxon, welches folgendes Phänomen beschreibt: Wenn sich der Fahrer der Buslinie 52 exakt an den Fahrplan halten würde, würde an der Haltestelle "HTL-BULME" exakt alle 15 Minuten ein Bus ankommen. Wenn ein Schüler also zu einem zufälligen Zeitpunkt zur Haltestelle geht, wird er oder sie im Mittel 7,5 Minuten warten müssen.

Wenn der Busfahrer den Fahrplan aus irgendwelchen Gründen aber nicht einhalten kann, muss man länger warten. Auf den ersten Blick erscheint das unlogisch: Es fahren immer noch vier Busse pro Stunde, die Wartezeit von Bus zu Bus beträgt im Mittel immer noch 15 Minuten. Trotzdem ist die erwartete Wartezeit -- unter der Voraussetzung, dass man zu einem zufälligen Zeitpunkt zur Haltestelle geht -- strikt größer als 7,5 Minuten. Mathematisch: Wenn der Bus im Mittel alle $T$ Minuten fährt, die Abweichung vom Fahrplan aber $\sigma_T$ beträgt, wartet man im Mittel

$$ \frac{T}{2} + \frac{\sigma_T^2}{2T} $$

Minuten bis man in den Bus einsteigen kann. (Interessanterweise hängt dieses Ergebnis nicht von der Verteilung der Ankunftszeiten ab, sonden nur von deren Standardabweichung!)

Intuitiv kann man sich das so vorstellen: Nehmen wir an, es fahren vier Busse pro Stunde (also $T=15$), wobei diese Busse alle unmittelbar hintereinander fahren. D.h. die ersten vier Busse fahren hintereinander um 12:00, die nächsten vier erst wieder um 13:00, etc. Die mittlere Zeit zwischen den Bussen beträgt 15 Minuten, aber die Streuung $\sigma_T$ ist sehr groß:

$$ \sigma_T^2 = \frac{1}{4} (0-15)^2+ \frac{1}{4} (0-15)^2+ \frac{1}{4} (0-15)^2+ \frac{1}{4} (60-15)^2 = 675$$

und mit obiger Formel warten wir im Mittel 30 Minuten. Es ist, also ob nur ein einziger (großer) Bus fahren würde.

Mir wurde gesagt, dass die Grazer Linien um bis zu zwei Minuten vom Fahrplan abweichen dürfen (Referenz folgt hoffentlich noch). Das bedeutet, dass $\sigma_T$ für die Grazer Linien bei 1,15 Minuten liegt. Im Mittel warten die armen BULME-SchülerInnen auf den 52er statt 7.5 Minuten ganze 7,54 Minuten.

Tuesday, March 8, 2016

OMG, KLD is only LSC!

Recently, my colleage Georg came to my office to show me some exciting result he found in a book: A bound on mutual information (MI) based on the variational distance (VD) between the joint distribution and the product of marginal distributions. In other words, if $X$ and $Y$ are random variables with probability mass functions (PMFs) $p_X$ and $p_Y$, and a joint PMF $p_{XY}$, one can write

$$ I(X;Y) \le C \Vert p_{XY}-p_Xp_Y\Vert, $$

where $C$ only depends on the alphabet sizes of $X$ and $Y$ and where

$$\Vert p_{XY}-p_Xp_Y\Vert = \sum_{x,y} |p_{XY}(x,y) - p_X(x)p_Y(y)| .$$

This of course implies that if the variational distance goes to zero, then so does the mutual information. This result is exciting since in general it is not possible to bound Kullback-Leibler divergence (KLD; mutual information is a special case of a Kullback-Leibler divergence) via variational distance, at least not by a bound that does not depend on the distributions. The reverse is always possible: The Kullback-Leibler divergence bounds variational distance via the well-known Pinsker inequality.

While I was surprised by the existence of such a bound, I was not surprised by the weaker result that if VD goes to zero then also MI does. At least if $X$ and $Y$ have a finite alphabet, then the entropies $H(X)$, $H(Y)$, and $H(X,Y)$ are continuous in the PMFs, hence convergence of the PMFs implies convergence of the entropies, and hence, of the MI.

Very interestingly, however, it turned out that KLD is NOT continuous, but only lower semi-continuous (LSC), even if the supports of the distributions are finite:

A sequence of (probability) measures $\mu_n$ is said to converge setwise or strongly to another probability measure $\mu$ iff for every measurable set $A$ one has $\mu_n(A)\to \mu(A)$. I think if $\mu=n$ are discrete probability measures with finite alphabets, i.e., can be described by PMFs $p_n$ with finite support, then setwise convergence is equivalent to the fact that, for every point $x$ in the support, $p_n(x)\to p(x)$. I furthermore think that for such finite PMFs setwise convergence is equivalent to the fact that $\Vert p_n-p\Vert\to 0$. (I should check that.) Now let us take two sequences $p_n$ and $q_n$ of PMFs. Let both have a finite support. One would guess that if $p_n \to p$ and $q_n\to p$, then the KLD

$$ D(p_n\Vert q_n) = \sum_x p_n(x)\log\frac{p_n(x)}{q_n(x)} \to 0.$$

But that's not true. While entropies are continuous w.r.t. setwise convergence, KLD is only LSC. It follows that the right-hand side of above equation can be strictly positive, even if both $p_n$ and $q_n$ converge to $p$. Here is an example that you can easily work out yourself: $p_n(0) = 1-1/n$, $p_n(1)=1-p_n(0)$ and $q_n(0)=1-e^{-n^2}$, $q_n(1)=1-q_n(0)$.

Tuesday, March 10, 2015

Are Polar Codes Fractal?

Alright, this is very preliminary. I don't believe in most of the things I say below myself, but I just hope that they are true. I also hope that in the near future we will be able to prove that these things are true, or that we will prove them false. Let's see. In any case, I will keep you posted!

Polar codes are a beautiful new scheme to achieve the capacity of a binary input memoryless channel (BIMC) without requiring iterative decoders, and without relying on random coding arguments (which are beautiful mathematical methods, too). They were first introduced by Erdal Arikan in 2009, so the topic of polar coding is relatively new and still hot. Although encoding and decoding complexity of these codes is low compared to other methods with similar performance, the construction of polar codes is a difficult (and computationally costly) problem. An approximate code construction method was proposed by Tal and Vardy, which has sub-exponential complexity. That polar codes might be fractal was mentioned in a blog post by Emmanuel Abbe long time ago. Kahraman, Viterbo, and Celebi observed that the coding matrix (very similar -- or identical? -- to a Hadamard matrix on Galois field 2) used there resembles the Sierpinski triangle, a well-known fractal.

What are Polar Codes?

I'd like to keep this one very short, and instead recommend the interested reader to Arikan's paper or to Sasoglu's monograph. But on a very abstract basis, we can interpret these codes in the following way: Assume you have a BIMC with a capacity $0<C<1$. You are interested in a code using the channel multiple times, such that you can transmit with a rate sufficiently close to $C$ and with a very small probability of error.

There are two ways you can use the channel twice: Either as it is leading to a total capacity of $C+C=2C$, or you do preprocessing of the channel inputs. The preprocessing will lead to a channel use with a higher capacity $C^1>C$ and to a channel use with lower capacity $C^0<C$, which satisfy $C^1+C^0=2C$. Similarly, if you want to use the channel four times, you can either take $C+C+C+C=4C$, or $C^1+C^1+C^0+C^0=4C$, or you can preprocess the two better channels $C^1$ to two channels with $C^{11}>C^1>C^{10}$, and the two worse channels $C^0$ to two channels with $C^{01}>C^0>C^{00}$. Note that here you already have

$$C^{11} > C^1 >C>C^0>C^{00}.$$

In other words, out of multiple uses of a given channel with capacity $C$, you can generate multiple channels with different capacities by preprocessing; some of these channels will have a higher capacity, some of the will have a lower capacity. Instead of using the same channel multiple times, you use preprocessing to get better channels and worse channels. And indeed, if we do infinitely many preprocessing steps (doubling the number of channel uses in each step), we end up either with a perfect channel ($C=1$) or with a completely useless one ($C=0$): The channels have been polarized, hence the name for this type of code.

$$C^{110101000101101001101100101010100101011101101000...} \text{ is either 0 or 1}$$

The interesting fact is that if we do these preprocessing steps randomly, i.e., we randomly choose between making the channel better (a 1 in the superscript) or worse (a 0 in the superscript), then with probability $C$ we get a perfect channel, and with probability $1-C$ a useless one. The idea of polar coding is to transmit data on the perfect channels (no error occurs), and to set all the other, bad channels to fixed or "frozen" bit values. Since a fraction of $C$ channels will always be good according to the polar coding theorem, the rate of the code is equal to (or at least close to) the capacity of the original channel $C$.

Which Channels are Good, Which are Bad?

That's the interesting question! But before we go on to talk about fractality, let me mention that in practice we never go all the way to the perfect or useless channels, as this would require using infinite block lengths. In fact, we fix a number $n$ and perform exactly as many preprocessing steps to pairs of channel each, ending up with $2^n$ channel uses or a block length of $2^n$. However, even in this case it is often not easy to decide which of these $2^n$ channels is close to perfect! But I hope to convince you that the true problems appear if we go with $n$ to infinity.

The picture below show the channels obtained by preprocessing a binary erasure channel with a capacity of $C=0.433$. Preprocessing, or polarization, has been performed seven times, leading to 128 channels. As you can see, some of these channels are already quite close to having a capacity of 1, some of them have a capacity close to 0. Intuition would suggest that if we continue polarizing, channels on the left will stay bad, while channels on the right will become perfect.

The capacity of 128 channels derived from a single binary erasure channel with capacity 0.433

To show that this is not the case, let's focus on the channel marked with a red point:

Polarizing a channel with low capacity (shown in red)...

This channel has a very low capacity, and from now on we will start polarizing this channel only. Starting from a very bad channel, even the "better" channel derived from it will be relatively bad (i.e., have a low capacity). But after sufficiently many polarization steps, this channel with a capacity close to zero will have generated several channels with a high capacity, and in the long run some of these channels will even be perfect ($C=1$): Indeed, if we start polarizing this channel with capacity $C^{--\cdots--}$, after infinitely many steps a relative fraction of $C^{--\cdots--}$ channels will be perfect.

...will eventually give rise to good channels.

Thus, even if after some steps a channel appears to be very bad (good), this channel will in the long run give rise to good (bad) channels.

Let's Talk Business!

I hope by now I gave you some intuition on why I believe that polar codes could be fractal: wherever you start, wherever you look, you will always find good and bad channels. But let's make this all a bit more precise: Take a real number $b\in [0,1]$, and let $b_2$ be the fractional part of its binary expansion. In the general case, the binary expansion of $b$ is an infinite sequence of zeros and ones. To be precise, if $b$ is irrational, $b_2$ is non-repeating and infinitely long and if $b$ is rational, then $b_2$ is either terminating or recurring. Let $C^b$ be the capacity of the channel obtained by preprocessing/polarizing the original channel with capacity $C$ according to the sequence specified by $b_2$. Then, the polar coding theorem says that

$$ \text{for almost all } b\in[0,1]{:}\ C^b\in\{0,1\}.$$

The term "for almost all" is well-defined and relates to the fact that there may be $b$ such that $C^b$ is neither zero or one, but that these $b$ will be extremely rare. The quest for a good polar code is the quest for finding good channels, i.e., it is the quest for specifying the set

$$G:=\{b\in[0,1]{:}\ C^b=1\}.$$

Let's see what we can say about this set.

$G$ is Lebesgue-measurable and its Lebesgue measure is $C$?

Why do I think that this is true? Well, I hope that we can replace the standard probability space $(\Omega,\mathcal{A},P)$ by $([0,1],\mathfrak{B}([0,1]),\lambda)$, where $\lambda$ is the Lebesgue measure and where $\mathfrak{B}([0,1])$ is the Borel sigma algebra for the unit interval $[0,1]$. Clearly, $([0,1],\mathfrak{B}([0,1]),\lambda)$ is a probability space. The original proof for channel polarization either relies on the theory of martingales, or on stochastic dynamical systems. As far as I know, both define the (random) sequence of polarizations leading to a specific channel $C^b$ as a realization of an i.i.d. Bernoulli process. One such realization corresponds to one element of the probability space $\Omega$. If we can show that, equivalently, one such realization corresponds to exactly one element $b$ in $[0,1]$, we can use the probability space associated with the unit interval. The claim that $G$ is measurable follows from the fact that $C^b$ is a well-defined random variable (hence measurable). And since we know that

$$ C=Prob[C^b=1]=\lambda(\{b\in[0,1]{:}\ C^b=1\})=\lambda(G)$$

this would show that, indeed, the Lebesgue measure of $G$ would be $C$.

$G^c$ is Dense in $[0,1]$?

I think this should be relatively easy to show: The complement of $G$, $G^c$, contains all $b$ such that the corresponding channel has capacity $C^b<1$. Now assume that the binary expansion of $b$ is terminating, i.e., after finitely many zeros and ones we can be sure that there will be only zeros left, and infinitely many of them. These zeros will polarize the channel to a capacity of zero, hence, all $b$ with a terminating binary expansion will be in $G^c$. But all dyadic rationals, i.e., rationals that can be written as $p/2^q$ for integers $p$ and $q$, have terminating binary expansions. Consequently, the set of dyadic rationals in $[0,1]$ is a subset of $G^c$. Moreover, since the set of dyadic rationals is dense in the set of reals, the set of dyadic rationals in $[0,1]$ is dense in $[0,1]$. Hence, $G^c$ is dense in $[0,1]$.

What does this mean? This means that every interval $I\subset [0,1]$ will contain at least one "bad" channel. More abstractly, if a polar code is a stochastic dynamical system with two attractors, then the basins of attraction (in terms of the number $b$) are no intervals. (I'm not yet sure if this is true, and even if it is, I have no idea what this means.)

$G$ is Self-Similar?

Self-similarity is typical for fractals, so showing that $G$ is self-similar may suggest that $G$ is fractal, too. Fractality is a a concept related to the dimension of a set, and to the fact that the dimension of this set is non-integer. In fact, since there are many measures of dimensionality, the definition of fractality could in fact depend on the measure used. To have a strong statement that (and why!) the construction of polar codes is difficult, I think it suffices to show that $G$ is self-similar. To prove self-similarity, we would need the following result:

Claim: Let $C_1$ and $C_2$ be the capacities of two different channels, and let $C_1<C_2$. Then, for all $b\in[0,1]$, $C_1^b<C_2^b$. Hence, if $G_1$ and $G_2$ are the sets of good channels associated with $C_1$ and $C_2$, respectively, then $G_1 \subseteq G_2$.

In other words, we would need that if a specific sequence polarizes a given channel to zero (one), then it will polarize all worse (better) channels to zero (one) as well. This is equivalent to saying that a dynamical system iterating according to a number $b$ has exactly two attractors and two basins of attraction. In other words, for a given $b$ we can find a number $\theta$, such that for all $C>\theta$, $C^b=1$ and for all $C<\theta$, $C^b=0$. $C=\theta$ may of course be another fixed point, but this fixed point is unstable.

Example: Consider the binary erasure channel and let $b=1/3$. Then, the binary expansion of $b$ is $0101010101...$. A binary erasure channel with capacity $C>\frac{\sqrt{5}-1}{2}$ will converge to a perfect channel, one with a capacity strictly lower than this number will converge to a useless channel, and one with a capacity exactly equal to this number will stay at this number. (This is why the polar coding theorem can only show convergence almost surely, or for almost all $b$: There may be $b$ for which a given channel does not converge to either zero or one.)

If this claim holds, then we can say the following: Clearly $[0,0.5]$ is similar to $[0,1]$ in the sense that the unit interval is obtained by a simple left shift of the binary expansion of $[0,0.5]$, or $[0,1]=2\cdot[0,0.5]$. Similarly, the set $[0.5,1]$ differs from the set $[0,0.5]$ only in the first digit of the binary expansion. Moreover, $[0,1]=2\cdot[0.5,1]-1$. We consider again a channel with capacity $C$, and the set of good channels shall be $G$. Preprocessing the channel once either improves it or worsens it, so we get capacities $C^1>C>C^0$. Let $G_1$ be the set of good channels when starting with channel $C^0$, and let $G_2$ be the set of good channels when starting with $C^1$. Let

$$ G_1:=\{b\in[0,0.5]{:}\ C^b=1\}$$

and

$$ G_2:=\{b\in[0.5,1]{:}\ C^b=1\}.$$

But since the set $[0,0.5]$ is similar to the set $[0,1]$, the set $G_1$ will also be similar to the set $G$: In fact,

$$ 2\cdot G_1=\{b\in[0,1]{:}\ C^{0b}=1\}$$

and

$$ 2\cdot G_2-1=\{b\in[0,1]{:}\ C^{1b}=1\}.$$

Since the first preprocessing changes the capacity of the channel with which we start polarizing, we can apply the claim to get $2\cdot G_1 \subset G \subset 2\cdot G_2-1$. At the same time, however, $G=G_1\cup G_2$. In other words, the set $G$ consists of a better (right) half and a worse (left) half, both of which are similar in the sense of above set inclusion. (If the sets would be equal rather than inclusions of each other, the set $G$ would be composed of exactly two scaled copies of $G$.) Of course, this analysis can be iterated ad infinitum. $G$ is self-similar.

Fractal Dimension of $G$?

Alright, this might be a tough one. Since $G$ will probably be dense, the box-counting dimension will probably be 1. Since $G$ consists of (more-less) two copies of its own, scaled by a factor of $1/2$, the scaling dimension will be 1, too. Hausdorff dimension is beyond my understanding.

Tuesday, December 30, 2014

Secret Santa Problems (Part II)

This post is in English now, since this is exactly the version of Secret Santa we played at our lab's Christmas party. I could not find the results of this post anywhere in the internet, but they probably appear somewhere. Possibly they are wrong.

This post is the second part to my previous post "Wichtel-Probleme".

Let's play Secret Santa again! Usually, at the beginning of the Christmas season all $n$ players would fix a derangement, which is a permutation of all players' names such that no player ends up with his or her own name. Naturally, this derangement is anonymous, i.e., a given player only knows which name he got assigned, and nothing else. The problem of finding a proper assignments is quite complicated, as the number of derangements is given by the sub-factorial

$$!n = n! \cdot \sum_{k=0}^n \frac{(-1)^k}{k!}$$

which for large $n$ quickly approaches a fraction of $1/e$ of all $n!$ permutations. Hence, only about 36% of all arrangements are proper, i.e., derangements. Typically, one would not want to play something like that in an office with 30 or more employees, since the process of drawing names can be quite time consuming (but see this blog post by Martin Eden about an algorithm ensuring a proper permutation).

Instead, let us play a different version of Secret Santa: Every player brings a wrapped gift and deposits it on a nicely decorated table. Then, the players take turns and draw exactly one gift. Let us assume that none of the players will choose his or her own gift -- that wouldn't be fun, would it? But what if a player has no choice? This brings us to the question of this post:

What is the probability that the last player has to take home his or her own gift?

Clearly, if there are just two players, this situation cannot occur: Either both players get their own gifts, or none does. Since the first option is ruled out, for $n=2$ this probability is zero. And for three players? Let's look at the following table listing all six possible permutations of gifts. Invalid permutations (those in which any of the first two players takes his or her own gift) are crossed out.:

1	2	3
1	2	3
1	3	2
2	1	3
2	3	1
3	1	2
3	2	1

Of all six permutations, only three follow the given rules, and of these only one (indicated in red) makes the last player draw his or her own gift. Hence, the probability is 1/3.

Let's generalize this:

How many permutations end up with the last player taking his or her own gift, given that none of the first $n-1$ players take their own gifts? And how many permutations are there such that the first $n-1$ players do not take their own gifts?

The ratio between the answers of these two questions is the desired probability.

The first question is easy to answer: Just fix the last gift to the last player and count the number of derangements for the first $n-1$ players; the result is $!(n-1)$. The second question is a bit more tricky, and I am not sure if the result is correct. Intuition tells me that the result can be obtained by complete induction or a different kind of recursion, but maybe also the following reasoning is correct: Essentially, there are just two options for the first $n-1$ players not to take their own gifts. Either all $n$ gifts are properly deranged, or the first $n-1$ gifts are deranged and the last gift is fixed. These two options are exclusive: If all $n$ names are deranged, the last gift has to be taken by any of the first $n-1$ players. Hence, we can add the possibilities of these two options; $!n$ for complete derangements and $!(n-1)$ for derangements with the last gift fixed. The probability that the last player has to take home his or her own gift is thus

$$ \frac{!(n-1)}{!(n-1)+!n} $$

The picture below shows that this probability decreases approximately by $1/n$ with the number of players. The more, the merrier!

By the way: The number of derangements with $k<n$ fixed points (i.e., exactly $k$ players take home their own gifts) is given by the Recontres numbers. However, given our rule, more than one fixed point cannot occur.

Wednesday, December 17, 2014

Wichtel-Probleme

This post is in German. There are already a few blog posts about this topic in English; see the list of references at the end of this post.

Lasst uns wichteln! Nehmen wir an, wir wären $n$ Freunde. Wir schreiben unsere Namen auf Kärtchen, werfen diese in einen Hut, und ziehen der Reihe nach jeweils eine Karte wieder heraus. Es kann natürlich passieren, dass jemand von uns seinen eigenen Namen zieht, worauf wir den Vorgang wiederholen müssten. Doch wie groß ist die Wahrscheinlichkeit, dass niemand von uns seinen eigenen Namen zieht?

Das Problem ist kombinatorischer Natur: Es gibt genausoviele Möglichkeiten, die Karten wieder unter uns aufzuteilen, wie es Permutationen der Karten gibt: Das wären $n!=1\cdot2\cdot 3\dots\cdot n$. Unter diesen möglichen Permutationen gibt es nur einige sogenannte fixpunktfreie Permutationen, also solche, bei denen niemand seine eigene Karte zieht. Die Anzahl der fixpunktfreien Permutationen ist

$$!n = n! \cdot \sum_{k=0}^n \frac{(-1)^k}{k!}$$

Sind wir nur zu zweit ($n=2$), gibt es natürlich nur zwei Permutationen, und nur eine davon ist fixpunktfrei. Die Wahrscheinlichkeit, dass wir beim ersten Ziehen bereits erfolgreich sind, liegt bei 50%. Spielen wir zu dritt, gibt es sechs Permutationen, wovon zwei fixpunktfrei sind (die zufälligerweise auch gleichzeitig Rotationen sind). Die Wahrscheinlichkeit sinkt also schnell auf ca. 33%. Interessant wird es, wenn wir $n$ wachsen lassen: Es zeigt sich, dass die Wahrscheinlichkeit sehr schnell gegen den Grenzwert $1/e$ konvergieren: Die Wahrscheinlichkeit, dass wir beim ersten Ziehen erfolgreich sind, liegt für vier oder mehr Spieler ungefähr bei 37%, und das unabhängig von der Anzahl der Spieler!

Etwas schwieriger wird es, wenn eine Gruppe von Paaren sich entscheidet zu wichteln: Die Regeln verändern sich dahingehend, dass man nicht nur sich selbst nicht ziehen darf, sondern auch seinen Partner nicht -- das wäre ja zu einfach! Und genau diese Situation trat in unserer Familie auf: Wir schrieben Zettel, warfen sie in einen Hut, zogen, und erhielten eine ungültige Zuordnung. Auch der zweite Versuch schlug fehl. Erst beim dritten Mal klappte es. Ich hatte das dumpfe Gefühl, dass wir beim dritten Mal bereits außerordentliches Glück hatten und wollte der Sache auf den Grund gehen.

Wenn wir $n$ Paare betrachten (also $2n$ Personen), dann gibt es $(2n)!$ Permutationen, von denen $!(2n)$ fixpunktfrei sind. Allerdings sind nicht alle diese fixpunktfreien Permutationen gültig: Die Permutation, in der jede Person ihren Partner zieht ist zum Beispiel fixpunktfrei. Es gibt also höchstens $!(2n)$ gültige Permutationen. Eine gültige Permutation wäre zum Beispiel eine solche, in der jedes Paar ein anderes Paar zieht; innerhalb dieses Paares gibt es dann zwei mögliche Zuordnungen: Mann-Mann und Frau-Frau oder Mann-Frau und Frau-Mann. Davon gibt es natürlich $2\cdot !n$ mögliche Permutationen ($!n$ für die Auswahl der Paare, und $2$ für die Zuordnung innerhalb der Paare). Nachdem das nicht alle gültigen Permutationen sind ist das eine untere Grenze, so wie $!(2n)$ eine obere ist. Die obere Grenze für die Wahrscheinlichkeit einer gültigen Permutation ist also (wie oben) bei 37%, the untere Grenze¹ ist eng mit dem quadrupel factorial verknüpft:

$$\frac{2 \cdot !n}{(2n)!}= \frac{2 \cdot n!}{(2n)!} \cdot \sum_{k=0}^n \frac{(-1)^k}{k!}$$

Bei zwei Paaren ($n=2$, vier Personen) gibt es also höchstens neun und mindestens zwei gültige Permutationen. In der Tat kann man zeigen, dass es genau vier gültige Permutationen bei zwei Paaren gibt. Bei drei Paaren, so wie in unserer Familie, gibt es bereits 720 mögliche Permutationen, von denen 265 fixpunktfrei sind, wovon wiederum nur 80 der Regel entsprechen, dass niemand den eigenen Partner ziehen darf! Die Wahrscheinlichkeit, bei drei Paaren gültig zu ziehen liegt also bei ca. 11%. Wenden wir nun noch die geometrische Verteilung an, welche die Wahrscheinlichkeit berechnet, den ersten Erfolg nach genau $k$ Zügen zu haben, stellt sich heraus dass man nur mit einer Wahrscheinlichkeit von ca. 30% innerhalb der ersten drei Versuche eine gültige Zuordnung bekommt. Wir hatten tatsächlich Glück!

Es geht aber noch ein wenig komplizierter: Nehmen wir an, wir haben $N$ Gruppen von Personen, und jede dieser Gruppen hat eine unterschiedliche Anzahl von Personen: Gruppe 1 hat $n_1$ Mitglieder, Gruppe 2 $n_2$, und so weiter. Wie groß ist die Wahrscheinlichkeit, dass bei der Auslosung niemand sich selbst oder ein Mitglied der eigenen Gruppe zieht? (Wichteln mit Paaren ist davon natürlich ein Sonderfall, und zwar mit $N=n$ und $n_1=n_2=\cdots=n_n=2$.) Die Fragestellung ist äußerst kompliziert, aber die Lösung existiert seit 1974². Sie ist über ein kompliziertes Integral über das Produkt mehrerer Laguerre-Polynome gegeben. Mehr will ich dazu jetzt nicht sagen.

Übrigens: Wenn die Anzahl $n$ der Pärchen wächst, erhält man als Grenzwert für die Wahrscheinlichkeit einer gültigen Zuordnung $1/e^2$! Für diesen Fall ist das oben genannte Integral nämlich relativ einfach lösbar.

Weiterführende Literatur/Further Reading:

1 Diese untere Grenze ist übrigens, für großes $n$, sehr schlecht: das quadrupel factorial wächst sehr schnell, weshalb die untere Grenze für die Wahrscheinlichkeit schnell gegen null geht.^↩

2 Derangements and Laguerre polynomials, S. Even and J. Gillis (1976). Mathematical Proceedings of the Cambridge Philosophical Society, Volume 79, Issue 01, January 1976, pp 135-143.^↩

Pages