Tuesday, April 11, 2017

An Inequality between Conditional Entropy and Error Probability

Suppose that $X$ is a random variable that can assume $N$ different values, and suppose that $Y$ is another random variable that is linked to $X$ in an arbitrary way. We want to estimate $X$ from observing $Y$, i.e., we define a (possibly random) function $\hat{X}(Y)$. (We may assume that $X$ is a signal we want to estimate and that $Y$ is a noisy or distorted observation of $X$.) We define the detection probability as $P_d:=\mathbb{P}(X=\hat{X}(Y))$, where the probability is taken w.r.t. the joint distribution $p_{X,Y}$ of $X$ and $Y$. A famous inequality relating $P_d$ to the conditional entropy of $X$ given $Y$ is Fano's inequality:
$$
 H(X|Y) \le h_2(P_d) + (1-P_d)\log (N-1)
$$
where $h_2(p):=-p\log p-(1-p)\log(1-p)$. Fano's inequality depends on the alphabet size $N$ of $X$. In what follows we derive a different inequality that is independent of $N$.

To this end, suppose that we use the maximum a posteriori estimate (MAP) of $X$ given $Y$, i.e.,
$$
  \hat{x}(y) = \arg\max_x p_{X,Y}(x,y) = \arg\max_x p_{X|Y}(x|y).
$$
Given that $Y=y$, we can thus define
$$
P_d(y) = \sum_{x} p_{X|Y}(x|y) \mathbb{I}(x=\hat{x}(y)) = \max_x p_{X|Y}(x|y)
$$
from which $P_d=\sum_y p_Y(y) P_d(y)$ follows. Comparing the right-hand side of this with Renyi's entropy of order infinity, we observe that
$$
 P_d(y) = 2^{-H_\infty(X|Y=y)}.
$$
Renyi's entropy is non-increasing in the order, i.e., we have $H_\infty(X|Y=y)\le H(X|Y=y)$. Hence, $P_d(y)\ge 2^{-H(X|Y=y)}$. The function $2^{-x}$ is convex in $x$. Hence, Jensen's inequality yields
$$
 P_d = \sum_y p_Y(y) P_d(y) \ge \sum_y p_Y(y) 2^{-H(X|Y=y)} \ge 2^{- \sum_y p_Y(y) H(X|Y=y)} = 2^{-H(X|Y)}.
$$
Thus, the MAP detection probability is bounded from below by a decreasing function of the conditional entropy. The bound does not depend (explicitly) on the alphabet size of $X$.

Monday, January 9, 2017

New Features for WhatsApp & Co?

This was originally intended as a submission to the Science Park Graz idea contest (that's why it's in German) - but then I thought that the idea is not innovative enough, even though I would love to see those features in WhatsApp or Hangouts. Who knows, maybe Google is reading this ;).


Zwei Use Cases, an denen WhatsApp, Facebook & Co scheitern

Use Case 1: Wochenendausflug mit Freunden teilen

Du verbringst das Wochenende in München und möchtest deinen Freunden zuhause ein paar Fotos vom Marienplatz, der Allianz-Arena und dem Olympiaturm schicken. Du könntest die Fotos auf Facebook, Instagram oder Google+ teilen. Jene deiner Freunde, die diese sozialen Netzwerke selten nutzen, könnten diese Fotos verpassen weil der Newsfeed in der Zwischenzeit von anderen Neuigkeiten überschwemmt wurde. Du könntest deine Fotos deinen Freunden per WhatsApp schicken - vielleicht existiert sogar schon eine passende Gruppe. Aber einerseits werden deine Freunde mit mehreren Benachrichtigungen belästigt, andererseits fühlen sie sich vielleicht dazu verpflichtet, auf deine Fotos mit Kommentaren zu reagieren. Der Abend im Hofbräuhaus wird so schnell zum Message-Marathon.

Use Case 2: Weihnachtswünsche und einen Jahresrückblick an Bekannte schicken

Du möchtest entfernten Bekannten (ehemalige ArbeitskollegInnen, entfernte Familienmitglieder, etc.) ein frohes Fest wünschen und sie wissen lassen, was im vergangenen Jahr bei dir passiert ist. Was du zu sagen hast sprengt den Rahmen der durchschnittlichen WhatsApp-Nachricht, und Facebook und Google+ sind dir zu öffentlich - zudem besteht das große Risiko, dass deine Bekannten die Nachricht aus den im letzten Use Case genannten Gründen gar nicht bekommen.

Detailbeschreibung

Um diese Use Cases abzudecken, bietet das App verschiedene Features, die über die Spezifikationen der derzeit verfügbaren Messaging-Apps hinausgehen. Zur besseren Übersicht sind die neuartigen Features (wie zum Beispiel Nachrichtentypen, Gruppenrollen und Gruppeneinstellungen) fett markiert.

Darstellung

Die Darstellung ähnelt WhatsApp bzw. den Abos von Youtube: Der Startbildschirm listet die Gruppen auf, in denen man Mitglied ist und gibt mit einer Zahl neben dem Gruppennamen die Anzahl der neuen Nachrichten in der Gruppe an. Bei Touch auf den Gruppennamen kommt man zum Gruppenchat, der wieder WhatsApp ähnelt. Ein wesentlicher Unterschied sind zwei neue Nachrichtentypen (Stories und Textnachrichten im eReader-Stil, siehe unten), die im Gruppenchat nur kompakt als Überschrift angezeigt werden. Alle Nachrichten haben außerdem eine Fußzeile mit der Anzahl von Likes bzw. Kommentaren (siehe unten). Das App hat außerdem ein Webinterface welches vor allem für das Verfassen längerer Nachrichten hilfreich ist.

Gruppenrollen

Neben dem Gruppenadministrator und dem Gruppenmitglied, die es auch in WhatsApp und anderen Messaging-Gruppen gibt, gibt es auch Leser, also eine rein passive Gruppenrolle.

  • Gruppenadministrator: Legt die Gruppeneinstellungen fest; lädt Mitglieder in die Gruppe ein und teilt ihnen Gruppenrollen zu
  • Gruppenmitglied: Teilt Nachrichten mit anderen Gruppenmitgliedern bzw. kommentiert und liked Nachrichten anderer Gruppenmitglieder
  • Leser: Passive Teilnahme; kann ggf. Nachrichten lesen, kommentieren und liken, aber keine neuen Nachrichten teilen

Nachrichtentypen

Das App erlaubt neben Textnachrichten im Sprechblasenstil natürlich auch das Teilen von Fotos, Videos, Dokumenten und Internetlinks. Ein besonderes Feature sind Textnachrichten im eReader-Stil. Dieser Nachrichtentypus ist für lange Textnachrichten ausgelegt und ermöglicht ein angenehmes Lesen durch Umblättern statt Scrollen. Außerdem gibt es Stories, also Nachrichten, die über einen längeren Zeitraum geschrieben werden und mit Fotos oder Videos versehen sind. Stories werden vom Gruppenmitglied wie normale Nachrichten geschrieben und abgeschickt, erreichen die anderen Gruppenteilnehmer aber nicht sofort. Erst wenn die Story als abgeschlossen markiert wurde, erscheinen alle kumulierten Nachrichten im Gruppenchat.
Auf Nachrichten kann auf dreierlei Art reagiert werden: Einerseits durch eine neue Nachricht -- ganz im Stile von WhatsApp. Andererseits können Nachrichten kommentiert und geliked werden -- ganz im Stile von Social Networks wie Facebook oder Google+.

Gruppeneinstellungen

Die Gruppeneinstellungen werden vom Gruppenadministrator festgelegt. Sie regeln unter anderem

  • Welche Nachrichtentypen dürfen in der Gruppe verwendet werden?
  • Gibt es Mindestgrößen für Stories bzw. Textnachrichten im eReader-Stil?
  • Welche Nachrichtentypen führen zu Push-Benachrichtigungen (es sei denn, das Gruppenmitglied hat Push-Benachrichtigungen für diese Gruppe deaktiviert)?
  • Wie darf auf Nachrichten reagiert werden? Sind Kommentare erlaubt? Gibt es die Möglichkeit zu liken?

Individuelle Einstellungen

Ferner gibt es Einstellungen, die jeder Nutzer für sich selbst festlegen kann. Dazu gehört unter anderem die Deaktivierung von Push-Benachrichtigungen und natürlich die Schriftgröße von Textnachrichten im eReader-Format. Außerdem kann jeder Nutzer entscheiden, wer Kommentare auf Nachrichten sehen kann: Alle Mitglieder einer Gruppe, oder nur der Verfasser der Nachricht. Die Reihenfolge der Gruppenliste ist ebenfalls anpassbar. So können die Gruppen nach der Anzahl oder Aktualität ungelesener Nachrichten, nach der Rolle des Nutzers in der jeweiligen Gruppe oder nach einer selbst gewählten Priorisierung geordnet sein. Nutzer können ferner entscheiden, wie sie in Gruppen hinzugefügt werden wollen: Hinzufügen ohne Bestätigung, Hinzufügen mit Bestätigung durch Nachricht oder Hinzufügen durch persönlichen Kontakt (z.B. über NFC-Transfer oder Fotografieren eines QR-Codes am Bildschirm des Gruppenadministrators). Diese Einstellungen sind für jede Gruppenrolle, die dem Nutzer zugeteilt werden soll, und für ausgewählte einladende Personen separat möglich. So erlaube ich zum Beispiel vertrauten Kontakten, mich zu jeder Gruppe als Leser hinzuzufügen, in der sie Gruppenadministrator sind; als Gruppenadministrator möchte ich nur hinzugefügt werden, wenn ich andere Gruppenadministratoren der selben Gruppe persönlich treffe.

Zurück zu den Use Cases

Use Case 1: Wochenendausflug mit Freunden teilen

Als Gruppenadministrator legst du eine Gruppe mit deinen Freunden als Gruppenmitgliedern an. Du erlaubst nur den Nachrichtentypus Stories und verbietest Kommentare - liken ist erlaubt. Die Push-Benachrichtigung ist aktiviert.
Im Laufe des Wochenendes schickst du mehrere Fotos an diese Gruppe. Deine Freunde bekommen am Sonntagabend die Benachrichtigung, dass du eine Story geteilt hast. Sie haben die Möglichkeit, die Fotos anzuschauen und zu liken, aber die Gruppeneinstellung verbietet Kommentare. Niemand ist verpflichtet, dir zu antworten -- wer es dennoch will, findet sicher einen Weg dazu (z.B. eine Direktnachricht oder SMS).

Use Case 2: Weihnachtswünsche und einen Jahresrückblick an Bekannte schicken

Als Gruppenadministrator legst du eine Gruppe mit deinem Bekanntenkreis als Lesern an. Du deaktivierst die Push-Benachrichtigung, erlaubst aber Kommentare.
Du schreibst im Webinterface deine Nachricht und schickst sie als Textnachricht im eReader-Stil an die Gruppe. Deine Leser sehen beim nächsten Öffnen der App einen roten Punkt vor dem Gruppennamen am Hauptbildschirm. Wenn sie die Nachricht öffnen, sehen sie keine scrollbare Nachricht, sondern eine Anzeige im eReader-Format: Durch Berühren des Bildschirms wird die Seite umgeblättert. Da die Option von dir aktiviert wurde, können deine Bekannten deine Nachricht kommentieren. Sie können dabei selbst entscheiden, ob alle Leser oder nur du das Kommentar sehen können.

Wednesday, November 30, 2016

(Lossy) Wyner's Common Information and Gacs-Körner's Common Randomness Revisited

Recently, I was involved (not so much, to be honest) in a research project dealing with caching, i.e., the storage of data close to a user such that it need not be transmitted anymore. A typical use case is Netflix: Netflix offers $N$ movies/series, and the user can choose any of these anytime. Most users watch during prime time, which taxes the communication network. During the day, the communication network has unused capacities, so one could try storing part of the Netflix database on a hard drive connected to the user's player. This reduces the network load during prime time and improves the user experience.

I our work, we not only looked at the fundamental limits of this caching scheme, but we also got a good idea of what should actually be stored in the user's hard drive ("cache"). The answers are decades old: One should store either what is known as Gacs-Körner's common randomness, or what is known as Wyner's common information.

For simplicity, we assume that there are only two files available on Netflix, namely $X$ and $Y$. The common information of these two random variables (RVs; in information theory, everything is random) is another RV $W$ that minimizes
$$I(X,Y; W)$$
subject to $X-W-Y$, i.e., subject to the fact that, given $W$, $X$ and $Y$ are conditionally independent. The common randomness of these two RVs is an RV $U$ that maximizes
$$ I(X,Y; U) $$
subject to the constraint that $X-Y-U$ and $Y-X-U$. It can be shown that
$$ I(X,Y; U) \le I(X;Y) \le I(X,Y; W).$$

Common randomness is something that is truly common, in the usual sense of the word. In particular, if $A$, $B$, and $C$ are independent RVs, and if $X=(A,B)$ and $Y=(A,C)$, then $U=A$. So much for the math. Now here comes an interpretation (and probably one for which an information theorists wants to punch me in the face): If $X$ and $Y$ are two different episodes of your favorite series, then $U$ is the opening theme. If you have a small hard drive, storing the opening theme is probably the most useful idea. Common information is a bit more intricate, but an intuitive explanation is the following: Suppose $X$ and $Y$ are two different episodes of your favorite series, and suppose the only thing $X$ and $Y$ have in common is an actress called $A$. In $A$, some parts of $A$ are seen, while in $Y$ other parts of $A$ are seen. (Say, $X$ shows $A$ running, while $Y$ shows $A$ talking to a person.) Obviously, if you could store all information about actress $A$ as a whole on your hard drive, the small computer in your player could reconstruct running scenes and talking scenes in which $A$ participates. Thus, if your hard drive is large enough, that's what you should store.

Let's take this one step further. Suppose that you are fine with watching not $X$ or $Y$, but lower-quality versions $\hat X$ or $\hat Y$, respectively. This requires the definition of lossy versions of common randomness and common information. And indeed, such definitions have been proposed in 1403.8093. Namely, the lossy common information is an RV $W$ that minimizes
$$ I(X,Y; W)$$
subject to $\hat X - W - \hat Y$ and $(X,Y) - (\hat X,\hat Y) - W$. Lossy common randomness is an RV $U$ that maximizes
$$ I(X,Y; U) $$
subject to the constraints $X-Y-U$, $Y-X-U$, $X-\hat X - U$, and $Y-\hat Y-U$. These definitions are all meaningful because they solve communication problems: The get their operational characterization in the Gray-Wyner problem (1403.8093) or in lossy caching (1610.07304).

The lossy version of common information seems fine to me: I only need to store as much information about actress $A$ in order to be able to reconstruct approximate depictions of her running and speaking. This might need more or less space on the hard drive than the perfect reconstruction: I might be able to store not only this reduced version of the actress $A$, but I might also store a, say, generic building, that allows to reconstruct two different but similar houses in episodes $X$ and $Y$. And indeed, lossy common information may be larger or smaller than lossless common information (see Lemma 1 in 1403.8093).

The lossy version of common randomness, however, is not fully satisfactory to me. If I am not interested in perfect reconstruction, should I not be able to store more on my internal hard drive? For example, if the credits of my two episodes $X$ and $Y$ differ only in a few actors' names, can I not store a "noisy" version of the credits in addition to the opening theme? Nevertheless, the lossy version of common randomness is usually smaller than the lossless version (see Corollary 2 in 1403.8093).

I would thus suggest different definitions of lossy common randomness: The lossy common randomness is an RV $U'$ that maximizes
$$ I(X,Y; U') \text{ or } I(\hat X,\hat Y; U')$$
subject to the constraints $\hat X-\hat Y-U'$, $\hat Y-\hat X-U'$, $X-\hat X - U'$, and $Y-\hat Y-U'$. The difference is subtle and mirrors the definition of lossy common information. While lossy common information requires $W$ to make the reconstructions $\hat X$ and $\hat Y$ conditionally independent, the original definition of lossy common randomness requires that $U$ is a common part of $X$ and $Y$ (the first two Markov constraints). The definition proposed here would require that $U'$ is only a common part of the reconstructions $\hat{X}$ and $\hat Y$. Given this definition, we may ask the following questions:

  1. Does the new definition satisfy the desired property that it is larger than the lossless common randomness?
  2. Is there a communications problem for which this definition is a single-letter characterization?

Monday, October 24, 2016

Invariant Distribution of a Second-Order Markov Chain?

A second-order Markov chain $X$ on a finite state space $\mathcal{X}$ is a stochastic process that satisfies
$$
 \mathbb{P}(X_n=x|X_{n-1}=x_{n-1},\dots,X_1=x_1) = \mathbb{P}(X_n=x|X_{n-1}=x_{n-1},X_{n-2}=x_{n-2})
$$
If the second term is invariant of $n$, we call the second-order Markov chain homogeneous and write
$$
  Q_{x,y\to z}= \mathbb{P}(X_3=z|X_2=y,X_1=x)
$$
We say that this Markov chain is irreducible, if and only if from every pair $(x,y)$ every other state $z$ can be reached in any number of steps. In other words, let
$$
  Q^n_{x,y\to z}= \mathbb{P}(X_n=z|X_2=y,X_1=x).
$$
Then, $X$ is irreducible, if and only if for every $(x,y)$ and every $z$ there exists an $n=n(x,y,z)\ge 1$ such that $Q^n_{x,y\to z}>0$. An even stronger condition is regularity: A second-order Markov chain $X$ is regular if and only if this integer $n$ does neither depend on $(x,y)$ nor on $z$. In this case, we write that $Q^n>0$.

We are now interested in the invariant distribution of $X$. In other words, we look for a probability distribution $\pi_{x,y}$ on $\mathcal{X}^2$ such that
$$
  \pi_{y,z} = \sum_{x\in\mathcal{X}} \pi_{x,y}Q_{x,y\to z}.
$$

More precisely, we are interested in the question whether there exists a unique invariant distribution. It is known, for example, that if $X$ is a first-order Markov chain, a unique invariant distribution exists if $X$ is irreducible. A fortiori, for a first-order Markov chain, a unique invariant distribution exists if $X$ is regular. Moreover, for a first-order Markov chain, a unique invariant distribution exists if $X$ is a so-called ergodic unichain, i.e., a chain with a single communicating class.

It is easy to show that if $X$ is a second-order Markov chain, that then the process $X^{(2)}$ with samples $(X_1,X_2)$, $(X_2,X_3)$, $(X_3,X_4)$, etc. is a first-order Markov chain. The hope is that this fact allows us to compute the invariant distribution (or prove its uniqueness) with the help of the simple first-order case (see, e.g., here). Unfortunately, in the setting we have here, this is not possible. It turns out that even if $X$ is a regular second-order Markov chain (and thus irreducible), the chain $X^{(2)}$ need not be irreducible. To this end, consider the following example:

Let $X$ be a second-order Markov chain on $\{1,2,3,4\}$ with transition matrix $Q$, where
$$Q=\begin{bmatrix} 0.5 & 0.5 & 0 & 0 \\
0 & 0 & 1 & 0 \\
0 & 1 & 0 & 0 \\
0 & 0 & 1 & 0 \\
0 & 0 & 0.5 & 0.5 \\
0 & 0.5 & 0.5 & 0 \\
0.5 & 0 & 0 & 0.5 \\
1 & 0 & 0 & 0 \\
0 & 1 & 0 & 0 \\
1 & 0 & 0 & 0 \\
0 & 0.5 & 0.5 & 0 \\
1 & 0 & 0 & 0 \\
0 & 1 & 0 & 0 \\
0 & 1 & 0 & 0 \\
0 & 1 & 0 & 0 \\
0 & 0 & 0.5 & 0.5\end{bmatrix}$$
In this matrix, the columns are labeled with states $\{1,2,3,4\}$, while the rows are labeled with the state tuples, i.e., $\{11,12,13,14,21,\dots,44\}$.

This Markov chain is regular, since $Q^{10}>0$, $Q^{11}>0$,... It turns out, however, that $X^{(2)}$ is not irreducible: $X$ is such that, depending on the initial states, we either have $1-2-3-4-1$ and $1-2-3-1$ or $1-4-3-2-1$ and $1-3-2-1$. It follows that $X^{(2)}$ has transient states $\{(1,1),(2,2),(2,4),(3,3),(4,2),(4,4)\}$ and communicating classes:
$$
  \{(1,2),(2,3),(3,1),(3,4),(4,1)\}
$$
 and
$$
  \{(1,3),(3,2),(2,1),(1,4),(4,3)\}.
$$
It follows that there is no unique distribution $\pi_{x,y}$.

To find out (a little bit) more, I recommend Chapter 7 of this book. Unfortunately, this is as much reference as I have found, the original works being only available in Romanian and on paper. I would be extremely grateful for any reference linked here!

Thursday, September 8, 2016

My First Lecture in Numbers

It's done! The semester is over and I graded the last student of my lecture "Information-Theoretic System Analysis and Design". It was a great experience, and I thank Gerhard Kramer for offering me this opportunity. I also owe Lars, my colleage, more than a few drinks for taking care of most of the problem classes, and my colleagues Andrei and Ali for interesting discussions and comments about the course material. And I also thank my students, who bore with me until the end of the semester, and who were never tired of asking interesting questions -- I think I've never had such an engaged, excited audience.

Here's the course in numbers:
  • (I think) I held 14 lectures. Lars and I shared 6 problem classes.
  • I spend roughly 200 hours preparing course notes, lecturing, correcting deliverables, etc.
  • In the problem classes, students calculated 11 problems in front of their peers. The remaining 7 problems were done by Lars.
  • We had 1 exam and 2 mandatory homework assignments, weighed 50%-25%-25%.
  • Both homeworks had 9 problems.
  • We had 8 regular students with (as far as I know) 8 different nationalities, originating from 2 continents.
  • 7 of these students took the exam and delivered the 2 homework exercises. The average grade of this course was 1.35.
  • The course notes totaled to 102 pages, containing 8 chapters, 40 worked examples, and 62 end-of-chapter problems.
  • 6 students finished the course evaluation. An excerpt is shown below.
Course Evaluation (in parts)
And here are some lessons I learned:
  • My PhD thesis is far from complete.
  • There is never enough time to prepare course notes. Even though the course was based on my PhD thesis, it was hard work to prepare the results in an understandable manner.
  • You have to make sure that the description of homework problems is precise and complete.
  • Don't get confused by registration numbers. 30+ students registered for the course, but in the first lecture less than 10 showed up. Apparently (so I was told), many students register for a course to get the course notes on Moodle.
  • Students prefer a script that they can get printed at the Fachschaft. Putting chapters online a few days before the lecture is ok, but a script is better.
  • It's good not to have a sttrict course syllabus if's the first time you give the course. You never how exactly how much time you have to spend at a given topic, so it's better to leave room for changes. Still, it's a good idea to have at least a rough sketch of what topics you want to cover.

Thursday, July 21, 2016

Reading Notes: Semi-Supervised Clustering

Reading Notes: A brief summary of papers I've read, mainly for personal future reference. The claims about the papers are due to my own (mis-)understanding and reflect my personal opinion. If you know other interesting papers on this topic, I would be happy if you could leave a comment!

Semi-supervised clustering, or constraint clustering, deals with clustering where few of the data points are labeled, either with constraints that they must belong to the same or to different clusters, or with a cluster label.

Constraints as Features (Asafi, Cohen-Or): Data points are assumed to come from an $M$-dimensional space. It is assumed that a pairwise distance matrix is given. Must-Link constraints reduce distances, after which all other distances have to be recomputed to satisfy the triangle inequality. Each Cannot-Link constraint adds a dimension to the data set. Moreover, each Cannot-Link constraint adds an additional distance between the two corresponding data points; the distances between unconstrained data points are increased according to their diffusion distance to the constrained points (data points that have small diffusion distances to the constrained points get larger distances added). By adding a dimension for each constraint, these added distances are "orthogonal" to the original distances, thus avoiding a change of the "general shape" of the data set. The thus obtained distance matrix can be clustered using standard, unsupervised clustering techniques. Experiments: Image segmentation, UCI repo. Noisy constraints are not considered.

Clustering with Partition Level Side Information (Liu, Fu): Data points are assumed to come from an $M$-dimensional space. It is assumed that a pairwise distance matrix is given. A subset of the data points are labeled with a cluster label $1$ through $K$. Labeling is assumed to extend the dimension, i.e., each data point is an element of $M+K$-dimensional space; all data points lie in an $M$-dimensional subspace. Data points labeled to cluster $i$ lie in the subspace that is obtained by shifting the original $M$-dimensional subspace by the unit vector with in the $M+i$-th direction. A k-means-like clustering algorithm is designed, computing alternatingly cluster memberships and centroids. Experiments: UCI repo, analysis of noisy labels.

Fuzzy Clustering with Partial Supervision (Pedrycz, Waletzky): The authors suggest adapting the fuzzy c-means algorithm for semi-supervised clustering. Data points are assumed to come from an $M$-dimensional space, and it is assumed that a pairwise distance matrix or a pairwise Mahalanobis distance matrix is given. The partial supervision comes in terms of partition level side information. The cost function for fuzzy c-means is regularized with a term penalizing a difference between obtained cluster membership matrix and the cluster membership of the labeled data points. (Although I believe that the term $b_k$ in their equation (2) should be outside the $p$-power.) Experiments: synthetic data, IRIS data set, software data set. Noise was, as far as I know, not considered.

Spectral Learning (Kamvar, Klein, Manning): Data points are given in terms of an affinity matrix. Labeling can either be given as cluster labels or a Must-Link/Cannot-Link constraints. Data points in the same group override affinity values to the max (without changing the values of close data points; still, the effect on unlabeled points is seen in spectral space), data points in different groups get the minimum affinity values. From the affinity matrix a Markov transition probability matrix is generated, from which the $K$ leading eigenvalues are extracted. These eigenvalues are the (spectral) feature based on which a classifier is trained on the labeled data points, which is then used to classify the unlabeled data points. Experiments: Newsgroup data, performance increases both with the percentage of labeled data points, as well as with the number of unlabeled data points. Noisy labels/constraints are not considered.

Flexible Constrained Spectral Clustering (Wang, Davidson): The constraints are given as Must-Link/Cannot-Link constraints with an additional belief value between 0 and 1. The utility is given as an inner product, i.e., if $Q$ is the matrix with the (real-valued) constraints (positive if $x$ and $y$ are in the same cluster, negative if in different clusters, 0 if unlabeled) and $u$ a cluster assignment vector, then $u^TQu$ decreases by a clustering inconsistent with the labels. Clustering is obtained by spectral clustering, i.e., solving the min-cut problem, under the constraint that the clustering utility has to be larger than a predefined threshold. This can be done by solving a generalized eigenvalue problem. The focus is on bi-clustering. Experiments: Image segmentation, UCI repo. Analysis of noisy labels implicit by assigning belief in labeling.

Affinity and Penalty Jointly Constrained Spectral Clustering With All-Compatibility, Flexibility, and Robustness (Qian, Jiang, Wang, Su, Wang, Hu, Muzic): The approach is very similar to the one of Wang and Davidson, with two main differences: 1) They incorporate the Must-Link/Cannot-Link constraints also in the affinity matrix (by setting affinities to one and zero, respectively), and 2) the generalized eigenvalue problem is converted to a standard eigenvalue problem with a Lagrange parameter. Moreover, the authors propose a second option for the constraint matrix and extend their analysis to more than two clusters. Finally, they perform a grid search over the parameters, which can be justified by having partial supervision available. Experiments: Synthetic data, Keel data, UCI repo, USPS, human facial data set, text data, image segmentation. Noise was not considered, as far as I know.

Semi-Supervised Information-Maximization Clustering (Calandriello, Niu, Sugiyama): Data points are assumed to come from an $M$-dimensional space and are given. The clustering utility function is the $\chi^2$-divergence between the joint data-label distribution and the product of their marginals ("squared mutual information"). The utility/cost of Must-Link and Cannot-Link constraints are encoded as inner products between the conditional class distribution of the two corresponding data points. Assuming a kernel density estimate of the probabilities, the cost function can be written as a matrix multiplication, and the optimal clustering is obtained by spectral methods (leading eigenvectors of the matrix). Strange things are going on: The authors add additional constraints for transitivity of the Must-Link and Cannot-Link constraints (for binary clustering), and the kernel matrix is modified to incorporate these constraints, too. It is therefore not clear which idea (modifikation of the kernel matrix, encoding transitivity, etc.) leads to the desired performance. Experiments: UCI repo, USPS digits, Olivetti faces. Performance similar to spectral lerning and constraint spectral clustering.  Noisy constraints are not considered.

Computing Gaussian Mixture Models with EM using Equivalence Constraints (Shental, Bar-Hillel, Hertz, Weinshall): Data points are assumed to come from an $M$-dimensional space. The constraints are given as Must-Link/Cannot-Link constraints. Data is modeled as coming from a GMM. The transitive closure of Must-Link constraints (called chunklet) is treated as a single data point in EM that is weighted according to the number of data points in the chunklet. Cannot-Link constraints introduce dependencies between the hidden variables (i.e., GMM component indicator), which leads to a hidden Markov random field. This complicates the EM algorithm, requires a generalized EM procedure that approximately solves the problem. The EM algorithm then trains a GMM which provides a soft/hard clustering. Experiments: UCI repo, noisy labels/constraints are not considered.

Semi-supervised Learning with Penalized Probabilistic Clustering (Lu, Leen): Data points are assumed to come from an $M$-dimensional space. The constraints are given as Must-Link/Cannot-Link constraints with an additional belief value between 0 and 1. Data is modeled as coming from a GMM. The (beliefs about the) constraints give a prior distribution over all clusterings (ruling out certain clusterings when hard constraints are given). The probabilities for the GMM components are evaluated numerically, after simplifying the model by, e.g., removing overlapping pairwise relations. The generalized EM algorithm trains a GMM which provides a soft/hard clustering. Experiments: UCI repo, partially labeled images (cluster-classes are given), texture image segmentation. Noisy labels/constraints are not considered.

Semi-Supervised Density-Based Clustering (Lelis, Sander): Again, data points come from an $M$-dimensional space, and a subset of the data points are labeled with a cluster label $1$ through $K$. The method extends DBSCAN. DBSCAN requires two parameters: an real-valued parameter defining the area of a data point's neighborhood, and an integer that specifies how many data points must lie in the neighborhood of a data point to make it a core point. The semi-supervised method uses the partial labeling to compute the real-valued parameter (such that data points with different labels fall in different clusters), hence the method only requires the integer parameter. Experiments: UCI. Noise was not considered.

Monday, June 20, 2016

The “I” in IT

This article has been published in the June issue of the Information Theory Society Newsletter, as this month's student column (students are always encouraged to send us their thoughts and ideas!). I gratefully acknowledge Gerhard Kramer's help in improving this article.

Have you ever asked yourself whether IT is the right field of science for you? I never did, but I guess IT wondered more than once whether I am the right person to work on its problems. Here’s my story.

I’m currently a postdoc at the Institute for Communications Engineering, Technical University of Munich, Germany. In this regard, my story has a happy ending, something I wouldn’t have dreamed of years ago. When I started my PhD at Graz University of Technology, Austria, I hadn’t heard a lecture on IT yet. My supervisor and I nevertheless decided to attempt developing a “theory for deterministic information processing”, or what we called it back then. I knew entropy from communications engineering and thought that I could get all the necessary information from Chapter 2 of the “Elements”. I later read Chapter 4 for stochastic processes and even resorted to more detailed results, such as those from Kolmogorov and Pinsker, when I tried taking limits of random variables. Nevertheless, phrases like “a sequence of codes” never appeared in my work. Probably the rest of my remaining scientific credibility would be lost if I told you how many of my papers got rejected - so I won’t. I will tell you, though, what most reviewers agreed on: That the information-theoretic quantities I introduced to characterize deterministic systems lack an operational characterization. That was a valid criticism, and I learned what an operational characterization is only a few months before I obtained my PhD. Relating to IEEE Spectrum’s article on Claude Shannon, I despaired that I had “jumped on the bandwagon without really understanding” (Slepian) and contributed to the “widespread abuse” (Pierce) of IT by other fields.

However, since then I have discovered that this kind of “abuse” is successful in scientific fields outside IT. For example, nonlinear adaptive filters are trained using an error entropy, Rényi and Cauchy-Schwarz divergences are used for learning and classification, and the information-bottleneck method enjoys widespread use in clustering. To the best of my knowledge, only few of these works accompany their objective functions with an operational characterization - mutual information is “just” a nonlinear measure of statistical dependence, and entropy is “just” a statistic that captures more than the error signal’s variance. Remarkably, these heuristic methods often show superior performance. Hence, at least in these other fields, the works are scientifically justified by experiment. (OK, I admit that this is a feeble attempt to justify my work a posteriori. As more self-justification, the Aims & Scope of the IEEE Transactions on Information Theory does state that the journal aims to publish theoretical and experimental papers.)

Should I, therefore, try to publish in other fields and transactions? Probably. My supervisor, suggested more than once that I should spread the word, trying to convince scientists in signal processing to use higher-order statistics, i.e., IT quantities. I haven’t listened, though, because I feel at home in the IT community, and I would not want to miss meeting my IT friends regularly at our annual events. Even in hindsight, I would rather submit to orthodox pressure and provide operational characterizations rather than to publish in a different community. In the future, I hope that I can do both.

That is my decision. As the IT society, we all must decide: what are our traditions (really) and how strong shall we hold on to them? Especially when dealing with future directions of IT, as mentioned in arXiv:1507.05941, how should we contribute to make our quantities be used in fields such as signal processing, control theory, and machine learning? Or should we not? For example, should we propose clustering algorithms using IT quantities, or should we rather focus on deriving asymptotic limits for the clustering problem? You, as a PhD student, must make your own decision: Are you sufficiently “in IT” so as to be accepted by the community? If your research topic is on the border between IT and another field, in which direction do you want to go?