Tuesday, June 4, 2019

Talk at LSIT 2019

On May 30th, I had the great pleasure to give a talk at the 5th London Symposium on Information Theory. The symposium is a revival of a conference series that was started in the 50s and 60s, with notable speakers such as Shannon and Turing. As back then, this year’s LSIT was jointly organized by Imperial College London (Deniz Gündüz) and King’s College London (Osvaldo Simeone). It was a great honor to be one of the invited speakers, and I was happy to talk about the potentials and pitfalls of training neural networks to minimize the information bottleneck functional (joint work with Ali Amjad from TUM). The paper accompanying this work is accepted for publication in the IEEE Transactions on Pattern Analysis and Machine Intelligence (but you can also find it on arXiv). If you are interested in the talk, as always you can download it by clicking on the image below.

Unfortunately, my stay at this symposium was the shorted I ever had (and, hopefully, will ever have): I got notice on the morning of my talk that my wife and my son fell sick, so I decided to fly back right after my talk to support them as best as I can. Apparently, the universe decided at the same time to make my trip back home as complicated as possible: The mobile website of Austrian Airlines claimed that my last name is invalid (whatever that means), a two-mile run to get my luggage from the hotel that made me all sweaty, and a fire alarm right in the middle of my talk overthrew the conference schedule. I still managed to hold my talk – it would not have been possible without the generous help of the organizers and the kind understanding of the entire audience.

Leaving a conference right after the talk is rude; it does not give your colleagues the opportunity to discuss your own ideas offline over coffee (or beer). Even worse, it can be seen as an expression of the disinterest in the talks of your colleagues. In my case, leaving the conference so early made me sad in one more way: I had to leave a group of people – information theorists – that I consider my academic family (and many of which I consider even friends). Only my own family could make me do that – and I know that the attendees of the London Symposium understand. Thanks!

Monday, April 15, 2019

Talk at apc|m 2019

I recently attended the 19th European Advanced Process Control and Manufacturing Conference, held this year in the nice city of Villach, Austria. The conference hosts experts in semiconductor manufacturing from both academia and industry.

I had the pleasure to talk about our work on an information-theoretic similarity measure for patterns on analog wafermaps. Analog wafermaps depict electrical measurement values of devices on a wafer, and patterns on these wafermaps may indicate process deviations. Detection and classifying these patterns, and reacting appropriately, can prevent further such deviations and, consequently, yield loss. Our work, a collaboration between Know-Center and K-AI within the SemI40 project, makes use of a feature extraction pipeline that was recently accepted for publication in the IEEE Transactions on Semiconductor Manufacturing. If you are interested in the slides, just click on the image below.

Tuesday, November 6, 2018

Data Science 101: Average Silhouette Coefficient

In this short entry I will talk about the average silhouette coefficient (ASC) which is a popular internal cluster validation measure. To be precise, the ASC is the average of the silhouette of a given dataset. We will consider a very specific dataset in this entry, which we shall call the Mouse dataset:

We will next cluster this dataset into three clusters using k-means. Furthermore, we will evaluate both the clustering result from k-means and the groundtruth clustering (namely, one "head" and two "ears") by means of the ASC:

What we observe is quite interesting. First of all, it can be seen that k-means fails to detect the groundtruth clustering, even though the clusters are separated. (See also here; it is argued that k-means prefers clusters of similar size, where size is taken in a Euclidean sense and not in the sense of equal number of datapoints.) Second, and more important, it is shown that the ASC for the "wrong" solution is larger (i.e., better) than the groundtruth.

As a second experiment, we projected the Mouse dataset in three-dimensional space and evaluated the ASC for the groundtruth clustering:

As it can be seen, the ASC differs from the ASC of the same cluster assignment in two-dimensional space -- ASC depends on the dimension of the dataset.

All this of course makes sense by recognizing that the ASC is distance-dependent. Since distances change when a dataset is projected in some higher-dimensional space, it is not surprising that the ASC changes as well. Furthermore, since k-means is a distance-based clustering technique, it is not surprising that the ASC of a k-means clustering is high. And finally, ASC will be a good indicator of cluster validity if the clusters in the dataset are distance-based (and not, e.g., density-, model-, or graph-based).

Related to this, in "Understanding of Internal Clustering Validation Measures" it is shown that k-means performs worse than Chameleon (Figure 6) on a very similar dataset (Figure 5); at least using Chameleon, the ASC is maximized by the correct number of clusters. This paper and the short analysis presented in this entry lead to the following questions:

  • Based on what cluster assumptions (distance, density, etc.) are different internal validation measures defined?
  • Given any internal validation measure, can we find a synthetic dataset for which the groundtruth clustering has a bad value, while an "obviously wrong" clustering has an extremely good value? I.e., can we find pathological examples for which a given internal validation measure fails? (This entry shows that the answer is positive for ASC.)
  • Given these pathological examples, can we show that their properties are in contrast with the cluster assumptions inherent to the considered internal validation measure?
Answering these questions will improve our understanding of these internal cluster validity measures and will help us choose the correct validity measure.