I had the great pleasure to give two lectures at the IDE Autumn School 2018 „Digitale Edition – Vertiefung und Nutzung“. The autumn school joins scholars from all areas of the Digital Humanities and aims at broadening their toolsets (see the remarkable syllabus of this year's school).

My task was to talk about network analysis, including graph-theoretic centrality measures and graph visualization in Python. I chose Shakespeare's Hamlet as a toy example, making use of the CSV-files from Roger W. Haworth. The slightly edited CSV-file and the Jupyter Notebooks (handout and solution) are available for download. And so are, of course, the slides (just click on the image below).

# Entropy of Random Talks

A collection of mathematical curiosities, chosen purely based on my own interest. Be prepared for a little information theory, combinatorics, and probability! (This blog is trackable: 5ZVNC2)

## Wednesday, October 3, 2018

## Wednesday, September 12, 2018

### A Visit from Marek Smieja

This week I had a nice visit from Marek Smieja, Assitant Professor at the Group of Machine Learning Research at Jagiellonian University in Krakov.

I first met Marek in 2012 on the terrace of Lausanne Palace Hotel - and if that doesn't make a good start for a story then I don't know what does. Drinks in our hands, we were discussing the exciting talks we have heard during the first few days of the IEEE Information Theory Workshop held at EPF Lausanne. Back then Marek and I were working both on information dimension, but we already had the idea that we will soon move towards machine learning.

Fast forward to 2015, when Marek invited me to visit Poland to attend the first edition of Theoretical Foundations of Machine Learning. A year or so later, Marek and I started working on a method for semi-supervised clustering, mixing ideas from model-based clustering and the information bottleneck principle. We have stayed in contact ever since, but this week was the first time we met again since raising glasses on this terrace in 2012. (Marek could not attend the conference in 2015.) Needless to say, I was really happy when he announced that he will be able to visit me at the Know-Center.

During two intensive days in Know-Centers "Small Meeting Room" we brought each other up to date regarding our current research projects, found a topic on which we may collaborate in the near future, and defined a student project with the aim to combine Markov aggregation and semi-supervised clustering. I'm looking forward to continuing this year-long academic relationship with Marek!

PS: If you have some hot results on the theory of machine learning ready, you may want to submit it to TFML 2019. Deadline is November 30th.

Marek presenting some of his most recent results - you can find the paper on arXiv (and in the upcoming proceedings of this year's edition of NIPS!) |

I first met Marek in 2012 on the terrace of Lausanne Palace Hotel - and if that doesn't make a good start for a story then I don't know what does. Drinks in our hands, we were discussing the exciting talks we have heard during the first few days of the IEEE Information Theory Workshop held at EPF Lausanne. Back then Marek and I were working both on information dimension, but we already had the idea that we will soon move towards machine learning.

Fast forward to 2015, when Marek invited me to visit Poland to attend the first edition of Theoretical Foundations of Machine Learning. A year or so later, Marek and I started working on a method for semi-supervised clustering, mixing ideas from model-based clustering and the information bottleneck principle. We have stayed in contact ever since, but this week was the first time we met again since raising glasses on this terrace in 2012. (Marek could not attend the conference in 2015.) Needless to say, I was really happy when he announced that he will be able to visit me at the Know-Center.

Marek in front of the Uhrturm, one of Graz' main attractions. |

During two intensive days in Know-Centers "Small Meeting Room" we brought each other up to date regarding our current research projects, found a topic on which we may collaborate in the near future, and defined a student project with the aim to combine Markov aggregation and semi-supervised clustering. I'm looking forward to continuing this year-long academic relationship with Marek!

PS: If you have some hot results on the theory of machine learning ready, you may want to submit it to TFML 2019. Deadline is November 30th.

## Thursday, August 16, 2018

### Santa Fe Institute Spring 2018 Complexity Challenge

At the end of my Schroedinger fellowship I decided to treat myself with participating in the Santa Fe Institute Spring 2018 Complexity Challenge. I've been a fan of the SFI as long as I've known what it does, and I've seen the most impressive papers published by authors affiliated with SFI.

The challenge itself was an agent-based model with rewards, i.e., it was closely connected to partially-observable Markov decision processes and game theory. In my solution, (which earned me an honorable mention) I took neither approach but just played around a bit with strategies I chose ad hoc. Despite this, I was amazed by the rich diversity of behavior I observed. It took me some 40 work hours but it was totally worth the effort. Thanks heaps to the organizers of the challenge, and congrats to the winners -- I hope I get the chance to participate again at some later time!

## Monday, July 9, 2018

### Data Science 101: Radar Charts

I actually wanted to write down what I don't like about radar charts, but then I found out that this already has been done. So I won't. But, adding to "Misreading 1: Area" in Ghosts on the Radar , I was thinking of the following puzzles:

Suppose you have $n$ real-valued, non-negative measurements $x_1$ through $x_n$. You are a super-smart data scientist and you want to use a radar chart to trick your customer to believe that the measurement result is "good" or "bad" -- i.e., the area under the polygon obtained by connecting points corresponding to these measurements should be large/small. Since the angle between the axes is fixed to $360/(n-1)$, you can influence the size of the polygonal area only by selecting the order in which you plot the measurements to the axes. What is the optimal assignment of measurement values to axes such that the area under the polygon is maximized/minimized?

Edit on July 12th, 2018: Here is the solution to the problem of maxizing the area.

Suppose you have $n$ real-valued, non-negative measurements $x_1$ through $x_n$. You are a super-smart data scientist and you want to use a radar chart to trick your customer to believe that the measurement result is "good" or "bad" -- i.e., the area under the polygon obtained by connecting points corresponding to these measurements should be large/small. Since the angle between the axes is fixed to $360/(n-1)$, you can influence the size of the polygonal area only by selecting the order in which you plot the measurements to the axes. What is the optimal assignment of measurement values to axes such that the area under the polygon is maximized/minimized?

Edit on July 12th, 2018: Here is the solution to the problem of maxizing the area.

## Tuesday, April 10, 2018

### Data Science 101: Affine Transforms Prior to Linear Regression

At work, we were recently talking about whether it is ok to affinely transform features prior to linear regression minimizing the least squares. I.e., do the common practices of min-max scaling or unit-variance scaling and centering result in regression coefficients equivalent in the sense that there exists an affine transform that maps the resulting coefficients to those obtained without affinely transforming the features?

The short answer: Any affine transform is ok if you fit the intercept term in linear regression, and any linear transform is ok even if you don't. However, if you fit a linear regression model

**without**intercept to an affine transform of your features, then the coefficients you get are not equivalent to those you get when you fit the linear regression model to the original features.

Indeed, suppose that $y$ is the target vector and $X$ the matrix of features. The optimal coefficients for linear regression without intercept are given by $\beta_o = X^+y$, where $X^+$ is the pseudo-inverse of $X$. Now let $W$ be a linear, invertible transform (e.g., scaling each feature to unit variance). Then, the optimal coefficients for the scaled features are $\gamma_o=(XW)^+y=W^{-1}X^+y=W^{-1}\beta_o$, which follows because the pseudo-inverse of a matrix product is the product of pseudo-inverses with the order of matrices switched. Thus, $\gamma_o$ and $\beta_o$ are equivalent in above sense. The situation changes, of course, if we transform $X$ affinely, which amounts to a rank-1 update of $XW$. Unfortunately, the relation between the pseudo-inverse of $XW$ and its rank-1 update is not obvious (but may be obtained by the Woodbury matrix identity?). In any case, and as the experiments show below, the coefficients are not equivalent in above sense.

If you fit the intercept term, then you choose a coefficient vector $\beta_o$ and a scalar coefficient $\beta_{o,0}$ such that $y$ is well approximated by $X\beta_o + 1\beta_{o,0}$, where $1$ is a vector of ones. Transforming $X$ affinely generates a feature matrix $XW+1\lambda^T$, where $\lambda$ is a vector of values subtracted from each linearly transformed feature. For this feature matrix, we want to find coefficients $\gamma_o$ and $\gamma_{o,0}$ such that $y$ is well approximated by $XW\gamma_o +1\lambda^T\gamma_o + 1\gamma_{o,0}$. One can see that $\gamma_{o,0}$ provides sufficient degrees of freedom to freely choose $\gamma_o$, which can thus be done by computing the pseudo-inverse of $XW$. Indeed, we obtain $\gamma_o=W^{-1}\beta_o$. With $\gamma_o$ now fixed, the intercept term becomes $\gamma_{o,0}=\beta_{o,0}+\lambda^T\gamma_o$.

In [1]:

```
## Setting up the model
import numpy as np
import matplotlib.pyplot as plt
# generate random numbers, mean = 500, std = 10
N=1000
mu_x=500
mu_n=0
sigma_x=10
sigma_n=0.05
x=np.random.normal(mu_x, sigma_x, N)
n=np.random.normal(mu_n, sigma_n, N)
# generate linear+noise model
a=0.01
y=a*x+n
```

In [2]:

```
## Learning a linear regression model with intercept
from sklearn import datasets, linear_model
# coefficients of linear regression model w/o scaling and centering
regr = linear_model.LinearRegression()
regr.fit(x.reshape(-1,1), y)
# coefficients of linear regression model with scaling and centering
std=50
mu=200
regr_centered=linear_model.LinearRegression()
x_centered=(x-mu)/std
regr_centered.fit(x_centered.reshape(-1,1), y)
# plot the coefficients
x_list = np.linspace(460, 540, 10)
plt.plot(x,y,'c.')
plt.plot(x_list,x_list*regr.coef_+regr.intercept_,'b')
plt.plot(x_list,x_list*(1/std)*regr_centered.coef_+regr_centered.intercept_-regr_centered.coef_*mu/std,'b-o')
plt.show()
```

In [3]:

```
## Learning a linear regression model without intercept
# coefficients of linear regression model w/o scaling and centering
regr = linear_model.LinearRegression(fit_intercept=False)
regr.fit(x.reshape(-1,1), y)
# coefficients of linear regression model with scaling and centering
std=50
mu=200
regr_centered=linear_model.LinearRegression(fit_intercept=False)
x_centered=(x-mu)/std
regr_centered.fit(x_centered.reshape(-1,1), y)
# plot the coefficients
x_list = np.linspace(460, 540, 10)
plt.plot(x,y,'c.')
plt.plot(x_list,x_list*regr.coef_+regr.intercept_,'r')
plt.plot(x_list,x_list*(1/std)*regr_centered.coef_+regr_centered.intercept_-regr_centered.coef_*mu/std,'b-o')
plt.show()
```

Subscribe to:
Posts (Atom)