The challenge itself was an agent-based model with rewards, i.e., it was closely connected to partially-observable Markov decision processes and game theory. In my solution, (which earned me an honorable mention) I took neither approach but just played around a bit with strategies I chose ad hoc. Despite this, I was amazed by the rich diversity of behavior I observed. It took me some 40 work hours but it was totally worth the effort. Thanks heaps to the organizers of the challenge, and congrats to the winners -- I hope I get the chance to participate again at some later time!

# Entropy of Random Talks

A collection of mathematical curiosities, chosen purely based on my own interest. Be prepared for a little information theory, combinatorics, and probability! (This blog is trackable: 5ZVNC2)

## Thursday, August 16, 2018

### Santa Fe Institute Spring 2018 Complexity Challenge

At the end of my Schroedinger fellowship I decided to treat myself with participating in the Santa Fe Institute Spring 2018 Complexity Challenge. I've been a fan of the SFI as long as I've known what it does, and I've seen the most impressive papers published by authors affiliated with SFI.

## Monday, July 9, 2018

### Data Science 101: Radar Charts

I actually wanted to write down what I don't like about radar charts, but then I found out that this already has been done. So I won't. But, adding to "Misreading 1: Area" in Ghosts on the Radar , I was thinking of the following puzzles:

Suppose you have $n$ real-valued, non-negative measurements $x_1$ through $x_n$. You are a super-smart data scientist and you want to use a radar chart to trick your customer to believe that the measurement result is "good" or "bad" -- i.e., the area under the polygon obtained by connecting points corresponding to these measurements should be large/small. Since the angle between the axes is fixed to $360/(n-1)$, you can influence the size of the polygonal area only by selecting the order in which you plot the measurements to the axes. What is the optimal assignment of measurement values to axes such that the area under the polygon is maximized/minimized?

Edit on July 12th, 2018: Here is the solution to the problem of maxizing the area.

Suppose you have $n$ real-valued, non-negative measurements $x_1$ through $x_n$. You are a super-smart data scientist and you want to use a radar chart to trick your customer to believe that the measurement result is "good" or "bad" -- i.e., the area under the polygon obtained by connecting points corresponding to these measurements should be large/small. Since the angle between the axes is fixed to $360/(n-1)$, you can influence the size of the polygonal area only by selecting the order in which you plot the measurements to the axes. What is the optimal assignment of measurement values to axes such that the area under the polygon is maximized/minimized?

Edit on July 12th, 2018: Here is the solution to the problem of maxizing the area.

## Tuesday, April 10, 2018

### Data Science 101: Affine Transforms Prior to Linear Regression

At work, we were recently talking about whether it is ok to affinely transform features prior to linear regression minimizing the least squares. I.e., do the common practices of min-max scaling or unit-variance scaling and centering result in regression coefficients equivalent in the sense that there exists an affine transform that maps the resulting coefficients to those obtained without affinely transforming the features?

The short answer: Any affine transform is ok if you fit the intercept term in linear regression, and any linear transform is ok even if you don't. However, if you fit a linear regression model

**without**intercept to an affine transform of your features, then the coefficients you get are not equivalent to those you get when you fit the linear regression model to the original features.

Indeed, suppose that $y$ is the target vector and $X$ the matrix of features. The optimal coefficients for linear regression without intercept are given by $\beta_o = X^+y$, where $X^+$ is the pseudo-inverse of $X$. Now let $W$ be a linear, invertible transform (e.g., scaling each feature to unit variance). Then, the optimal coefficients for the scaled features are $\gamma_o=(XW)^+y=W^{-1}X^+y=W^{-1}\beta_o$, which follows because the pseudo-inverse of a matrix product is the product of pseudo-inverses with the order of matrices switched. Thus, $\gamma_o$ and $\beta_o$ are equivalent in above sense. The situation changes, of course, if we transform $X$ affinely, which amounts to a rank-1 update of $XW$. Unfortunately, the relation between the pseudo-inverse of $XW$ and its rank-1 update is not obvious (but may be obtained by the Woodbury matrix identity?). In any case, and as the experiments show below, the coefficients are not equivalent in above sense.

If you fit the intercept term, then you choose a coefficient vector $\beta_o$ and a scalar coefficient $\beta_{o,0}$ such that $y$ is well approximated by $X\beta_o + 1\beta_{o,0}$, where $1$ is a vector of ones. Transforming $X$ affinely generates a feature matrix $XW+1\lambda^T$, where $\lambda$ is a vector of values subtracted from each linearly transformed feature. For this feature matrix, we want to find coefficients $\gamma_o$ and $\gamma_{o,0}$ such that $y$ is well approximated by $XW\gamma_o +1\lambda^T\gamma_o + 1\gamma_{o,0}$. One can see that $\gamma_{o,0}$ provides sufficient degrees of freedom to freely choose $\gamma_o$, which can thus be done by computing the pseudo-inverse of $XW$. Indeed, we obtain $\gamma_o=W^{-1}\beta_o$. With $\gamma_o$ now fixed, the intercept term becomes $\gamma_{o,0}=\beta_{o,0}+\lambda^T\gamma_o$.

In [1]:

```
## Setting up the model
import numpy as np
import matplotlib.pyplot as plt
# generate random numbers, mean = 500, std = 10
N=1000
mu_x=500
mu_n=0
sigma_x=10
sigma_n=0.05
x=np.random.normal(mu_x, sigma_x, N)
n=np.random.normal(mu_n, sigma_n, N)
# generate linear+noise model
a=0.01
y=a*x+n
```

In [2]:

```
## Learning a linear regression model with intercept
from sklearn import datasets, linear_model
# coefficients of linear regression model w/o scaling and centering
regr = linear_model.LinearRegression()
regr.fit(x.reshape(-1,1), y)
# coefficients of linear regression model with scaling and centering
std=50
mu=200
regr_centered=linear_model.LinearRegression()
x_centered=(x-mu)/std
regr_centered.fit(x_centered.reshape(-1,1), y)
# plot the coefficients
x_list = np.linspace(460, 540, 10)
plt.plot(x,y,'c.')
plt.plot(x_list,x_list*regr.coef_+regr.intercept_,'b')
plt.plot(x_list,x_list*(1/std)*regr_centered.coef_+regr_centered.intercept_-regr_centered.coef_*mu/std,'b-o')
plt.show()
```

In [3]:

```
## Learning a linear regression model without intercept
# coefficients of linear regression model w/o scaling and centering
regr = linear_model.LinearRegression(fit_intercept=False)
regr.fit(x.reshape(-1,1), y)
# coefficients of linear regression model with scaling and centering
std=50
mu=200
regr_centered=linear_model.LinearRegression(fit_intercept=False)
x_centered=(x-mu)/std
regr_centered.fit(x_centered.reshape(-1,1), y)
# plot the coefficients
x_list = np.linspace(460, 540, 10)
plt.plot(x,y,'c.')
plt.plot(x_list,x_list*regr.coef_+regr.intercept_,'r')
plt.plot(x_list,x_list*(1/std)*regr_centered.coef_+regr_centered.intercept_-regr_centered.coef_*mu/std,'b-o')
plt.show()
```

Subscribe to:
Posts (Atom)