At work, we were recently talking about whether it is ok to affinely transform features prior to linear regression minimizing the least squares. I.e., do the common practices of min-max scaling or unit-variance scaling and centering result in regression coefficients equivalent in the sense that there exists an affine transform that maps the resulting coefficients to those obtained without affinely transforming the features?

The short answer: Any affine transform is ok if you fit the intercept term in linear regression, and any linear transform is ok even if you don't. However, if you fit a linear regression model without intercept to an affine transform of your features, then the coefficients you get are not equivalent to those you get when you fit the linear regression model to the original features.

Indeed, suppose that $y$ is the target vector and $X$ the matrix of features. The optimal coefficients for linear regression without intercept are given by $\beta_o = X^+y$, where $X^+$ is the pseudo-inverse of $X$. Now let $W$ be a linear, invertible transform (e.g., scaling each feature to unit variance). Then, the optimal coefficients for the scaled features are $\gamma_o=(XW)^+y=W^{-1}X^+y=W^{-1}\beta_o$, which follows because the pseudo-inverse of a matrix product is the product of pseudo-inverses with the order of matrices switched. Thus, $\gamma_o$ and $\beta_o$ are equivalent in above sense. The situation changes, of course, if we transform $X$ affinely, which amounts to a rank-1 update of $XW$. Unfortunately, the relation between the pseudo-inverse of $XW$ and its rank-1 update is not obvious (but may be obtained by the Woodbury matrix identity?). In any case, and as the experiments show below, the coefficients are not equivalent in above sense.

If you fit the intercept term, then you choose a coefficient vector $\beta_o$ and a scalar coefficient $\beta_{o,0}$ such that $y$ is well approximated by $X\beta_o + 1\beta_{o,0}$, where $1$ is a vector of ones. Transforming $X$ affinely generates a feature matrix $XW+1\lambda^T$, where $\lambda$ is a vector of values subtracted from each linearly transformed feature. For this feature matrix, we want to find coefficients $\gamma_o$ and $\gamma_{o,0}$ such that $y$ is well approximated by $XW\gamma_o +1\lambda^T\gamma_o + 1\gamma_{o,0}$. One can see that $\gamma_{o,0}$ provides sufficient degrees of freedom to freely choose $\gamma_o$, which can thus be done by computing the pseudo-inverse of $XW$. Indeed, we obtain $\gamma_o=W^{-1}\beta_o$. With $\gamma_o$ now fixed, the intercept term becomes $\gamma_{o,0}=\beta_{o,0}+\lambda^T\gamma_o$.

In [1]:

## Setting up the model

import numpy as np
import matplotlib.pyplot as plt

# generate random numbers, mean = 500, std = 10
N=1000
mu_x=500
mu_n=0
sigma_x=10
sigma_n=0.05
x=np.random.normal(mu_x, sigma_x, N)
n=np.random.normal(mu_n, sigma_n, N)

# generate linear+noise model
a=0.01
y=a*x+n

In [2]:

## Learning a linear regression model with intercept

from sklearn import datasets, linear_model

# coefficients of linear regression model w/o scaling and centering
regr = linear_model.LinearRegression()
regr.fit(x.reshape(-1,1), y)

# coefficients of linear regression model with scaling and centering
std=50
mu=200
regr_centered=linear_model.LinearRegression()
x_centered=(x-mu)/std
regr_centered.fit(x_centered.reshape(-1,1), y)

# plot the coefficients
x_list = np.linspace(460, 540, 10)
plt.plot(x,y,'c.')
plt.plot(x_list,x_list*regr.coef_+regr.intercept_,'b')
plt.plot(x_list,x_list*(1/std)*regr_centered.coef_+regr_centered.intercept_-regr_centered.coef_*mu/std,'b-o')
plt.show()

In [3]:

## Learning a linear regression model without intercept

# coefficients of linear regression model w/o scaling and centering
regr = linear_model.LinearRegression(fit_intercept=False)
regr.fit(x.reshape(-1,1), y)

# coefficients of linear regression model with scaling and centering
std=50
mu=200
regr_centered=linear_model.LinearRegression(fit_intercept=False)
x_centered=(x-mu)/std
regr_centered.fit(x_centered.reshape(-1,1), y)

# plot the coefficients
x_list = np.linspace(460, 540, 10)
plt.plot(x,y,'c.')
plt.plot(x_list,x_list*regr.coef_+regr.intercept_,'r')
plt.plot(x_list,x_list*(1/std)*regr_centered.coef_+regr_centered.intercept_-regr_centered.coef_*mu/std,'b-o')
plt.show()

Entropy of Random Talks

Pages

Tuesday, April 10, 2018

Data Science 101: Affine Transforms Prior to Linear Regression