Linear regression | Ludek Cizinsky

👋 About the lesson

We start with one of the simplest models that can be used for regression analysis - linear regression. (LR) Despite its simplicity, it introduces a lot of fundamental concepts which are then used later on in more powerful models like neural networks. This lesson might seem to be too theoretical given that you can do LR in a few lines of code using sklearn, but the main focus of this course is on the actual models and how they work rather than how to make them work using some library. It is okay to feel confused, certain things just need some time to digest and they will come to your understanding naturally as we will iterate through certain concepts over and over.

📓 Notes

Fundamental information about LR

Brief model description

In general, given some feature vector \(x\), we want to predict target variable \(y\) using a linear combination of the input features (and their transformations). Transformation can simply be square of the given feature, or product of two features. Note that the model is still linear, it is linear in its parameters. However, you might also say that it is non-linear in its input features if you apply some non-linear transformations.

Model mathematical definition

\[h(x) = \theta x + \theta_0 + \epsilon\]

Here \(\theta\) represents a vector of all parameters and \(x\) is a feature vector. We can not forget on the offset term \(\theta_0\) and error term \(\epsilon\) which is variable that contains all the variance that our model has not been able to consider. Note that this is a theoretical definition of the model, i.e., this model would be able to perfectly match the target variable \(y\). In practice, the following model is used to make predictions:

\[h(x) = \theta x + \theta_0\]

I discuss in a separte section below the relationship between the error term and residual variance.

How to train

Analytical solution. There exists an equation which you can use to compute unbiased estimates of the linear model’s parameters, it is called normal equation:

\[\theta = (X^TX)^{-1}X^Ty\]

If this looks familiar, it should since this way we were taught to estimate parameters in linear algebra course. For small feature space, this is probably a better solution. However, as the feature space grows, it gets expensive to compute the inverse. Here \(X\) is called design matrix and its dimension is \(n \times p\) where \(n\) is the number of training samples, and \(p\) is the number of features (including the bias term). The design matrix therefore looks as follows:

\[\left[\begin{array}{cccc} 1 & x_{i 1} & \ldots & x_{i p} \\ \vdots & \vdots & \vdots & \vdots \\ 1 & x_{n 1} & \ldots & x_{n p} \end{array}\right]\]

If we quickly go through the matrix multiplication steps (in the normal equation):

\(p \times n\) @ \(n \times p\) –> \(p \times p\)
\(p \times p\) @ \(p \times n\) –> \(p \times n\)
\(p \times n\) @ \(n \times 1\) –> \(p \times 1\)

This yields \(p \times 1\) vector with coefficient estimates. We then use these to make prediction with future data.

Numerical solution. This has not been taught yet, so do not worry, it is only if you are interested. Alternatively, you can use gradient descent to find the optimal parameters:

\[\begin{aligned} \theta_0 := \theta_0 - \alpha\frac{1}{n}(X\theta - y) \\ \theta := \theta - \alpha\frac{1}{n}X^T(X\theta - y) \end{aligned}\]

Note that you might also want to regularise the model against overfit using L2 regularisation:

\[\theta := \theta - \alpha\frac{1}{n}[X^T(X\theta - y) + 2\lambda\theta]\]

Summary of the section In this section I explained the following concepts:

Linear regression model - linear combination of coefficients with (transformed) features. Linear because there is no interaction between coefficients. More on the transformations of features below.
How to estimate coefficients - you have a model, training data and your task is to estimate model’s parameters (a.k.a. coefficients). In the lecture, you were taugh the analytical way using normal equation (plug and play formula). In the future, we will talk about numerical way which uses gradient descent.
how to make predictions - simply multiply the input matrix with estimated parameters vector

More on the features (and their transformation)

From simple to complex LR model

Let’s start with the simplest linear regression model which just have one feature and one response variable:

\[Y = X \theta_1 + \theta_0\]

This is nice, but in real world, we have multiple feature datasets, therefore to make it more general we can write:

\[Y = X \theta + \theta_0\]

So far, we varied number of features. Let’s step up and vary type of features, i.e., apart from continuous ones, we will now have also categorial ones. For simplicity, I will just use one continous feature and one categorical feature with two levels. This means:

\[Y = X_1 \theta_1 + X_2 \theta_2 + \theta_0\]

where \(X_2\) is the categorical one (taking on binary values). It is interesting to think how the model will look for samples where \(X_2 = 1\) - we will add \(\theta_2\) to the output. In practice, this would shift the line (representation) of the model either up or down. In the other case, where \(X_2 = 0\), we will not add anything. This means that our lines will be parallel with the same slope. This class of LR models is often reffered to as parallel slopes models. Notice one important implication, if we want to investigate how change in \(X_1\) impacts \(Y\), we get:

\[\frac{\partial Y}{\partial X_1} = \theta_1\]

This means that rate of change is solely dependent on the coefficient associated with the continous feature. But what if we wanted to also incorporate presence of the categorical variable? We can add a new term called interaction term:

\[Y = X_1 \theta_1 + X_2 \theta_2 + X_1 X_2 \theta_3 + \theta_0\]

Let’s differentiate again and see what we get:

\[\frac{\partial Y}{\partial X_1} = \theta_1 + X_2 \theta_3\]

It’s no longer dependent only on \(\theta_1\), but also on the actual value of \(X_2\) which is then weighted by \(\theta_3\). As a result, if we were to plot it, the lines would no longer be parallel, since they now have different slopes. So far, if we were to visualize the model, we would get some linear construct (line, plane, hyperplane) since there are non-linear transformations. Therefore, to achieve the non-linear shape of the model, we could for instance use some non-linear transformation of one the continuous feature:

\[Y = X_1 \theta_1 + X_2 \theta_2 + X_1 X_2 \theta_3 + t(X_1) \theta_4 + \theta_0\]

where \(t(x)\) could for instance be square of the input etc.

In summary

This section was supposed to show you that LR is more than just simple multplication of coefficients and features. In addition, it should give you an intuition about how interaction between continuous and categorical features influences behaviour of the model for different groups of samples.

More on estimation of parameters

To be added.

Inspecting model(s) (Hypothesis testing, residuals, Q-Q plot)

To be added.