Linear regression solutions

Supervised vs Unsupervised problems

Exercise 1. For each of the following problems, determine whether the problem is supervised or unsupervised.

(a) Given detailed phone usage from many people, find interesting groups of people with similar behaviour.

(b) Given detailed phone usage of many users along with their historic churn, predict if people are going to change contracts again.

(c) Given expression measurements of 1000s of genes for 100s of patients along with a binary variable indicating presence or absence of a specific cancer, predict if the cancer is present for a new patient.

(d) Given expression measurements of 1000s of genes for 100s of patients, find groups of functionally similar genes.

Solution

(a) Since we have no response variable \(y\) to train on, it is unsupervised

(b) In this case, our features might be number of calls, amount of data used for example per last month, and we also have access to whether the given customer ended subsciption to the given service or not. (target variable) Therefore, this is supervised.

(c) Again, we have access to the binary target variable, therefore supervised.

(d) In this case, there is no response variable, but we are rather searching for groups (i.e. observations with similar features), therefore unsupervised.


Classification vs Regression

Exercise 2. For data with each of the following outcome variables, determine whether the problem is suitable for classification or regression: (a) Presence or absence of cancer.

(b) Favourite fruits

(c) Annual income in kroner

(d) Income bracket

Solution

(a) This is an example of binary classification.

(b) This is multi-class classification.

(c) This is regression since we are predicting a continuous value.

(d) Assuming there is limited number of brackets, this is again multi-class classification.


Linear regression models

Imagine we have a dataset with two features: \(x_1\) and \(x_2\) that are numerical (real-valued) variables Consider the following models:

(a) \(Y=\beta_0+\beta_1 x_1+\epsilon\)

(b) \(Y=\beta_0+\beta_1 x_1+\beta_2 x_2+\epsilon\)

(c) \(Y=\beta_0+\beta_1 x_1+\beta_2 x_2+\beta_3\left(x_1 x_2\right)+\epsilon\)

(d) \(Y=\beta_0+\beta_1 x_1+\beta_2 x_2+\beta_3\left(x_1 x_2\right)+\beta_4 x_1^2+\beta_5 x_2^2+\epsilon\)

Exercise 3. Explain how a unit change in \(x1\) would affect \(y\) (leaving \(x_2\) unchanged) in each model (a-d) Exercise 4. Make a sketch of the functional relationship between \(Y\) and the two features in each of the models. (a-d)

Solution (3 & 4) To find how a small change in \(x1\) would affect the overall output, we can simply differentiate \(Y\) with respect to \(x_1\). Therefore:

(a) \(\frac{dY}{dx_1} = \beta_1\) - in this case, the change is constant which should make sense given the model is just a simple line

(b) \(\frac{\partial Y}{\partial x_1} = \beta_1\) - this might look very similar to (a), but notice that this is only a partial derivative. Why does it make sense that we have got again \(\beta_1\)? Try to checkout this Geogebra applet, for the function, you can for instance write 1 + 0.5x + 2y which should give you a nice plane. Then tick y as constant. You should then see how two planes intersect, i.e., you should see a line. One plane is the model, the other represents the fixed \(x_2\). And this is the reason why we obtained exact same result as in (a), since we are again looking at a line which grows/decreases constantly. If you feel still bit unsure, I suggest you check this short video about how to interpret partial derivative. Overall, the answer is that by changing \(x_1\), \(y\) would be again affected in a constant manner.

(c) \(\frac{\partial Y}{\partial x_1} = \beta_1 + \beta_3x_2\) - now it got more interesting since the partial derivative is no longer constant but depends on \(x_2\). Again, if you use the above applet with the new model, you should see why. First of all, our model no longer looks like a plane, but rather looks bit more complex, yet it is still linear. (liear in its coefficients, but it is non-linear from a perspective of its features) Try to slide with the \(Y\) (in our case this is \(x_2\)), you should be able to see how the intersection line changes. In other words, when we want to know how output changes if we change \(x_1\), we have to take into consideration also where we are in terms of \(x_2\) position. Therefore the rate of change in this case is linear as can also be deduced from the obtained partial derivative equation.

(d) \(\frac{\partial Y}{\partial x_1} = \beta_1 + \beta_3x_2 + 2\beta_4x_1\) - this is the most complex model which can also be seen from the Geogebra applet. In this case, not only the \(x_2\) determines the rate of change but we also we have to consider position on \(x_1\). This makes sense since the intersection is no longer a line but a parabolla which grows/decreases at different rates based on the \(x_1\). Again, as can be seen from the first partial derivative of the model, the change of \(x_1\) would translate into linear change in the output.

I believe the main takeaway from this exercise is that the more complex model, the more is the output variable impacted by little changes in the given feature. You could see that for the model represented by plane, the rate of change was always constant, where as for the more complex models, the rate of change was linear depending on the input features.

Exercise 5. Explain how the design matrix would look in each model.

Solution

We know that the design matrix has \(n\) rows depending on the number of samples, and \(p + 1\) columns where \(p\) denotes the number of features. Plus one is because we also need to add column full of ones to account for the bias term. Therefore the design matrix shape would be in:

(a) \(n \times 2\)

(b) \(n \times 3\)

(c) \(n \times 4\)

(d) \(n \times 6\)

Introduce now a third, categorical, feature C with two levels yes/no (for the first model and similarly it would look like for other models):

\[Y=\beta_0+\beta_1 x_1+\beta_2 C_{\mathrm{yes}}+\epsilon\]

Exercise 7. Can you sketch (or explain) the change to the relationship between \(Y\) and \(x_1\) and \(x_2\) if you introduce interactions between \(C\) and \(x_1\) and between \(C\) and \(x_2\) in each of the four models (a-d).

Solution

In practive, we will be adding some constant \(\beta_{p}\) in case the answer is yes. In other words, it means that the final prediction will be offseted by this constant. In python, you could make a simple if-else statement and then based on the answer (yes/no) add or not the given constant. In the training phase, it will be determined what should be value of this constant since it is one of model’s parameters.


Linear regression models in Python

Now familiarise yourself with building linear regression models in Python. Consider for this exercise the following three ways of fitting a linear regression:

  • The OLS method from statsmodels.api

  • The ols method from statsmodels.formula.api This method allows us to specify models “R-style” rather than via the design matrix (see e.g. https://www.statsmodels.org/dev/example_formulas.html). Note that, in the formula, you can specify that \(X\) is a factor as \(C(X)\). Also note that anything enclosed in “the identity function” \(I()\) will be taken literally; for instance, \(I(x1 * x2)\) gives a new variable with the numeric product of variables \(x_1\) and \(x_2\).

  • The method LinearRegression from linear model in sklearn.

You are encouraged to try all three methods. When you read the manual pages, note that outcome \(Y\) is referred to as the endogenous variable and features \(X\) as the exogenous variable(s).

Solution

Solution to the rest of exercises can be found in this notebook. (You should be able to add comments in the notebook, so feel free to use it to ask questions)