Partial derivatives | Ludek Cizinsky

👋 About the lesson

Welcome to the lecture about partial derivatives, this might sound as a completely new concept, but as you will see, it is not. Without telling you specifics, I would like to emphasize that knowledge of partial derivatives is esspecially important for machine learning. Therefore, it is worth time spending some time on this topic. Without further ado, let’s get into it!

📓 Notes

What is a partial derivative

In previous lectures, we have talked about two kind of functions (mappings):

single input and single output: \(R \rightarrow R\)
single input and multiple outputs: \(R \rightarrow R^m\)

This lecture, we will talk about functions with multiple inputs and single output: \(R^m \rightarrow R\).

To differentiate such functions, we need to use partial differentiation. I will explain the concept on the simple example. Assume we are given following multivariate function \(f(x, y)\):

\[f(x, y) = x^2 + 3y^2\]

Since the function is dependent on two parameters, we need to differentiate the function \(f\) with respect to each of the parameters separately. This means that partial derivative with respect to:

x is \(f_x(x, y) = 2x\)
y is \(f_y(x, y) = 6y^2\)

Therefore, as a result we have obtained two equations, each describing partial derivative of \(f\) with respect to the given parameter. When doing the partial differentiation, we focus on the given parameter at hand and treat all other parameters as constant.

Interpretation of the partial derivatives

Recall from the lecture 10 that when we compute derivative of the funtion \(f(x)\), it describes the slope of the tangent line at the given point \(x\). Slope is in other words rate of change. More specifically, it tells us how much we can expect the dependent varible \(y\) to change, if we change the input variable \(x\).

In the case of multivariate function, each partial derivative describes rate of change for given variable. For instance, given function \(f(x, y)\) and one of its partial derivatives \(f_x(x, y)\), we know that \(f_x(x, y)\) describes the change in \(f\) if we make a tiny change along the dimension of \(x\).

In the next lecture, we will be talking about gradient vector. Gradient vector of function \(f(x, y)\) is:

\[\nabla f = <f_x(x, y), f_y(x, y)>\]

This vector points in the direction of the steepest growth of the function \(f\) at a particular point \((x, y)\). We will talk about the signifficance of this vector more in the next lecture, stay tuned!

Generalised chain-rule

So now you should know how to differentiate all kinds of functions. However, so far we have mostly talked about functions which are ‘flat’, i.e., there is no nesting.

Okay, I know this sounds super vague. Let me give you a concrete example of non-flat function:

\[f(x, y) = x + y\]

where \(x(t) = t^2\) and \(y(t) = t\). How do you differentiate such functions? Quite simply, we first come up with a variable-dependency graph:

Then, we can clearly see that \(f\) depends on \(t\). For this reason, to find \(\frac{df}{dt}\), we can write:

\[\frac{df}{dt} = \frac{\partial f}{\partial x} \frac{dx}{dt} + \frac{\partial f}{\partial y} \frac{dy}{dt}\]

How did I use the dependancy graph? I found all paths from \(f\) to \(t\):

f - x, x - t
f - y, y - t

Then for each path, I defined the edge weights as the derivative of the start of the edge with respect to given end. Notice that I chose partial derivative for \(f\) and then full derivatives for \(x\) and \(y\). This is because \(f\) depends on two variables, while \(x, y\) only depends on \(t\). For each path, I then take the product of the weights and take sum of the products.

You can try this approach on something more complex. Just a reminder from the intro week to calculus where we talked about the chain rule in the context of functions like \(f(g(x)) = (x + 1)^2\), the principle of course also applies here. The dependence is \(f \rightarrow g \rightarrow x\). Therefore:

\[\frac{df}{dx} = \frac{df}{dg}\frac{dg}{dx}\]

In words, product the derivative of the outer function and inner function. To practice this concept, see this week’s exercises.

Approximating \(f(x, y)\) via tangent plane

Assume we have the following single input function \(f(x) = x^2\). Our goal is to find its approximation around given point \(x\). This approximation can be found through tangent line that goes through this point and is tangent to the function \(f\). This tangent line has a general form: \(t(x) = ax + b\).

We know that to find \(a\) (slope), we just use the \(f'(x) = 2x\) at the given point \(x\). Let’s be more specific and say we want to find the tangent line at \(x = 2\). Therefore, the tangent line has form \(t(x) = 4x + b\). How do we get \(b\)? This is actually also fairly easy, we know that the tangent line goes trough the point \((2, f(2)) = (2, 4)\). Therefore, if we plug this into the equation for the tangent line: \(4 = 8 + b \Rightarrow b = -4\). Therefore, the tangent line has the form:

\[t(x) = 4x - 4\]

To formalize this procedure, to find tangent line:

\[t(x)=f(a)+f^{\prime}(a)(x-a)\]

where \(a\) around which we want to approximate the function \(f\).

Now, you might be questioning why did I go through all that math? Well, because if we now add one more input parameter to our function, i.e., \(f(x, y)\), we can approximate this multivariate function around the given point \((a, b)\) as follows:

\[t(x, y) = f(a, b) + f_x(a, b)(x-a) + f_y(a, b)(y-b)\]

Yes, this results into equation of a plane. In this case, we specifically refer to this plane as tangent. Check out exercise for examples using this formula.

⛳️ Learning goals checklist

After this week’s exercise, you should be able to:

compute the partial derivatives of functions of many variables
compute the formula for the tangent plane of a given function at a given point
use the generalized chain rule to differentiate some more complex functions

This is it for this week, see you next week talking about gradient, another essential topic for machine learning!