The gradient
👋 About the lesson
This week, we will first learn about gradient vector
(how to compute it, how to interpret it). As a next step, we will learn how to use it for optimization of functions under certain constraint.
Please note that this exercise might feel more challenging and this is fine since the material is generally perceived challenging by students. Therefore, I suggest you focus on trying to first focus on at least knowing how to compute things, and then also try to fsunderstand better the intuition behind computations.
📓 Notes
Gradient of a function
Gradient of a function \(f(x, y)\) is defined as:
\[\nabla f(x, y) = <f_x(x, y), f_y(x, y)>\]In words, the gradient of a function that can be visualized in 3D is a 2 dimensional vector where each component is the partial derivative of the function \(f\) to each of its parameters. Therefore, as far as the computation goes, fairly straighforward.
Now, let’s discuss how to actually interpret gradient vector. For this, I will use the following geogebra applet. For now, I want you to focus on the following elements of the applet (right half):
- function \(f\) defined as \(f(x, y) = x^2 + y^2\) (red color)
- gradient vector \(\nabla f(x, y)\) (green color)
Try to manipulate the position of point \(P(x_0, y_0)\) via the sliders in left lower corner and watch how it influences the gradient vector. You should observer that no matter where we place the point \(P\), the gradient vector always points in the steepest ascend of the function of \(f\). Therefore, if I would point \(P\) along the direction of the gradient vector, then the value of the function \(f\) at this new point \(P\) would be the highest compare to all other directions that we could take. If we multiply this vector by minus one, then it denotes the step in the steepest descend.
In addition, gradient vector has couple more useful properties:
- The magnitude of the gradient vector \(\|\nabla f(x, y)\|\) can be interpretted as the rate of change in the direction pointed by the gradient vector. Mathematically speaking:
Therefore, this tiny step is \(h\), then we evaluate the ratio between this tiny step and difference of function \(f\) evaluated at two different points distanced by \(h\). Thus, higher the rate of change the faster the function grows in the given direction.
- the gradient vector is always perpendicular to the level curves of the function \(f\) (this property is essential for approximating the given function around given point and will talk about in the the later section)
Directional derivative
Now, you might be also wondering how does this all relate to the directional derivatives. Directional derivate simply speaking is a number while gradient is a vector. Mathematically speaking, the relationship between the two entities can be expressed as follows:
\[\nabla_u f(x, y) = \nabla f(x, y) \cdot u\]where \(u\) is a vector denoting the direction in which we take the step at point \(P(x, y)\). Notice that there are many possible directions that we can take. In fact, you might be asked to compute a directional derivative of function \(f\) for given direction \(u\). (based on the above formula this should be easy) But very importantly notice that if the direction of \(u\) is the same as the direction of \(\nabla f(x, y)\) then the dot product is maximized. The dot product of these two vectors gives us directional derivative which can be interpretted as a rate of growth of function \(f\) in particular direction. Thus, if \(u\) points in the same direction as \(\nabla f(x, y)\), we see that growth of rate is maximized - steepest ascend.
Computing value of tangent line/plane to a level curve/surface
Before diving into the explanation, you first need to understand properly what is a level curve. Same principle then applies for level surface. For instance for the function \(f(x, y)\), level curve mathematically speaking is a set of points:
\[S = {(x, y) | f(x, y) = c}\]where \(c\) is some constant. In words, if we plug in any of the points from \(S\) to our function \(f(x, y)\) then it will return always the same number, i.e., \(c\). For this reason, if we can draw \(f(x, y)\) in 3D, then the level curve can be displayed just in 2D.
Finally, I will again use the Geogebra applet to explain how this all relates to computing value of the tangent line. I want you to focus on the following:
- two level curves (blue)
- gradient vector (green)
Tro to move the point \(P(x_0, y_0)\) so that it happens to be on one of the curves. You can clearly see that if were to draw a tangent line through the given point, then any vector part of that tangent line would be perpendicular to the gradient vector. Let me define any vector which is part of tangent line that goes through point P(x_0, y_0) as:
\[u = <x - x_0, y - y_0>\]Since \(u\) is perpendicular to \(\nabla f(x, y)\) then:
\[u \cdot \nabla f(x, y) = 0\]Given some concrete values, this equation would yield the formula for the tangent line of level curve \(f(x, y) = c\) at given point \(P(x_0, y_0)\). To see a concrete example, see first exercise in this week’s exercises.
Optimization of multivariate functions WITHOUT constraints
Recall that for functions with single variable as an input such as \(f(x)\), we would simply find \(f'(x)\) and solve the following system:
\[f'(x) = 0\]in order to find all \(x\) for which the function \(f\) might reach the (local) maxima or minima. We then use first derivative test
to determine whether the given point is max, min or a saddle point (see lecture 10).
Not much changes when it comes to multivariate function. For intstance for \(f(x, y)\), we solve:
\[f_x(x, y) = 0 f_y(x, y) = 0\]and then use second derivative
test for the given set of points \(S = {(x, y)}\) whether they are max, min or a saddle point. So how does second derivative works?
We first collect partial derivatives \(f_{x x}, f_{x y}=f_{y x} \text { and } f_{y y}\) into a matrix:
\[\left[\begin{array}{ll} f_{x x} & f_{x y} \\ f_{y x} & f_{y y} \end{array}\right]\]then we compute the value of the determinant of this matrix:
\[D=f_{x x}(a, b) f_{y y}-\left[f_{x y}(a, b)\right]^2\]Then we interpret the result as follows:
- If \(D>0\) and \(f_{x x}\left(x_0, y_0\right)>0\), the point is a local minimum.
- If \(D>0\) and \(f_{x x}\left(x_0, y_0\right)<0\), the point is a local maximum.
- If \(D<0\), the point is a saddle point.
- If \(D=0\), higher order tests must be used.
Optimization of multivariate functions WITH constraints
So far, we have considered optimizing given function \(f\) without any constraints, i.e., we could consider all possible inputs. For instance in case of \(f(x)\) we would consider all possible values of \(x\). However, in real world scenarios such as machine learning, we often need to embedded some kind of constraints into our optimization.
Lagrange multiplier is an optimization technique that allows you to do exactly that. More specifically, it allows you to optimize a given function \(f\) subject to a constraint \(g\). In other words, it is an extension of a regular way of how to optimize function where you have no constraints. So in the end, what the Lagrange method does is that it allows you to come up with a new function \(L\) which you can then optimize as if there are no constraints since \(L\) has the constraints embedded in it.
Let me define an example scenario. We are given a function \(f(x, y)\) which we want to optimize (either maximize or minimize). This function is subject to the constraint
\[g(x, y) = c\]Note that for simplicity, I use only two variables \(x\) and \(y\), but the concept then naturally translates to higher dimensions. In addition, I present here only one constraint \(g\), however you can have indeed more than one constraint. I comment on it in the paragraph below. In general, you can actually find two ways of how people define the constraint \(g\), the one I introduced above, i.e. \(g(x, y) = c\), and the alternative one \(g(x, y) = 0\). In the latter one, the constant \(c\) is embedded into \(g(x,y)\). By embedded I mean that you simply say:
\[g(x, y) = g(x, y) - c = 0\]From now on, I will assume this embedding and thus the definition of the constraint is \(g(x, y) = 0\). So how do we define function \(L\)?
The lagrange function \(L\), given some constraint \(g\), can be defined as follows:
\[L(x, y, \lambda) = f(x, y) - \lambda g(x, y)\]Further, if we have more constraints, then we can write:
\[L(x, y, \hat{\lambda}) = f(x, y) -(\hat{\lambda} \cdot \hat{g})\]More specifically, giving a new constraint means simply subtracting it within an equation. (For simplification of the formula I used dot product)
As I mentioned above, you can now treat the function \(L\) as regular function which has no constraints. Therefore, in order to find an optimum we need to find \(\nabla L\) such that \(\nabla L = 0\). This in practice yields following set of (usually) non-linear equations:
\[\frac{\delta L}{\delta x} = 0 \text{; } \frac{\delta L}{\delta y} = 0 \text{; } \frac{\delta L}{\delta \lambda} = 0\]This is equivalent to writing:
\[\begin{aligned} \nabla f(x, y) = \lambda \nabla g(x, y) \\ g(x, y) = 0 \end{aligned}\]You can try to expand this and see yourself that you will get the same set of equations.
⛳️ Learning goals checklist
Wow, this has been a lot to digest. To recap, after this week you should be able to:
- compute the gradient and directional derivatives
- compute the formula for tangent lines to curves given as level curves, and tangent planes to level surfaces
- use the second derivative test
- solve optimisation problems with constraints using Lagrange multipliers
See you next week at our very last exercise session. 😭