Introduction to Calculus | Ludek Cizinsky

👋 About the lesson

What is optimization about

It is week 10, and we are transferring from Linear algebra to the optimization part of the course. The real question is what are we going to optimize? 🤔 Broadly speaking, each problem at hand can be described by some function f which gives some output based on its input. If I were to say this more formally, I would say it is a mapping from some subspace \(R^n\) to some subspace \(R^m\). Understand mapping as some kind of computation. For instance, based on age and number of times person runs a week, you could compute expected marathon time.

Where am I going to use it

You will encounter optimization in the many upcoming courses, for instance:

Applied Statistics - computing what is the probability of give certain set of events for given random variable
Machine learning - in supervised learning, you always determine a cost function L which you are trying to optimize such that you find its most optimal parameters, that minize the cost

In this lesson

In this lesson, we will focus on the following:

what is a derivative of a function: formal definition as well as understanding the intuition behind
what is an integral of the function
important rules on how to integrate and differentiate
analysing behaviour of the given function
how to approximate part of the function using Taylor polynomials

📓 Notes

What is a derivative of a function

Derivative of a function f is another function denoted as f' which tells you how given function f behaves for given input. For instance, consider \(f(x) = x^2\) and its corresponding derivative \(f'(x) = 2x\). Let’s say I am interested in the behaviour of f at the point -2. The f' gives me -4, how to interpret this? First, look at sign, we can tell that f is decreasing based on the sign. How about the value itself? Well, this just tells you how fast the function is decreasing.

Similarly, if we plug in x = 2, we obtain 4. This actually makes sense since the function f is symmetrical. If we try to plug in, x = 100, we get 200, which actually also makes sense since the further you ‘go’ left or right, the faster the f increases or decreases.

Let’s now look at the formal definition, the derivate of f(x), at some particular \(x_0\) is defined as:

\[\lim _{\Delta x \rightarrow 0} \frac{\Delta f}{\Delta x}\]

where \(\Delta f=f\left(x_{1}\right)-f\left(x_{0}\right) \text { and } \Delta x=x_{1}-x_{0}\). In words, if I make infinitelly small step along the x-axis, how is that going to impact the output of the function.

Differentiation rules

So now when we know what is a derivative, how do we actually get it? This is where the rules come in since there is a variety of functions, there is a variety of rules that can help you to get the derivative of the given function. Let’s start with how we obtain derivaties for most important functions:

\[\begin{array}{|l|l|l|} \hline \text { Function name } & f(x) & f'(x) \\ \hline \text { Constant } & \text { const } & 0 \\ \hline \text { Linear } & x & 1 \\ \hline \text { Power } & x^{a} & a x^{a-1} \\ \hline \text { Exponential } & e^{x} & e^{x} \\ \hline \text { Exponential } & a^{x} & a x \ln a \\ \hline \text { Natural logarithm } & \ln (x) & \frac{1}{x} \\ \hline \text { Logarithm } & \log b(x) & \frac{1}{x \ln (b)} \\ \hline \text { Sine } & \sin x & \cos x \\ \hline \text { Cosine } & \cos x & -\sin x \\ \hline \end{array}\]

Now, in the real world, you may encounter that these functions are somehow combined. If they are combined linearly (constant multiplication, addition), then we can use the linearity rule. Formally this rule says:

\[(a f(x)+b g(x))^{\prime}=a f^{\prime}(x)+b g^{\prime}(x)\]

Informally, consider the following function h: \(h(x) = 2x^2 + 6x\). This function can be decomposed into the following pieces:

constants: a = 2 and b = 6
two functions: \(f(x) = x^2\) and \(g(x) = x\)

The linearity rule simply says, that we can simply get the derivative of h by simply decomposing into the above mentioned parts and then deal with each part separatelly. As a result, we get \(h'(x) = 4x + 6\). But what if we use function nesting, i.e., the output of one function serves as an input to another function, this can be formally written as: \((f \circ g)(x)\) or \(f(g(x))\). In this case, chain rule comes in rescue:

\[(f \circ g)'(x) = g'(x)(f' \circ g)\]

I would be pretty confused seeing this formal definition. So in human words, you simply first differentiate the inner function g and then you multiply by the derivative of the outer function f. Let’s see some examples:

Example 1: \(h(x) = sin(x^2)\)
- inner is \(x^2\), outer is \(sin(x)\)
- inner function’s derivative is \(2x\), outer’s is \(cos(x)\)
- as a result: \(2xcos(x^2)\) - remember the input to the outer function is \(x^2\)
Example 2: \(h(x) = e^{x^2}\)
- inner is \(x^2\), outer is \(e^x\)
- inner’s derivative is \(2x\), outer’s is \(e^x\)
- as a result: \(2xe^{x^2}\)

How to determine inner and outer? Imagine, you would try to compute value of the composite function for some particular value of \(x\). What would you compute first, and what second? There you have inner and outer. Finally, you might also encounter the case where h(x) is defined as: \(h(x) = f(x)g(x)\). For this reason, your toolbox should also include knowledge of product rule:

\[h'(x) = f(x)g'(x) + f'(x)g(x)\]

Here is a simple example: \(h(x) = 2xsin(x)\):

\(f(x) = 2x\) and \(f'(x) = 2\)
\(g(x) = sin(x)\) and \(g'(x) = cos(x)\)
plug and play: \(h'(x) = 2xcos(x) + 2sin(x)\)

All these rule share one common philosophy: divide and conquer. In other words, divide the problem into smaller subproblems, solve these and then put together these intermediate solutions to get the overall solution. 👌

What is an integral of a function

I believe that by this point, in IDSP, you were introduced to the different probability distributions. Most famous one is the Gaussian a.k.a. normal distribution. This distribution can be described by function called density function. This functions takes as an input a possible outcome and returns its corresponding probability. Let’s we are modelling people’s height, and you are interested what is the probability that a person is between 170 to 180 cm tall. To solve this problem, you would use integral of the above mentioned density function.

Let me be more specific. In human words, you can imagine integral as a for loop that runs over all possible values between 170 and 180, each of these values inputs to the density function which returns corresponding probability. These are then summed over. So in python, you could write something like this:

def integral(a, b, f):
    result = 0
    for i in range(a, b, dx):
        result += f(i)*dx # add probability
    return result

This is literally it. Notice one important thing which is the dx. What is that? In python, this third parameter defines the increment. In theory, this increment is infinitely small. Or more formally \(\lim dx \rightarrow 0\). So essentially, what you are doing in each iteration is computing area of a rectangle where the width is given by dx and height by f(i). Unfortunatelly, computers’ memory is limited and therefore there is no such thing as infinitelly small difference, to find out what is the smallest possible float, you can write:

import sys
sys.float_info.min

So, formally, the inegral of a function f is defined as:

\[\int_{a}^{b} f(x) d x\]

The weird symbol denotes the infinite sum which in python I try to approximate through for loop. But is there a closed form formula that can give us the exact value of the integral? Yes, there is and it is in fact described by the fundamental theorem of calculus which says:

\[\int_{a}^{b} f(x) d x = F(b) - F(a)\]

So what is F? The symbol F denotes antiderivative of the function f. What is an antiderivative you ask:

\[F'(x) = f(x)\]

In words, when you differentiate F, you should obtain the original function f. For instance if \(f(x) = 2x\), then \(F(x) = x^2\).

Integration rules

To make your life easier, here is a quick overview with most common functions:

\[\begin{array}{|l|l|l|} \hline \text { Function name } & f(x) & F(x) \\ \hline \text{constant} & 1 & x \\ \hline \text{linear} & x & \frac{1}{2}x^{2} \\ \hline \text{power} & x^{a}(a \neq-1) & \frac{x^{a+1}}{a+1} \\ \hline \text{power: } a = -1& \frac{1}{x} & \ln |x| \\ \hline \text{exponential} & e^{x} & e^{x} \\ \hline \text{exponential} & a^{x}(a>0) & \frac{a^{x}}{\ln a} \\ \hline \text{natural logarithm} & \ln(x) & x \ln (x)-x \\ \hline \text{logarithm} & \log_b(x) & x \log _{b} x-\frac{x}{\ln b} \\ \hline \text{sine} & \sin x & -\cos x \\ \hline \text{cosine} & \cos x & \sin x \\ \hline \end{array}\]

This is all nice, but as we already know from differentiation, these functions might be more complicated. Let’s start with the linearity rule which can also be used for integrals as follows:

constanst multiplication: \(\int_a^b cf(x) d x = c \int_a^b f(x) d x\)
addition: \(\int_a^b f(x) + g(x) d x= \int_a^b f(x) d x + \int_a^b g(x) d x\)

As a next natural step, we should discuss how to deal with nested functions as well as product of two functions. Recall, that for these two scenarios for differentiation, we used chain rule and product rule. For the integration, we can use something called substitution rule. As far as I am concerned, this was not discussed during lecture, but there appears to be one exercise for it during our session. Therefore, I suggest, you first read this pdf describing the substitution process and then check out the solution sheet where I follow these guidelines to solve one of the problems.

Analysing behaviour of the given function

Finally, we can get to one of the first applications of the above theory. Most of the times, we are interested where the given functions has its extremes, i.e., maximum and minimum. These can be easily identified by setting the derivative of the function to zero: \(f'(x) = 0\). (Given what was mentioned before, why do we set it to 0?)

Solving this equation, you should be able to identify all extremes. It is important to mention that the given function can have many extremes. Therefore, we have two types of extremes: local and global. There can be only one global maximum and minimum.

So how do we actually determine if at given point x the function attains min or max. It is fairly simple:

pick a value which if left and right to the x and plug these into f'(x)
for max: the left value is positive, and the right is negative
for min: the left is negative, and the right is positive

Why? Simply imagine going to the top of the hill (max): you first climb up (increase, positive), and then climb down (decrease, negative). Similarly for the minimum. Once you identify which points are min and max, you can then identify which of the points correspond to the global min and max.

Approximating function using the Taylor polynomial

Sometimes, the function is very complex to describe as a whole, but for your use case it might be sufficient to simple approximate it around certain point. This is exactly the use case of Taylor polynomial.

Formal definition is as follows:

\[T_{n, f, a}(x)=\sum_{i=0}^{n} \frac{f^{(i)}(a)}{i !}(x-a)^{i}\]

In words, we use n-th degree Taylor polynomial to approximate function f around point a. As a result, you will get a new function where you can plug any value x and get an estimate of the true value of the function f. To see this in practice, check the last exercise in the solution sheet.

⛳️ Learning goals checklist

Congrats, you have made it through! There has been a lot of information to digest so do not worry if you feel bit overwhelmed. To make it easier, here are according to me most important learning objectives:

I can explain the idea behind derivative and integral of the given function
I can do basic differentiation and integration - using linearity rule
I am able to analyse behaviour of the function
I am able to use Taylor polynomial to approximate given function
[More advanced] I am able to use product, chain and (substitution) rule