57  Applications with scalar functions

This section uses these add-on packages:

using CalculusWithJulia
using Plots
plotly()
using SymPy
using Roots

And the following from the Contour package:

import Contour: contours, levels, level, lines, coordinates

This section presents different applications of scalar functions.

57.1 Tangent planes, linearization

Consider the case \(f:R^2 \rightarrow R\). We visualize \(z=f(x,y)\) through a surface. At a point \((a, b)\), this surface, if \(f\) is sufficiently smooth, can be approximated by a flat area, or a plane. For example, the Northern hemisphere of the earth, might be modeled simplistically by \(z = \sqrt{R^2 - (x^2 + y^2)}\) for some \(R\) and with the origin at the earth’s core. The ancient view of a “flat earth,” can be more generously seen as identifying this tangent plane with the sphere. More apt for current times, is the use of GPS coordinates to describe location. The difference between any two coordinates is technically a distance on a curved, nearly spherical, surface. But if the two points are reasonably closes (miles, not tens of miles) and accuracy isn’t of utmost importance (i.e., not used for self-driving cars), then the distance can be found from the Euclidean distance formula, \(\sqrt{(\Delta\text{latitude})^2 + (\Delta\text{longitude})^2}\). That is, as if the points were on a plane, not a curved surface.

For the univariate case, the tangent line has many different uses. Here we see the tangent plane also does.

57.1.1 Equation of the tangent plane

The partial derivatives have the geometric view of being the derivative of the univariate functions \(f(\vec\gamma_x(t))\) and \(f(\vec\gamma_y(t))\), where \(\vec\gamma_x\) moves just parallel to the \(x\) axis (e.g. \(\langle t + a, b\rangle\)). and \(\vec\gamma_y\) moves just parallel to the \(y\) axis. The partial derivatives then are slopes of tangent lines to each curve. The tangent plane, should it exist, should match both slopes at a given point. With this observation, we can identify it.

Consider \(f(\vec\gamma_x)\) at a point \((a,b)\). The path has a tangent vector, which has “slope” \(\frac{\partial f}{\partial x}\). and in the direction of the \(x\) axis, but not the \(y\) axis, as does this vector: \(\langle 1, 0, \frac{\partial f}{\partial x} \rangle\). Similarly, this vector \(\langle 0, 1, \frac{\partial f}{\partial y} \rangle\) describes the tangent line to \(f(\vec\gamma_y)\) at the point.

These two vectors will lie in the plane. The normal vector is found by their cross product:

@syms f_x f_y
n = [1, 0, f_x] × [0, 1, f_y]

\(\left[\begin{smallmatrix}- f_{x}\\- f_{y}\\1\end{smallmatrix}\right]\)

Let \(\vec{x} = \langle a, b, f(a,b) \rangle\). The tangent plane at \(\vec{x}\) then is described by all vectors \(\vec{v}\) with \(\vec{n}\cdot(\vec{v} - \vec{x}) = 0\). Using \(\vec{v} = \langle x,y,z\rangle\), we have:

\[ [-\frac{\partial f}{\partial x}, -\frac{\partial f}{\partial y}, 1] \cdot [x-a, y-b, z - f(a,b)] = 0, \]

or,

\[ z = f(a,b) + \frac{\partial f}{\partial x} (x-a) + \frac{\partial f}{\partial y} (y-b), \]

which is more compactly expressed as

\[ z = f(a,b) + \nabla(f) \cdot \langle x-a, y-b \rangle. \]

This form would then generalize to scalar functions from \(R^n \rightarrow R\). This is consistent with the definition of \(f\) being differentiable, where \(\nabla{f}\) plays the role of the slope in the formulas.

The following figure illustrates the above for the function \(f(x,y) = 6 - x^2 - y^2\):

f(x,y) = 6 - x^2 -y^2
f(x)= f(x...)

a,b = 1, -1/2


# draw surface
xr = 7/4
xs = ys = range(-xr, xr, length=100)
surface(xs, ys, f, legend=false)

# visualize tangent plane as 3d polygon
pt = [a,b]
tplane(x) = f(pt) + gradient(f)(pt)  (x - [a,b])

pts = [[a-1,b-1], [a+1, b-1], [a+1, b+1], [a-1, b+1], [a-1, b-1]]
plot!(unzip([[pt..., tplane(pt)] for pt in pts])...)

# plot paths in x and y direction through (a,b)
γ_x(t) = pt + t*[1,0]
γ_y(t) = pt + t*[0,1]

plot_parametric!((-xr-a)..(xr-a), t -> [γ_x(t)..., (fγ_x)(t)],  linewidth=3)
plot_parametric!((-xr-b)..(xr-b), t -> [γ_y(t)..., (fγ_y)(t)],  linewidth=3)

# draw directional derivatives in 3d and normal
pt = [a, b, f(a,b)]
fx, fy = gradient(f)(a,b)
arrow!(pt, [1, 0, fx], linewidth=3)
arrow!(pt, [0, 1, fy], linewidth=3)
arrow!(pt, [-fx, -fy, 1], linewidth=3) # normal

# draw point in base, x-y, plane
pt = [a, b, 0]
scatter!(unzip([pt])...)
arrow!(pt, [1,0,0], linestyle=:dash)
arrow!(pt, [0,1,0], linestyle=:dash)

Alternate forms

The equation for the tangent plane is often expressed in a more explicit form. For \(n=2\), if we set \(dx = x-a\) and \(dy=y-b\), then the equation for the plane becomes:

\[ f(a,b) + \frac{\partial f}{\partial x} dx + \frac{\partial f}{\partial y} dy, \]

which is a common form for the equation, though possibly confusing, as \(\partial x\) and \(dx\) need to be distinguished. For \(n > 2\), additional terms follow this pattern. This explicit form is helpful when doing calculations by hand, but much less so when working on the computer, say with Julia, as the representations using vectors (or matrices) can be readily implemented and their representation much closer to the formulas. For example, consider these two possible functions to find the tangent plane (returned as a function) at a point in \(2\) dimensions

function tangent_plane_1st_crack(f, pt)
  fx, fy = ForwardDiff.gradient(f, pt)
  x -> f(x...) + fx * (x[1]-pt[1]) + fy * (x[2]-pt[2])
end
tangent_plane_1st_crack (generic function with 1 method)

It isn’t so bad, but as written, we specialized to the number of dimensions, used indexing, and with additional dimensions, it clearly would get tedious to generalize. Using vectors, we might have:

function tangent_plane(f, pt)
  ∇f = ForwardDiff.gradient(f, pt) # using a variable ∇f
  x -> f(pt) + ∇f  (x - pt)
end
tangent_plane (generic function with 1 method)

This is much more like the compact formula and able to handle higher dimensions without rewriting.

57.1.2 Tangent plane for level curves

Consider the surface described by \(f(x,y,z) = c\), a constant. This is more general than surfaces described by \(z = f(x,y)\). The concept of a tangent plane should still be applicable though. Suppose, \(\vec{\gamma}(t)\) is a curve in the \(x-y-z\) plane, then we have \((f\circ\vec\gamma)(t)\) is a curve on the surface and its derivative is given by the chain rule through: \(\nabla{f}(\vec\gamma(t))\cdot \vec\gamma'(t)\). But this composition is constantly the same value, so the derivative is \(0\). This says that \(\nabla{f}(\vec\gamma(t))\) is orthogonal to \(\vec\gamma'(t)\) for any curve. As these tangential vectors to \(\vec\gamma\) lie in the tangent plane, the tangent plane can be characterized by having \(\nabla{f}\) as the normal.

This computation was previously done in two dimensions, and showed the gradient is orthogonal to the contour lines (and points in the direction of greatest ascent). It can be generalized to higher dimensions.

The surface \(F(x,y,z) = z - f(x,y) = 0\) has gradient given by \(\langle -\partial{f}/\partial{x}, -\partial{f}/\partial{y}, 1\rangle\), and as seen above, this vector is normal to the tangent plane, so this generalization agrees on the easier case.

For clarity:

  • The scalar function \(z = f(x,y)\) describes a surface, \((x,y,f(x,y))\); the gradient, \(\nabla{f}\), is \(2\) dimensional and points in the direction of greatest ascent for the surface.
  • The scalar function \(f(x,y,z)\) also describes a surface, through level curves \(f(x,y,z) = c\), for some constant \(c\). The gradient \(\nabla{f}\) is \(3\) dimensional and orthogonal to the surface.
Example

Let \(z = f(x,y) = \sin(x)\cos(x-y)\). Find an equation for the tangent plane at \((\pi/4, \pi/3)\).

We have many possible forms to express this in, but we will use the functional description:

@syms x, y
(x, y)
f(x,y) = sin(x) * cos(x-y)
f(x) = f(x...)
vars = [x, y]

gradf = diff.(f(x,y), vars)  # or use gradient(f, vars) or ∇((f,vars))

pt = [PI/4, PI/3]
gradfa = subs.(gradf, x=>pt[1], y=>pt[2])

f(pt) + gradfa  (vars - pt)

\(\left(x - \frac{\pi}{4}\right) \left(- \frac{\sqrt{2} \left(- \frac{\sqrt{6}}{4} + \frac{\sqrt{2}}{4}\right)}{2} + \frac{\sqrt{2} \left(\frac{\sqrt{2}}{4} + \frac{\sqrt{6}}{4}\right)}{2}\right) + \frac{\sqrt{2} \left(- \frac{\sqrt{6}}{4} + \frac{\sqrt{2}}{4}\right) \left(y - \frac{\pi}{3}\right)}{2} + \frac{\sqrt{2} \left(\frac{\sqrt{2}}{4} + \frac{\sqrt{6}}{4}\right)}{2}\)

Example

A cylinder \(f(x,y,z) = (x-a)^2 + y^2 = (2a)^2\) is intersected with a sphere \(g(x,y,z) = x^2 + y^2 + z^2 = a^2\). Let \(V\) be the line of intersection. (Viviani’s curve). Let \(P\) be a point on the curve. Describe the tangent to the curve.

We have the line of intersection will have tangent line lying in the tangent plane to both surfaces. These two surfaces have normal vectors given by the gradient, or \(\vec{n}_1 = \langle 2(x-a), 2y, 0 \rangle\) and \(\vec{n}_2 = \langle 2x, 2y, 2z \rangle\). The cross product of these two vectors will lie in both tangent planes, so we have:

\[ P + t (\vec{n}_1 \times \vec{n}_2), \]

will describe the tangent.

The curve may be described parametrically by \(\vec\gamma(t) = a \langle 1 + \cos(t), \sin(t), 2\sin(t/2) \rangle\). Let’s see that the above is correct by verifying that the cross product of the tangent vector computed two ways is \(0\):

a = 1
gamma(t) = a * [1 + cos(t), sin(t), 2sin(t/2) ]
P = gamma(1/2)
n1(x,y,z)= [2*(x-a), 2y, 0]
n2(x,y,z) = [2x,2y,2z]
n1(x) = n1(x...)
n2(x) = n2(x...)

t = 1/2
(n1(gamma(t)) × n2(gamma(t))) × gamma'(t)
3-element Vector{Float64}:
 0.0
 0.0
 0.0

Plotting level curves of \(F(x,y,z) = c\)

The wireframe plot can be used to visualize a surface of the type z=f(x,y), as previously illustrated. However we have no way of plotting \(3\)-dimensional implicit surfaces (of the type \(F(x,y,z)=c\)) as we do for \(2\)-dimensional implicit surfaces with Plots. (The MDBM or IntervalConstraintProgramming packages can be used along with Makie plotting package to produce one.)

The CalculusWithJulia package provides a stop-gap function, plot_implicit_surface for this task. The basic idea is to slice an axis, by default the \(z\) axis up and for each level plot the contours of \((x,y) \rightarrow f(x,y,z)-c\), which becomes a \(2\)-dimensional problem. The function allows any of 3 different axes to be chosen to slice over, the default being just the \(z\) axis.

We demonstrate with an example from a February 14, 2019 article in the New York Times. It shows an equation for a “heart,” as the graphic will illustrate:

a, b = 1, 3
f(x,y,z) = (x^2 + ((1+b) * y)^2 + z^2 - 1)^3 - x^2 * z^3 - a * y^2 * z^3

CalculusWithJulia.plot_implicit_surface(f, xlim=-2..2, ylim=-1..1, zlim=-1..2)

57.2 Linearization

The tangent plane is the best “linear approximation” to a function at a point. “Linear” refers to mathematical properties of the tangent plane, but at a practical level it means easy to compute, as it will involve only multiplication and addition. “Approximation” is useful in that if a bit of error is an acceptable tradeoff for computational ease, the tangent plane may be used in place of the function. In the univariate case, this is known as linearization, and the tradeoff is widely used in the derivation of theoretical relationships, as well as in practice to get reasonable numeric values.

Formally, this is saying:

\[ f(\vec{x}) \approx f(\vec{a}) + ∇f(\vec{a}) ⋅ (\vec{x} - \vec{a}). \]

The explicit meaning of \(\approx\) will be made clear when the generalization of Taylor’s theorem is to be stated.

Example: Linear approximation

The volume of a cylinder is \(V=\pi r^2 h\). It is thought a cylinder has \(r=1\) and \(h=2\). If instead, the amounts are \(r=1.01, h=2.01\), what is the difference in volume?

That is, if \(V(r,h) = \pi r^2 h\), what is \(V(1.01, 2.01) - V(1,2)\)?

We can use linear approximation to see that this difference is approximately \(\nabla{V} \cdot \langle 0.01, 0.01 \rangle\). This is:

V(r, h) = pi * r^2 * h
V(v) = V(v...)
a₁ = [1,2]
dx₁ = [0.01, 0.01]
ForwardDiff.gradient(V, a₁)  dx₁   # or use ∇(V)(a)
0.15707963267948966

The exact difference can be computed:

V(a₁ + dx₁) - V(a₁)
0.15833941133357854
Example

Let \(f(x,y) = \sin(\pi x y^2)\). Estimate \(f(1.1, 0.9)\).

Using linear approximation with \(dx=0.1\) and \(dy=-0.1\), this is

\[ f(1,1) + \nabla{f}(1,1) \cdot \langle 0.1, -0.1\rangle, \]

where \(f(1,1) = \sin(\pi) = 0\) and \(\nabla{f} = \langle \pi y^2\cos(\pi x y^2), \cos(\pi x y^2) 2\pi x y\rangle = \pi y \cos(\pi x y^2)\langle y,2x\rangle\). So, the answer is:

\[ 0 + \pi \cos(\pi) \langle 1,2\rangle\cdot \langle 0.1, -0.1 \rangle = (-\pi)(0.1 - 2(0.1)) = 0.1\pi. \]

Example

A piriform is described by the quartic surface \(f(x,y,z) = x^4 -x^3 + y^2+z^2 = 0\). Find the tangent line at the point \(\langle 2,2,2 \rangle\).

Here, \(\nabla{f}\) describes a normal to the tangent plane. The description of a plane may be described by \(\hat{N}\cdot(\vec{x} - \vec{x}_0) = 0\), where \(\vec{x}_0\) is identified with a point on the plane (the point \((2,2,2)\) here). With this, we have \(\hat{N}\cdot\vec{x} = ax + by + cz = \hat{N}\cdot\langle 2,2,2\rangle = 2(a+b+c)\). For this problem, \(\nabla{f}(2,2,2) = \langle a, b, c\rangle\) is given by:

f(x,y,z) = x^4 -x^3 + y^2 + z^2
f(v) = f(v...)
a, b,c = (f)(2,2,2)
"$a x + $b y  + $c z = $([a,b,c]  [2,2,2])"
"20 x + 4 y  + 4 z = 56"

57.2.1 Newton’s method to solve \(f(x,y) = 0\) and \(g(x,y)=0\).

The level curve \(f(x,y)=0\) and the level curve \(g(x,y)=0\) may intersect. Solving algebraically for the intersection may be difficult in most cases, though the linear case is not. (The linear case being the intersection of two lines).

To elaborate, consider two linear equations written in a general form:

\[ \begin{align*} ax + by &= u\\ cx + dy &= v \end{align*} \]

A method to solve this by hand would be to solve for \(y\) from one equation, replace this expression into the second equation and then solve for \(x\). From there, \(y\) can be found. A more advanced method expresses the problem in a matrix formulation of the form \(Mx=b\) and solves that equation. This form of solving is implemented in Julia, through the “backslash” operator. Here is the general solution:

@syms a b c d u v
M = [a b; c d]
B = [u, v]
M \ B .|> simplify

\(\left[\begin{smallmatrix}\frac{- b v + d u}{a d - b c}\\\frac{a v - c u}{a d - b c}\end{smallmatrix}\right]\)

The term \(\det(M) = ad-bc\) term is important, as evidenced by its appearance in the denominator of each term. When this is zero there is not a unique solution, as in the typical case.

Using Newton’s method to solve for intersection points, uses linearization of the surfaces to replace the problem to the intersection of level curves for tangent planes. This is the linear case that can be readily solved. As with Newton’s method for the univariate case, the new answer is generally a better approximation to the answer, and the process is iterated to get a good enough approximation, as defined through some tolerance.

Consider the functions \(f(x,y) =2 - x^2 - y^2\) and \(g(x,y) = 3 - 2x^2 - (1/3)y^2\). These graphs show their surfaces with the level sets for \(c=0\) drawn and just the levels sets, showing they intersect in \(4\) places.

We look to find the intersection point near \((1,1)\) using Newton’s method

We have by linearization:

\[ \begin{align*} f(x,y) &\approx f(x_n, y_n) + \frac{\partial f}{\partial x}\Delta x + \frac{\partial f}{\partial y}\Delta y \\ g(x,y) &\approx g(x_n, y_n) + \frac{\partial g}{\partial x}\Delta x + \frac{\partial g}{\partial y}\Delta y, \end{align*} \]

where \(\Delta x = x- x_n\) and \(\Delta y = y-y_n\). Setting \(f(x,y)=0\) and \(g(x,y)=0\), leaves these two linear equations in \(\Delta x\) and \(\Delta y\):

\[ \begin{align*} \frac{\partial f}{\partial x} \Delta x + \frac{\partial f}{\partial y} \Delta y &= -f(x_n, y_n)\\ \frac{\partial g}{\partial x} \Delta x + \frac{\partial g}{\partial y} \Delta y &= -g(x_n, y_n). \end{align*} \]

One step of Newton’s method defines \((x_{n+1}, y_{n+1})\) to be the values \((x,y)\) that make the linearized functions about \((x_n, y_n)\) both equal to \(\vec{0}\).

As just described, we can use Julia’s \ operation to solve the above system of equations, if we express them in matrix form. With this, one step of Newton’s method can be coded as follows:

function newton_step(f, g, xn)
    M = [ForwardDiff.gradient(f, xn)'; ForwardDiff.gradient(g, xn)']
    b = -[f(xn), g(xn)]
    Delta = M \ b
    xn + Delta
end
newton_step (generic function with 1 method)

We investigate what happens starting at \((1,1)\) after one step:

𝒇(x,y) = 2 - x^2 - y^2
𝒈(x,y) = 3 - 2x^2 - (1/3)y^2
𝒇(v) = 𝒇(v...); 𝒈(v) = 𝒈(v...)
𝒙₀ = [1,1]
𝒙₁ = newton_step(𝒇, 𝒈, 𝒙₀)
2-element Vector{Float64}:
 1.2
 0.8

The new function values are

𝒇(𝒙₁), 𝒈(𝒙₁)
(-0.08000000000000007, -0.09333333333333327)

We can get better approximations by iterating. Here we hard code \(4\) more steps:

𝒙₂ = newton_step(𝒇, 𝒈, 𝒙₁)
𝒙₃ = newton_step(𝒇, 𝒈, 𝒙₂)
𝒙₄ = newton_step(𝒇, 𝒈, 𝒙₃)
𝒙₅ = newton_step(𝒇, 𝒈, 𝒙₄)
𝒙₅, 𝒇(𝒙₅), 𝒈(𝒙₅)
([1.1832159566199232, 0.7745966692414834], 0.0, 1.6653345369377348e-16)

We see that at the new point, x5, both functions are basically the same value, \(0\), so we have approximated the intersection point.

For nearby initial guesses and reasonable functions, Newton’s method is quadratic, so should take few steps for convergence, as above.

Here is a simplistic method to iterate \(n\) steps:

function nm(f, g, x, n=5)
    for i in 1:n
      x = newton_step(f, g, x)
    end
    x
end
nm (generic function with 2 methods)
Example

Consider the bicylinder the intersection of two perpendicular cylinders of the same radius. If the radius is \(1\), we might express these by the functions:

\[ f(x,y) = \sqrt{1 - y^2}, \quad g(x,y) = \sqrt{1 - x^2}. \]

We see that \((1,1)\), \((-1,1)\), \((1,-1)\) and \((-1,-1)\) are solutions to \(f(x,y)=0\), \(g(x,y)=0\) and \((0,0)\) is a solution to \(f(x,y)=1\) and \(g(x,y)=1\). What about a level like \(1/2\), say?

Rather than work with \(f(x,y) = c\) we solve \(f(x,y)^2 = c^2\), as that will be avoid issues with the square root not being defined. Here is one way to solve:

c = 1/2
f(x,y) = 1 - y^2 - c^2
g(x,y) = (1 - x^2) - c^2
f(v) = f(v...); g(v) = g(v...)
nm(f, g, [1/2, 1/3])
2-element Vector{Float64}:
 0.8660254037844386
 0.8660254037935468

That \(x=y\) is not so surprising, and in fact, this problem can more easily be solved analytically through \(x^2 = y^2 = 1 - c^2\).

57.3 Implicit differentiation

Implicit differentiation of an equation of two variables (say \(x\) and \(y\)) is performed by assuming \(y\) is a function of \(x\) and when differentiating an expression with \(y\), use the chain rule. For example, the slope of the tangent line, \(dy/dx\), for the general ellipse \(x^2/a + y^2/b = 1\) can be found through this calculation:

\[ \frac{d}{dx}(\frac{x^2}{a} + \frac{y^2}{b}) = \frac{d}{dx}(1), \]

or, using \(d/dx(y^2) = 2y dy/dx\):

\[ \frac{2x}{a} + \frac{2y \frac{dy}{dx}}{b} = 0. \]

From this, solving for \(dy/dx\) is routine, as the equation is linear in that unknown: \(dy/dx = -(b/a)(x/y)\)

With more variables, the same technique may be used. Say we have variables \(x\), \(y\), and \(z\) in a relation like \(F(x,y,z) = 0\). If we assume \(z=z(x,y)\) for some differentiable function (we mention later what conditions will ensure this assumption is valid for some open set), then we can proceed as before, using the chain rule as necessary.

For example, consider the ellipsoid: \(x^2/a + y^2/b + z^2/c = 1\). What is \(\partial z/\partial x\) and \(\partial{z}/\partial{y}\), as needed to describe the tangent plane as above?

To find \(\partial/\partial{x}\) we have:

\[ \frac{\partial}{\partial{x}}(x^2/a + y^2/b + z^2/c) = \frac{\partial}{\partial{x}}1, \]

or

\[ \frac{2x}{a} + \frac{0}{b} + \frac{2z\frac{\partial{z}}{\partial{x}}}{c} = 0. \]

Again the desired unknown is within a linear equation so can readily be solved:

\[ \frac{\partial{z}}{\partial{x}} = -\frac{c}{a} \frac{x}{z}. \]

A similar approach can be used for \(\partial{z}/\partial{y}\).

Example

Let \(f(x,y,z) = x^4 -x^3 + y^2 + z^2 = 0\) be a surface with point \((2,2,2)\). Find \(\partial{z}/\partial{x}\) and \(\partial{z}/\partial{y}\).

To find \(\partial{z}/\partial{x}\) and \(\partial{z}/\partial{y}\) we have:

@syms x, y, Z()
∂x = solve(diff(x^4 -x^3 + y^2 + Z(x,y)^2, x), diff(Z(x,y),x))
∂y = solve(diff(x^4 -x^3 + y^2 + Z(x,y)^2, y), diff(Z(x,y),y))
∂x, ∂y
(Sym{PyCall.PyObject}[x^2*(3 - 4*x)/(2*Z(x, y))], Sym{PyCall.PyObject}[-y/Z(x, y)])
Example

The find_zero function has been used to solve for \(f(x, p) = 0\) where \(p\) is a parameter. Mathematically, this equation defines \(x^*(p)\) satisfying \(f(x^*(p), p)=0,\) that is \(x^*\) is implicitly a function of \(p\). What is the derivative (gradient) of \(x^*\)?

Assume for now that \(p\) is a scalar quantity, so \(f(x,p)\) is a scalar function of two variables. Then we can differentiate in \(p\) to see:

\[ \frac{d}{dp}f(x^*(p), p) = \frac{\partial}{\partial x}f(x^*(p),p) \frac{d}{dp}x^*(p) + \frac{\partial}{\partial p}f(x^*(p),p) \frac{d}{dp}p = 0 \]

As the partial in \(x\) of \(f\) is a scalar quantity, we can divide to get:

\[ \frac{d}{dp}x^*(p) = -\frac{\frac{\partial}{\partial p}f(x^*(p),p)}{\frac{\partial}{\partial x}f(x^*(p),p)} \]

For example, consider the equation \(f(x, p) = \cos(x) - px\) for \(p > 0\). This will always have a unique solution in \((0, \pi/2)\) which can be found, as follows with \(p=2\):

f(x, p) = cos(x) - p*x
p = 2
xᵅ = find_zero(f, (0, pi/2), p)
0.45018361129487355

The derivative could be found directly through automatic differentiation by ForwardDiff.derivative, though it takes a bit of an adjustment to our function call to get the types to flow through. Rather than make that adjustment, we use the above to find the derivative for a given \(p\), for example, again with \(p=2\):

p = 2
xᵅ =  find_zero(f, (0, pi/2), p)
fₓ = ForwardDiff.derivative(x -> f(x,p), xᵅ)
fₚ = ForwardDiff.derivative(p -> f(xᵅ, p), p)
- fₚ / fₓ
-0.1848703980832297

The negative sign makes sense: as \(p\) gets bigger, the line \(y=px\) gets steeper and intersects the cosine graph earlier, hence \(x^*\) is decreasing.

A plot can easily be made by wrapping the above in a function:

function find_zero_derivative(f, x₀, p)
    xᵅ =  find_zero(f, x₀, p)
    fₓ = ForwardDiff.derivative(x -> f(x,p), xᵅ)
    fₚ = ForwardDiff.derivative(p -> f(xᵅ, p), p)
    - fₚ / fₓ
end
F(p) = find_zero_derivative(f, (0, pi/2), p)
plot(F, 0.01, 5)  # p > 0

This problem does not have a readily expressed value for \(x^*\), but when \(p \approx 0\) we should get similar behavior to the intersection of \(y=px\) and \(y=\pi/2 - x\) for \(x^*\), or \(x^* \approx \pi/(2(1+p))\) which has derivative of \(-\pi/2\) at \(p=0\), matching the above graph. For large \(p\), the problem looks like the intersection of the line \(y=1\) with \(y=px\) or \(x^* \approx 1/p\) which has derivative that goes to \(0\) as \(p\) goes to infinity, again matching this graph.

57.4 Optimization

For a continuous univariate function \(f:R \rightarrow R\) over an interval \(I\) the question of finding a maximum or minimum value is aided by two theorems:

  • The Extreme Value Theorem, which states that if \(I\) is closed (e.g, \(I=[a,b]\)) then \(f\) has a maximum (minimum) value \(M\) and there is at least one value \(c\) with \(a \leq c \leq b\) with \(M = f(x)\).
  • Fermat’s theorem on critical points, which states that if \(f:(a,b) \rightarrow R\) and \(x_0\) is such that \(a < x_0 < b\) and \(f(x_0)\) is a local extremum. If \(f\) is differentiable at \(x_0\), then \(f'(x_0) = 0\). That is, local extrema of \(f\) happen at points where the derivative does not exist or is \(0\) (critical points).

These two theorems provide an algorithm to find the extreme values of a continuous function over a closed interval: find the critical points, check these and the end points for the maximum and minimum value.

These checks can be reduced by two theorems that can classify critical points as local extrema, the first and second derivative tests.

These theorems have generalizations to scalar functions, allowing a similar study of extrema.

First, we define a local maximum for \(f:R^n \rightarrow R\) over a region \(U\): a point \(\vec{a}\) in \(U\) is a local maximum if \(f(\vec{a}) \geq f(\vec{u})\) for all \(u\) in some ball about \(\vec{a}\). A local minimum would have \(\leq\) instead.

An absolute maximum over \(U\), should it exist, would be \(f(\vec{a})\) if there exists a value \(\vec{a}\) in \(U\) with the property \(f(\vec{a}) \geq f(\vec{u})\) for all \(\vec{u}\) in \(U\).

The difference is the same as the one-dimensional case: local is a statement about nearby points only, absolute a statement about all the points in the specified set.

Let \(f:R^n \rightarrow R\) be continuous and defined on closed set \(V\). Then \(f\) has a minimum value \(m\) and maximum value \(M\) over \(V\) and there exists at least two points \(\vec{a}\) and \(\vec{b}\) with \(m = f(\vec{a})\) and \(M = f(\vec{b})\).

Fermat’s theorem on critical points

Let \(f:R^n \rightarrow R\) be a continuous function defined on an open set \(U\). If \(x \in U\) is a point where \(f\) has a local extrema and \(f\) is differentiable, then the gradient of \(f\) at \(x\) is \(\vec{0}\).

Call a point in the domain of \(f\) where the function is differentiable and the gradient is zero a stationary point and a point in the domain where the function is either not differentiable or is a stationary point a critical point. The local extrema can only happen at critical points by Fermat.

Consider the function \(f(x,y) = e^{-(x^2 + y^2)/5} \cos(x^2 + y^2)\).

f(x,y)= exp(-(x^2 + y^2)/5) * cos(x^2 + y^2)
xs = ys = range(-4, 4, length=100)
surface(xs, ys, f, legend=false)

This function is differentiable and the gradient is given by:

\[ \nabla{f} = -2/5e^{-(x^2 + y^2)/5} (5\sin(x^2 + y^2) + \cos(x^2 + y^2)) \langle x, y \rangle. \]

This is zero at the origin, or when \(5\sin(x^2 + y^2) = -\cos(x^2 + y^2)\). The latter is \(0\) on circles of radius \(r\) where \(5\sin(r) = -\cos(r)\) or \(r = \tan^{-1}(-1/5) + k\pi\) for \(k = 1, 2, \dots\). This matches the graph, where the extrema are on circles by symmetry. Imagine now, picking a value where the function takes a maximum and adding the tangent plane. As the gradient is \(\vec{0}\), this will be flat. The point at the origin will have the surface fall off from the tangent plane in each direction, whereas the other points, will have a circle where the tangent plane rests on the surface, but otherwise will fall off from the tangent plane. Characterizing this “falling off” will help to identify local maxima that are distinct.


Now consider the differentiable function \(f(x,y) = xy\), graphed below with the projections of the \(x\) and \(y\) axes:

f(x,y) = x*y
xs = ys = range(-3, 3, length=100)
surface(xs, ys, f, legend=false)

plot_parametric!(-4..4, t -> [t, 0, f(t, 0)], linewidth=5)
plot_parametric!(-4..4, t -> [0, t, f(0, t)], linewidth=5)

The extrema happen at the edges of the region. The gradient is \(\nabla{f} = \langle y, x \rangle\). This is \(\vec{0}\) only at the origin. At the origin, were we to imagine a tangent plane, the surface falls off in one direction but falls above in the other direction. Such a point is referred to as a saddle point. A saddle point for a continuous \(f:R^n \rightarrow R\) would be a critical point, \(\vec{a}\) where for any ball with non-zero radius about \(\vec{a}\), there are values where the function is greater than \(f(\vec{a})\) and values where the function is less.

To identify these through formulas, and not graphically, we could try and use the first derivative test along all paths through \(\vec{a}\), but this approach is better at showing something isn’t the case, like two paths to show non-continuity.

The generalization of the second derivative test is more concrete though. Recall, the second derivative test is about the concavity of the function at the critical point. When the concavity can be determined as non-zero, the test is conclusive; when the concavity is zero, the test is not conclusive. Similarly here:

The second Partial Derivative Test for \(f:R^2 \rightarrow R\).

Assume the first and second partial derivatives of \(f\) are defined and continuous; \(\vec{a}\) be a critical point of \(f\); \(H\) is the hessian matrix, \([f_{xx}\quad f_{xy};f_{xy}\quad f_{yy}]\), and \(d = \det(H) = f_{xx} f_{yy} - f_{xy}^2\) is the determinant of the Hessian matrix. Then:

  • The function \(f\) has a local minimum at \(\vec{a}\) if \(f_{xx} > 0\) and \(d>0\),
  • The function \(f\) has a local maximum at \(\vec{a}\) if \(f_{xx} < 0\) and \(d>0\),
  • The function \(f\) has a saddle point at \(\vec{a}\) if \(d < 0\),
  • Nothing can be said if \(d=0\).

The intuition behind a proof follows. The case when \(f_{xx} > 0\) and \(d > 0\) uses a consequence of these assumptions that for any non-zero vector \(\vec{x}\) it must be that \(x\cdot(Hx) > 0\) (positive definite) and the quadratic approximation \(f(\vec{a}+d\vec{x}) \approx f(\vec{a}) + \nabla{f}(\vec{a}) \cdot d\vec{x} + d\vec{x} \cdot (Hd\vec{x}) = f(\vec{a}) + d\vec{x} \cdot (Hd\vec{x})\), so for any \(d\vec{x}\) small enough, \(f(\vec{a}+d\vec{x}) \geq f(\vec{a})\). That is \(f(\vec{a})\) is a local minimum. Similarly, a proof for the local maximum follows by considering \(-f\). Finally, if \(d < 0\), then there are vectors, \(d\vec{x}\), for which \(d\vec{x} \cdot (Hd\vec{x})\) will have different signs, and along these vectors the function will be concave up/concave down.

Apply this to \(f(x,y) = xy\) at \(\vec{a} = \vec{0}\) we have \(f_{xx} = f_{yy} = 0\) and \(f_{xy} = 1\), so the determinant of the Hessian is \(-1\). By the second partial derivative test, this critical point is a saddle point, as seen from a previous graph.

Applying this to \(f(x,y) = e^{-(x^2 + y^2)/5} \cos(x^2 + y^2)\), we will use SymPy to compute the derivatives, as they get a bit involved:

fₖ(x,y) =  exp(-(x^2 + y^2)/5) * cos(x^2 + y^2)
Hₖ = sympy.hessian(fₖ(x,y), (x,y))

\(\left[\begin{smallmatrix}\frac{8 x^{2} e^{- \frac{x^{2}}{5} - \frac{y^{2}}{5}} \sin{\left(x^{2} + y^{2} \right)}}{5} - \frac{96 x^{2} e^{- \frac{x^{2}}{5} - \frac{y^{2}}{5}} \cos{\left(x^{2} + y^{2} \right)}}{25} - 2 e^{- \frac{x^{2}}{5} - \frac{y^{2}}{5}} \sin{\left(x^{2} + y^{2} \right)} - \frac{2 e^{- \frac{x^{2}}{5} - \frac{y^{2}}{5}} \cos{\left(x^{2} + y^{2} \right)}}{5} & \frac{8 x y e^{- \frac{x^{2}}{5} - \frac{y^{2}}{5}} \sin{\left(x^{2} + y^{2} \right)}}{5} - \frac{96 x y e^{- \frac{x^{2}}{5} - \frac{y^{2}}{5}} \cos{\left(x^{2} + y^{2} \right)}}{25}\\\frac{8 x y e^{- \frac{x^{2}}{5} - \frac{y^{2}}{5}} \sin{\left(x^{2} + y^{2} \right)}}{5} - \frac{96 x y e^{- \frac{x^{2}}{5} - \frac{y^{2}}{5}} \cos{\left(x^{2} + y^{2} \right)}}{25} & \frac{8 y^{2} e^{- \frac{x^{2}}{5} - \frac{y^{2}}{5}} \sin{\left(x^{2} + y^{2} \right)}}{5} - \frac{96 y^{2} e^{- \frac{x^{2}}{5} - \frac{y^{2}}{5}} \cos{\left(x^{2} + y^{2} \right)}}{25} - 2 e^{- \frac{x^{2}}{5} - \frac{y^{2}}{5}} \sin{\left(x^{2} + y^{2} \right)} - \frac{2 e^{- \frac{x^{2}}{5} - \frac{y^{2}}{5}} \cos{\left(x^{2} + y^{2} \right)}}{5}\end{smallmatrix}\right]\)

This is messy, but we only consider it at critical points. The point \((0,0)\) is graphically a local maximum. We can see from the Hessian, that the second partial derivative test will give the same characterization:

H₀₀ = subs.(Hₖ, x=>0, y=>0)

\(\left[\begin{smallmatrix}- \frac{2}{5} & 0\\0 & - \frac{2}{5}\end{smallmatrix}\right]\)

Which satisfies:

H₀₀[1,1] < 0 && det(H₀₀) > 0
true

Now consider \(\vec{a} = \langle \sqrt{2\pi + \tan^{-1}(-1/5)}, 0 \rangle\), a point on the first visible ring on the graph. The gradient vanishes here:

gradfₖ = diff.(fₖ(x,y), [x,y])
a = [sqrt(2PI + atan(-Sym(1)//5)), 0]
subs.(gradfₖ, x => a[1], y => a[2])

\(\left[\begin{smallmatrix}0\\0\end{smallmatrix}\right]\)

But the test is inconclusive, as the determinant of the Hessian is \(0\):

a = [sqrt(PI + atan(-Sym(1)//5)), 0]
H_a = subs.(Hₖ, x => a[1], y => a[2])
det(H_a)

\(0\)

(The test is inconclusive, as it needs the function to “fall away” from the tangent plane in all directions, in this case, along a circular curve, the function touches the tangent plane, so it doesn’t fall away.)

Example

Characterize the critical points of \(f(x,y) = 4xy - x^4 - y^4\).

The critical points may be found by solving when the gradient is \(\vec{0}\):

fⱼ(x,y) = 4x*y - x^4 - y^4
gradfⱼ = diff.(fⱼ(x,y), [x,y])

\(\left[\begin{smallmatrix}- 4 x^{3} + 4 y\\4 x - 4 y^{3}\end{smallmatrix}\right]\)

all_ptsⱼ = solve(gradfⱼ, [x,y])
ptsⱼ = filter(u -> all(isreal.(u)), all_ptsⱼ)
3-element Vector{Tuple{Sym{PyCall.PyObject}, Sym{PyCall.PyObject}}}:
 (-1, -1)
 (0, 0)
 (1, 1)

There are \(3\) real critical points. To classify them we need the sign of \(f_{xx}\) and the determinant of the Hessian. We make a simple function to compute these, then apply it to each point using a comprehension:

Hⱼ = sympy.hessian(fⱼ(x,y), (x,y))
function classify(H, pt)
  Ha = subs.(H, x => pt[1], y => pt[2])
  (det=det(Ha), f_xx=Ha[1,1])
end
[classify(Hⱼ, pt) for pt in ptsⱼ]
3-element Vector{@NamedTuple{det::Sym{PyCall.PyObject}, f_xx::Sym{PyCall.PyObject}}}:
 (det = 128, f_xx = -12)
 (det = -16, f_xx = 0)
 (det = 128, f_xx = -12)

We see the first and third points have positive determinant and negative \(f_{xx}\), so are relative maxima, and the second point has negative derivative, so is a saddle point. We graphically confirm this:

xs = ys = range(-3/2, 3/2, length=100)
p = surface(xs, ys, fⱼ, legend=false)
for pt  ptsⱼ
    scatter!(p, unzip([N.([pt...,fⱼ(pt...)])])...,
             markercolor=:black, markersize=5)  # add each pt on surface
end
p
Example

Consider the function \(f(x,y) = x^2 + 3y^2 -x\) over the region \(x^2 + y^2 \leq 1\). This is a continuous function over a closed set, so will have both an absolute maximum and minimum. Find these from an investigation of the critical points and the boundary points.

The gradient is easily found: \(\nabla{f} = \langle 2x - 1, 6y \rangle\), and is \(\vec{0}\) only at \(\vec{a} = \langle 1/2, 0 \rangle\). The Hessian is:

\[ H = \begin{bmatrix} 2 & 0\\ 0 & 6 \end{bmatrix}. \]

At \(\vec{a}\) this has positive determinant and \(f_{xx} > 0\), so \(\vec{a}\) corresponds to a local minimum with values \(f(\vec{a}) = (1/2)^2 + 3(0) - 1/2 = -1/4\). The absolute maximum and minimum may occur here (well, not the maximum) or on the boundary, so that must be considered. In this case we can easily parameterize the boundary and turn this into the univariate case:

fₗ(x,y) = x^2 + 2y^2 - x
fₗ(v) = fₗ(v...)
gammaₗ(t) = [cos(t), sin(t)]  # traces out x^2 + y^2 = 1 over [0, 2pi]
gₗ = fₗ  gammaₗ

cpsₗ = find_zeros(gₗ', 0, 2pi) # critical points of g
append!(cpsₗ, [0, 2pi])
unique!(cpsₗ)
gₗ.(cpsₗ)
5-element Vector{Float64}:
 0.0
 2.25
 2.0
 2.25
 0.0

We see that maximum value is 2.25 and that the interior point, \(\vec{a}\), will be where the minimum value occurs. To see exactly where the maximum occurs, we look at the values of gamma:

inds = [2,4]
cpsₗ[inds]
2-element Vector{Float64}:
 2.0943951023931953
 4.1887902047863905

These are multiples of \(\pi\):

cpsₗ[inds]/pi
2-element Vector{Float64}:
 0.6666666666666666
 1.3333333333333333

So we have the maximum occurs at the angles \(2\pi/3\) and \(4\pi/3\). Here we visualize, using a hacky trick of assigning NaN values to the function to avoid plotting outside the circle:

hₗ(x,y) = fₗ(x,y) * (x^2 + y^2 <= 1 ? 1 : NaN)
hₗ (generic function with 1 method)
xs = ys = range(-1,1, length=100)
surface(xs, ys, hₗ)

ts = cpsₗ  # 2pi/3 and 4pi/3 by above
xs, ys = cos.(ts), sin.(ts)
zs = fₗ.(xs, ys)
scatter3d!(xs, ys, zs)

A contour plot also shows that some - and only one - extrema happens on the interior:

xs = ys = range(-1,1, length=100)
contour(xs, ys, hₗ)

The extrema are identified by the enclosing regions, in this case the one around the point \((1/2, 0)\).

Example: Steiner’s problem

This is from Strang p 506.

We have three points in the plane, \((x_1, y_1)\), \((x_2, y_2)\), and \((x_3,y_3)\). A point \(p=(p_x, p_y)\) will have \(3\) distances \(d_1\), \(d_2\), and \(d_3\). Broadly speaking we want to minimize to find the point \(p\) “nearest” the three fixed points within the triangle. Locating a facility so that it can service \(3\) separate cities might be one application. The answer depends on the notion of what measure of distance to use.

If the measure is the Euclidean distance, then \(d_i^2 = (p_x - x_i)^2 + (p_y - y_i)^2\). If we sought to minimize \(d_1^2 + d_2^2 + d_3^2\), then we would proceed as follows:

@syms x1 y1 x2 y2 x3 y3
d2(p,x) = (p[1] - x[1])^2 + (p[2]-x[2])^2
d2_1, d2_2, d2_3 = d2((x,y), (x1, y1)), d2((x,y), (x2, y2)), d2((x,y), (x3, y3))
exₛ = d2_1 + d2_2 + d2_3

\(\left(x - x_{1}\right)^{2} + \left(x - x_{2}\right)^{2} + \left(x - x_{3}\right)^{2} + \left(y - y_{1}\right)^{2} + \left(y - y_{2}\right)^{2} + \left(y - y_{3}\right)^{2}\)

We then find the gradient, and solve for when it is \(\vec{0}\):

gradfₛ = diff.(exₛ, [x,y])
xstarₛ = solve(gradfₛ, [x,y])

\[\begin{equation*}\begin{cases}x & \text{=>} &\frac{x_{1}}{3} + \frac{x_{2}}{3} + \frac{x_{3}}{3}\\y & \text{=>} &\frac{y_{1}}{3} + \frac{y_{2}}{3} + \frac{y_{3}}{3}\\\end{cases}\end{equation*}\]

There is only one critical point, so must be a minimum.

We confirm this by looking at the Hessian and noting \(H_{11} > 0\):

Hₛ = subs.(hessian(exₛ, [x,y]), x=>xstarₛ[x], y=>xstarₛ[y])

\(\left[\begin{smallmatrix}6 & 0\\0 & 6\end{smallmatrix}\right]\)

As it occurs at \((\bar{x}, \bar{y})\) where \(\bar{x} = (x_1 + x_2 + x_3)/3\) and \(\bar{y} = (y_1+y_2+y_3)/3\) - the averages of the three values - the critical point is an interior point of the triangle.

As mentioned by Strang, the real problem is to minimize \(d_1 + d_2 + d_3\). A direct approach with SymPy - just replacing d2 above with the square root fails. Consider instead the gradient of \(d_1\), say. To avoid square roots, this is taken implicitly from \(d_1^2\):

\[ \frac{\partial}{\partial{x}}(d_1^2) = 2 d_1 \frac{\partial{d_1}}{\partial{x}}. \]

But computing directly from the expression yields \(2(x - x_1)\) Solving, yields:

\[ \frac{\partial{d_1}}{\partial{x}} = \frac{(x-x_1)}{d_1}, \quad \frac{\partial{d_1}}{\partial{y}} = \frac{(y-y_1)}{d_1}. \]

The gradient is then \((\vec{p} - \vec{x}_1)/\|\vec{p} - \vec{x}_1\|\), a unit vector, call it \(\hat{u}_1\). Similarly for \(\hat{u}_2\) and \(\hat{u}_3\).

Let \(f = d_1 + d_2 + d_3\). Then \(\nabla{f} = \hat{u}_1 + \hat{u}_2 + \hat{u}_3\). At the minimum, the gradient is \(\vec{0}\), so the three unit vectors must cancel. This can only happen if the three make a “peace” sign with angles \(120^\circ\) between them. To find the minimum then within the triangle, this point and the boundary must be considered, when this point falls outside the triangle.

Here is a triangle, where the minimum would be within the triangle:

usₛ = [[cos(t), sin(t)] for t in (0, 2pi/3, 4pi/3)]
polygon(ps) = unzip(vcat(ps, ps[1:1])) # easier way to plot a polygon

pₛ = scatter([0],[0], markersize=2, legend=false, aspect_ratio=:equal)

asₛ = (1,2,3)
plot!(polygon([a*u for (a,u) in zip(asₛ, usₛ)])...)
[arrow!([0,0], a*u, alpha=0.5) for (a,u) in zip(asₛ, usₛ)]
pₛ

For this triangle we find the Steiner point outside of the triangle.

asₛ₁ = (1, -1, 3)
scatter([0],[0], markersize=2, legend=false)
psₛₗ = [a*u for (a,u) in zip(asₛ₁, usₛ)]
plot!(polygon(psₛₗ)...)

Let’s see where the minimum distance point is by constructing a plot. The minimum must be on the boundary, as the only point where the gradient vanishes is the origin, not in the triangle. The plot of the triangle has a contour plot of the distance function, so we see clearly that the minimum happens at the point [0.5, -0.866025]. On this plot, we drew the gradient at some points along the boundary. The gradient points in the direction of greatest increase - away from the minimum. That the gradient vectors have a non-zero projection onto the edges of the triangle in a direction pointing away from the point indicates that the function d would increase if moved along the boundary in that direction, as indeed it does.

euclid_dist(x; ps=psₛₗ) = sum(norm(x-p) for p in ps)
euclid_dist(x,y; ps=psₛₗ) = euclid_dist([x,y]; ps=ps)
euclid_dist (generic function with 2 methods)
xs = range(-1.5, 1.5, length=100)
ys = range(-3, 1.0, length=100)

p = plot(polygon(psₛₗ)..., linewidth=3, legend=false)
scatter!(p, unzip(psₛₗ)..., markersize=3)
contour!(p, xs, ys, euclid_dist)

# add some gradients along boundary
li(t, p1, p2) = p1 + t*(p2-p1)  # t in [0,1]
for t in range(1/100, 1/2, length=3)
    pt = li(t, psₛₗ[2], psₛₗ[3])
    arrow!(pt, ForwardDiff.gradient(euclid_dist, pt))
    pt = li(t, psₛₗ[2], psₛₗ[1])
    arrow!(pt, ForwardDiff.gradient(euclid_dist, pt))
end

p

The following graph, shows distance along each edge:

li(t, p1, p2) = p1 + t*(p2-p1)
p = plot(legend=false)
for i in 1:2, j in (i+1):3
  plot!(p, t -> euclid_dist(li(t, psₛₗ[i], psₛₗ[j]); ps=psₛₗ), 0, 1)
end
p

The smallest value is when \(t=0\) or \(t=1\), so at one of the points, as li is defined above.

Example: least squares

We know that two points determine a line. What happens when there are more than two points? This is common in statistics where a bivariate data set (pairs of points \((x,y)\)) are summarized through a linear model \(\mu_{y|x} = \alpha + \beta x\), That is the average value for \(y\) given a particular \(x\) value is given through the equation of a line. The data is used to identify what the slope and intercept are for this line. We consider a simple case - \(3\) points. The case of \(n \geq 3\) being similar.

We have a line \(l(x) = \alpha + \beta(x)\) and three points \((x_1, y_1)\), \((x_2, y_2)\), and \((x_3, y_3)\). Unless these three points happen to be collinear, they can’t possibly all lie on the same line. So to approximate a relationship by a line requires some inexactness. One measure of inexactness is the vertical distance to the line:

\[ d1(\alpha, \beta) = |y_1 - l(x_1)| + |y_2 - l(x_2)| + |y_3 - l(x_3)|. \]

Another might be the vertical squared distance to the line:

\[ \begin{align*} d2(\alpha, \beta) &= (y_1 - l(x_1))^2 + (y_2 - l(x_2))^2 + (y_3 - l(x_3))^2 \\ &= (y1 - (\alpha + \beta x_1))^2 + (y2 - (\alpha + \beta x_2))^2 + (y3 - (\alpha + \beta x_3))^2 \end{align*} \]

Another might be the shortest distance to the line:

\[ d3(\alpha, \beta) = \frac{\beta x_1 - y_1 + \alpha}{\sqrt{1 + \beta^2}} + \frac{\beta x_2 - y_2 + \alpha}{\sqrt{1 + \beta^2}} + \frac{\beta x_3 - y_3 + \alpha}{\sqrt{1 + \beta^2}}. \]

The method of least squares minimizes the second one of these. That is, it chooses \(\alpha\) and \(\beta\) that make the expression a minimum.

@syms xₗₛ[1:3] yₗₛ[1:3] α β
li(x, alpha, beta) =  alpha + beta * x
d₂(alpha, beta) = sum((y - li(x, alpha, beta))^2 for (y,x) in zip(yₗₛ, xₗₛ))
d₂(α, β)

\(\left(- xₗₛ₁ β + yₗₛ₁ - α\right)^{2} + \left(- xₗₛ₂ β + yₗₛ₂ - α\right)^{2} + \left(- xₗₛ₃ β + yₗₛ₃ - α\right)^{2}\)

To identify \(\alpha\) and \(\beta\) we find the gradient:

grad_d₂ = diff.(d₂(α, β), [α, β])

\(\left[\begin{smallmatrix}2 xₗₛ₁ β + 2 xₗₛ₂ β + 2 xₗₛ₃ β - 2 yₗₛ₁ - 2 yₗₛ₂ - 2 yₗₛ₃ + 6 α\\- 2 xₗₛ₁ \left(- xₗₛ₁ β + yₗₛ₁ - α\right) - 2 xₗₛ₂ \left(- xₗₛ₂ β + yₗₛ₂ - α\right) - 2 xₗₛ₃ \left(- xₗₛ₃ β + yₗₛ₃ - α\right)\end{smallmatrix}\right]\)

outₗₛ = solve(grad_d₂, [α, β])

\[\begin{equation*}\begin{cases}β & \text{=>} &\frac{2 xₗₛ₁ yₗₛ₁ - xₗₛ₁ yₗₛ₂ - xₗₛ₁ yₗₛ₃ - xₗₛ₂ yₗₛ₁ + 2 xₗₛ₂ yₗₛ₂ - xₗₛ₂ yₗₛ₃ - xₗₛ₃ yₗₛ₁ - xₗₛ₃ yₗₛ₂ + 2 xₗₛ₃ yₗₛ₃}{2 xₗₛ₁^{2} - 2 xₗₛ₁ xₗₛ₂ - 2 xₗₛ₁ xₗₛ₃ + 2 xₗₛ₂^{2} - 2 xₗₛ₂ xₗₛ₃ + 2 xₗₛ₃^{2}}\\α & \text{=>} &\frac{xₗₛ₁^{2} yₗₛ₂ + xₗₛ₁^{2} yₗₛ₃ - xₗₛ₁ xₗₛ₂ yₗₛ₁ - xₗₛ₁ xₗₛ₂ yₗₛ₂ - xₗₛ₁ xₗₛ₃ yₗₛ₁ - xₗₛ₁ xₗₛ₃ yₗₛ₃ + xₗₛ₂^{2} yₗₛ₁ + xₗₛ₂^{2} yₗₛ₃ - xₗₛ₂ xₗₛ₃ yₗₛ₂ - xₗₛ₂ xₗₛ₃ yₗₛ₃ + xₗₛ₃^{2} yₗₛ₁ + xₗₛ₃^{2} yₗₛ₂}{2 xₗₛ₁^{2} - 2 xₗₛ₁ xₗₛ₂ - 2 xₗₛ₁ xₗₛ₃ + 2 xₗₛ₂^{2} - 2 xₗₛ₂ xₗₛ₃ + 2 xₗₛ₃^{2}}\\\end{cases}\end{equation*}\]

As found, the formulas aren’t pretty. If \(x_1 + x_2 + x_3 = 0\) they simplify. For example:

subs(outₗₛ[β], sum(xₗₛ) => 0)

\(\frac{2 xₗₛ₁ yₗₛ₁ - xₗₛ₁ yₗₛ₂ - xₗₛ₁ yₗₛ₃ - xₗₛ₂ yₗₛ₁ + 2 xₗₛ₂ yₗₛ₂ - xₗₛ₂ yₗₛ₃ - xₗₛ₃ yₗₛ₁ - xₗₛ₃ yₗₛ₂ + 2 xₗₛ₃ yₗₛ₃}{2 xₗₛ₁^{2} - 2 xₗₛ₁ xₗₛ₂ - 2 xₗₛ₁ xₗₛ₃ + 2 xₗₛ₂^{2} - 2 xₗₛ₂ xₗₛ₃ + 2 xₗₛ₃^{2}}\)

Let \(\vec{x} = \langle x_1, x_2, x_3 \rangle\) and \(\vec{y} = \langle y_1, y_2, y_3 \rangle\) this is simply \((\vec{x} \cdot \vec{y})/(\vec{x}\cdot \vec{x})\), a formula that will generalize to \(n > 3\). The assumption is not a restriction - it comes about by subtracting the mean, \(\bar{x} = (x_1 + x_2 + x_3)/3\), from each \(x\) term (and similarly subtract \(\bar{y}\) from each \(y\) term). A process called “centering.”

With this observation, the formulas can be re-expressed through:

\[ \beta = \frac{\sum{(x_i - \bar{x})(y_i - \bar{y})}}{\sum(x_i-\bar{x})^2}, \quad \alpha = \bar{y} - \beta \bar{x}. \]

Relative to the centered values, this may be viewed as a line through \((\bar{x}, \bar{y})\) with slope given by \((\vec{x}-\bar{x})\cdot(\vec{y}-\bar{y}) / \|\vec{x}-\bar{x}\|^2\).

As an example, if the point are \((1,1), (2,3), (5,8)\) we get:

[k => subs(v, xₗₛ[1]=>1, yₗₛ[1]=>1, xₗₛ[2]=>2, yₗₛ[2]=>3,
           xₗₛ[3]=>5, yₗₛ[3]=>8) for (k,v) in outₗₛ]
2-element Vector{Pair{Sym{PyCall.PyObject}, Sym{PyCall.PyObject}}}:
 β => 45/26
 α => -8/13

57.4.1 Gradient descent

As seen in the examples above, extrema may be identified analytically by solving for when the gradient is \(0\). Here we discuss some numeric algorithms for finding extrema.

An algorithm to identify where a surface is at its minimum is gradient descent. The gradient points in the direction of the steepest ascent of the surface and the negative gradient the direction of the steepest descent. To move to a minimum then, it make intuitive sense to move in the direction of the negative gradient. How far? That is a different question and one with different answers. Let’s formulate the movement first, then discuss how far.

Let \(\vec{x}_0\), \(\vec{x}_1\), \(\dots\), \(\vec{x}_n\) be the position of the algorithm for \(n\) steps starting from an initial point \(\vec{x}_0\). The difference between these points is given by:

\[ \vec{x}_{n+1} = \vec{x}_n - \gamma \nabla{f}(\vec{x}_n), \]

where \(\gamma\) is some scaling factor for the gradient. The above quantifies the idea: to go from \(\vec{x}_n\) to \(\vec{x}_{n+1}\), move along \(-\nabla{f}\) by a certain amount.

Let \(\Delta_x =\vec{x}_{n}- \vec{x}_{n-1}\) and \(\Delta_y = \nabla{f}(\vec{x}_{n}) - \nabla{f}(\vec{x}_{n-1})\) A variant of the Barzilai-Borwein method is to take \(\gamma_n = | \Delta_x \cdot \Delta_y / \Delta_y \cdot \Delta_y |\).

To illustrate, take \(f(x,y) = - e^{-((x-1)^2 + 2(y-1/2)^2)}\) and a starting point \(\langle 0, 0 \rangle\). We have, starting with \(\gamma_0 = 1\) there are \(5\) steps taken:

f₂(x,y) = -exp(-((x-1)^2 + 2(y-1/2)^2))
f₂(x) = f₂(x...)

xs₂ = [[0.0, 0.0]] # we store a vector
gammas₂ = [1.0]

for n in 1:5
    xn = xs₂[end]
    gamma₀ = gammas₂[end]
    xn1 = xn - gamma₀ * gradient(f₂)(xn)
    dx, dy = xn1 - xn, gradient(f₂)(xn1) - gradient(f₂)(xn)
    gamman1 = abs( (dx  dy) / (dy  dy) )

    push!(xs₂, xn1)
    push!(gammas₂, gamman1)
end

[(x, f₂(x)) for x in xs₂]
6-element Vector{Tuple{Vector{Float64}, Float64}}:
 ([0.0, 0.0], -0.22313016014842982)
 ([0.44626032029685964, 0.44626032029685964], -0.7316862045596354)
 ([0.5719399641782019, 0.4706543959065717], -0.8311394210020312)
 ([1.3127598757955443, 0.5722280351701136], -0.8974009578884475)
 ([0.9982224839581173, 0.4269509740243237], -0.9893813007474934)
 ([0.9996828943781475, 0.5469853998120562], -0.9955943772073014)

We now visualize, using the Contour package to draw the contour lines in the \(x-y\) plane:

function surface_contour(xs, ys, f; offset=0)
  p = surface(xs, ys, f, legend=false, fillalpha=0.5)

  ## we add to the graphic p, then plot
  zs = [f(x,y) for x in xs, y in ys]  # reverse order for use with Contour package
  for cl in levels(contours(xs, ys, zs))
    lvl = level(cl) # the z-value of this contour level
    for line in lines(cl)
        _xs, _ys = coordinates(line) # coordinates of this line segment
        _zs = offset * _xs
        plot!(p, _xs, _ys, _zs, alpha=0.5)        # add curve on x-y plane
    end
  end
  p
end


offset = 0
us = vs = range(-1, 2, length=100)
surface_contour(us, vs, f₂, offset=offset)
pts = [[pt..., offset] for pt in xs₂]
scatter3d!(unzip(pts)...)
plot!(unzip(pts)..., linewidth=3)

57.4.2 Newton’s method for minimization

A variant of Newton’s method can be used to minimize a function \(f:R^2 \rightarrow R\). We look for points where both partial derivatives of \(f\) vanish. Let \(g(x,y) = \partial f/\partial x(x,y)\) and \(h(x,y) = \partial f/\partial y(x,y)\). Then applying Newton’s method, as above to solve simultaneously for when \(g=0\) and \(h=0\), we considered this matrix:

\[ M = [\nabla{g}'; \nabla{h}'], \]

and had a step expressible in terms of the inverse of \(M\) as \(M^{-1} [g; h]\). In terms of the function \(f\), this step is \(H^{-1}\nabla{f}\), where \(H\) is the Hessian matrix. Newton’s method then becomes:

\[ \vec{x}_{n+1} = \vec{x}_n - [H_f(\vec{x}_n)]^{-1} \nabla(f)(\vec{x}_n). \]

The Wikipedia page states where applicable, Newton’s method converges much faster towards a local maximum or minimum than gradient descent.

We apply it to the task of characterizing the following function, which has a few different peaks over the region \([-3,3] \times [-2,2]\):

function peaks(x, y)
    z = 3 * (1 - x)^2 * exp(-x^2 - (y + 1)^2)
    z += -10 * (x / 5 - x^3 - y^5) * exp(-x^2 - y^2)
    z += -1/3 * exp(-(x+1)^2 - y^2)
    return z
end
peaks(v) = peaks(v...)
peaks (generic function with 2 methods)
xs = range(-3, stop=3, length=100)
ys = range(-2, stop=2, length=100)
Ps = surface(xs, ys, peaks, legend=false)
Pc = contour(xs, ys, peaks, legend=false)
plot(Ps, Pc, layout=2) # combine plots