Num | Date | Summary |
---|---|---|
11 | 30.Sep | We covered various parts of the material contained on pages 1 to 26 of the notes on differential equations. Specifically: We discussed first-order and second-order differential equations. We discussed the analytic homogeneous solutions one obtains when those equations are linear. We considered some numerical methods for solving first-order initial-value problems, including Euler's method, two of the Runge-Kutta methods, and two of the Adams-Bashforth methods. We discussed local error, noting that Euler's method has a local error proportional to h2, with h being the step size, while Runge-Kutta 4 has a local error proportional to h5, as does the Adams-Bashforth method of order 4. |
12 | 2.Oct | Today we began our discussion of optimization. Throughout, we were working with real-valued functions defined on Euclidean space (or possibly some convex subdomain of Euclidean space). We assumed that all our functions had continuous second partial derivatives. We defined a function's gradient ∇f and Hessian [∇2f]. The gradient is the vector (∂f/∂xi) of first partial derivatives of f. The Hessian is the matrix [∂2f/∂xi∂xj] of second partial derivatives of f. The Hessian is symmetric since f has continuous second partials. We also defined the critical points of a function as the set of points where the function's gradient is zero. Much as in one-dimensional high-school calculus, if a function f has a local minimum at point x*, then the gradient ∇f(x*) at that point must be zero and the Hessian [∇2f](x*) at that point must be symmetric positive semi-definite. The converse holds as well if the Hessian is symmetric positive definite, in which case one can even conclude that the local minimum is a strict local minimum. We looked at a 2D example: f(x,y) = xy. This function has a single critical point, namely the origin. The shape of the function there is what is known as a saddle point, meaning f(x,y) looks to have a local minimum along one direction and a local maximum along another direction. We could see this by looking at the directions of the gradient of f near the critical point, or by looking at the the Hessian, in particular its eigenvalues. Here is a rough picture (the black vectors are perpendicular to the surface): ![]() We reviewed Golden Section Search for 1D optimization, as a generalization of bisection for root finding. (That technique winds up being useful for line optimizations in arbitrary dimensions.) We discussed the significance of positive semi-definite matrices, observing that these characterize convex functions. We proved that every local minimum of a convex function is a global minimum and we proved that the set of all points at which the function attains this minimum is a convex set. We discussed Gradient Descent (also known as Steepest Descent) for finding local minima. We noted the potential for zigzag behavior in ellipsoidal troughs of iso-contours. We observed that those troughs depend on the eigenvalues of the Hessian. This suggested turning ellipsoidal troughs into circles or spheres, something we will investigate further next time. We mentioned Newton's Method, in the context of optimization. |
13 | 7.Oct | Last time we saw that Gradient Descent has the potential for zigzag behavior in ellipsoidal troughs of iso-contours. The shape of the troughs depends on the eigenvalues of the Hessian of the function being optimized. Today we thought about choosing descent directions more carefully, so as to reduce such zigzagging. Near a local optimum a function looks much like a quadratic (by Taylor's Theorem), so it is convenient to focus on purely quadratic functions as models in developing an algorithm that mitigates zigzagging. The algorithm we derived is called the Conjugate Gradient Algorithm. It chooses descent directions in a manner that effectively turns ellipsoidal troughs into circles or spheres (for the purely quadratic case). Recall that a purely quadratic function is of the form with c a real number, both b and x real n-dimensional vectors, and Q a real symmetric positive-definite nxn matrix. There are several magical steps in deriving the Conjugate Gradient Algorithm, meaning that doing something local has a global effect. A first piece of magic is the simple fact that we can view a global optimization as a collection of local projections, using the inner product defined by the second derivative matrix Q. A second instance of magic occurs in the Expanding Subspace Theorem: Doing a line optimization in that particular setting actually ensures that the function attains an optimum over an entire affine subspace. This magic is a consequence of function convexity and Q-orthogonality of the descent directions. A third instance of magic appears in the Conjugate Gradient Theorem, in defining the descent directions. Even though we pick the next direction merely to be Q-orthogonal to a single previous direction, it actually turns out to be Q-orthogonal to all (other) directions. Details: Recall (see notes for a proof) that a set of nonzero pairwise Q-orthogonal vectors is linearly independent, and so can form a basis for its span. That idea appears throughout the development of the Conjugate Gradient Algorithm. We used this independence to express the vector x* - x0 in terms of projections onto a Q-orthogonal basis {d0, ..., dn-1}. Here x* is the location of function f's global optimum and x0 is the (arbitrary) starting point of the optimization process. Specifically: x* - x0 = Σαidi, with αi = <x* - x0, di> / <di, di> = dTiQ (x* - x0) / dTiQ di. In an actual optimization, x* is unknown. We need to replace x* with some computable quantity. For the pure quadratic, we know that x* satisfies Qx*=-b. That means we can simplify the numerator of the projection coefficients αi from dTiQ(x*-x0) to -dTi(b+Qx0). Observe that b+Qx0 is the gradient of f evaluated at the starting point x0. In fact, in the formula for αi, one can replace b+Qx0 with b+Qxi, the gradient of f evaluated at the point xi. Here xi = xi-1 + αi-1di-1, starting from x0. That idea is useful in generalizing the algorithm from pure quadratics to other functions. So: αi = - dTigi / dTiQ di, with gi=b+Qxi being the gradient of f at xi. Observe how these equations let us view coordinate projections as steps in an optimization: A typical step is from xi to xi+1, by moving along direction di with a scale factor of αi. Formally, we saw this in the Expanding Subspace Theorem, which states that each of the αi may be computed via a line optimization, and that doing so actually optimizes f over an entire affine subspace spanned by all descent directions encountered thus far, not merely just over a line. This was an example of a local-to-global leveraging. (Another example occurs in choosing the descent directions. One computes di+1 = -gi+1 + βidi, with βi chosen so that <di,di+1> = 0. Although a local computation, one magically obtains <di,dj> = 0 for all distinct i and j.) Consequently, for a pure quadratic function, one finds the global optimum after n steps, with n the dimension of the space. For general functions, the Conjugate Gradient algorithm repeatedly performs packets of such n-step descents. Each descent step is computed by doing a one-dimensional line optimization. At the end of lecture, we briefly looked at an example in which the conjugate gradient algorithm converged faster than one might have expected based on the discussion above. The reason had to do with a symmetry in the Hessian of the function. |
14 | 9.Oct | We discussed the first- and second-order necessary conditions for optimality in constrained optimization, using Lagrange Multipliers. We worked a simple example. Subsequently, we started our discussion of the Calculus of Variations. We derived the basic Euler-Lagrange Equation. As an example, we showed that the shortest curve between two points in the plane is a straight line. We considered a constrained version of this problem in which one seeks a shortest curve with a given area below the curve. The curve, when it exists, is now a circular arc. We solved this problem by extending the Calculus of Variations to include Lagrange multipliers. We further observed that the dual problem -- in which one seeks to maximize the area below a curve of fixed length -- gives rise to essentially the same Euler-Lagrange Equation as the original constrained problem, and thus has the same type of optimizing curve, when an optimum exists. In the context of Calculus of Variations, we observed that some cost functions do not have optimal solutions that are twice differentiable. In the constrained problem mentioned above this can happen when the endpoint and area (or length) conditions are incompatible. In that situation, one may be able to interpret the optimal curve as a circular arc with jump discontinuities at the endpoints. |
15 | 21.Oct | We surveyed various generalizations of the so-called
Simplest Problem in the Calculus of Variations that
we had discussed last time. In particular, we considered
optimization with free boundaries, surface integrals, multiple
dependent variables, and higher order derivatives. We
sketched examples for two of these generalizations, as
follows:
As an example of a problem with a free boundary, we worked through key parts of a brachistochrone example: finding the shape of a ski slope that minimizes horizontal traversal time for a skier whose motion is determined by gravity. We mentioned that when the cost integrand F(x,y,y') does not depend directly on x, it can be helpful to replace the Euler-Lagrange Equation with the following differential equation, called the Beltrami identity (see page 26 of the Calculus of Variations notes for a derivation):
We also considered the case of a free boundary point specified merely to lie on a curve given implicitly by an equation of the form g(x,y)=0. In that case the relevant endpoint condition becomes The key steps in a derivation of this condition appear here. As an example of a problem involving a surface integral, we found the shape of a soap film hanging from a ring, in which the potential energy of stretching counteracts the potential energy of gravity. The film seeks an equilibrium shape at which the net potential energy is a minimum (or, more precisely, at which the potential energy is stationary, meaning differential perturbations in the soap film's shape do not change the potential energy). The basic technique is very similar to what we did originally when computing a cost as an integral over a one-dimensional interval (as for the shortest path problem), except now we compute a cost by integrating over a two-dimensional region. In order to obtain an appropriate Euler-Lagrange Equation, one needs a generalization of integration by parts to higher dimensions. That generalization is Stokes Theorem, which we mentioned but did not discuss in detail. The notes contain a description of a two-dimensional instance of Stokes Theorem known also as Green's Theorem (see pages 33 and 34). |