Appendix A
Gradient Based Optimization Methods

In this appendix, a few popular gradient based optimization methods are outlined. In addition, a simple heuristic technique is described, which is by default used in the experimental software implementation to locate a feasible region in parameter space for further optimization by the one of the other optimization methods.

A.1 Alternatives for Steepest Descent

A practical introduction to the methods described in this section can be found in [17], as well as in many other books, so we will only add a few notes.

The simplest gradient-based optimization scheme is the steepest descent method. In the present software implementation more efficient methods are provided, among which the Fletcher-Reeves and Polak-Ribiere conjugate gradient optimization methods [16]. Its detailed discussion, especially w.r.t. 1-dimensional search, would lead too far beyond the presentation of basic modelling principles, and would in fact require a rather extensive introduction to general gradient-based optimization theory. However, a few remarks on the algorithmic part may be useful to give an idea about the structure and (lack of) complexity of the method. Basically, conjugate gradient defines subsequent search directions s by

s (k+1) =  - g (k+1) + β(k)s(k)   (A.1)

where the superscript indicates the iteration count. Here g is the gradient of an error or cost function E which has to be minimized by choosing suitable parameters p; g = E, or in terms of notations that we used before, g = (    )
  ∂E-
  ∂pT . If β(k) = 0 k, this scheme corresponds to steepest descent with learning rate η = 1 and momentum μ = 0, see Eq. (3.24). However, with conjugate gradient, generally only β(0) = 0 and with the Fletcher-Reeves scheme, for k = 1,2,,

        g(k+1)Tg(k+1)
β (k) =  ------T------
          g(k) g(k)   (A.2)

while the Polak-Ribiere scheme involves

  (k)    (g(k+1) - g(k))T g(k+1)
β    =  -------(k)T--(k)------
              g    g   (A.3)

For quadratic functions E these two schemes for β(k) can be shown to be equivalent, which implies that the schemes will for any nonlinear function E behave similarly near a smooth minimum, due to the nearly quadratic shape of the local Taylor expansion. New parameter vectors p are obtained by searching for a minimum of E in the s direction by calculating the value of the scalar parameter α which minimizes E(p(k) + α s(k)). The new point in parameter space thus obtained becomes the starting point p(k+1) for the next iteration, i.e., the next 1-dimensional search. The details of 1-dimensional search are omitted here, but it typically involves estimating the position of the minimum of E (only in the search direction!) through interpolation of subsequent points in each 1-dimensional search by a parabola or a cubic polynomial, of which the minima can be found analytically. The slope along the search direction is given by dE-
dα = sT g. Special measures have to be taken to ensure that E will never increase with subsequent iterations.

The background of the conjugate gradient method lies in a Gram-Schmidt orthogonalization procedure, which simplifies to the Fletcher-Reeves scheme for quadratic functions. For quadratic functions, the optimization is guaranteed to reach the minimum within a finite number of exact 1-dimensional searches: at most n, where n is the number of parameters in E. For more general forms of E, no such guarantees can be given, and a significant amount of heuristic knowledge is needed to obtain an implementation that is numerically robust and that has good convergence properties. Unfortunately, this is still a bit of an art, if not alchemy.

Finally, it should be noted that still more powerful optimization methods are known. Among them, the so-called BFGS quasi-Newton method has become rather popular. Slightly less popular is the DFP quasi-Newton method. These quasi-Newton methods build up an approximation of the inverse Hessian of the error function in successive iterations, using only gradient information. In practice, these methods typically need some two or three times fewer iterations than the conjugate gradient methods, at the expense of handling an approximation of the inverse Hessian [16]. Due to the matrix multiplications involved in this scheme, the cost of creating the approximation grows quadratically with the number of parameters to be determined. This can become prohibitive for large neural networks. On the other hand, as long as the CPU-time for evaluating the error function and its gradient is the dominant factor, these methods tend to provide a significant saving (again a factor two or three) in overall CPU-time. For relatively small problems to be characterized in the least-squares sense, the Levenberg-Marquardt method can be attractive. This method builds an approximation of the Hessian in a single iteration, again using only gradient information. However, the overhead of this method grows even cubically with the number of model parameters, due to the need to solve a corresponding set of linear equations for each iteration. All in all, one can say that while these more advanced optimization methods certainly provide added value, they rarely provide an order of magnitude (or more) reduction in overall CPU-time. This general observation has been confirmed by the experience of the author with many experiments not described in this thesis.

A.2 Heuristic Optimization Method

It was found that in many cases the methods of the preceding section failed to quickly converge to a reasonable fit to the target data set. In itself this is not at all surprising, since these methods were designed to work well when close to a quadratic minimum, but nothing is guaranteed about their performance far away from a minimum. However, it came somewhat as a surprise that under these circumstances a very simple heuristic method often turned out to be more successful at quickly converging to a reasonable fit—although it converges far more slowly close to a minimum.

This method basically involves the following steps:

This is essentially a one-dimensional bisection-like search scheme which has been rather boldly extended for use in multiple dimensions, as if there were no interaction at all among the various dimensions w.r.t. the position of the minima of the cost function. Some additional precautions are needed to avoid (strong) divergence, since convergence is not guaranteed. One may, for example, reduce all parameter steps using a factor close to zero if the cost function would increase (too much). When the parameter steps have the opposite sign of the gradient, the step size reduction ensures that eventually a sufficiently small step in this (generally not steepest) descent direction will lead to a decrease of the cost function, as long as a minimum has not been reached.

After using this method for a certain number of iterations, it is advisable to switch to one of the methods of the preceding section. At present, this is still done manually, but one could conceive additional heuristics for doing this automatically.