Chapter Contents Previous Next
 The NLP Procedure

## Optimization Algorithms

There are three groups of optimization techniques available in PROC NLP. A particular optimizer can be selected with the TECH=name option on the PROC NLP statement.

 Algorithm TECH= Linear Complementary Problem LICOMP Quadratic Active Set Technique QUADAS Trust-Region Method TRUREG Newton-Raphson Method With Line Search NEWRAP Newton-Raphson Method With Ridging NRRIDG Quasi-Newton Methods (DBFGS, DDFP, BFGS, DFP) QUANEW Double-Dogleg Method (DBFGS, DDFP) DBLDOG Conjugate Gradient Methods (PB, FR, PR, CD) CONGRA Nelder-Mead Simplex Method NMSIMP Levenberg-Marquardt Method LEVMAR Hybrid Quasi-Newton Methods (DBFGS, DDFP) HYQUAN

No algorithm for optimizing general nonlinear functions exists that will always find the global optimum for a general nonlinear minimization problem in a reasonable amount of time. Since no single optimization technique is invariably superior to others, PROC NLP provides a variety of optimization techniques that work well in various circumstances. However, it possible to devise problems for which none of the techniques in PROC NLP can find the correct solution. Moreover, nonlinear optimization can be computationally expensive in terms of time and memory so care must be taken when matching an algorithm to a problem.

All optimization techniques in PROC NLP use O(n2) memory except the conjugate gradient methods, which uses only O(n) of memory and are designed to optimize problems with many variables. Since the techniques are iterative they require the repeated computation of:

• the function value (optimization criterion),
• the gradient vector (first-order partial derivatives),
• for some techniques, the (approximate) Hessian matrix (second-order partial derivatives),
• values of linear and nonlinear constraints,
• the first-order partial derivatives (Jacobian) of nonlinear constraints.

However, since each of the optimizers requires different derivatives and supports different types of constraints some computational efficiencies can be gained. The following table shows, for each optimization technique, which derivatives are needed (FOD: first order derivatives; SOD: second order derivatives) and what kind of constraints (BC: boundary constraints; LIC: linear constraints, NLC: nonlinear constraints) are supported.

 Algorithm FOD SOD BC LIC NLC LICOMP - - x x - QUADAS - - x x - TRUREG x x x x - NEWRAP x x x x - NRRIDG x x x x - QUANEW x - x x x DBLDOG x - x x - CONGRA x - x x - NMSIMP - - x x x LEVMAR x - x x - HYQUAN x - x x -

### Choosing an Optimization Algorithm

The factors that go into choosing a particular optimizer for a particular problem are complex and may involve trial and error. The following should be taken into account: First, the structure of the problem has to be considered: Is it quadratic? least-squares? Does it have linear or nonlinear constraints? Next, it is important to consider the type of derivatives of the objective function and the constraints which are needed and whether these are analytically tractable or not. This section provides some guidelines for making the right choices.

For many optimization problems, computing the gradient takes more computer time than computing the function value, and computing the Hessian sometimes takes much more computer time and memory than computing the gradient, especially when there are many decision variables. Unfortunately, optimization techniques that do not use the Hessian usually require many more iterations than techniques that do use the (approximate) Hessian, and so are often slower. Techniques that do not use the Hessian also tend to be less reliable (e.g., terminating at local rather than global optima).

The (very general) function compiler does not compute derivatives very efficiently. This is particularly apparent for second-order derivatives. For large problems, memory and computer time can be saved by programming your own derivatives using the GRADIENT, JACOBIAN, CRPJAC, HESSIAN, and JACNLC statements. If the optimization problem depends on more than 50 parameters, and you are not able to specify first and second-order derivatives of the objective function, you are advised not to use an optimization technique that requires the computation of second derivatives.

The following provides some guidance for matching an algorithm to a particular problem.

• LINCOMP
• General Nonlinear Optimization
• Nonlinear Constraints
• Small Problems: NMSIMP
Not suitable for highly nonlinear problems or for problems with n > 20.
• Medium Problems: QUANEW
• Only Linear Constraints
• Small Problems: TRUREG (NEWRAP, NRRIDG)
() where the Hessian matrix is not expensive to compute. Sometimes NRRIDG can be faster than TRUREG, but TRUREG can be more stable. NRRIDG needs only one matrix with n(n+1)/2 double words; TRUREG and NEWRAP need two such matrices.
• Medium Problems: QUANEW (DBLDOG)
() where the objective function and the gradient are much faster to evaluate than the Hessian. QUANEW and DBLDOG in general need more iterations than TRUREG, NRRIDG, and NEWRAP, but each iteration can be much faster. QUANEW and DBLDOG need only the gradient to update an approximate Hessian. QUANEW and DBLDOG need slightly less memory than TRUREG or NEWRAP (essentially one matrix with n(n+1)/2 double words).
• Large Problems: CONGRA
(n > 200) where the objective function and the gradient can be computed much faster than the Hessian and where too much memory is needed to store the (approximate) Hessian. CONGRA in general needs more iterations than QUANEW or DBLDOG, but each iteration can be much faster. Since CONGRA needs only a factor of n double-word memory, many large applications of PROC NLP can be solved only by CONGRA.
• No Derivatives: NMSIMP
() where derivatives are not continuous or are very difficult to compute.
• Least-Squares Minimization
• Small Problems: LEVMAR (HYQUAN)
() where the cross-product Jacobian matrix is easy and inexpensive to compute. In general, LEVMAR is more reliable, but there are problems with high residuals where HYQUAN can be faster than LEVMAR.
• Medium Problems: QUANEW (DBLDOG)
() where the objective function and the gradient are much faster to evaluate than the cross-product Jacobian. QUANEW and DBLDOG in general need more iterations than LEVMAR or HYQUAN, but each iteration can be much faster.
• Large Problems: CONGRA
• No Derivatives: NMSIMP

The QUADAS and LICOMP algorithms can be used to minimize or maximize a quadratic objective function,

f(x) = [1 /2] xT G x + gT x + c,     with     GT = G ,
with linear or boundary constraints
where x = (x1, ... ,xn)T, g = (g1, ... ,gn)T, G is an n ×n symmetric matrix, A is an m ×n matrix of general linear constraints, and b = (b1, ... ,bm)T. The value of c only modifies the value of the objective function, not its derivatives, and the location of the optimizer x* does not depend on the value of the constant term c. For QUADAS or LICOMP, the objective function must be specified using the MINQUAD or MAX QUAD statement or using an INQUAD= data set. In this case derivatives do not need to be specified. because the gradient vector
and the n ×n Hessian matrix
are easily obtained from the data input.

Simple boundary and general linear constraints can be specified using the BOUNDS or LINCON statement or an INQUAD= or INEST= data set.

The QUADAS algorithm is an active set method which iteratively updates the QT decomposition of the matrix Ak of active linear constraints and the Cholesky factor of the projected Hessian ZTGZ simultaneously. The update of active boundary and linear constraints is done separately; see

Gill, Murray, Saunders, & Wright (1984). Here is Q an nfree ×nfree orthogonal matrix containing null space Z in its first nfree - nalc columns and range space Y in its last nalc columns; T is an nalc ×nalc triangular matrix of special form, tij=0 for i < n-j, where nfree is the number of free parameters (n minus the number of active boundary constraints), and nalc is the number of active linear constraints. The Cholesky factor of the projected Hessian matrix ZTkGZk and the QT decomposition are updated simultaneously when the active set changes.

#### Linear Complementarity (LICOMP)

The LICOMP technique solves a quadratic problem as a linear complementarity problem. It can be used only if G is positive (negative) semi-definite for minimization (maximization) and if the parameters are restricted to be positive.

This technique finds a point that meets the Karush-Kuhn-Tucker conditions by solving the linear complementary problem

w = M z + q
with constraints
where
Only the LCEPSILON= option may be used to specify a tolerance used in computations.

### General Nonlinear Optimization

#### Trust-Region Optimization (TRUREG)

The trust-region method uses the gradient g(x(k)) and Hessian matrix G(x(k)) and thus requires that the objective function f(x) have continuous first- and second-order derivatives inside the feasible region.

The trust-region method iteratively optimizes a quadratic approximation to the nonlinear objective function within a hyperelliptic trust region with radius that constrains the step size corresponding to the quality of the quadratic approximation. The trust-region method is implemented using Dennis, Gay, & Welsch (1981), Gay (1983),

and Mor & Sorensen (1983)

.

The trust region method performs well for small- to medium-sized problems and does not need many function, gradient, and Hessian calls. However, if the computation of the Hessian matrix is computationally expensive, one of the (dual) quasi-Newton or conjugate gradient algorithms may be more efficient.

#### Newton-Raphson Optimization With Line Search (NEWRAP)

The NEWRAP technique uses the gradient g(x(k)) and Hessian matrix G(x(k)) and thus requires that the objective function have continuous first- and second-order derivatives inside the feasible region. If second-order derivatives are computed efficiently and precisely the NEWRAP method may perform well for medium-sized to large problems and does not need many function, gradient, and Hessian calls.

This algorithm uses a pure Newton step when the Hessian is positive definite and when the Newton step reduces the value of the objective function successfully. Otherwise a combination of ridging and line-search is done to compute successful steps. If the Hessian is not positive definite, a multiple of the identity matrix is added to the Hessian matrix to make it positive definite (Eskow & Schnabel, 1991).

In each iteration a line search is done along the search direction to find an approximate optimum of the objective function. The default line-search method uses quadratic interpolation and cubic extrapolation (LIS=2).

#### Newton-Raphson Ridge Optimization (NRRIDG)

The NRRIDG technique uses the gradient g(x(k)) and Hessian matrix G(x(k)) and thus requires that the objective function have continuous first- and second-order derivatives inside the feasible region.

This algorithm uses a pure Newton step when the Hessian is positive definite and when the Newton step reduces the value of the objective function successfully. If at least one of these two conditions is not satisfied, a multiple of the identity matrix is added to the Hessian matrix. If this algorithm is used for least-squares problems, it performs a ridged Gauss-Newton minimization.

The NRRIDG method performs well for small to medium-sized problems and does not need many function, gradient, and Hessian calls. However, if the computation of the Hessian matrix is computationally expensive, one of the (dual) quasi-Newton or conjugate gradient algorithms may be more efficient.

Since NRRIDG uses an orthogonal decomposition of the approximate Hessian, each iteration of NRRIDG can be slower than that of NEWRAP which works with Cholesky decomposition. However, usually NRRIDG needs fewer iterations than NEWRAP.

#### Quasi-Newton Optimization (QUANEW)

The (dual) quasi-Newton method uses the gradient g(x(k)) and does not need to compute second-order derivatives since they are approximated. It works well for medium to moderately large optimization problems where the objective function and the gradient are much faster to compute than the Hessian but in general requires more iterations than the techniques TRUREG, NEWRAP, and NRRIDG which compute second-order derivatives.

The QUANEW algorithm depends on whether there are nonlinear constraints or not.

If there are no nonlinear constraints QUANEW is either

• the original quasi-Newton algorithm, that updates an approximation of the inverse Hessian,
• the dual quasi-Newton algorithm, that updates the Cholesky factor of an approximate Hessian (default),
depending upon the value of the UPDATE= options. For problems with general linear inequality constraints the dual quasi-Newton methods can be more efficient than the original ones.

Four update formulas can be specified with the UPDATE= option:

DBFGS
performs the dual BFGS (Broyden, Fletcher, Goldfarb, & Shanno) update of the Cholesky factor of the Hessian matrix. This is the default.
DDFP
performs the dual DFP (Davidon, Fletcher, & Powell) update of the Cholesky factor of the Hessian matrix .
BFGS
performs the original BFGS (Broyden, Fletcher, Goldfarb, & Shanno) update of the inverse Hessian matrix.
DFP
performs the original DFP (Davidon, Fletcher, & Powell) update of the inverse Hessian matrix.

In each iteration a line search is done along the search direction to find an approximate optimum. The default line-search method uses quadratic interpolation and cubic extrapolation to obtain a step size satisfying the Goldstein conditions. One of the Goldstein conditions can be violated if the feasible region defines an upper limit of the step size. Violating the left side Goldstein condition can affect the positive definiteness of the quasi-Newton update. In those cases either the update is skipped or the iterations are restarted with an identity matrix resulting in the steepest descent or ascent search direction. Line-search algorithms other than the default one can be specified with the LIS= option.

The algorithm used for nonlinearly constrained quasi-Newton optimization is an efficient modification of Powell's (1978, 1982) Variable Metric Constrained WatchDog (VMCWD) algorithm. A similar but older algorithm (VF02AD) is part of the Harwell library. Both VMCWD and VF02AD use Fletcher's VE02AD algorithm (part of the Harwell library) for positive definite quadratic programming. The PROC NLP QUANEW implementation uses a quadratic programming subroutine that updates and downdates the approximation of the Cholesky factor when the active set changes. The nonlinear QUANEW algorithm is not a feasible point algorithm, and the value of the objective function need not decrease (minimization) or increase (maximization) monotonically. Instead, the algorithm tries to reduce a linear combination of the objective function and constraint violations, called the merit function.

The following are similarities and differences between this algorithm and VMCWD:

• A modification of this algorithm can be performed by specifying VERSION=1 which replaces the update of the Lagrange vector with the original update of Powell (1978)

which is used in VF02AD. This can be helpful for some applications with linearly dependent active constraints.

• If VERSION is not specified or VERSION=2 is specified, the evaluation of the Lagrange vector is performed in the same way as Powell (1982) describes.
• Instead of updating an approximate Hessian matrix, this algorithm uses the dual BFGS (or DFP) update that updates the Cholesky factor of an approximate Hessian. If the condition of the updated matrix gets too bad, a restart is done with a positive diagonal matrix. At the end of the first iteration after each restart the Cholesky factor is scaled.
• The Cholesky factor is loaded into the quadratic programming subroutine, automatically ensuring positive definiteness of the problem. During the quadratic programming step, the Cholesky factor of the projected Hessian matrix ZTkGZk and the QT decomposition are updated simultaneously when the active set changes. See Gill, Murray, Saunders, & Wright (1984) for more information.
• The line-search strategy is very similar to that of Powell (1982). However, this algorithm does not call for derivatives during the line search. So that the algorithm generally needs fewer derivative calls than function calls. VMCWD always requires the same number of derivative and function calls. Sometimes Powell's line-search method uses steps that are too long. In these cases, use the INSTEP= option to restrict the step length .
• The watchdog strategy is similar to that of

Powell (1982), however, it doesn't return automatically after a fixed number of iterations to a former better point. A return here is further delayed if the observed function reduction is close to the expected function reduction of the quadratic model.

• The Powell termination criterion still is used (as FTOL2), the QUANEW implementation uses two additional termination criteria (GTOL and ABSGTOL).

The nonlinear QUANEW algorithm needs the Jacobian matrix of the first-order derivatives (constraints normals) of the constraints CJ(x).

You can specify two update formulas with the UPDATE= option:

• UPDATE=DBFGS performs the dual BFGS update of the Cholesky factor of the Hessian matrix. This is the default.
• UPDATE=DDFP performs the dual DFP update of the Cholesky factor of the Hessian matrix.
This algorithm uses its own line-search technique. All options and parameters (except the INSTEP= option) controlling the line search in the other algorithms do not apply here. In several applications, large steps in the first iterations were troublesome. You can use the INSTEP= option to impose an upper bound for the step size during the first five iterations. You may also use the INHESSIAN[=r] option to specify a different starting approximation for the Hessian. Choosing simply the INHESSIAN option will use the Cholesky factor of a (possibly ridged) finite difference approximation of the Hessian to initialize the quasi-Newton update process. The values of the LCSINGULAR=, LCEPSILON=, and LCDEACT= options, which control the processing of linear and boundary constraints, are valid only for the quadratic programming subroutine used in each iteration of the nonlinear constraints QUANEW algorithm.

#### Double Dogleg Optimization (DBLDOG)

The double dogleg optimization method combines the ideas of quasi-Newton and trust region methods. The double dogleg algorithm computes in each iteration the step s(k) as the linear combination of the steepest descent or ascent search direction s1(k) and a quasi-Newton search direction s2(k),

The step is requested to remain within a prespecified trust region radius, see Fletcher (1987, p. 107). Thus the DBLDOG subroutine uses the dual quasi-Newton update but does not perform a line search. Two update formulas can be specified with the UPDATE= option:
DBFGS
performs the dual BFGS (Broyden, Fletcher, Goldfarb, & Shanno) update of the Cholesky factor of the Hessian matrix. This is the default.
DDFP
performs the dual DFP (Davidon, Fletcher, & Powell) update of the Cholesky factor of the Hessian matrix.

The double dogleg optimization technique works well for medium to moderately large optimization problems where the objective function and the gradient are much faster to compute than the Hessian. The implementation is based on Dennis & Mei (1979) and Gay (1983)

but is extended for dealing with boundary and linear constraints. DBLDOG generally needs more iterations than the techniques TRUREG, NEWRAP, or NRRIDG which need second-order derivatives, but each of the DBLDOG iterations is computationally cheap. Furthermore, DBLDOG needs only gradient calls for the update of the Cholesky factor of an approximate Hessian.

Second-order derivatives are not needed by CONGRA and not even approximated. The CONGRA algorithm can be expensive in function and gradient calls but needs only O(n) memory for unconstrained optimization. In general, many iterations are needed to obtain a precise solution, but each of the CONGRA iterations is computationally cheap. Four different update formulas for generating the conjugate directions can be specified using the UPDATE= option:

PB
: performs the automatic restart update method of Powell (1977) and Beale (1972). This is the default.
FR
: performs the Fletcher-Reeves update (Fletcher 1987).
PR
: performs the Polak-Ribiere update (Fletcher 1987).
CD
: performs a conjugate-descent update of Fletcher (1987).

The default is UPDATE=PB, since it behaved best in most test examples. You are advised to avoid the option UPDATE=CD, which behaved worst in most test examples.

The CONGRA subroutine should be used for optimization problems with large n. For the unconstrained or boundary constrained case, CONGRA needs only O(n) bytes of working memory, whereas all other optimization methods require order O(n2) bytes of working memory. During n successive iterations, uninterrupted by restarts or changes in the working set, the conjugate gradient algorithm computes a cycle of n conjugate search directions. In each iteration, a line search is done along the search direction to find an approximate optimum of the objective function. The default line-search method uses quadratic interpolation and cubic extrapolation to obtain a step size satisfying the Goldstein conditions. One of the Goldstein conditions can be violated if the feasible region defines an upper limit for the step size. Other line-search algorithms can be specified with the LIS= option.

The Nelder-Mead simplex method does not use any derivatives and does not assume that the objective function has continuous derivatives. The objective function itself needs to be continuous. This technique is quite expensive in the number of function calls and may be unable to generate precise results for .

Depending on the kind of constraints, one of the following Nelder-Mead simplex algorithms is used:

• unconstrained or only boundary constrained problems:
The original Nelder-Mead simplex algorithm is implemented and extended to boundary constraints. This algorithm does not compute the objective for infeasible points. This algorithm is automatically invoked if the LINCON or NLINCON statements are not specified.
• general linearly constrained or nonlinearly constrained problems:
A slightly modified version of Powell's (1992) COBYLA (Constrained Optimization BY Linear Approximations) implementation is used. This algorithm is automatically invoked if the either LINCON or the NLINCON statement is specified.

The original Nelder-Mead algorithm cannot be used for general linear or nonlinear constraints but can be faster for the unconstrained or boundary constrained case. The original Nelder-Mead algorithm changes the shape of the simplex adapting the nonlinearities of the objective function which contributes to an increased speed of convergence. The two NMSIMP subroutines use special sets of termination criteria. For more details see the section "Termination Criteria".

Powell's COBYLA algorithm is a sequential trust-region algorithm (originally with a monotonically decreasing

radius of a spheric trust region) that tries to maintain a regular-shaped simplex over the iterations. A small modification was made to the original algorithm, which permits an increase of the trust-region radius in special situations. A sequence of iterations is performed with a constant trust-region radius until the computed objective function reduction is much less than the predicted reduction. Then, the trust-region radius is reduced. The trust-region radius is increased only if the computed function reduction is relatively close to the predicted reduction and the simplex is well-shaped. The start radius and the final radius can be specified using =INSTEP and =ABSXTOL. The convergence to small values of (high precision) may take many calls of the function and constraint modules and may result in numerical problems. There are two main reasons for the slow convergence of the COBYLA algorithm:

• Only linear approximations of the objective and constraint functions are used locally.
• Maintaining the regular-shaped simplex and not adapting its shape to nonlinearities yields very small simplexes for highly nonlinear functions (for example, fourth order polynomials).

### Nonlinear Least-Squares Optimization

#### Levenberg-Marquardt Least-Squares Method (LEVMAR)

The Levenberg-Marquardt method is a modification of the trust-region method for nonlinear least-squares problems and is implemented as in Mor (1978).

This is the recommended algorithm for small- to medium-sized least-squares problems. Large least-squares problems can be transformed into minimization problems, which can be processed with conjugate gradient or (dual) quasi-Newton techniques. In each iteration LEVMAR solves a quadratically constrained quadratic minimization problem that restricts the step to stay at the surface of or inside an n dimensional elliptical (or spherical) trust region. In each iteration, LEVMAR uses the cross-product Jacobian matrix JTJ as an approximate Hessian matrix.

#### Hybrid Quasi-Newton Least-Squares Methods (HYQUAN)

In each iteration of one of the Fletcher & Xu (1987)

(refer also to AlBaali & Fletcher, 1985, 1986)

hybrid quasi-Newton methods, a criterion is used to decide whether a Gauss-Newton or a dual quasi-Newton search direction is appropriate. The VERSION= option can be used to choose one of three criteria (HY1, HY2, HY3) proposed by Fletcher & Xu (1987). The default is VERSION=2; that is, HY2. In each iteration, HYQUAN computes the cross-product Jacobian (used for the Gauss-Newton step), updates the Cholesky factor of an approximate Hessian (used for the quasi-Newton step), and does a line search to compute an approximate minimum along the search direction. The default line-search technique used by HYQUAN is especially designed for least-squares problems (refer to Lindstrm & Wedin, 1984, and AlBaali & Fletcher, 1986). Using the LIS= option you can choose a different line-search algorithm than the default one. Two update formulas can be specified with the UPDATE= option:

DBFGS
performs the dual BFGS (Broyden, Fletcher, Goldfarb, & Shanno) update of the Cholesky factor of the Hessian matrix. This is the default.
DDFP
performs the dual DFP (Davidon, Fletcher, & Powell) update of the Cholesky factor of the Hessian matrix.

The HYQUAN subroutine needs about the same amount of working memory as the LEVMAR algorithm. In most applications LEVMAR seems to be superior to HYQUAN, and using HYQUAN is recommended only when problems are experienced with the performance of LEVMAR.

 Chapter Contents Previous Next Top