Chapter Contents Previous Next
 The MODEL Procedure

## Troubleshooting Convergence Problems

As with any nonlinear estimation routine, there is no guarantee that the estimation will be successful for a given model and data. If the equations are linear with respect to the parameters, the parameter estimates always converge in one iteration. The methods that iterate the S matrix must iterate further for the S matrix to converge. Nonlinear models may not necessarily converge.

Convergence can be expected only with fully identified parameters, adequate data, and starting values sufficiently close to solution estimates.

Convergence and the rate of convergence may depend primarily on the choice of starting values for the estimates. This does not mean that a great deal of effort should be invested in choosing starting values. First, try the default values. If the estimation fails with these starting values, examine the model and data and re-run the estimation using reasonable starting values. It is usually not necessary that the starting values be very good, just that they not be very bad; choose values that seem plausible for the model and data.

### An Example of Requiring Starting Values

Suppose you want to regress a variable Y on a variable X assuming that the variables are related by the following nonlinear equation:

In this equation, Y is linearly related to a power transformation of X. The unknown parameters are a, b, and c. is an unobserved random error. Some simulated data was generated using the following SAS statements. In this simulation, a=10, b=2, and the use of the SQRT function corresponds to c=.5.

   data test;
do i = 1 to 20;
x = 5 * ranuni(1234);
y = 10 + 2 * sqrt(x) + .5 * rannor(2345);
output;
end;
run;


The following statements specify the model and give descriptive labels to the model parameters. Then the FIT statement attempts to estimate a, b, and c using the default starting value .0001.

   proc model data=test;
y = a + b * x ** c;
label a = "Intercept"
b = "Coefficient of Transformed X"
c = "Power Transformation Parameter";
fit y;
run;


PROC MODEL prints model summary and estimation problem summary reports and then prints the output shown in Figure 14.17.

 The MODEL Procedure OLS Estimation

 NOTE: The iteration limit is exceeded for OLS.

 ERROR: The parameter estimates failed to converge for OLS after 100 iterations using CONVERGE=0.001 as the convergence criteria.

 The MODEL Procedure OLS Estimation

 Iteration N Obs R Objective N Subit a b c OLS 100 20 0.9627 3.9678 2 137.3844 -126.536 -0.00213

 Gauss Method Parameter Change Vector a b c -69367.57 69366.51 -1.16

 NOTE: The parameter estimation is abandoned. Check your model and data. If the model is correct and the input data are appropriate, try rerunning the parameter estimation using different starting values for the parameter estimates.

 PROC MODEL continues as if the parameter estimates had converged.

Figure 14.17: Diagnostics for Convergence Failure

By using the default starting values, PROC MODEL was unable to take even the first step in iterating to the solution. The change in the parameters that the Gauss-Newton method computes is very extreme and makes the objective values worse instead of better. Even when this step is shortened by a factor of a million, the objective function is still worse, and PROC MODEL is unable to estimate the model parameters.

The problem is caused by the starting value of C. Using the default starting value C=.0001, the first iteration attempts to compute better values of A and B by what is, in effect, a linear regression of Y on the 10,000th root of X, which is almost the same as the constant 1. Thus the matrix that is inverted to compute the changes is nearly singular and affects the accuracy of the computed parameter changes.

This is also illustrated by the next part of the output, which displays collinearity diagnostics for the crossproducts matrix of the partial derivatives with respect to the parameters, shown in Figure 14.18.

 The MODEL Procedure OLS Estimation

 Collinearity Diagnostics Number Eigenvalue ConditionNumber Proportion of Variation a b c 1 2.376793 1.0000 0.0000 0.0000 0.0000 2 0.623207 1.9529 0.0000 0.0000 0.0000 3 1.684616E-12 1187805 1.0000 1.0000 1.0000

Figure 14.18: Collinearity Diagnostics

This output shows that the matrix is singular and that the partials of A, B, and C with respect to the residual are collinear at the point ( 0.0001, 0.0001, 0.0001 ) in the parameter space. See the section "Linear Dependencies" for a full explanation of the collinearity diagnostics.

The MODEL procedure next prints the note shown in Figure 14.19, which suggests that you try different starting values.

 The MODEL Procedure OLS Estimation

 NOTE: The parameter estimation is abandoned. Check your model and data. If the model is correct and the input data are appropriate, try rerunning the parameter estimation using different starting values for the parameter estimates.

 PROC MODEL continues as if the parameter estimates had converged.

Figure 14.19: Estimation Failure Note

PROC MODEL then produces the usual printout of results for the nonconverged parameter values. The estimation summary is shown in Figure 14.20. The heading includes the reminder "(Not Converged)."

 The MODEL Procedure OLS Estimation

 Collinearity Diagnostics Number Eigenvalue ConditionNumber Proportion of Variation a b c 1 2.376793 1.0000 0.0000 0.0000 0.0000 2 0.623207 1.9529 0.0000 0.0000 0.0000 3 1.684616E-12 1187805 1.0000 1.0000 1.0000

 The MODEL Procedure OLS Estimation Summary (Not Converged)

 Minimization Summary Parameters Estimated 3 Method Gauss Iterations 100 Subiterations 239 Average Subiterations 2.39

 Final Convergence Criteria R 0.962666 PPC(b) 548.1977 RPC(b) 540.4224 Object 2.633E-6 Trace(S) 4.667947 Objective Value 3.967755

 Observations Processed Read 20 Solved 20

Figure 14.20: Nonconverged Estimation Summary

The nonconverged estimation results are shown in Figure 14.21.

 The MODEL Procedure

 Nonlinear OLS Summary of Residual Errors (Not Converged) Equation DF Model DF Error SSE MSE Root MSE R-Square Adj R-Sq y 3 17 79.3551 4.6679 2.1605 -1.6812 -1.9966

 Nonlinear OLS Parameter Estimates (Not Converged) Parameter Estimate Approx Std Err t Value Approx Pr > |t| Label a 137.3844 263342 0.00 0.9996 Intercept b -126.536 263342 -0.00 0.9996 Coefficient of Transformed X c -0.00213 4.4371 -0.00 0.9996 Power Transformation Parameter

Figure 14.21: Nonconverged Results

Note that the R2 statistic is negative. An R2 < 0 results when the residual mean square error for the model is larger than the variance of the dependent variable. Negative R2 statistics may be produced when either the parameter estimates fail to converge correctly, as in this case, or when the correctly estimated model fits the data very poorly.

### Controlling Starting Values

To fit the preceding model you must specify a better starting value for C. Avoid starting values of C that are either very large or close to 0. For starting values of A and B, you can either specify values, use the default, or have PROC MODEL fit starting values for them conditional on the starting value for C.

Starting values are specified with the START= option of the FIT statement or on a PARMS statement. For example, the following statements estimate the model parameters using the starting values A=.0001, B=.0001, and C=5.

   proc model data=test;
y = a + b * x ** c;
label a = "Intercept"
b = "Coefficient of Transformed X"
c = "Power Transformation Parameter";
fit y start=(c=5);
run;


Using these starting values, the estimates converge in 16 iterations. The results are shown in Figure 14.22. Note that since the START= option explicitly declares parameters, the parameter C is placed first in the table.

 The MODEL Procedure

 Nonlinear OLS Summary of Residual Errors Equation DF Model DF Error SSE MSE Root MSE R-Square Adj R-Sq y 3 17 5.7359 0.3374 0.5809 0.8062 0.7834

 Nonlinear OLS Parameter Estimates Parameter Estimate Approx Std Err t Value Approx Pr > |t| Label c 0.327079 0.2892 1.13 0.2738 Power Transformation Parameter a 8.384311 3.3775 2.48 0.0238 Intercept b 3.505391 3.4858 1.01 0.3287 Coefficient of Transformed X

Figure 14.22: Converged Results

### Using the STARTITER Option

PROC MODEL can compute starting values for some parameters conditional on starting values you specify for the other parameters. You supply starting values for some parameters and specify the STARTITER option on the FIT statement.

For example, the following statements set C to 1 and compute starting values for A and B by estimating these parameters conditional on the fixed value of C. With C=1 this is equivalent to computing A and B by linear regression on X. A PARMS statement is used to declare the parameters in alphabetical order. The ITPRINT option is used to print the parameter values at each iteration.

   proc model data=test;
parms a b c;
y = a + b * x ** c;
label a = "Intercept"
b = "Coefficient of Transformed X"
c = "Power Transformation Parameter";
fit y start=(c=1) / startiter itprint;
run;


With better starting values, the estimates converge in only 5 iterations. Counting the 2 iterations required to compute the starting values for A and B, this is 5 fewer than the 12 iterations required without the STARTITER option. The iteration history listing is shown in Figure 14.23.

 The MODEL Procedure OLS Estimation

 Iteration N Obs R Objective N Subit a b c GRID 0 20 0.9970 161.9 0 0.00010 0.00010 5.00000 GRID 1 20 0.0000 0.9675 0 12.29508 0.00108 5.00000

 Iteration N Obs R Objective N Subit a b c OLS 0 20 0.6551 0.9675 0 12.29508 0.00108 5.00000 OLS 1 20 0.6882 0.9558 4 12.26426 0.00201 4.44013 OLS 2 20 0.6960 0.9490 4 12.25554 0.00251 4.28262 OLS 3 20 0.7058 0.9428 2 12.24487 0.00323 4.09977 OLS 4 20 0.7177 0.9380 2 12.23186 0.00430 3.89040 OLS 5 20 0.7317 0.9354 2 12.21610 0.00592 3.65450 OLS 6 20 0.7376 0.9289 3 12.20663 0.00715 3.52417 OLS 7 20 0.7445 0.9223 2 12.19502 0.00887 3.37407 OLS 8 20 0.7524 0.9162 2 12.18085 0.01130 3.20393 OLS 9 20 0.7613 0.9106 2 12.16366 0.01477 3.01460 OLS 10 20 0.7705 0.9058 2 12.14298 0.01975 2.80839 OLS 11 20 0.7797 0.9015 2 12.11827 0.02690 2.58933 OLS 12 20 0.7880 0.8971 2 12.08900 0.03712 2.36306 OLS 13 20 0.7947 0.8916 2 12.05460 0.05152 2.13650 OLS 14 20 0.7993 0.8835 2 12.01449 0.07139 1.91695 OLS 15 20 0.8015 0.8717 2 11.96803 0.09808 1.71101 OLS 16 20 0.8013 0.8551 2 11.91459 0.13284 1.52361 OLS 17 20 0.7987 0.8335 2 11.85359 0.17666 1.35745 OLS 18 20 0.8026 0.8311 1 11.71551 0.28373 1.06872 OLS 19 20 0.7945 0.7935 2 11.57666 0.40366 0.89662 OLS 20 20 0.7872 0.7607 1 11.29346 0.65999 0.67059 OLS 21 20 0.7632 0.6885 1 10.81372 1.11483 0.48842 OLS 22 20 0.6976 0.5587 0 9.54889 2.34556 0.30461 OLS 23 20 0.0108 0.2868 0 8.44333 3.44826 0.33232 OLS 24 20 0.0008 0.2868 0 8.39438 3.49500 0.32790

 NOTE: At OLS Iteration 24 CONVERGE=0.001 Criteria Met.

Figure 14.23: ITPRINT Listing

The results produced in this case are almost the same as the results shown in Figure 14.22, except that the PARMS statement causes the Parameter Estimates table to be ordered A, B, C instead of C, A, B. They are not exactly the same because the different starting values caused the iterations to converge at a slightly different place. This effect is controlled by changing the convergence criterion with the CONVERGE= option.

By default, the STARTITER option performs one iteration to find starting values for the parameters not given values. In this case the model is linear in A and B, so only one iteration is needed. If A or B were nonlinear, you could specify more than one "starting values" iteration by specifying a number for the STARTITER= option.

### Finding Starting Values by Grid Search

PROC MODEL can try various combinations of parameter values and use the combination producing the smallest objective function value as starting values. (For OLS the objective function is the residual mean square.) This is known as a preliminary grid search. You can combine the STARTITER option with a grid search.

For example, the following statements try 5 different starting values for C: 10, 5, 2.5, -2.5, -5. For each value of C, values for A and B are estimated. The combination of A, B, and C values producing the smallest residual mean square is then used to start the iterative process.

   proc model data=test;
parms a b c;
y = a + b * x ** c;
label a = "Intercept"
b = "Coefficient of Transformed X"
c = "Power Transformation Parameter";
fit y start=(c=10 5 2.5 -2.5 -5) / startiter itprint;
run;


The iteration history listing is shown in Figure 14.24. Using the best starting values found by the grid search, the OLS estimation only requires 2 iterations. However, since the grid search required 10 iterations, the total iterations in this case is 12.

 The MODEL Procedure OLS Estimation

 Iteration N Obs R Objective N Subit a b c GRID 0 20 1.0000 26815.5 0 0.00010 0.00010 10.00000 GRID 1 20 0.0000 1.2193 0 12.51792 0.00000 10.00000 GRID 0 20 0.6012 1.5151 0 12.51792 0.00000 5.00000 GRID 1 20 0.0000 0.9675 0 12.29508 0.00108 5.00000 GRID 0 20 0.7804 1.6091 0 12.29508 0.00108 2.50000 GRID 1 20 0.0000 0.6290 0 11.87327 0.06372 2.50000 GRID 0 20 0.8779 4.1604 0 11.87327 0.06372 -2.50000 GRID 1 20 0.0000 0.9542 0 12.92455 -0.04700 -2.50000 GRID 0 20 0.9998 2776.1 0 12.92455 -0.04700 -5.00000 GRID 1 20 0.0000 1.0450 0 12.86129 -0.00060 -5.00000

 Iteration N Obs R Objective N Subit a b c OLS 0 20 0.6685 0.6290 0 11.87327 0.06372 2.50000 OLS 1 20 0.6649 0.5871 3 11.79268 0.10083 2.11710 OLS 2 20 0.6713 0.5740 2 11.71445 0.14901 1.81658 OLS 3 20 0.6726 0.5621 2 11.63772 0.20595 1.58705 OLS 4 20 0.6678 0.5471 2 11.56098 0.26987 1.40903 OLS 5 20 0.6587 0.5295 2 11.48317 0.33953 1.26760 OLS 6 20 0.6605 0.5235 1 11.32436 0.48846 1.03784 OLS 7 20 0.6434 0.4997 2 11.18704 0.62475 0.90793 OLS 8 20 0.6294 0.4805 1 10.93520 0.87965 0.73319 OLS 9 20 0.6031 0.4530 1 10.55670 1.26879 0.57385 OLS 10 20 0.6052 0.4526 0 9.62442 2.23114 0.36146 OLS 11 20 0.1652 0.2948 0 8.56683 3.31774 0.32417 OLS 12 20 0.0008 0.2868 0 8.38015 3.50974 0.32664

 NOTE: At OLS Iteration 12 CONVERGE=0.001 Criteria Met.

Figure 14.24: ITPRINT Listing

Because no initial values for A or B were provided in the PARAMETERS statement or were read in with a PARMSDATA= or ESTDATA= option, A and B were given the default value of 0.0001 for the first iteration. At the second grid point, C=5, the values of A and B obtained from the previous iterations are used for the initial iteration. If initial values are provided for parameters, the parameters start at those initial values at each grid point.

### Guessing Starting Values from the Logic of the Model

Example 14.1 of the logistic growth curve model of the U.S. population illustrates the need for reasonable starting values. This model can be written

pop = [a/(1+exp(b-c(t-1790)))]

where t is time in years. The model is estimated using decennial census data of the U.S. population in millions. If this simple but highly nonlinear model is estimated using the default starting values, the estimation fails to converge.

To find reasonable starting values, first consider the meaning of a and c. Taking the limit as time increases, a is the limiting or maximum possible population. So, as a starting value for a, several times the most recent population known can be used, for example, one billion (1000 million).

Dividing the time derivative by the function to find the growth rate and taking the limit as t moves into the past, you can determine that c is the initial growth rate. You can examine the data and compute an estimate of the growth rate for the first few decades, or you can pick a number that sounds like a plausible population growth rate figure, such as 2%.

To find a starting value for b, let t equal the base year used, 1790, which causes c to drop out of the formula for that year, and then solve for the value of b that is consistent with the known population in 1790 and with the starting value of a. This yields b = ln(a/3.9-1) or about 5.5, where a is 1000 and 3.9 is roughly the population for 1790 given in the data. The estimates converge using these starting values.

### Convergence Problems

When estimating nonlinear models, you may encounter some of the following convergence problems.

#### Unable to Improve

The optimization algorithm may be unable to find a step that improves the objective function. If this happens in the Gauss-Newton method, the step size is halved to find a change vector for which the objective improves. In the Marquardt method, will be increased to find a change vector for which the objective improves. If, after MAXSUBITER= step-size halvings or increases in , the change vector still does not produce a better objective value, the iterations are stopped and an error message is printed.

Failure of the algorithm to improve the objective value can be caused by a CONVERGE= value that is too small. Look at the convergence measures reported at the point of failure. If the estimates appear to be approximately converged, you can accept the NOT CONVERGED results reported, or you can try re-running the FIT task with a larger CONVERGE= value.

If the procedure fails to converge because it is unable to find a change vector that improves the objective value, check your model and data to ensure that all parameters are identified and data values are reasonably scaled. Then, re-run the model with different starting values. Also, consider using the Marquardt method if Gauss-Newton fails; the Gauss-Newton method can get into trouble if the Jacobian matrix is nearly singular or ill-conditioned. Keep in mind that a nonlinear model may be well-identified and well-conditioned for parameter values close to the solution values but unidentified or numerically ill-conditioned for other parameter values. The choice of starting values can make a big difference.

#### Nonconvergence

The estimates may diverge into areas where the program overflows or the estimates may go into areas where function values are illegal or too badly scaled for accurate calculation. The estimation may also take steps that are too small or that make only marginal improvement in the objective function and, thus, fail to converge within the iteration limit.

When the estimates fail to converge, collinearity diagnostics for the Jacobian crossproducts matrix are printed if there are 20 or fewer parameters estimated. See "Linear Dependencies" later in this section for an explanation of these diagnostics.

#### Inadequate Convergence Criterion

If convergence is obtained, the resulting estimates will only approximate a minimum point of the objective function. The statistical validity of the results is based on the exact minimization of the objective function, and for nonlinear models the quality of the results depends on the accuracy of the approximation of the minimum. This is controlled by the convergence criterion used.

There are many nonlinear functions for which the objective function is quite flat in a large region around the minimum point so that many quite different parameter vectors may satisfy a weak convergence criterion. By using different starting values, different convergence criteria, or different minimization methods, you can produce very different estimates for such models.

You can guard against this by running the estimation with different starting values and different convergence criteria and checking that the estimates produced are essentially the same. If they are not, use a smaller CONVERGE= value.

#### Local Minimum

You may have converged to a local minimum rather than a global one. This problem is difficult to detect because the procedure will appear to have succeeded. You can guard against this by running the estimation with different starting values or with a different minimization technique. The START= option can be used to automatically perform a grid search to aid in the search for a global minimum.

#### Discontinuities

The computational methods assume that the model is a continuous and smooth function of the parameters. If this is not the case, the methods may not work.

If the model equations or their derivatives contain discontinuities, the estimation will usually succeed, provided that the final parameter estimates lie in a continuous interval and that the iterations do not produce parameter values at points of discontinuity or parameter values that try to cross asymptotes.

One common case of discontinuities causing estimation failure is that of an asymptotic discontinuity between the final estimates and the initial values. For example, consider the following model, which is basically linear but is written with one parameter in reciprocal form:

   y = a + b * x1 + x2 / c;


By placing the parameter C in the denominator, a singularity is introduced into the parameter space at C=0. This is not necessarily a problem, but if the correct estimate of C is negative while the starting value is positive (or vice versa), the asymptotic discontinuity at 0 will lie between the estimate and the starting value. This means that the iterations have to pass through the singularity to get to the correct estimates. The situation is shown in Figure 14.25.

Figure 14.25: Asymptotic Discontinuity

Because of the incorrect sign of the starting value, the C estimate goes off towards positive infinity in a vain effort to get past the asymptote and onto the correct arm of the hyperbola. As the computer is required to work with ever closer approximations to infinity, the numerical calculations break down and an "objective function was not improved" convergence failure message is printed. At this point, the iterations terminate with an extremely large positive value for C. When the sign of the starting value for C is changed, the estimates converge quickly to the correct values.

### Linear Dependencies

In some cases, the Jacobian matrix may not be of full rank; parameters may not be fully identified for the current parameter values with the current data. When linear dependencies occur among the derivatives of the model, some parameters appear with a standard error of 0 and with the word BIASED printed in place of the t statistic. When this happens, collinearity diagnostics for the Jacobian crossproducts matrix are printed if the DETAILS option is specified and there are twenty or fewer parameters estimated. Collinearity diagnostics are also printed out automatically when a minimization method fails, or when the COLLIN option is specified.

For each parameter, the proportion of the variance of the estimate accounted for by each principal component is printed. The principal components are constructed from the eigenvalues and eigenvectors of the correlation matrix (scaled covariance matrix). When collinearity exists, a principal component is associated with proportion of the variance of more than one parameter. The numbers reported are proportions so they will remain between 0 and 1. If two or more parameters have large proportion values associated with the same principle component, then two problems can occur: the computation of the parameter estimates are slow or nonconvergent; and the parameter estimates have inflated variances (Belsley 1980, p. 105-117).

For example, the following cubic model is fit to a quadratic data set:

   proc model data=test3;
exogenous x1 ;
parms b1 a1 c1 ;
y1 = a1 * x1 + b1 * x1 * x1  + c1 * x1 * x1 *x1;
fit y1/ collin ;
run;

The collinearity diagnostics are shown in Figure 14.26.

 The MODEL Procedure

 Collinearity Diagnostics Number Eigenvalue ConditionNumber Proportion of Variation b1 a1 c1 1 2.942920 1.0000 0.0001 0.0004 0.0002 2 0.056638 7.2084 0.0001 0.0357 0.0148 3 0.000442 81.5801 0.9999 0.9639 0.9850

Figure 14.26: Collinearity Diagnostics

Notice that the proportions associated with the smallest eigenvalue are almost 1. For this model, removing any of the parameters will decrease the variances of the remaining parameters.

In many models the collinearity might not be clear cut. Collinearity is not necessarily something you remove. A model may need to be reformulated to remove the redundant parameterization or the limitations on the estimatability of the model can be accepted.

Collinearity diagnostics are also useful when an estimation does not converge. The diagnostics provide insight into the numerical problems and can suggest which parameters need better starting values. These diagnostics are based on the approach of Belsley, Kuh, and Welsch (1980).

 Chapter Contents Previous Next Top