Chapter Contents
Chapter Contents
Previous
Previous
Next
Next
Robust Regression Examples

Overview

SAS/IML has three subroutines that can be used for outlier detection and robust regression. The Least Median of Squares (LMS) and Least Trimmed Squares (LTS) subroutines perform robust regression (sometimes called resistant regression). These subroutines are able to detect outliers and perform a least-squares regression on the remaining observations. The Minimum Volume Ellipsoid Estimation (MVE) subroutine can be used to find the minimum volume ellipsoid estimator, which is the location and robust covariance matrix that can be used for constructing confidence regions and for detecting multivariate outliers and leverage points. Moreover, the MVE subroutine provides a table of robust distances and classical Mahalanobis distances. The LMS, LTS, and MVE subroutines and some other robust estimation theories and methods were developed by Rousseeuw (1984) and Rousseeuw and Leroy (1987). Some statistical applications for MVE are described in Rousseeuw and Van Zomeren (1990).

Whereas robust regression methods like L1 or Huber M-estimators reduce the influence of outliers only (compared to least-squares or L2 regression), resistant regression methods like LMS and LTS can completely disregard influential outliers (sometimes called leverage points) from the fit of the model. The algorithms used in the LMS and LTS subroutines are based on the PROGRESS program by Rousseeuw and Leroy (1987). Rousseeuw and Hubert (1996) prepared a new version of PROGRESS to facilitate its inclusion in SAS software, and they have incorporated several recent developments. Among other things, the new version of PROGRESS now yields the exact LMS for simple regression, and the program uses a new definition of the robust coefficient of determination (R2). Therefore, the outputs may differ slightly from those given in Rousseeuw and Leroy (1987) or those obtained from software based on the older version of PROGRESS. The MVE algorithm is based on the algorithm used in the MINVOL program by Rousseeuw (1984).

The three SAS/IML subroutines are designed for

where h is defined in the range
\frac{N}2 + 1  \leq  h  \leq  \frac{3N}4 + \frac{n+1}4
In the preceding equation, N is the number of observations and n is the number of regressors. * The value of h determines the breakdown point, which is "the smallest fraction of contamination that can cause the estimator T to take on values arbitrarily far from T(Z)" (Rousseeuw and Leroy 1987, p.10). Here, T denotes an estimator and T(Z) applies T to a sample Z.

For each parameter vector b = (b1, ... ,bn), the residual of observation i is ri = yi - xi b. You then denote the ordered, squared residuals as

(r^2)_{1:N} \leq  ...  \leq (r^2)_{N:N}
The objective functions for the LMS and LTS optimization problems are defined as follows:

Because of the nonsmooth form of these objective functions, the estimates cannot be obtained with traditional optimization algorithms. For LMS and LTS, the algorithm, as in the PROGRESS program, selects a number of subsets of n observations out of the N given observations, evaluates the objective function, and saves the subset with the lowest objective function. As long as the problem size enables you to evaluate all such subsets, the result is a global optimum. If computer time does not permit you to evaluate all the different subsets, a random collection of subsets is evaluated. In such a case, you may not obtain the global optimum.

Note that the LMS, LTS, and MVE subroutines are executed only when the number N of observations is over twice the number n of explanatory variables xj (including the intercept), that is, if N > 2n.


Flow Chart for LMS, LTS, and MVE

Chapter Contents
Chapter Contents
Previous
Previous
Next
Next
Top
Top

Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.