Chapter Contents
Chapter Contents
Previous
Previous
Next
Next
Robust Regression Examples

Example 9.2: LMS and LTS: Stackloss Data

This example presents the results for Brownlee's (1965) stackloss data, which is also used for documenting the L1 regression module. The three explanatory variables correspond to measurements for a plant oxidizing ammonia to nitric acid on 21 consecutive days.

The response variable yi gives the permillage of ammonia lost (stackloss). These data are also given in Rousseeuw and Leroy (1987, p.76) and Osborne (1985, p.267):
   print "Stackloss Data";
      aa = { 1  80  27  89  42,
             1  80  27  88  37,
             1  75  25  90  37,
             1  62  24  87  28,
             1  62  22  87  18,
             1  62  23  87  18,
             1  62  24  93  19,
             1  62  24  93  20,
             1  58  23  87  15,
             1  58  18  80  14,
             1  58  18  89  14,
             1  58  17  88  13,
             1  58  18  82  11,
             1  58  19  93  12,
             1  50  18  89   8,
             1  50  18  86   7,
             1  50  19  72   8,
             1  50  19  79   8,
             1  50  20  80   9,
             1  56  20  82  15,
             1  70  20  91  15 };
Rousseeuw and Leroy (1987, p.76) cite a large number of papers in which this data set was analyzed before. They state that most researchers "concluded that observations 1, 3, 4, and 21 were outliers" and that some people also reported observation 2 as an outlier.

Consider 2000 Random Subsets

For N=21 and n=4 (three explanatory variables including intercept), you obtain a total of 5985 different subsets of 4 observations out of 21. If you do not specify optn[6], the LMS and LTS algorithms draw Nrep=2000 random sample subsets. Since there is a large number of subsets with singular linear systems that you do not want to print, you can choose optn[2]=2 for reduced printed output.
   title2 "***Use 2000 Random Subsets***";
         a = aa[,2:4]; b = aa[,5];
         optn = j(8,1,.);
         optn[2]= 2;    /* ipri */
         optn[3]= 3;    /* ilsq */
         optn[8]= 3;    /* icov */

   call lms(sc,coef,wgt,optn,b,a);

Output 9.2.1: Some Simple Statistics
Median and Mean
  Median Mean
VAR1 58 60.428571429
VAR2 20 21.095238095
VAR3 87 86.285714286
Intercep 1 1
Response 15 17.523809524

 
 
Dispersion and Standard Deviation
  Dispersion StdDev
VAR1 5.930408874 9.1682682584
VAR2 2.965204437 3.160771455
VAR3 4.4478066555 5.3585712381
Intercep 0 0
Response 5.930408874 10.171622524
 

The following are the results of LS regression.

Output 9.2.2: Table of Unweighted LS Regression
 
Unweighted Least-Squares Estimation
LS Parameter Estimates
Variable Estimate Approx
Std Err
t Value Pr > |t| Lower WCI Upper WCI
VAR1 0.7156402 0.13485819 5.31 <.0001 0.45132301 0.97995739
VAR2 1.29528612 0.36802427 3.52 0.0026 0.57397182 2.01660043
VAR3 -0.1521225 0.15629404 -0.97 0.3440 -0.4584532 0.15420818
Intercep -39.919674 11.8959969 -3.36 0.0038 -63.2354 -16.603949
 
Sum of Squares = 178.8299616
 
Degrees of Freedom = 17
 
LS Scale Estimate = 3.2433639182

 
 
Cov Matrix of Parameter Estimates
  VAR1 VAR2 VAR3 Intercep
VAR1 0.0181867302 -0.036510675 -0.007143521 0.2875871057
VAR2 -0.036510675 0.1354418598 0.0000104768 -0.651794369
VAR3 -0.007143521 0.0000104768 0.024427828 -1.676320797
Intercep 0.2875871057 -0.651794369 -1.676320797 141.51474107
 
R-squared = 0.9135769045
 
F(3,17) Statistic = 59.9022259
 
Probability = 3.0163272E-9

The following are the LMS results for the 2000 random subsets.

Output 9.2.3: Iteration History and Optimization Results
Random Subsampling for LMS

Subset Singular Best
Criterion
Percent
500 23 0.163262 25
1000 55 0.140519 50
1500 79 0.140519 75
2000 103 0.126467 100

Minimum Criterion= 0.1264668282

Least Median of Squares (LMS) Method

Minimizing 13th Ordered Squared Residual.

Highest Possible Breakdown Value = 42.86 %

Random Selection of 2103 Subsets

Among 2103 subsets 103 are singular.

Observations of Best Subset
15 11 19 10

Estimated Coefficients
VAR1 VAR2 VAR3 Intercep
0.75 0.5 0 -39.25

LMS Objective Function = 0.75

Preliminary LMS Scale = 1.0478510755

Robust R Squared = 0.96484375

Final LMS Scale = 1.2076147288


For LMS, observations 1, 3, 4, and 21 have scaled residuals larger than 2.5 (output not shown), and they are considered outliers. The following are the corresponding WLS results.

Output 9.2.4: Table of Weighted LS Regression
 
Weighted Least-Squares Estimation
RLS Parameter Estimates Based on LMS
Variable Estimate Approx
Std Err
t Value Pr > |t| Lower WCI Upper WCI
VAR1 0.79768556 0.06743906 11.83 <.0001 0.66550742 0.9298637
VAR2 0.57734046 0.16596894 3.48 0.0041 0.25204731 0.9026336
VAR3 -0.0670602 0.06160314 -1.09 0.2961 -0.1878001 0.05367975
Intercep -37.652459 4.73205086 -7.96 <.0001 -46.927108 -28.37781
 
Weighted Sum of Squares = 20.400800254
 
Degrees of Freedom = 13
 
RLS Scale Estimate = 1.2527139846

 
 
Cov Matrix of Parameter Estimates
  VAR1 VAR2 VAR3 Intercep
VAR1 0.0045480273 -0.007921409 -0.001198689 0.0015681747
VAR2 -0.007921409 0.0275456893 -0.00046339 -0.065017508
VAR3 -0.001198689 -0.00046339 0.0037949466 -0.246102248
Intercep 0.0015681747 -0.065017508 -0.246102248 22.392305355
 
Weighted R-squared = 0.9750062263
 
F(3,13) Statistic = 169.04317954
 
Probability = 1.158521E-10
 
There are 17 points with nonzero weight.
 
Average Weight = 0.8095238095

The subroutine, prilmts(), which is in robustmc.sas file that is contained in the sample library, can be called to print the output summary:

   call prilmts(3,sc,coef,wgt);

Output 9.2.5: First Part of Output Generated by prilmts()

Results of the Least Median Squares Estimation

Quantile: 13
Number of Subsets: 2103
Number of Singular Subsets: 103
Number of Nonzero Weights: 17
Objective Function: 0.75
Preliminary Scale Estimate: 1.0478511
Final Scale Estimate: 1.2076147
Robust R-Squared: 0.9648438
Asymptotic Consistency Factor: 1.1413664
RLS Scale Estimate: 1.252714
Weighted Sum of Squares: 20.4008
Weighted R-Squared: 0.9750062
F Statistic: 169.04318


Output 9.2.6: Second Part of Output Generated by prilmts()
 
Estimated LMS Coefficients 
0.75     0.5     0     -39.25 
 
Indices of Best Sample
15     11     19     10     
 
Estimated WLS Coefficients
0.7976856     0.5773405     -0.06706     -37.65246     
 
Standard Errors
0.0674391     0.1659689     0.0616031     4.7320509     
 
T Values
11.828242     3.4786054     -1.088584     -7.956901     
 
Probabilities
2.4838E-8     0.004078     0.2961071     2.3723E-6     
 
Lower Wald CI
0.6655074     0.2520473     -0.1878     -46.92711     
 
Upper Wald CI
0.9298637     0.9026336     0.0536798     -28.37781
 
 

Output 9.2.7: Third Part of Output Generated by prilmts()
LMS Residuals
                             6.4176097       2.2772163        6.21059        7.2456884        -0.20702        -0.621059 
                         -0.20702            0.621059       -0.621059       0.621059            0.621059        0.2070197
                         -1.863177        -1.449138         0.621059     -0.20702              0.2070197      0.2070197
                            0.621059           1.863177      -6.831649

 
Diagnostics
                            10.448052       7.9317507                  10     11.666667       2.7297297     3.4864865     
                            4.7297297       4.2432432     3.6486486       3.7598351     4.6057675     4.9251688     
                            3.8888889       4.5864209     5.2970297       4.009901       6.679576       4.3053404     
                            4.0199755                      3                  11

 
WLS Residuals
                              4.9634454     0.9185794      5.1312163     6.5250478     -0.535877      -0.996749 
                            -0.338162       0.4601047   -0.844485       0.286883          0.7686702      0.377432 
                            -2.000854     -1.074607       1.0731966     0.1143341      -0.297718        0.0770058 
                              0.4679328     1.544008     -6.888934 
 

You now want to report the results of LTS for the 2000 random subsets:

   title2 "***Use 2000 Random Subsets***";
      a = aa[,2:4]; b = aa[,5];
      optn = j(8,1,.);
      optn[2]= 2;    /* ipri */
      optn[3]= 3;    /* ilsq */
      optn[8]= 3;    /* icov */

   call lts(sc,coef,wgt,optn,b,a);

Output 9.2.8: Iteration History and Optimization Results
 
Random Subsampling for LTS
Subset Singular Best 
Criterion
Percent
500 23 0.099507 25
1000 55 0.087814 50
1500 79 0.084061 75
2000 103 0.084061 100
 
Minimum Criterion= 0.0840614072
 
Least Trimmed Squares (LTS) Method
 
Minimizing Sum of 13 Smallest Squared Residuals.
 
Highest Possible Breakdown Value = 42.86 %
 
Random Selection of 2103 Subsets
 
Among 2103 subsets 103 are singular.

 
 
Observations of Best Subset
10 11 7 15

 
 
Estimated Coefficients
VAR1 VAR2 VAR3 Intercep
0.75 0.3333333333 0 -35.70512821
 
 
 
LTS Objective Function = 0.4985185153
 
Preliminary LTS Scale = 1.0379336739
 
Robust R Squared = 0.9719626168
 
Final LTS Scale = 1.0407755737

In addition to observations 1, 3, 4, and 21, which were considered outliers in LMS, observation 2 for LTS has a scaled residual considerably larger than 2.5 (output not shown) and is considered an outlier. Therefore, the WLS results based on LTS are different from those based on LMS.

Output 9.2.9: Table of Weighted LS Regression
Weighted Least-Squares Estimation

RLS Paramet er Estimates Based on LTS
Variable Estimate Approx
Std Err
t Value Pr > |t| Lower WCI Upper WCI
VAR1 0.75694055 0.07860766 9.63 <.0001 0.60287236 0.91100874
VAR2 0.45353029 0.13605033 3.33 0.0067 0.18687654 0.72018405
VAR3 -0.05211 0.05463722 -0.95 0.3607 -0.159197 0.054977
Intercep -34.05751 3.82881873 -8.90 <.0001 -41.561857 -26.553163

Weighted Sum of Squares = 10.273044977

Degrees of Freedom = 11

RLS Scale Estimate = 0.9663918355

Cov Matrix of Parameter Estimates
  VAR1 VAR2 VAR3 Intercep
VAR1 0.0061791648 -0.005776855 -0.002300587 -0.034290068
VAR2 -0.005776855 0.0185096933 0.0002582502 -0.069740883
VAR3 -0.002300587 0.0002582502 0.0029852254 -0.131487406
Intercep -0.034290068 -0.069740883 -0.131487406 14.659852903

Weighted R-squared = 0.9622869127

F(3,11) Statistic = 93.558645037

Probability = 4.1136826E-8

There are 15 points with nonzero weight.

Average Weight = 0.7142857143


Consider All 5985 Subsets

You now report the results of LMS for all different subsets:
   title2 "*** Use All 5985 Subsets***";
      a = aa[,2:4]; b = aa[,5];
      optn = j(8,1,.);
      optn[2]= 2;    /* ipri */
      optn[3]= 3;    /* ilsq */
      optn[6]= -1;   /* nrep: all 5985 subsets */
      optn[8]= 3;    /* icov */

   call lms(sc,coef,wgt,optn,b,a);

Output 9.2.10: Iteration History and Optimization Results for LMS
Complete Enumeration for LMS

Subset Singular Best
Criterion
Percent
1497 36 0.185899 25
2993 87 0.158268 50
4489 149 0.140519 75
5985 266 0.126467 100

Minimum Criterion= 0.1264668282

Least Median of Squares (LMS) Method

Minimizing 13th Ordered Squared Residual.

Highest Possible Breakdown Value = 42.86 %

Selection of All 5985 Subsets of 4 Cases Out of 21

Among 5985 subsets 266 are singular.

Observations of Best Subset
8 10 15 19

Estimated Coefficients
VAR1 VAR2 VAR3 Intercep
0.75 0.5 1.29526E-16 -39.25

LMS Objective Function = 0.75

Preliminary LMS Scale = 1.0478510755

Robust R Squared = 0.96484375

Final LMS Scale = 1.2076147288


Next, report the results of LTS for all different subsets:
   title2 "*** Use All 5985 Subsets***";
      a = aa[,2:4]; b = aa[,5];
      optn = j(8,1,.);
      optn[2]= 2;    /* ipri */
      optn[3]= 3;    /* ilsq */
      optn[6]= -1;   /* nrep: all 5985 subsets */
      optn[8]= 3;    /* icov */

   call lts(sc,coef,wgt,optn,b,a);

Output 9.2.11: Iteration History and Optimization Results for LTS
Complete Enumeration for LTS

Subset Singular Best
Criterion
Percent
1497 36 0.135449 25
2993 87 0.107084 50
4489 149 0.081536 75
5985 266 0.081536 100

Minimum Criterion= 0.0815355299

Least Trimmed Squares (LTS) Method

Minimizing Sum of 13 Smallest Squared Residuals.

Highest Possible Breakdown Value = 42.86 %

Selection of All 5985 Subsets of 4 Cases Out of 21

Among 5985 subsets 266 are singular.

Observations of Best Subset
5 12 17 18

Estimated Coefficients
VAR1 VAR2 VAR3 Intercep
0.7291666667 0.4166666667 2.220446E-16 -36.22115385

LTS Objective Function = 0.4835390299

Preliminary LTS Scale = 1.0067458407

Robust R Squared = 0.9736222371

Final LTS Scale = 1.009470149


Chapter Contents
Chapter Contents
Previous
Previous
Next
Next
Top
Top

Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.