Chapter Contents Previous Next
 Robust Regression Examples

## Example 9.2: LMS and LTS: Stackloss Data

This example presents the results for Brownlee's (1965) stackloss data, which is also used for documenting the L1 regression module. The three explanatory variables correspond to measurements for a plant oxidizing ammonia to nitric acid on 21 consecutive days.

• x1 air flow to the plant
• x2 cooling water inlet temperature
• x3 acid concentration
The response variable yi gives the permillage of ammonia lost (stackloss). These data are also given in Rousseeuw and Leroy (1987, p.76) and Osborne (1985, p.267):
```   print "Stackloss Data";
aa = { 1  80  27  89  42,
1  80  27  88  37,
1  75  25  90  37,
1  62  24  87  28,
1  62  22  87  18,
1  62  23  87  18,
1  62  24  93  19,
1  62  24  93  20,
1  58  23  87  15,
1  58  18  80  14,
1  58  18  89  14,
1  58  17  88  13,
1  58  18  82  11,
1  58  19  93  12,
1  50  18  89   8,
1  50  18  86   7,
1  50  19  72   8,
1  50  19  79   8,
1  50  20  80   9,
1  56  20  82  15,
1  70  20  91  15 };
```
Rousseeuw and Leroy (1987, p.76) cite a large number of papers in which this data set was analyzed before. They state that most researchers "concluded that observations 1, 3, 4, and 21 were outliers" and that some people also reported observation 2 as an outlier.

### Consider 2000 Random Subsets

For N=21 and n=4 (three explanatory variables including intercept), you obtain a total of 5985 different subsets of 4 observations out of 21. If you do not specify optn[6], the LMS and LTS algorithms draw Nrep=2000 random sample subsets. Since there is a large number of subsets with singular linear systems that you do not want to print, you can choose optn[2]=2 for reduced printed output.
```   title2 "***Use 2000 Random Subsets***";
a = aa[,2:4]; b = aa[,5];
optn = j(8,1,.);
optn[2]= 2;    /* ipri */
optn[3]= 3;    /* ilsq */
optn[8]= 3;    /* icov */

call lms(sc,coef,wgt,optn,b,a);
```

Output 9.2.1: Some Simple Statistics
 Median and Mean Median Mean VAR1 58 60.428571429 VAR2 20 21.095238095 VAR3 87 86.285714286 Intercep 1 1 Response 15 17.523809524

 Dispersion and Standard Deviation Dispersion StdDev VAR1 5.930408874 9.1682682584 VAR2 2.965204437 3.160771455 VAR3 4.4478066555 5.3585712381 Intercep 0 0 Response 5.930408874 10.171622524

The following are the results of LS regression.

Output 9.2.2: Table of Unweighted LS Regression

 Unweighted Least-Squares Estimation
 LS Parameter Estimates Variable Estimate Approx Std Err t Value Pr > |t| Lower WCI Upper WCI VAR1 0.7156402 0.13485819 5.31 <.0001 0.45132301 0.97995739 VAR2 1.29528612 0.36802427 3.52 0.0026 0.57397182 2.01660043 VAR3 -0.1521225 0.15629404 -0.97 0.3440 -0.4584532 0.15420818 Intercep -39.919674 11.8959969 -3.36 0.0038 -63.2354 -16.603949

 Sum of Squares = 178.8299616

 Degrees of Freedom = 17

 LS Scale Estimate = 3.2433639182

 Cov Matrix of Parameter Estimates VAR1 VAR2 VAR3 Intercep VAR1 0.0181867302 -0.036510675 -0.007143521 0.2875871057 VAR2 -0.036510675 0.1354418598 0.0000104768 -0.651794369 VAR3 -0.007143521 0.0000104768 0.024427828 -1.676320797 Intercep 0.2875871057 -0.651794369 -1.676320797 141.51474107

 R-squared = 0.9135769045

 F(3,17) Statistic = 59.9022259

 Probability = 3.0163272E-9

The following are the LMS results for the 2000 random subsets.

Output 9.2.3: Iteration History and Optimization Results
 Random Subsampling for LMS

 Subset Singular BestCriterion Percent 500 23 0.163262 25 1000 55 0.140519 50 1500 79 0.140519 75 2000 103 0.126467 100

 Minimum Criterion= 0.1264668282

 Least Median of Squares (LMS) Method

 Minimizing 13th Ordered Squared Residual.

 Highest Possible Breakdown Value = 42.86 %

 Random Selection of 2103 Subsets

 Among 2103 subsets 103 are singular.

 Observations of Best Subset 15 11 19 10

 Estimated Coefficients VAR1 VAR2 VAR3 Intercep 0.75 0.5 0 -39.25

 LMS Objective Function = 0.75

 Preliminary LMS Scale = 1.0478510755

 Robust R Squared = 0.96484375

 Final LMS Scale = 1.2076147288

For LMS, observations 1, 3, 4, and 21 have scaled residuals larger than 2.5 (output not shown), and they are considered outliers. The following are the corresponding WLS results.

Output 9.2.4: Table of Weighted LS Regression

 Weighted Least-Squares Estimation
 RLS Parameter Estimates Based on LMS Variable Estimate Approx Std Err t Value Pr > |t| Lower WCI Upper WCI VAR1 0.79768556 0.06743906 11.83 <.0001 0.66550742 0.9298637 VAR2 0.57734046 0.16596894 3.48 0.0041 0.25204731 0.9026336 VAR3 -0.0670602 0.06160314 -1.09 0.2961 -0.1878001 0.05367975 Intercep -37.652459 4.73205086 -7.96 <.0001 -46.927108 -28.37781

 Weighted Sum of Squares = 20.400800254

 Degrees of Freedom = 13

 RLS Scale Estimate = 1.2527139846

 Cov Matrix of Parameter Estimates VAR1 VAR2 VAR3 Intercep VAR1 0.0045480273 -0.007921409 -0.001198689 0.0015681747 VAR2 -0.007921409 0.0275456893 -0.00046339 -0.065017508 VAR3 -0.001198689 -0.00046339 0.0037949466 -0.246102248 Intercep 0.0015681747 -0.065017508 -0.246102248 22.392305355

 Weighted R-squared = 0.9750062263

 F(3,13) Statistic = 169.04317954

 Probability = 1.158521E-10

 There are 17 points with nonzero weight.

 Average Weight = 0.8095238095

The subroutine, prilmts(), which is in robustmc.sas file that is contained in the sample library, can be called to print the output summary:

```   call prilmts(3,sc,coef,wgt);
```

Output 9.2.5: First Part of Output Generated by prilmts()

 Results of the Least Median Squares Estimation

 Quantile: 13 Number of Subsets: 2103 Number of Singular Subsets: 103 Number of Nonzero Weights: 17 Objective Function: 0.75 Preliminary Scale Estimate: 1.0478511 Final Scale Estimate: 1.2076147 Robust R-Squared: 0.9648438 Asymptotic Consistency Factor: 1.1413664 RLS Scale Estimate: 1.252714 Weighted Sum of Squares: 20.4008 Weighted R-Squared: 0.9750062 F Statistic: 169.04318

Output 9.2.6: Second Part of Output Generated by prilmts()

 Estimated LMS Coefficients  0.75     0.5     0     -39.25 Indices of Best Sample 15     11     19     10 Estimated WLS Coefficients 0.7976856     0.5773405     -0.06706     -37.65246 Standard Errors 0.0674391     0.1659689     0.0616031     4.7320509 T Values 11.828242     3.4786054     -1.088584     -7.956901 Probabilities 2.4838E-8     0.004078     0.2961071     2.3723E-6 Lower Wald CI 0.6655074     0.2520473     -0.1878     -46.92711 Upper Wald CI 0.9298637     0.9026336     0.0536798     -28.37781

Output 9.2.7: Third Part of Output Generated by prilmts()
 LMS Residuals 6.4176097       2.2772163        6.21059        7.2456884        -0.20702        -0.621059 -0.20702            0.621059       -0.621059       0.621059            0.621059        0.2070197 -1.863177        -1.449138         0.621059     -0.20702              0.2070197      0.2070197 0.621059           1.863177      -6.831649

 Diagnostics 10.448052       7.9317507                  10     11.666667       2.7297297     3.4864865 4.7297297       4.2432432     3.6486486       3.7598351     4.6057675     4.9251688 3.8888889       4.5864209     5.2970297       4.009901       6.679576       4.3053404 4.0199755                      3                  11

 WLS Residuals 4.9634454     0.9185794      5.1312163     6.5250478     -0.535877      -0.996749 -0.338162       0.4601047   -0.844485       0.286883          0.7686702      0.377432 -2.000854     -1.074607       1.0731966     0.1143341      -0.297718        0.0770058 0.4679328     1.544008     -6.888934

You now want to report the results of LTS for the 2000 random subsets:

```   title2 "***Use 2000 Random Subsets***";
a = aa[,2:4]; b = aa[,5];
optn = j(8,1,.);
optn[2]= 2;    /* ipri */
optn[3]= 3;    /* ilsq */
optn[8]= 3;    /* icov */

call lts(sc,coef,wgt,optn,b,a);
```

Output 9.2.8: Iteration History and Optimization Results

 Random Subsampling for LTS
 Subset Singular Best  Criterion Percent 500 23 0.099507 25 1000 55 0.087814 50 1500 79 0.084061 75 2000 103 0.084061 100

 Minimum Criterion= 0.0840614072

 Least Trimmed Squares (LTS) Method

 Minimizing Sum of 13 Smallest Squared Residuals.

 Highest Possible Breakdown Value = 42.86 %

 Random Selection of 2103 Subsets

 Among 2103 subsets 103 are singular.

 Observations of Best Subset 10 11 7 15

 Estimated Coefficients VAR1 VAR2 VAR3 Intercep 0.75 0.3333333333 0 -35.70512821

 LTS Objective Function = 0.4985185153

 Preliminary LTS Scale = 1.0379336739

 Robust R Squared = 0.9719626168

 Final LTS Scale = 1.0407755737

In addition to observations 1, 3, 4, and 21, which were considered outliers in LMS, observation 2 for LTS has a scaled residual considerably larger than 2.5 (output not shown) and is considered an outlier. Therefore, the WLS results based on LTS are different from those based on LMS.

Output 9.2.9: Table of Weighted LS Regression
 Weighted Least-Squares Estimation

 RLS Paramet er Estimates Based on LTS Variable Estimate ApproxStd Err t Value Pr > |t| Lower WCI Upper WCI VAR1 0.75694055 0.07860766 9.63 <.0001 0.60287236 0.91100874 VAR2 0.45353029 0.13605033 3.33 0.0067 0.18687654 0.72018405 VAR3 -0.05211 0.05463722 -0.95 0.3607 -0.159197 0.054977 Intercep -34.05751 3.82881873 -8.90 <.0001 -41.561857 -26.553163

 Weighted Sum of Squares = 10.273044977

 Degrees of Freedom = 11

 RLS Scale Estimate = 0.9663918355

 Cov Matrix of Parameter Estimates VAR1 VAR2 VAR3 Intercep VAR1 0.0061791648 -0.005776855 -0.002300587 -0.034290068 VAR2 -0.005776855 0.0185096933 0.0002582502 -0.069740883 VAR3 -0.002300587 0.0002582502 0.0029852254 -0.131487406 Intercep -0.034290068 -0.069740883 -0.131487406 14.659852903

 Weighted R-squared = 0.9622869127

 F(3,11) Statistic = 93.558645037

 Probability = 4.1136826E-8

 There are 15 points with nonzero weight.

 Average Weight = 0.7142857143

### Consider All 5985 Subsets

You now report the results of LMS for all different subsets:
```   title2 "*** Use All 5985 Subsets***";
a = aa[,2:4]; b = aa[,5];
optn = j(8,1,.);
optn[2]= 2;    /* ipri */
optn[3]= 3;    /* ilsq */
optn[6]= -1;   /* nrep: all 5985 subsets */
optn[8]= 3;    /* icov */

call lms(sc,coef,wgt,optn,b,a);
```

Output 9.2.10: Iteration History and Optimization Results for LMS
 Complete Enumeration for LMS

 Subset Singular BestCriterion Percent 1497 36 0.185899 25 2993 87 0.158268 50 4489 149 0.140519 75 5985 266 0.126467 100

 Minimum Criterion= 0.1264668282

 Least Median of Squares (LMS) Method

 Minimizing 13th Ordered Squared Residual.

 Highest Possible Breakdown Value = 42.86 %

 Selection of All 5985 Subsets of 4 Cases Out of 21

 Among 5985 subsets 266 are singular.

 Observations of Best Subset 8 10 15 19

 Estimated Coefficients VAR1 VAR2 VAR3 Intercep 0.75 0.5 1.29526E-16 -39.25

 LMS Objective Function = 0.75

 Preliminary LMS Scale = 1.0478510755

 Robust R Squared = 0.96484375

 Final LMS Scale = 1.2076147288

Next, report the results of LTS for all different subsets:
```   title2 "*** Use All 5985 Subsets***";
a = aa[,2:4]; b = aa[,5];
optn = j(8,1,.);
optn[2]= 2;    /* ipri */
optn[3]= 3;    /* ilsq */
optn[6]= -1;   /* nrep: all 5985 subsets */
optn[8]= 3;    /* icov */

call lts(sc,coef,wgt,optn,b,a);
```

Output 9.2.11: Iteration History and Optimization Results for LTS
 Complete Enumeration for LTS

 Subset Singular BestCriterion Percent 1497 36 0.135449 25 2993 87 0.107084 50 4489 149 0.081536 75 5985 266 0.081536 100

 Minimum Criterion= 0.0815355299

 Least Trimmed Squares (LTS) Method

 Minimizing Sum of 13 Smallest Squared Residuals.

 Highest Possible Breakdown Value = 42.86 %

 Selection of All 5985 Subsets of 4 Cases Out of 21

 Among 5985 subsets 266 are singular.

 Observations of Best Subset 5 12 17 18

 Estimated Coefficients VAR1 VAR2 VAR3 Intercep 0.7291666667 0.4166666667 2.220446E-16 -36.22115385

 LTS Objective Function = 0.4835390299

 Preliminary LTS Scale = 1.0067458407

 Robust R Squared = 0.9736222371

 Final LTS Scale = 1.009470149

 Chapter Contents Previous Next Top