Chapter Contents |
Previous |
Next |

The STDIZE Procedure |

Method |
Location |
Scale |

MEAN | mean | 1 |

MEDIAN | median | 1 |

SUM | 0 | sum |

EUCLEN | 0 | Euclidean length |

USTD | 0 | standard deviation about origin |

STD | mean | standard deviation |

RANGE | minimum | range |

MIDRANGE | midrange | range/2 |

MAXABS | 0 | maximum absolute value |

IQR | median | interquartile range |

MAD | median | median absolute deviation from median |

ABW(c) | biweight 1-step M-estimate | biweight A-estimate |

AHUBER(c) | Huber 1-step M-estimate | Huber A-estimate |

AWAVE(c) | Wave 1-step M-estimate | Wave A-estimate |

AGK(p) | mean | AGK estimate (ACECLUS) |

SPACING(p) | mid minimum-spacing | minimum spacing |

L(p) | L(p) | L(p) |

IN(ds) | read from data set | read from data set |

For METHOD=ABW(*c*), METHOD=AHUBER(*c*), or
METHOD=AWAVE(*c*), *c* is a positive
numeric tuning constant.

For METHOD=AGK(*p*), *p* is a numeric constant giving the
proportion of pairs to be included in the estimation of the
within-cluster variances.

For METHOD=SPACING(*p*), *p* is a numeric constant giving
the proportion of data to be contained in the spacing.

For METHOD=L(*p*), *p* is a numeric constant greater than
or equal to 1 specifying the power to which differences
are to be raised in computing an L(*p*) or Minkowski
metric.

For METHOD=IN(*ds*), *ds* is the name of a SAS data set that
meets either one of the following two conditions:

- contains a _TYPE_ variable.
The observation that contains the location
measure corresponds to the value _TYPE_= 'LOCATION' and the observation that
contains the scale measure corresponds to the value _TYPE_= 'SCALE'.
You can also use a data set created by the
OUTSTAT= option from another PROC STDIZE statement as the
*ds*data set. See the section "Output Data Sets" for the contents of the OUTSTAT= data set. - contains the location and scale variables specified by the LOCATION and SCALE statements.

PROC STDIZE reads in the location and scale variables
in the *ds* data set by
first looking for the _TYPE_ variable in the
*ds* data set. If it finds this variable, PROC STDIZE
continues to search for all variables specified
in the VAR statement. If it does not find the _TYPE_ variable,
PROC STDIZE
searches for the location variables specified in
the LOCATION statement and the scale variables
specified in the SCALE statement.

For robust estimators, refer to Goodall (1983) and Iglewicz (1983).
The MAD method has the
highest breakdown point (50%), but it is somewhat inefficient.
The ABW, AHUBER, and AWAVE methods provide a good compromise between
breakdown and efficiency. The L(*p*) location estimates are
increasingly robust as *p* drops from 2 (corresponding to least squares,
or mean estimation) to 1 (corresponding to least absolute value, or median
estimation). However, the L(*p*) scale estimates are not robust.

The SPACING method is robust to both outliers and clustering
(Jannsen et al. 1995) and is,
therefore, a good choice for cluster analysis or
nonparametric density estimation. The mid-minimum
spacing method estimates the mode for small *p*. The AGK method is also
robust to clustering and more efficient than the SPACING method,
but it is not as robust to outliers and takes longer to
compute. If you expect *g* clusters, the argument to
METHOD=SPACING or METHOD=AGK should be [1/*g*] or less.
The AGK method is less biased
than the SPACING method for small samples. As a general guide, it is
reasonable to use AGK for samples of size 100 or less
and SPACING for samples of size 1000 or more, with the
treatment of intermediate sample sizes depending on the
available computer resources.

Chapter Contents |
Previous |
Next |
Top |

Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.