|
Chapter Contents |
Previous |
Next |
| The FASTCLUS Procedure |
You can specify the following options in the PROC FASTCLUS statement. Table 25.1 summarizes the options.
Table 25.1: Options Available in the PROC FASTCLUS Statement
| Task | Options | |
| Specify data set details | CLUSTER= | |
| DATA= | ||
| MEAN= | ||
| OUT= | ||
| OUTITER | ||
| OUTSEED= | ||
| OUTSTAT= | ||
| SEED= | ||
| Specify distance dimension | BINS= | |
| HC= | ||
| HP= | ||
| IRLS | ||
| LEAST= | ||
| Select initial cluster seeds | RANDOM= | |
| REPLACE= | ||
| Compute final cluster seeds | CONVERGE= | |
| DELETE= | ||
| DRIFT | ||
| MAXCLUSTERS= | ||
| MAXITER= | ||
| RADIUS= | ||
| STRICT | ||
| Work with missing values | IMPUTE | |
| NOMISS | ||
| Specify variance divisor | VARDEF | |
| Control output | DISTANCE | |
| LIST | ||
| NOPRINT | ||
| SHORT | ||
| SUMMARY |
The following list provides details on these options. The list is in alphabetical order.
If the initial homotopy parameter is too large or if it is decreased too slowly, the optimization may require many iterations. If the initial homotopy parameter is too small or if it is decreased too quickly, convergence to a local optimum is likely.
If you also request an OUT= data set, it contains the imputed values.
If you do not specify the LEAST= option, PROC FASTCLUS uses the least-squares (L2) criterion. However, the default number of iterations is only 1 if you omit the LEAST= option, so the optimization of the criterion is generally not completed. If you specify the LEAST= option, the maximum number of iterations is increased to allow the optimization process a chance to converge. See the MAXITER= option.
Specifying the LEAST= option also changes the default convergence criterion from 0.02 to 0.0001. See the CONVERGE= option.
When LEAST=2, PROC FASTCLUS tries to minimize the root mean square difference between the data and the corresponding cluster means. When LEAST=1, PROC FASTCLUS tries to minimize the mean absolute difference between the data and the corresponding cluster medians. When LEAST=MAX, PROC FASTCLUS tries to minimize the maximum absolute difference between the data and the corresponding cluster midranges. For general values of p, PROC FASTCLUS tries to minimize the pth root of the mean of the pth powers of the absolute differences between the data and the corresponding cluster seeds.
The divisor in the clustering criterion is either the number of nonmissing data used in the analysis or, if there is a WEIGHT statement, the sum of the weights corresponding to all the nonmissing data used in the analysis (that is, an observation with n nonmissing data contributes n times the observation weight to the divisor). The divisor is not adjusted for degrees of freedom.
The method for updating cluster seeds during iteration depends on the LEAST= option, as follows (Gonin and Money 1989).
| LEAST=p | Algorithm for Computing Cluster Seeds |
| p=1 | bin sort for median |
| 1<p<2 | modified Merle-Spath if you specify IRLS, |
| otherwise modified Ekblom-Newton | |
| p=2 | arithmetic mean |
| Newton | |
| midrange |
During the final pass, a modified Merle-Spath step is taken to
compute the cluster centers for
or
.
If you specify the LEAST=p option with a value other than 2, PROC FASTCLUS computes pooled scale estimates analogous to the root mean square standard deviation but based on pth power deviations instead of squared deviations.
| LEAST=p | Scale Estimate |
| p=1 | mean absolute deviation |
| root mean pth-power absolute deviation | |
| maximum absolute deviation |
The divisors for computing the mean absolute deviation or the root mean pth-power absolute deviation are adjusted for degrees of freedom just like the divisors for computing standard deviations. This adjustment can be suppressed by the VARDEF= option.
The default value of the MAXITER= option depends on the LEAST=p option.
| LEAST=p | MAXITER= |
| not specified | 1 |
| p = 1 | 20 |
| 1 < p < 1.5 | 50 |
| 20 | |
| p = 2 | 10 |
| 20 |
| Value | Description | Divisor | ||
| DF | error degrees of freedom | n-c | ||
| N | number of observations | n | ||
| WDF | sum of weights DF | |||
| WEIGHT | WGT | sum of weights |
In the preceding definitions, c represents the number of clusters.
|
Chapter Contents |
Previous |
Next |
Top |
Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.