|
Chapter Contents |
Previous |
Next |
| The CLUSTER Procedure |
The METHOD= specification determines the clustering method used by the procedure. Any one of the following 11 methods can be specified for name:
The following table summarizes the options in the PROC CLUSTER statement.
| Tasks | Options | ||
| Specify input and output data sets | |||
| specify input data set | DATA= | ||
| create output data set | OUTTREE= | ||
| Specify clustering methods | |||
| specify clustering method | METHOD= | ||
| beta for flexible beta method | BETA= | ||
| minimum number of members for modal clusters | MODE= | ||
| penalty coefficient for maximum-likelihood | PENALTY= | ||
| Wong's hybrid clustering method | HYBRID | ||
| Control data processing prior to clustering | |||
| suppress computation of eigenvalues | NOEIGEN | ||
| suppress normalizing of distances | NONORM | ||
| suppress squaring of distances | NOSQUARE | ||
| standardize variables | STANDARD | ||
| omit points with low probability densities | TRIM= | ||
| Control density estimation | |||
| dimensionality for estimates | DIM= | ||
| number of neighbors for kth-nearest-neighbor | K= | ||
| radius of sphere of support for uniform-kernel | R= | ||
| Suppress checking for ties | NOTIE | ||
| Control display of the cluster history | |||
| display cubic clustering criterion | CCC | ||
| suppress display of ID values | NOID | ||
| specify number of generations to display | PRINT= | ||
| display pseudo F and t2 statistics | PSEUDO | ||
| display root-mean-square standard deviation | RMSSTD | ||
| display R2 and semipartial R2 | RSQUARE | ||
| Control other aspects of output | |||
| suppress display of all output | NOPRINT | ||
| display simple summary statistics | SIMPLE | ||
The following list provides details on these options.
You cannot use a TYPE=CORR data set as input to PROC CLUSTER, since the procedure uses dissimilarity measures. Instead, you can use a DATA step or the IML procedure to extract the correlation matrix from a TYPE=CORR data set and transform the values to dissimilarities such as 1-r or 1-r2, where r is the correlation.
All methods produce the same results when used with coordinate data as when used with Euclidean distances computed from the coordinates. However, the DIM= option must be used with distance data if you specify METHOD=TWOSTAGE or METHOD=DENSITY or if you specify the TRIM= option.
Certain methods that are most naturally defined in terms of coordinates require squared Euclidean distances to be used in the combinatorial distance formulas (Lance and Williams 1967). For this reason, distance data are automatically squared when used with METHOD=AVERAGE, METHOD=CENTROID, METHOD=MEDIAN, or METHOD=WARD. If you want the combinatorial formulas to be applied to the (unsquared) distances with these methods, use the NOSQUARE option.
The MEAN= data set produced by the FASTCLUS procedure is suitable for input to the CLUSTER procedure for hybrid clustering. Since this data set contains _FREQ_ and _RMSSTD_ variables, you can use it as input and then omit the FREQ and RMSSTD statements.
You must specify either METHOD=DENSITY or METHOD=TWOSTAGE with the HYBRID option. You cannot use this option in combination with the TRIM=, K=, or R= option.
If you request an analysis that requires density estimation (the TRIM= option, METHOD=DENSITY, or METHOD=TWOSTAGE), you must specify one of the K=, HYBRID, or R= options.
Use the MODE= option only with METHOD=DENSITY or METHOD=TWOSTAGE. With METHOD=TWOSTAGE, the MODE= option affects the number of modal clusters formed. With METHOD=DENSITY, the MODE= option does not affect the clustering process but does determine the number of modal clusters reported on the output and identified by the _MODE_ variable in the output data set.
If you specify the K= option, the default value of MODE= is the same as the value of K= because the use of kth-nearest-neighbor density estimation limits the resolution that can be obtained for clusters with fewer than k members. If you do not specify the K= option, the default is MODE=2.
If you specify MODE=0, the default value is used instead of 0.
If you specify a FREQ statement or if a _FREQ_ variable appears in the input data set, the MODE= value is compared with the number of actual observations in the clusters being joined, not with the sum of the frequencies in the clusters.
If you specify the NOSQUARE option with distance data, the data are assumed to be squared Euclidean distances for computing R-squared and related statistics defined in a Euclidean coordinate system.
If you specify the NOSQUARE option with coordinate data with METHOD=CENTROID, METHOD=MEDIAN, or METHOD=WARD, then the combinatorial formula is applied to unsquared Euclidean distances. The resulting cluster distances do not have their usual Euclidean interpretation and are, therefore, labeled "False" in the output.
If you request an analysis that requires density estimation (the TRIM= option, METHOD=DENSITY, or METHOD=TWOSTAGE), you must specify one of the K=, HYBRID, or R= options.
You must use either the K= or R= option when you use TRIM=. You cannot use the HYBRID option in combination with TRIM=, so you may want to use the DIM= option instead. If you specify the STANDARD option in combination with TRIM=, the variables are standardized both before and after trimming.
The TRIM= option is useful for removing outliers and reducing chaining. Trimming is highly recommended with METHOD=WARD or METHOD=COMPLETE because clusters from these methods can be severely distorted by outliers. Trimming is also valuable with METHOD=SINGLE since single linkage is the method most susceptible to chaining. Most other methods also benefit from trimming. However, trimming is unnecessary with METHOD=TWOSTAGE or METHOD=DENSITY when kth-nearest-neighbor density estimation is used.
Use of the TRIM= option may spuriously inflate the cubic clustering criterion and the pseudo F and t2 statistics. Trimming only outliers improves the accuracy of the statistics, but trimming saddle regions between clusters yields excessively large values.
|
Chapter Contents |
Previous |
Next |
Top |
Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.