|
Chapter Contents |
Previous |
Next |
| The MODECLUS Procedure |
Options available in the PROC MODECLUS statement are classified by function in Table 38.1. The corresponding default value for each option, if applicable, is also listed in this table.
Table 38.1: Functional Summary| Description | Option | Default Value |
| Data Sets | ||
| specify input data set name | DATA= | _LAST_ |
| specify output data set name for observations | OUT= | |
| specify output data set name for clusters | OUTC= | |
| specify output data set name for cluster solutions | OUTS= | |
| Variables in Output Data Sets | ||
| specify variable in the OUT= and OUTCLUS= data sets identifying clusters | CLUSTER= | CLUSTER |
| specify variable in the OUT= data set containing density estimates | DENSITY= | DENSITY |
| specify length of variables in the output data sets | OUTLENGTH= | 8 |
| Results and Data Processing before Clustering * | ||
| request simple statistics | SIMPLE | |
| standardize the variables to mean 0 and standard deviation 1 | STANDARD | |
| Smoothing Parameters | ||
| specify number of neighbors to use for kth-nearest-neighbor density estimation | DK= | |
| specify number of neighbors to use for clustering | CK= | |
| specify number of neighbors to use for kth-nearest-neighbor density estimation and clustering | K= | |
| specify radius of the sphere of support for uniform-kernel density estimation | DR= | |
| specify radius of the neighborhood for clustering | CR= | |
| specify radius of the sphere of support for uniform-kernel density estimation and the neighborhood clustering | R= | |
| Density Estimation Options | ||
| specify number of times the density estimates are to be cascaded | CASCADE= | 0 |
| compute the likelihood cross-validation criterion | CROSS or CROSSLIST | |
| specify dimensionality to be used when computing density estimates | DIMENSION= | nvar* or 1 * |
| use arithmetic means for cascading density estimates | AM | |
| use harmonic means for cascading density estimates | HM | |
| use sums for cascading density estimates | SUM | |
| Clustering Methods Options | ||
| dissolve clusters with n or fewer members | DOCK | |
| stop the analysis after obtaining a solution with either no cluster or a single cluster | EARLY | |
| request that nonsignificant clusters be hierarchically joined. | JOIN(=) | |
| specify maximum number of clusters to be obtained with METHOD=6 | MAXCLUSTERS= | no limit |
| specify clustering method to use | METHOD= | |
| specify minimum members for either cluster to be designated a modal cluster when two clusters are joined using METHOD=5 | MODE= | the value of K * or 2* |
| specify power of the density used with METHOD=6 | POWER= | 2 |
| specify approximate significance tests for the number of clusters | TEST | |
| specify assignment threshold used with METHOD=6 | THRESHOLD= | 0.5 |
| Miscellaneous Options | ||
| produce all optional output | ALL | |
| display the density and cluster membership of observations with neighbors belonging to a different cluster | BOUNDARY | |
| retain the neighbor lists for each observation in memory | CORE | |
| display the estimated cross-validated log density of each observation | CROSSLIST | |
| display the estimated density and cluster membership of each observation | LIST | |
| display estimates of local dimensionality and write them to the OUT=data set | LOCAL | |
| display the neighbors of each observation | NEIGHBOR | |
| suppress the display of the output | NOPRINT | |
| suppress the display of the summary of the number of clusters, number of unassigned observations, and maximum p-value for each analysis | NOSUMMARY | |
| suppress the display of statistics for each cluster | SHORT | |
| trace the cluster assignments for the METHOD=6 algorithm | TRACE | |
You must specify at least one of the following options for smoothing parameters for density estimation: DK=, K=, DR=, or R=. To obtain a cluster analysis, you must specify the METHOD= option and at least one of the following smoothing parameters for clustering: CK=, K=, CR=, or R=. If you want significance tests for the number of clusters, you must specify either the DR= or R= option. See the section "Density Estimation" for a discussion of smoothing parameters.
You can specify lists of values for the DK=, CK=, K=, DR=, CR=, and R= options. Numbers in the lists can be separated by blanks or commas. You can include in the lists one or more items of the form start TO stop BY increment. Each list can contain either one value or the same number of values as in every other list that contains more than one value. If a list has only one value, that value is used in combination with all the values in longer lists. If two or more lists have more than one value, then one analysis is done using the first value in each list, another analysis is done using the second value in each list, and so on.
You can specify the following options in the PROC MODECLUS statement.
You can specify a list of values for the CASCADE= option. Each value in the list is combined with each combination of smoothing parameters to produce a separate analysis.
If the data set is TYPE=DISTANCE, the data are interpreted as a distance matrix. The number of variables must equal the number of observations in the data set or in each BY group. The distances are assumed to be Euclidean, but the procedure accepts other types of distances or dissimilarities. Unlike the CLUSTER procedure, PROC MODECLUS uses the entire distance matrix, not just the lower triangle; the distances are not required to be symmetric. The neighbors of a given observation are determined solely from the distances in that observation. Missing values are considered infinite. Various distance measures can be computed from coordinate data using the %DISTANCE macro in the SAS/STAT sample library.
If the data set is not TYPE=DISTANCE, the data are interpreted as coordinates in a Euclidean space, and Euclidean distances are computed. The variables can be discrete or continuous and should be at the interval level of measurement.
Any value of p less than 1E-8 is set to 1E-8.
You must specify the METHOD= option to obtain a cluster analysis.
You can specify a list of values for the METHOD= option. Each value in the list is combined with each combination of smoothing and cascading parameters to produce a separate cluster analysis.
The OUTLENGTH= option applies only to the following variables that appear in all of the output data sets: _K_, _DK_, _CK_, _R_, _DR_, _CR_, _CASCAD_, _METHOD_, _NJOIN_, and _LOCAL_.
The minimum value is 2 or 3, depending on the operating system. The maximum value is 8. The default value is 8.
The significance tests performed by PROC MODECLUS are valid only for simple random samples, and they require at least 20 observations per cluster to have enough power to be of any use. See the section "Significance Tests".
|
Chapter Contents |
Previous |
Next |
Top |
Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.