PROC VARCLUS Statement
- PROC VARCLUS < options >;
The PROC VARCLUS statement starts the VARCLUS procedure
and optionally identifies a data set or requests particular
cluster analyses. By default, the procedure uses
the most recently created SAS data set and omits observations
with missing values from the analysis.
Table 59.1 summarizes
some of the options available in the PROC VARCLUS
statement.
Table 59.1: Options available on the PROC VARCLUS statement
|
Task
|
Options
|
| Specify data sets | DATA=
OUTSTAT=
OUTTREE= |
| Determine the number of clusters | MAXCLUSTERS=
MINCLUSTERS=
MAXEIGEN=
PROPORTION= |
| Specify cluster formation | CENTROID
COVARIANCE
HIERARCHY
INTITIAL=
MAXITER=
MAXSEARCH=
MULTIPLEGROUP
RANDOM= |
| Control output | CORR
NOPRINT
SHORT
SIMPLE
SUMMARY
TRACE |
| Omit intercept | NOINT |
| Specify divisor for variances | VARDEF= |
The following list gives details on these
options. The list is in alphabetical order.
-
CENTROID
-
uses centroid components rather than principal
components. You should specify centroid components
if you want the cluster components to be
unweighted averages of the standardized variables (the default) or the
unstandardized variables (if you specify the COV option).
It is possible to obtain
locally optimal clusterings in which a variable is not
assigned to the cluster component with which it has the
highest squared correlation. You cannot specify the CENTROID
option with the MAXEIGEN= option.
-
CORR
- C
-
displays the correlation matrix.
-
COVARIANCE
- COV
-
analyzes the covariance matrix rather than the
correlation matrix.
-
DATA=SAS-data-set
-
specifies the input data set to be analyzed. The data set
can be an ordinary SAS data set or TYPE=CORR, UCORR, COV, UCOV,
FACTOR, or SSCP. If you do not specify
the DATA= option, the most recently created SAS
data set is used.
See Appendix A, "Special SAS Data Sets,"
for more information on types of SAS data sets.
-
HIERARCHY
- HI
-
requires the clusters at different levels to
maintain a hierarchical structure.
-
INITIAL=GROUP
- INITIAL=INPUT
- INITIAL=RANDOM
- INITIAL=SEED
-
specifies the method for initializing the clusters.
If the INITIAL= option is omitted and the MINCLUSTERS= option
is greater than 1,
the initial cluster components are obtained by extracting
the required number of principal components and performing
an orthoblique rotation. The following list describes the values
for the INITIAL= option:
-
GROUP
- specifies that clusters be initialized by group. You can use this
option if the input data set is a
TYPE=CORR, UCORR, COV, UCOV, or FACTOR data set.
The cluster membership of each variable is obtained from an
observation with _TYPE_='GROUP', which contains an integer
for each variable ranging from one to the number of clusters.
You can use a data set created either by a
previous run of PROC VARCLUS or in a DATA step.
-
INPUT
- specifies that the input data set is a TYPE=CORR, UCORR, COV,
UCOV, or FACTOR data set, in which case scoring coefficients
are read from observations where _TYPE_='SCORE'.
You can use scoring coefficients from the FACTOR procedure
or a previous run of PROC VARCLUS, or
you can enter other coefficients in a DATA step.
-
RANDOM
- assigns variables randomly to clusters.
If you specify INITIAL=RANDOM without the CENTROID option,
it is recommended that you specify MAXSEARCH=5, although
the CPU time required is substantially increased.
-
SEED
- initializes clusters according to the
variables named in the SEED statement.
Each variable listed in the SEED statement becomes the sole
member of a cluster, and the other variables remain unassigned.
If you do not specify the SEED statement, the first
MINCLUSTERS= variables in the VAR statement are used as seeds.
-
MAXCLUSTERS=n
- MAXC=n
-
specifies the largest number of clusters desired. The
default value is the number of variables.
-
MAXEIGEN=n
-
specifies the largest permissible value of the second
eigenvalue in each cluster. If you do not specify either
the PROPORTION= or the
MAXCLUSTERS= option, the default value is the average of
the diagonal elements of the matrix being analyzed.
This value is either the average variance if a covariance matrix is analyzed,
or 1 if the
correlation matrix is analyzed (unless some of the variables are
constant, in which case the value is the number of nonconstant
variables divided by the number of variables). Otherwise, the default is 0.
The MAXEIGEN= option cannot be used with the CENTROID option.
-
MAXSEARCH=n
-
specifies the maximum number of iterations during the
search phase. The default is 10 if you specify the CENTROID option;
the default is 0 otherwise.
-
MINCLUSTERS=n
- MINC=n
-
specifies the smallest number of clusters desired. The
default value is 2 if INITIAL=RANDOM or INITIAL=SEED;
otherwise, the procedure begins with one cluster and tries
to split it in accordance with the PROPORTION= or MAXEIGEN=
option.
-
MULTIPLEGROUP
- MG
-
performs a multiple group component analysis (refer to Harman 1976).
The input data set must be TYPE=CORR, UCORR, COV, UCOV,
FACTOR or SSCP and must contain an observation
with _TYPE_='GROUP' defining the variable groups.
Specifying the MULTIPLEGROUP option is equivalent to
specifying all of the following options:
MINC=1, MAXITER=0, MAXSEARCH=0, MAXEIGEN=0,
PROPORTION=0, and INITIAL=GROUP.
-
NOINT
-
requests that no intercept be used; covariances or correlations are
not corrected for the mean.
If you specify the NOINT option, the OUTSTAT= data set is
TYPE=UCORR.
-
NOPRINT
-
suppresses the output. Note that this option
temporarily disables the Output Delivery System (ODS).
For more information, see Chapter 14, "Using the Output Delivery System."
-
OUTSTAT=SAS-data-set
-
creates an output data set to contain statistics including
means, standard deviations, correlations, cluster scoring
coefficients, and the cluster structure. If you want to
create a permanent SAS data set, you must specify a two-level name.
The OUTSTAT= data set is TYPE=UCORR if the NOINT option is specified.
For more information on permanent SAS data sets, refer to
"SAS Files" and "DATA Step Concepts" in
SAS Language Reference: Concepts.
For information on types of SAS data sets,
see Appendix A.
-
OUTTREE=SAS-data-set
-
creates an output data set to contain information on the
tree structure that can be used by the TREE procedure to
print a tree diagram. The OUTTREE= option implies the
HIERARCHY option. See Example 59.1 for
use of the OUTTREE= option.
If you want to create a permanent SAS data
set, you must specify a two-level name.
For more information on permanent SAS data sets,
refer to "SAS Files" and
"DATA Step Concepts" in
SAS Language Reference: Concepts.
-
PROPORTION=n
- PERCENT=n
-
gives the proportion or percentage of variation that must
be explained by the cluster component. Values greater than
1.0 are considered to be percentages, so
PROPORTION=0.75 and
PERCENT=75 are equivalent.
If you specify
the CENTROID option, the
default value is 0.75; otherwise, the default value is 0.
-
MAXITER=n
-
specifies the maximum number of iterations during the
alternating least-squares phase. The default value is 1 if
you specify the CENTROID option; the default is 10 otherwise.
-
RANDOM=n
-
specifies a positive integer as a starting value for use with
REPLACE=RANDOM. If you do not specify the RANDOM= option,
the time of day is
used to initialize the pseudo-random number sequence.
-
SHORT
-
suppresses printing of the cluster structure, scoring
coefficient, and intercluster correlation matrices.
-
SIMPLE
- S
-
displays means and standard deviations.
-
SUMMARY
-
suppresses all default output except the final summary table.
-
TRACE
-
lists the cluster to which each variable is
assigned during the iterations.
-
VARDEF=DF
- VARDEF=N
- VARDEF=WDF
- VARDEF=WEIGHT | WGT
-
specifies the divisor to be used in the calculation of
variances and covariances. The default value is
VARDEF=DF. The values and associated divisors
are displayed in the following table.
|
Value
|
Divisor
|
Formula
|
| DF | degrees of freedom | n-i |
| N | number of observations | n |
| WDF | sum of weights minus one |  |
| WEIGHT | WGT | sum of weights |  |
In the preceding table,
i=0 if the NOINT option is specified, and i=1 otherwise.
Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.