Chapter Contents |
Previous |
Next |

Introduction to Clustering Procedures |

You can use SAS clustering procedures to cluster the observations or the variables in a SAS data set. Both hierarchical and disjoint clusters can be obtained. Only numeric variables can be analyzed directly by the procedures, although the %DISTANCE macro can compute a distance matrix using character or numeric variables.

The purpose of cluster analysis is to place objects
into groups or clusters suggested by the data, not
defined a priori, such that objects in a given cluster
tend to be similar to each other in some sense, and
objects in different clusters tend to be dissimilar.
You can also use cluster analysis for summarizing data rather than
for finding "natural" or "real" clusters; this use
of clustering is sometimes called *dissection* (Everitt 1980).

Any generalization about cluster analysis must be vague because a vast number of clustering methods have been developed in several different fields, with different definitions of clusters and similarity among objects. The variety of clustering techniques is reflected by the variety of terms used for cluster analysis: botryology, classification, clumping, competitive learning, morphometrics, nosography, nosology, numerical taxonomy, partitioning, Q-analysis, systematics, taximetrics, taxonorics, typology, unsupervised pattern recognition, vector quantization, and winner-take-all learning. Good (1977) has also suggested aciniformics and agminatics.

Several types of clusters are possible:

- Disjoint clusters place each object in one and only one cluster.
- Hierarchical clusters are organized so that one cluster may be entirely contained within another cluster, but no other kind of overlap between clusters is allowed.
- Overlapping clusters can be constrained to limit the number of objects that belong simultaneously to two clusters, or they can be unconstrained, allowing any degree of overlap in cluster membership.
- Fuzzy clusters are defined by a probability or grade of membership of each object in each cluster. Fuzzy clusters can be disjoint, hierarchical, or overlapping.

The data representations of objects to be clustered also take many forms. The most common are

- a square distance or similarity matrix, in which both rows and columns correspond to the objects to be clustered. A correlation matrix is an example of a similarity matrix.
- a coordinate matrix, in which the rows are observations and the columns are variables, as in the usual SAS multivariate data set. The observations, the variables, or both may be clustered.

- CLUSTER
- performs hierarchical clustering of observations using eleven
agglomerative methods applied to coordinate data or distance data.
- FASTCLUS
- finds disjoint clusters of observations using
a
*k*-means method applied to coordinate data. PROC FASTCLUS is especially suitable for large data sets. - MODECLUS
- finds disjoint clusters of observations with coordinate
or distance data using nonparametric density estimation.
It can also perform approximate nonparametric
significance tests for the number of clusters.
- VARCLUS
- performs both hierarchical and disjoint clustering of
variables by oblique multiple-group component analysis.
- TREE
- draws tree diagrams, also called
*dendrograms*or*phenograms*, using output from the CLUSTER or VARCLUS procedures. PROC TREE can also create a data set indicating cluster membership at any specified level of the cluster tree.

The following procedures are useful for processing data prior to the actual cluster analysis:

- ACECLUS
- attempts to estimate the pooled within-cluster
covariance matrix from coordinate data without
knowledge of the number or the membership of the
clusters (Art, Gnanadesikan, and Kettenring 1982).
PROC ACECLUS outputs a data set containing canonical
variable scores to be used in the cluster analysis proper.
- PRINCOMP
- performs a principal component analysis
and outputs principal component scores.
- STDIZE
- standardizes variables using any of a variety of location and scale
measures, including mean and standard deviation, minimum and range,
median and absolute deviation from the median, various
*m*estimators and*a*estimators, and some scale estimators designed specifically for cluster analysis.

Massart and Kaufman (1983) is the best elementary introduction to cluster analysis. Other important texts are Anderberg (1973), Sneath and Sokal (1973), Duran and Odell (1974), Hartigan (1975), Titterington, Smith, and Makov (1985), McLachlan and Basford (1988), and Kaufmann and Rousseeuw (1990). Hartigan (1975) and Spath (1980) give numerous FORTRAN programs for clustering. Any prospective user of cluster analysis should study the Monte Carlo results of Milligan (1980), Milligan and Cooper (1985), and Cooper and Milligan (1984). Important references on the statistical aspects of clustering include MacQueen (1967), Wolfe (1970), Scott and Symons (1971), Hartigan (1977; 1978; 1981; 1985), Symons (1981), Everitt (1981), Sarle (1983), Bock (1985), and Thode et al. (1988). Bayesian methods have important advantages over maximum likelihood; refer to Binder (1978; 1981), Banfield and Raftery (1993), and Bensmail et al, (1997). For fuzzy clustering, refer to Bezdek (1981) and Bezdek and Pal (1992). The signal-processing perspective is provided by Gersho and Gray (1992). Refer to Blashfield and Aldenderfer (1978) for a discussion of the fragmented state of the literature on cluster analysis.

Chapter Contents |
Previous |
Next |
Top |

Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.