Chapter Contents
Chapter Contents
Previous
Previous
Next
Next
Introduction to Clustering Procedures

Overview

You can use SAS clustering procedures to cluster the observations or the variables in a SAS data set. Both hierarchical and disjoint clusters can be obtained. Only numeric variables can be analyzed directly by the procedures, although the %DISTANCE macro can compute a distance matrix using character or numeric variables.

The purpose of cluster analysis is to place objects into groups or clusters suggested by the data, not defined a priori, such that objects in a given cluster tend to be similar to each other in some sense, and objects in different clusters tend to be dissimilar. You can also use cluster analysis for summarizing data rather than for finding "natural" or "real" clusters; this use of clustering is sometimes called dissection (Everitt 1980).

Any generalization about cluster analysis must be vague because a vast number of clustering methods have been developed in several different fields, with different definitions of clusters and similarity among objects. The variety of clustering techniques is reflected by the variety of terms used for cluster analysis: botryology, classification, clumping, competitive learning, morphometrics, nosography, nosology, numerical taxonomy, partitioning, Q-analysis, systematics, taximetrics, taxonorics, typology, unsupervised pattern recognition, vector quantization, and winner-take-all learning. Good (1977) has also suggested aciniformics and agminatics.

Several types of clusters are possible:

The data representations of objects to be clustered also take many forms. The most common are

The SAS procedures for clustering are oriented toward disjoint or hierarchical clusters from coordinate data, distance data, or a correlation or covariance matrix. The following procedures are used for clustering:
CLUSTER
performs hierarchical clustering of observations using eleven agglomerative methods applied to coordinate data or distance data.

FASTCLUS
finds disjoint clusters of observations using a k-means method applied to coordinate data. PROC FASTCLUS is especially suitable for large data sets.

MODECLUS
finds disjoint clusters of observations with coordinate or distance data using nonparametric density estimation. It can also perform approximate nonparametric significance tests for the number of clusters.

VARCLUS
performs both hierarchical and disjoint clustering of variables by oblique multiple-group component analysis.

TREE
draws tree diagrams, also called dendrograms or phenograms, using output from the CLUSTER or VARCLUS procedures. PROC TREE can also create a data set indicating cluster membership at any specified level of the cluster tree.

The following procedures are useful for processing data prior to the actual cluster analysis:
ACECLUS
attempts to estimate the pooled within-cluster covariance matrix from coordinate data without knowledge of the number or the membership of the clusters (Art, Gnanadesikan, and Kettenring 1982). PROC ACECLUS outputs a data set containing canonical variable scores to be used in the cluster analysis proper.

PRINCOMP
performs a principal component analysis and outputs principal component scores.

STDIZE
standardizes variables using any of a variety of location and scale measures, including mean and standard deviation, minimum and range, median and absolute deviation from the median, various m estimators and a estimators, and some scale estimators designed specifically for cluster analysis.

Massart and Kaufman (1983) is the best elementary introduction to cluster analysis. Other important texts are Anderberg (1973), Sneath and Sokal (1973), Duran and Odell (1974), Hartigan (1975), Titterington, Smith, and Makov (1985), McLachlan and Basford (1988), and Kaufmann and Rousseeuw (1990). Hartigan (1975) and Spath (1980) give numerous FORTRAN programs for clustering. Any prospective user of cluster analysis should study the Monte Carlo results of Milligan (1980), Milligan and Cooper (1985), and Cooper and Milligan (1984). Important references on the statistical aspects of clustering include MacQueen (1967), Wolfe (1970), Scott and Symons (1971), Hartigan (1977; 1978; 1981; 1985), Symons (1981), Everitt (1981), Sarle (1983), Bock (1985), and Thode et al. (1988). Bayesian methods have important advantages over maximum likelihood; refer to Binder (1978; 1981), Banfield and Raftery (1993), and Bensmail et al, (1997). For fuzzy clustering, refer to Bezdek (1981) and Bezdek and Pal (1992). The signal-processing perspective is provided by Gersho and Gray (1992). Refer to Blashfield and Aldenderfer (1978) for a discussion of the fragmented state of the literature on cluster analysis.

Chapter Contents
Chapter Contents
Previous
Previous
Next
Next
Top
Top

Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.