Chapter Contents
Chapter Contents
Previous
Previous
Next
Next
The FASTCLUS Procedure

Example 27.1: Fisher's Iris Data

The iris data published by Fisher (1936) have been widely used for examples in discriminant analysis and cluster analysis. The sepal length, sepal width, petal length, and petal width are measured in millimeters on fifty iris specimens from each of three species, Iris setosa, I. versicolor, and I. virginica. Mezzich and Solomon (1980) discuss a variety of cluster analyses of the iris data.

In this example, the FASTCLUS procedure is used to find two and, then, three clusters. An output data set is created, and PROC FREQ is invoked to compare the clusters with the species classification. See Output 27.1.1 and Output 27.1.2 for these results. For three clusters, you can use the CANDISC procedure to compute canonical variables for plotting the clusters. See Output 27.1.3 for the results.

   proc format;
      value specname
         1='Setosa    '
         2='Versicolor'
         3='Virginica ';
   run;

   data iris;
      title 'Fisher (1936) Iris Data';
      input SepalLength SepalWidth PetalLength PetalWidth Species @@;
      format Species specname.;
      label SepalLength='Sepal Length in mm.'
            SepalWidth ='Sepal Width in mm.'
            PetalLength='Petal Length in mm.'
            PetalWidth ='Petal Width in mm.';
      symbol = put(species, specname10.);
      datalines;
   50 33 14 02 1 64 28 56 22 3 65 28 46 15 2 67 31 56 24 3
   63 28 51 15 3 46 34 14 03 1 69 31 51 23 3 62 22 45 15 2
   59 32 48 18 2 46 36 10 02 1 61 30 46 14 2 60 27 51 16 2
   65 30 52 20 3 56 25 39 11 2 65 30 55 18 3 58 27 51 19 3
   68 32 59 23 3 51 33 17 05 1 57 28 45 13 2 62 34 54 23 3
   77 38 67 22 3 63 33 47 16 2 67 33 57 25 3 76 30 66 21 3
   49 25 45 17 3 55 35 13 02 1 67 30 52 23 3 70 32 47 14 2
   64 32 45 15 2 61 28 40 13 2 48 31 16 02 1 59 30 51 18 3
   55 24 38 11 2 63 25 50 19 3 64 32 53 23 3 52 34 14 02 1
   49 36 14 01 1 54 30 45 15 2 79 38 64 20 3 44 32 13 02 1
   67 33 57 21 3 50 35 16 06 1 58 26 40 12 2 44 30 13 02 1
   77 28 67 20 3 63 27 49 18 3 47 32 16 02 1 55 26 44 12 2
   50 23 33 10 2 72 32 60 18 3 48 30 14 03 1 51 38 16 02 1
   61 30 49 18 3 48 34 19 02 1 50 30 16 02 1 50 32 12 02 1
   61 26 56 14 3 64 28 56 21 3 43 30 11 01 1 58 40 12 02 1
   51 38 19 04 1 67 31 44 14 2 62 28 48 18 3 49 30 14 02 1
   51 35 14 02 1 56 30 45 15 2 58 27 41 10 2 50 34 16 04 1
   46 32 14 02 1 60 29 45 15 2 57 26 35 10 2 57 44 15 04 1
   50 36 14 02 1 77 30 61 23 3 63 34 56 24 3 58 27 51 19 3
   57 29 42 13 2 72 30 58 16 3 54 34 15 04 1 52 41 15 01 1
   71 30 59 21 3 64 31 55 18 3 60 30 48 18 3 63 29 56 18 3
   49 24 33 10 2 56 27 42 13 2 57 30 42 12 2 55 42 14 02 1
   49 31 15 02 1 77 26 69 23 3 60 22 50 15 3 54 39 17 04 1
   66 29 46 13 2 52 27 39 14 2 60 34 45 16 2 50 34 15 02 1
   44 29 14 02 1 50 20 35 10 2 55 24 37 10 2 58 27 39 12 2
   47 32 13 02 1 46 31 15 02 1 69 32 57 23 3 62 29 43 13 2
   74 28 61 19 3 59 30 42 15 2 51 34 15 02 1 50 35 13 03 1
   56 28 49 20 3 60 22 40 10 2 73 29 63 18 3 67 25 58 18 3
   49 31 15 01 1 67 31 47 15 2 63 23 44 13 2 54 37 15 02 1
   56 30 41 13 2 63 25 49 15 2 61 28 47 12 2 64 29 43 13 2
   51 25 30 11 2 57 28 41 13 2 65 30 58 22 3 69 31 54 21 3
   54 39 13 04 1 51 35 14 03 1 72 36 61 25 3 65 32 51 20 3
   61 29 47 14 2 56 29 36 13 2 69 31 49 15 2 64 27 53 19 3
   68 30 55 21 3 55 25 40 13 2 48 34 16 02 1 48 30 14 01 1
   45 23 13 03 1 57 25 50 20 3 57 38 17 03 1 51 38 15 03 1
   55 23 40 13 2 66 30 44 14 2 68 28 48 14 2 54 34 17 02 1
   51 37 15 04 1 52 35 15 02 1 58 28 51 24 3 67 30 50 17 2
   63 33 60 25 3 53 37 15 02 1
   ;

   proc fastclus data=iris maxc=2 maxiter=10 out=clus;
      var SepalLength SepalWidth PetalLength PetalWidth; 
   run;

   proc freq;
      tables cluster*species;
   run;

   proc fastclus data=iris maxc=3 maxiter=10 out=clus;
      var SepalLength SepalWidth PetalLength PetalWidth;
   run;

   proc freq;
      tables cluster*Species;
   run;

   proc candisc anova out=can;
      class cluster;
      var SepalLength SepalWidth PetalLength PetalWidth;
      title2 'Canonical Discriminant Analysis of Iris Clusters';
   run;
   legend1 frame cframe=ligr label=none cborder=black 
           position=center value=(justify=center);
   axis1 label=(angle=90 rotate=0) minor=none;
   axis2 minor=none;

   proc gplot data=Can;
      plot Can2*Can1=Cluster/frame cframe=ligr
                     legend=legend1 vaxis=axis1 haxis=axis2;
      title2 'Plot of Canonical Variables Identified by Cluster';
   run;

Output 27.1.1: Fisher's Iris Data: PROC FASTCLUS with MAXC=2 and PROC FREQ

Fisher (1936) Iris Data

The FASTCLUS Procedure
Replace=FULL Radius=0 Maxclusters=2 Maxiter=10 Converge=0.02

Initial Seeds
Cluster SepalLength SepalWidth PetalLength PetalWidth
1 43.00000000 30.00000000 11.00000000 1.00000000
2 77.00000000 26.00000000 69.00000000 23.00000000

Minimum Distance Between Initial Seeds = 70.85196


Fisher (1936) Iris Data

The FASTCLUS Procedure
Replace=FULL Radius=0 Maxclusters=2 Maxiter=10 Converge=0.02

Iteration History
Iteration Criterion Relative Change in Cluster
Seeds
1 2
1 11.0638 0.1904 0.3163
2 5.3780 0.0596 0.0264
3 5.0718 0.0174 0.00766

Convergence criterion is satisfied.

Criterion Based on Final Seeds = 5.0417

Cluster Summary
Cluster Frequency RMS Std Deviation Maximum Distance
from Seed
to Observation
Radius
Exceeded
Nearest Cluster Distance Between
Cluster Centroids
1 53 3.7050 21.1621   2 39.2879
2 97 5.6779 24.6430   1 39.2879

Statistics for Variables
Variable Total STD Within STD R-Square RSQ/(1-RSQ)
SepalLength 8.28066 5.49313 0.562896 1.287784
SepalWidth 4.35866 3.70393 0.282710 0.394137
PetalLength 17.65298 6.80331 0.852470 5.778291
PetalWidth 7.62238 3.57200 0.781868 3.584390
OVER-ALL 10.69224 5.07291 0.776410 3.472463

Pseudo F Statistic = 513.92

Approximate Expected Over-All R-Squared = 0.51539

Cubic Clustering Criterion = 14.806

WARNING: The two above values are invalid  for correlated variables.


Fisher (1936) Iris Data

The FASTCLUS Procedure
Replace=FULL Radius=0 Maxclusters=2 Maxiter=10 Converge=0.02

Cluster Means
Cluster SepalLength SepalWidth PetalLength PetalWidth
1 50.05660377 33.69811321 15.60377358 2.90566038
2 63.01030928 28.86597938 49.58762887 16.95876289

Cluster Standard Deviations
Cluster SepalLength SepalWidth PetalLength PetalWidth
1 3.427350930 4.396611045 4.404279486 2.105525249
2 6.336887455 3.267991438 7.800577673 4.155612484


Fisher (1936) Iris Data

The FREQ Procedure

Frequency
Percent
Row Pct
Col Pct

Table of CLUSTER by Species
CLUSTER(Cluster Species Total
Setosa Versicolor Virginica
1 50
33.33
94.34
100.00
3
2.00
5.66
6.00
0
0.00
0.00
0.00
53
35.33
 
 
2 0
0.00
0.00
0.00
47
31.33
48.45
94.00
50
33.33
51.55
100.00
97
64.67
 
 
Total 50
33.33
50
33.33
50
33.33
150
100.00

Output 27.1.2: Fisher's Iris Data: PROC FASTCLUS with MAXC=3 and PROC FREQ

Fisher (1936) Iris Data

The FASTCLUS Procedure
Replace=FULL Radius=0 Maxclusters=3 Maxiter=10 Converge=0.02

Initial Seeds
Cluster SepalLength SepalWidth PetalLength PetalWidth
1 58.00000000 40.00000000 12.00000000 2.00000000
2 77.00000000 38.00000000 67.00000000 22.00000000
3 49.00000000 25.00000000 45.00000000 17.00000000

Minimum Distance Between Initial Seeds = 38.23611


Fisher (1936) Iris Data

The FASTCLUS Procedure
Replace=FULL Radius=0 Maxclusters=3 Maxiter=10 Converge=0.02

Iteration History
Iteration Criterion Relative Change in Cluster Seeds
1 2 3
1 6.7591 0.2652 0.3205 0.2985
2 3.7097 0 0.0459 0.0317
3 3.6427 0 0.0182 0.0124

Convergence criterion is satisfied.

Criterion Based on Final Seeds = 3.6289

Cluster Summary
Cluster Frequency RMS Std Deviation Maximum Distance
from Seed
to Observation
Radius
Exceeded
Nearest Cluster Distance Between
Cluster Centroids
1 50 2.7803 12.4803   3 33.5693
2 38 4.0168 14.9736   3 17.9718
3 62 4.0398 16.9272   2 17.9718

Statistics for Variables
Variable Total STD Within STD R-Square RSQ/(1-RSQ)
SepalLength 8.28066 4.39488 0.722096 2.598359
SepalWidth 4.35866 3.24816 0.452102 0.825156
PetalLength 17.65298 4.21431 0.943773 16.784895
PetalWidth 7.62238 2.45244 0.897872 8.791618
OVER-ALL 10.69224 3.66198 0.884275 7.641194

Pseudo F Statistic = 561.63

Approximate Expected Over-All R-Squared = 0.62728

Cubic Clustering Criterion = 25.021

WARNING: The two above values are invalid  for correlated variables.


Fisher (1936) Iris Data

The FASTCLUS Procedure
Replace=FULL Radius=0 Maxclusters=3 Maxiter=10 Converge=0.02

Cluster Means
Cluster SepalLength SepalWidth PetalLength PetalWidth
1 50.06000000 34.28000000 14.62000000 2.46000000
2 68.50000000 30.73684211 57.42105263 20.71052632
3 59.01612903 27.48387097 43.93548387 14.33870968

Cluster Standard Deviations
Cluster SepalLength SepalWidth PetalLength PetalWidth
1 3.524896872 3.790643691 1.736639965 1.053855894
2 4.941550255 2.900924461 4.885895746 2.798724562
3 4.664100551 2.962840548 5.088949673 2.974997167


Fisher (1936) Iris Data

The FREQ Procedure

Frequency
Percent
Row Pct
Col Pct

Table of CLUSTER by Species
CLUSTER(Cluster Species Total
Setosa Versicolor Virginica
1 50
33.33
100.00
100.00
0
0.00
0.00
0.00
0
0.00
0.00
0.00
50
33.33
 
 
2 0
0.00
0.00
0.00
2
1.33
5.26
4.00
36
24.00
94.74
72.00
38
25.33
 
 
3 0
0.00
0.00
0.00
48
32.00
77.42
96.00
14
9.33
22.58
28.00
62
41.33
 
 
Total 50
33.33
50
33.33
50
33.33
150
100.00

Output 27.1.3: Fisher's Iris Data: PROC CANDISC and PROC GPLOT

Fisher (1936) Iris Data
Canonical Discriminant Analysis of Iris Clusters

The CANDISC Procedure

Observations 150 DF Total 149
Variables 4 DF Within Classes 147
Classes 3 DF Between Classes 2

Class Level Information
CLUSTER Variable
Name
Frequency Weight Proportion
1 _1 50 50.0000 0.333333
2 _2 38 38.0000 0.253333
3 _3 62 62.0000 0.413333


Fisher (1936) Iris Data
Canonical Discriminant Analysis of Iris Clusters

The CANDISC Procedure

Univariate Test Statistics
F Statistics, Num DF=2, Den DF=147
Variable Label Total
Standard
Deviation
Pooled
Standard
Deviation
Between
Standard
Deviation
R-Square R-Square
/ (1-RSq)
F Value Pr > F
SepalLength Sepal Length in mm. 8.2807 4.3949 8.5893 0.7221 2.5984 190.98 <.0001
SepalWidth Sepal Width in mm. 4.3587 3.2482 3.5774 0.4521 0.8252 60.65 <.0001
PetalLength Petal Length in mm. 17.6530 4.2143 20.9336 0.9438 16.7849 1233.69 <.0001
PetalWidth Petal Width in mm. 7.6224 2.4524 8.8164 0.8979 8.7916 646.18 <.0001

Average R-Square
Unweighted 0.7539604
Weighted by Variance 0.8842753

Multivariate Statistics and F Approximations
S=2 M=0.5 N=71
Statistic Value F Value Num DF Den DF Pr > F
Wilks' Lambda 0.03222337 164.55 8 288 <.0001
Pillai's Trace 1.25669612 61.29 8 290 <.0001
Hotelling-Lawley Trace 21.06722883 377.66 8 203.4 <.0001
Roy's Greatest Root 20.63266809 747.93 4 145 <.0001

NOTE: F Statistic for Roy's Greatest Root is an upper bound.

NOTE: F Statistic for Wilks' Lambda is exact.


Fisher (1936) Iris Data
Canonical Discriminant Analysis of Iris Clusters

The CANDISC Procedure

  Canonical
Correlation
Adjusted
Canonical
Correlation
Approximate
Standard
Error
Squared
Canonical
Correlation
Eigenvalues of Inv(E)*H
= CanRsq/(1-CanRsq)
Test of H0: The canonical correlations in the current row and all that follow are zero
  Eigenvalue Difference Proportion Cumulative Likelihood
Ratio
Approximate
F Value
Num DF Den DF Pr > F
1 0.976613 0.976123 0.003787 0.953774 20.6327 20.1981 0.9794 0.9794 0.03222337 164.55 8 288 <.0001
2 0.550384 0.543354 0.057107 0.302923 0.4346   0.0206 1.0000 0.69707749 21.00 3 145 <.0001


Fisher (1936) Iris Data
Canonical Discriminant Analysis of Iris Clusters

The CANDISC Procedure

Total Canonical Structure
Variable Label Can1 Can2
SepalLength Sepal Length in mm. 0.831965 0.452137
SepalWidth Sepal Width in mm. -0.515082 0.810630
PetalLength Petal Length in mm. 0.993520 0.087514
PetalWidth Petal Width in mm. 0.966325 0.154745

Between Canonical Structure
Variable Label Can1 Can2
SepalLength Sepal Length in mm. 0.956160 0.292846
SepalWidth Sepal Width in mm. -0.748136 0.663545
PetalLength Petal Length in mm. 0.998770 0.049580
PetalWidth Petal Width in mm. 0.995952 0.089883

Pooled Within Canonical Structure
Variable Label Can1 Can2
SepalLength Sepal Length in mm. 0.339314 0.716082
SepalWidth Sepal Width in mm. -0.149614 0.914351
PetalLength Petal Length in mm. 0.900839 0.308136
PetalWidth Petal Width in mm. 0.650123 0.404282


Fisher (1936) Iris Data
Canonical Discriminant Analysis of Iris Clusters

The CANDISC Procedure

Total-Sample Standardized Canonical Coefficients
Variable Label Can1 Can2
SepalLength Sepal Length in mm. 0.047747341 1.021487262
SepalWidth Sepal Width in mm. -0.577569244 0.864455153
PetalLength Petal Length in mm. 3.341309573 -1.283043758
PetalWidth Petal Width in mm. 0.996451144 0.900476563

Pooled Within-Class Standardized Canonical Coefficients
Variable Label Can1 Can2
SepalLength Sepal Length in mm. 0.0253414487 0.5421446856
SepalWidth Sepal Width in mm. -.4304161258 0.6442092294
PetalLength Petal Length in mm. 0.7976741592 -.3063023132
PetalWidth Petal Width in mm. 0.3205998034 0.2897207865

Raw Canonical Coefficients
Variable Label Can1 Can2
SepalLength Sepal Length in mm. 0.0057661265 0.1233581748
SepalWidth Sepal Width in mm. -.1325106494 0.1983303556
PetalLength Petal Length in mm. 0.1892773419 -.0726814163
PetalWidth Petal Width in mm. 0.1307270927 0.1181359305

Class Means on Canonical Variables
CLUSTER Can1 Can2
1 -6.131527227 0.244761516
2 4.931414018 0.861972277
3 1.922300462 -0.725693908


fase1c6.gif (5121 bytes)

Chapter Contents
Chapter Contents
Previous
Previous
Next
Next
Top
Top

Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.