Chapter Contents
Chapter Contents
Previous
Previous
Next
Next
The FASTCLUS Procedure

Example 27.2: Outliers

The second example involves data artificially generated to contain two clusters and several severe outliers. A preliminary analysis specifies twenty clusters and outputs an OUTSEED= data set to be used for a diagnostic plot. The exact number of initial clusters is not important; similar results could be obtained with ten or fifty initial clusters. Examination of the plot suggests that clusters with more than five (again, the exact number is not important) observations may yield good seeds for the main analysis. A DATA step deletes clusters with five or fewer observations, and the remaining cluster means provide seeds for the next PROC FASTCLUS analysis.

Two clusters are requested; the LEAST= option specifies the mean absolute deviation criterion (LEAST=1) . Values of the LEAST= option less than 2 reduce the effect of outliers on cluster centers.

The next analysis also requests two clusters; the STRICT= option is specified to prevent outliers from distorting the results. The STRICT= value is chosen to be close to the _GAP_ and _RADIUS_ values of the larger clusters in the diagnostic plot; the exact value is not critical.

A final PROC FASTCLUS run assigns the outliers to clusters. The results are displayed in Output 27.2.1 through Output 27.2.4.

   /*   Create artificial data set with two clusters         */
   /*   and some outliers.                                   */

   data x;
   title 'Using PROC FASTCLUS to Analyze Data with Outliers';
      drop n;
      do n=1 to 100;
         x=rannor(12345)+2;
         y=rannor(12345);
         output;
      end;
      do n=1 to 100;
         x=rannor(12345)-2;
         y=rannor(12345);
         output;
      end;
      do n=1 to 10;
         x=10*rannor(12345);
         y=10*rannor(12345);
         output;
      end;
   run;

   /* Run PROC FASTCLUS with many clusters and OUTSEED= output */
   /* data set for diagnostic plot.                            */

   title2 'Preliminary PROC FASTCLUS Analysis with 20 Clusters';
   proc fastclus data=x outseed=mean1 maxc=20 maxiter=0 summary;
      var x y;
   run;

   legend1 frame cframe=ligr label=none cborder=black 
           position=center value=(justify=center);

   axis1 label=(angle=90 rotate=0) minor=none order=(0 to 10 by 2);
   axis2 minor=none ;

   proc gplot data=mean1;
        plot  _gap_*_freq_ _radius_*_freq_ /overlay frame 
        cframe=ligr vaxis=axis1 haxis=axis2 legend=legend1;
   run;

Output 27.2.1: Preliminary Analysis of Data with Outliers: PROC FASTCLUS and PROC GPLOT

Using PROC FASTCLUS to Analyze Data with Outliers
Preliminary PROC FASTCLUS Analysis with 20 Clusters

The FASTCLUS Procedure
Replace=FULL Radius=0 Maxclusters=20 Maxiter=0

Criterion Based on Final Seeds = 0.6873

Cluster Summary
Cluster Frequency RMS Std Deviation Maximum Distance
from Seed
to Observation
Radius
Exceeded
Nearest Cluster Distance Between
Cluster Centroids
1 8 0.4753 1.1924   19 1.7205
2 1 . 0   6 6.2847
3 44 0.6252 1.6774   5 1.4386
4 1 . 0   20 5.2130
5 38 0.5603 1.4528   3 1.4386
6 2 0.0542 0.1085   2 6.2847
7 1 . 0   14 2.5094
8 2 0.6480 1.2961   1 1.8450
9 1 . 0   7 9.4534
10 1 . 0   18 4.2514
11 1 . 0   16 4.7582
12 20 0.5911 1.6291   16 1.5601
13 5 0.6682 1.4244   3 1.9553
14 1 . 0   7 2.5094
15 5 0.4074 1.2678   3 1.7609
16 22 0.4168 1.5139   19 1.4936
17 8 0.4031 1.4794   5 1.5564
18 1 . 0   10 4.2514
19 45 0.6475 1.6285   16 1.4936
20 3 0.5719 1.3642   15 1.8999

Pseudo F Statistic = 207.58

Approximate Expected Over-All R-Squared = 0.96103

Cubic Clustering Criterion = -2.503

WARNING: The two above values are invalid  for correlated variables.


fase2a2.gif (5139 bytes)

      /*    Remove low frequency clusters.  */
   data seed;
      set mean1;
      if _freq_>5;
   run;

      /*   Run PROC FASTCLUS again, selecting seeds from the          */
      /*   high frequency clusters in the previous analysis           */
      /*   using LEAST=1 Clustering Criterion                         */

   title2 'PROC FASTCLUS Analysis Using LEAST= Clustering Criterion';
   title3 'Values < 2 Reduce Effect of Outliers on Cluster Centers';
   proc fastclus data=x seed=seed maxc=2 least=1 out=out;
      var x y;
   run;

   legend1 frame cframe=ligr label=none cborder=black 
           position=center value=(justify=center);
   axis1 label=(angle=90 rotate=0) minor=none;
   axis2 minor=none;

   proc gplot data=out;
      plot y*x=cluster/frame cframe=ligr
                     legend=legend1 vaxis=axis1 haxis=axis2;
   run;

Output 27.2.2: Analysis of Data with Outliers using the LEAST= Option

Using PROC FASTCLUS to Analyze Data with Outliers
PROC FASTCLUS Analysis Using LEAST= Clustering Criterion
Values < 2 Reduce Effect of Outliers on Cluster Centers

The FASTCLUS Procedure
Replace=FULL Radius=0 Maxclusters=2 Maxiter=20 Converge=0.0001 Least=1

Initial Seeds
Cluster x y
1 2.794174248 -0.065970836
2 -2.027300384 -2.051208579

Minimum Distance Between Initial Seeds = 6.806712

Preliminary L(1) Scale Estimate = 2.796579


Using PROC FASTCLUS to Analyze Data with Outliers
PROC FASTCLUS Analysis Using LEAST= Clustering Criterion
Values < 2 Reduce Effect of Outliers on Cluster Centers

The FASTCLUS Procedure
Replace=FULL Radius=0 Maxclusters=2 Maxiter=20 Converge=0.0001 Least=1

Number of Bins = 100

Iteration History
Iteration Criterion Maximum Bin Size Relative Change in Cluster
Seeds
1 2
1 1.3983 0.2263 0.4091 0.6696
2 1.0776 0.0226 0.00511 0.0452
3 1.0771 0.00226 0.00229 0.00234
4 1.0771 0.000396 0.000253 0.000144
5 1.0771 0.000396 0 0

Convergence criterion is satisfied.


Using PROC FASTCLUS to Analyze Data with Outliers
PROC FASTCLUS Analysis Using LEAST= Clustering Criterion
Values < 2 Reduce Effect of Outliers on Cluster Centers

The FASTCLUS Procedure
Replace=FULL Radius=0 Maxclusters=2 Maxiter=20 Converge=0.0001 Least=1

Criterion Based on Final Seeds = 1.0771

Cluster Summary
Cluster Frequency Mean Absolute
Deviation
Maximum Distance
from Seed
to Observation
Radius
Exceeded
Nearest Cluster Distance Between
Cluster Medians
1 102 1.1278 24.1622   2 4.2585
2 108 1.0494 14.8292   1 4.2585

Cluster Medians
Cluster x y
1 1.923023887 0.222482918
2 -1.826721743 -0.286253041

Mean Absolute Deviations from Final
Seeds
Cluster x y
1 1.113465261 1.142120480
2 0.890331835 1.208370913


fase2a6.gif (4415 bytes)

      /*   Run PROC FASTCLUS again, selecting seeds from the        */
      /*   high frequency clusters in the previous analysis         */
      /*   STRICT= prevents outliers from distorting the results.   */

   title2 'PROC FASTCLUS Analysis Using STRICT= to Omit Outliers';
   proc fastclus data=x seed=seed
        maxc=2 strict=3.0 out=out outseed=mean2;
      var x y;
   run;

   proc gplot data=out;
      plot y*x=cluster/frame cframe=ligr
                     legend=legend1 vaxis=axis1 haxis=axis2;
   run;

Output 27.2.3: Cluster Analysis with Outliers Omitted: PROC FASTCLUS and PROC GPLOT

Using PROC FASTCLUS to Analyze Data with Outliers
PROC FASTCLUS Analysis Using STRICT= to Omit Outliers

The FASTCLUS Procedure
Replace=FULL Radius=0 Strict=3 Maxclusters=2 Maxiter=1

Initial Seeds
Cluster x y
1 2.794174248 -0.065970836
2 -2.027300384 -2.051208579

Criterion Based on Final Seeds = 0.9515

Cluster Summary
Cluster Frequency RMS Std Deviation Maximum Distance
from Seed
to Observation
Radius
Exceeded
Nearest Cluster Distance Between
Cluster Centroids
1 99 0.9501 2.9589   2 3.7666
2 99 0.9290 2.8011   1 3.7666

12 Observation(s) were not assigned to a cluster  because the minimum distance to a cluster seed  exceeded the STRICT= value.

Statistics for Variables
Variable Total STD Within STD R-Square RSQ/(1-RSQ)
x 2.06854 0.87098 0.823609 4.669219
y 1.02113 1.00352 0.039093 0.040683
OVER-ALL 1.63119 0.93959 0.669891 2.029303

Pseudo F Statistic = 397.74

Approximate Expected Over-All R-Squared = 0.60615

Cubic Clustering Criterion = 3.197

WARNING: The two above values are invalid  for correlated variables.


Using PROC FASTCLUS to Analyze Data with Outliers
PROC FASTCLUS Analysis Using STRICT= to Omit Outliers

The FASTCLUS Procedure
Replace=FULL Radius=0 Strict=3 Maxclusters=2 Maxiter=1

Cluster Means
Cluster x y
1 1.825111432 0.141211701
2 -1.919910712 -0.261558725

Cluster Standard Deviations
Cluster x y
1 0.889549271 1.006965219
2 0.852000588 1.000062579


fase2b3.gif (4320 bytes)

      /* Run PROC FASTCLUS one more time with zero iterations */
      /* to assign outliers and tails to clusters.            */
   title2 'Final PROC FASTCLUS Analysis Assigning Outliers to '
          'Clusters';
   proc fastclus data=x seed=mean2 maxc=2 maxiter=0 out=out;
      var x y;
   run;


   proc gplot data=out;
      plot y*x=cluster/frame cframe=ligr
                     legend=legend1 vaxis=axis1 haxis=axis2;
   run;

Output 27.2.4: Final Analysis with Outliers Assigned to Clusters: PROC FASTCLUS and PROC GPLOT

Using PROC FASTCLUS to Analyze Data with Outliers
Final PROC FASTCLUS Analysis Assigning Outliers to Clusters

The FASTCLUS Procedure
Replace=FULL Radius=0 Maxclusters=2 Maxiter=0

Initial Seeds
Cluster x y
1 1.825111432 0.141211701
2 -1.919910712 -0.261558725

Criterion Based on Final Seeds = 2.0594

Cluster Summary
Cluster Frequency RMS Std Deviation Maximum Distance
from Seed
to Observation
Radius
Exceeded
Nearest Cluster Distance Between
Cluster Centroids
1 103 2.2569 17.9426   2 4.3753
2 107 1.8371 11.7362   1 4.3753

Statistics for Variables
Variable Total STD Within STD R-Square RSQ/(1-RSQ)
x 2.92721 1.95529 0.555950 1.252000
y 2.15248 2.14754 0.009347 0.009435
OVER-ALL 2.56922 2.05367 0.364119 0.572621

Pseudo F Statistic = 119.11

Approximate Expected Over-All R-Squared = 0.49090

Cubic Clustering Criterion = -5.338

WARNING: The two above values are invalid  for correlated variables.


Using PROC FASTCLUS to Analyze Data with Outliers
Final PROC FASTCLUS Analysis Assigning Outliers to Clusters

The FASTCLUS Procedure
Replace=FULL Radius=0 Maxclusters=2 Maxiter=0

Cluster Means
Cluster x y
1 2.280017469 0.263940765
2 -2.075547895 -0.151348765

Cluster Standard Deviations
Cluster x y
1 2.412264861 2.089922815
2 1.379355878 2.201567557


fase2b6.gif (4038 bytes)

Chapter Contents
Chapter Contents
Previous
Previous
Next
Next
Top
Top

Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.