Chapter Contents
Chapter Contents
Previous
Previous
Next
Next
The CLUSTER Procedure

Example 23.5: Computing a Distance Matrix

A wide variety of distance and similarity measures are used in cluster analysis (Anderberg 1973, Sneath and Sokal 1973). If your data are in coordinate form and you want to use a non-Euclidean distance for clustering, you can compute a distance matrix using a DATA step or the IML procedure.

Similarity measures must be converted to dissimilarities before being used in PROC CLUSTER. Such conversion can be done in a variety of ways, such as taking reciprocals or subtracting from a large value. The choice of conversion method depends on the application and the similarity measure.

In the following example, the observations are states. Binary-valued variables correspond to various grounds for divorce and indicate whether the grounds for divorce apply in each of the states.

The %DISTANCE* macro is used to compute the Jaccard coefficient (Anderberg 1973, pp. 89, 115, and 117) between each pair of states. The Jaccard coefficient is defined as the number of variables that are coded as 1 for both states divided by the number of variables that are coded as 1 for either or both states. The Jaccard coefficient is converted to a distance measure by subtracting it from 1.

   %include '<location of  SAS/STAT sample library>/xmacro.sas';
   %include '<location of  SAS/STAT sample library>/distnew.sas';

   options ls=120 ps=60;
   data divorce;
      title 'Grounds for Divorce';
      input state $15.
            (incompat cruelty desertn non_supp alcohol 
             felony impotenc insanity separate) (1.) @@;
      if mod(_n_,2) then input +4 @@; else input;
      datalines;
   ALABAMA        111111111    ALASKA         111011110
   ARIZONA        100000000    ARKANSAS       011111111
   CALIFORNIA     100000010    COLORADO       100000000
   CONNECTICUT    111111011    DELAWARE       100000001
   FLORIDA        100000010    GEORGIA        111011110
   HAWAII         100000001    IDAHO          111111011
   ILLINOIS       011011100    INDIANA        100001110
   IOWA           100000000    KANSAS         111011110
   KENTUCKY       100000000    LOUISIANA      000001001
   MAINE          111110110    MARYLAND       011001111
   MASSACHUSETTS  111111101    MICHIGAN       100000000
   MINNESOTA      100000000    MISSISSIPPI    111011110
   MISSOURI       100000000    MONTANA        100000000
   NEBRASKA       100000000    NEVADA         100000011
   NEW HAMPSHIRE  111111100    NEW JERSEY     011011011
   NEW MEXICO     111000000    NEW YORK       011001001
   NORTH CAROLINA 000000111    NORTH DAKOTA   111111110
   OHIO           111011101    OKLAHOMA       111111110
   OREGON         100000000    PENNSYLVANIA   011001110
   RHODE ISLAND   111111101    SOUTH CAROLINA 011010001
   SOUTH DAKOTA   011111000    TENNESSEE      111111100
   TEXAS          111001011    UTAH           011111110
   VERMONT        011101011    VIRGINIA       010001001
   WASHINGTON     100000001    WEST VIRGINIA  111011011
   WISCONSIN      100000001    WYOMING        100000011
   ;

   %distance(data=divorce, id=state, options=nomiss, out=distjacc,
             shape=square, method=djaccard, var=incompat--separate);

   proc print data=distjacc(obs=10);
      id state; var alabama--georgia;
      title2 'First 10 states';
   run;
   title2;

   proc cluster data=distjacc method=centroid 
                pseudo outtree=tree;
      id state;
      var alabama--wyoming;
   run;

   proc tree data=tree noprint n=9 out=out;
      id state;
   run;

   proc sort;
      by state;
   run;

   data clus;
      merge divorce out;
      by state;
   run;

   proc sort;
      by cluster;
   run;

   proc print;
      id state;
      var incompat--separate;
      by cluster;
   run;

Output 23.5.1: Computing a Distance Matrix

Grounds for Divorce
First 10 states

state ALABAMA ALASKA ARIZONA ARKANSAS CALIFORNIA COLORADO CONNECTICUT DELAWARE FLORIDA GEORGIA
ALABAMA 0.00000 0.22222 0.88889 0.11111 0.77778 0.88889 0.11111 0.77778 0.77778 0.22222
ALASKA 0.22222 0.00000 0.85714 0.33333 0.71429 0.85714 0.33333 0.87500 0.71429 0.00000
ARIZONA 0.88889 0.85714 0.00000 1.00000 0.50000 0.00000 0.87500 0.50000 0.50000 0.85714
ARKANSAS 0.11111 0.33333 1.00000 0.00000 0.88889 1.00000 0.22222 0.88889 0.88889 0.33333
CALIFORNIA 0.77778 0.71429 0.50000 0.88889 0.00000 0.50000 0.75000 0.66667 0.00000 0.71429
COLORADO 0.88889 0.85714 0.00000 1.00000 0.50000 0.00000 0.87500 0.50000 0.50000 0.85714
CONNECTICUT 0.11111 0.33333 0.87500 0.22222 0.75000 0.87500 0.00000 0.75000 0.75000 0.33333
DELAWARE 0.77778 0.87500 0.50000 0.88889 0.66667 0.50000 0.75000 0.00000 0.66667 0.87500
FLORIDA 0.77778 0.71429 0.50000 0.88889 0.00000 0.50000 0.75000 0.66667 0.00000 0.71429
GEORGIA 0.22222 0.00000 0.85714 0.33333 0.71429 0.85714 0.33333 0.87500 0.71429 0.00000


Grounds for Divorce

The CLUSTER Procedure
Centroid Hierarchical Cluster Analysis

Root-Mean-Square Distance Between Observations = 0.694873

Cluster History
NCL Clusters Joined FREQ PSF PST2 Norm
Cent
Dist
T
i
e
49 ARIZONA COLORADO 2 . . 0 T
48 CALIFORNIA FLORIDA 2 . . 0 T
47 ALASKA GEORGIA 2 . . 0 T
46 DELAWARE HAWAII 2 . . 0 T
45 CONNECTICUT IDAHO 2 . . 0 T
44 CL49 IOWA 3 . . 0 T
43 CL47 KANSAS 3 . . 0 T
42 CL44 KENTUCKY 4 . . 0 T
41 CL42 MICHIGAN 5 . . 0 T
40 CL41 MINNESOTA 6 . . 0 T
39 CL43 MISSISSIPPI 4 . . 0 T
38 CL40 MISSOURI 7 . . 0 T
37 CL38 MONTANA 8 . . 0 T
36 CL37 NEBRASKA 9 . . 0 T
35 NORTH DAKOTA OKLAHOMA 2 . . 0 T
34 CL36 OREGON 10 . . 0 T
33 MASSACHUSETTS RHODE ISLAND 2 . . 0 T
32 NEW HAMPSHIRE TENNESSEE 2 . . 0 T
31 CL46 WASHINGTON 3 . . 0 T
30 CL31 WISCONSIN 4 . . 0 T
29 NEVADA WYOMING 2 . . 0  
28 ALABAMA ARKANSAS 2 1561 . 0.1599 T
27 CL33 CL32 4 479 . 0.1799 T
26 CL39 CL35 6 265 . 0.1799 T
25 CL45 WEST VIRGINIA 3 231 . 0.1799  
24 MARYLAND PENNSYLVANIA 2 199 . 0.2399  
23 CL28 UTAH 3 167 3.2 0.2468  
22 CL27 OHIO 5 136 5.4 0.2698  
21 CL26 MAINE 7 111 8.9 0.2998  
20 CL23 CL21 10 75.2 8.7 0.3004  
19 CL25 NEW JERSEY 4 71.8 6.5 0.3053 T
18 CL19 TEXAS 5 69.1 2.5 0.3077  
17 CL20 CL22 15 48.7 9.9 0.3219  
16 NEW YORK VIRGINIA 2 50.1 . 0.3598  
15 CL18 VERMONT 6 49.4 2.9 0.3797  
14 CL17 ILLINOIS 16 47.0 3.2 0.4425  
13 CL14 CL15 22 29.2 15.3 0.4722  
12 CL48 CL29 4 29.5 . 0.4797 T
11 CL13 CL24 24 27.6 4.5 0.5042  
10 CL11 SOUTH DAKOTA 25 28.4 2.4 0.5449  
9 LOUISIANA CL16 3 30.3 3.5 0.5844  
8 CL34 CL30 14 23.3 . 0.7196  
7 CL8 CL12 18 19.3 15.0 0.7175  
6 CL10 SOUTH CAROLINA 26 21.4 4.2 0.7384  
5 CL6 NEW MEXICO 27 24.0 4.7 0.8303  
4 CL5 INDIANA 28 28.9 4.1 0.8343  
3 CL4 CL9 31 31.7 10.9 0.8472  
2 CL3 NORTH CAROLINA 32 55.1 4.1 1.0017  
1 CL2 CL7 50 . 55.1 1.0663  


Grounds for Divorce

CLUSTER=1

state incompat cruelty desertn non_supp alcohol felony impotenc insanity separate
ARIZONA 1 0 0 0 0 0 0 0 0
COLORADO 1 0 0 0 0 0 0 0 0
IOWA 1 0 0 0 0 0 0 0 0
KENTUCKY 1 0 0 0 0 0 0 0 0
MICHIGAN 1 0 0 0 0 0 0 0 0
MINNESOTA 1 0 0 0 0 0 0 0 0
MISSOURI 1 0 0 0 0 0 0 0 0
MONTANA 1 0 0 0 0 0 0 0 0
NEBRASKA 1 0 0 0 0 0 0 0 0
OREGON 1 0 0 0 0 0 0 0 0

CLUSTER=2

state incompat cruelty desertn non_supp alcohol felony impotenc insanity separate
CALIFORNIA 1 0 0 0 0 0 0 1 0
FLORIDA 1 0 0 0 0 0 0 1 0
NEVADA 1 0 0 0 0 0 0 1 1
WYOMING 1 0 0 0 0 0 0 1 1

CLUSTER=3

state incompat cruelty desertn non_supp alcohol felony impotenc insanity separate
ALABAMA 1 1 1 1 1 1 1 1 1
ALASKA 1 1 1 0 1 1 1 1 0
ARKANSAS 0 1 1 1 1 1 1 1 1
CONNECTICUT 1 1 1 1 1 1 0 1 1
GEORGIA 1 1 1 0 1 1 1 1 0
IDAHO 1 1 1 1 1 1 0 1 1
ILLINOIS 0 1 1 0 1 1 1 0 0
KANSAS 1 1 1 0 1 1 1 1 0
MAINE 1 1 1 1 1 0 1 1 0
MARYLAND 0 1 1 0 0 1 1 1 1
MASSACHUSETTS 1 1 1 1 1 1 1 0 1
MISSISSIPPI 1 1 1 0 1 1 1 1 0
NEW HAMPSHIRE 1 1 1 1 1 1 1 0 0
NEW JERSEY 0 1 1 0 1 1 0 1 1
NORTH DAKOTA 1 1 1 1 1 1 1 1 0
OHIO 1 1 1 0 1 1 1 0 1
OKLAHOMA 1 1 1 1 1 1 1 1 0
PENNSYLVANIA 0 1 1 0 0 1 1 1 0
RHODE ISLAND 1 1 1 1 1 1 1 0 1
SOUTH DAKOTA 0 1 1 1 1 1 0 0 0
TENNESSEE 1 1 1 1 1 1 1 0 0
TEXAS 1 1 1 0 0 1 0 1 1
UTAH 0 1 1 1 1 1 1 1 0
VERMONT 0 1 1 1 0 1 0 1 1
WEST VIRGINIA 1 1 1 0 1 1 0 1 1

CLUSTER=4

state incompat cruelty desertn non_supp alcohol felony impotenc insanity separate
DELAWARE 1 0 0 0 0 0 0 0 1
HAWAII 1 0 0 0 0 0 0 0 1
WASHINGTON 1 0 0 0 0 0 0 0 1
WISCONSIN 1 0 0 0 0 0 0 0 1

CLUSTER=5

state incompat cruelty desertn non_supp alcohol felony impotenc insanity separate
LOUISIANA 0 0 0 0 0 1 0 0 1
NEW YORK 0 1 1 0 0 1 0 0 1
VIRGINIA 0 1 0 0 0 1 0 0 1

CLUSTER=6

state incompat cruelty desertn non_supp alcohol felony impotenc insanity separate
SOUTH CAROLINA 0 1 1 0 1 0 0 0 1

CLUSTER=7

state incompat cruelty desertn non_supp alcohol felony impotenc insanity separate
NEW MEXICO 1 1 1 0 0 0 0 0 0

CLUSTER=8

state incompat cruelty desertn non_supp alcohol felony impotenc insanity separate
INDIANA 1 0 0 0 0 1 1 1 0

CLUSTER=9

state incompat cruelty desertn non_supp alcohol felony impotenc insanity separate
NORTH CAROLINA 0 0 0 0 0 0 1 1 1

Chapter Contents
Chapter Contents
Previous
Previous
Next
Next
Top
Top

Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.