Chapter Contents |
Previous |
Next |

The CLUSTER Procedure |

The CLUSTER procedure stores the data (including the COPY and ID variables) in memory or, if necessary, on disk. If eigenvalues are computed, the covariance matrix is stored in memory. If the stored distance or sorted distance algorithm is used, the distances are stored in memory or, if necessary, on disk.

With coordinate data, the increase in CPU time is roughly proportional to the number of variables. The VAR statement should list the variables in order of decreasing variance for greatest efficiency.

For both coordinate and distance data, the dominant
factor determining CPU time is the number of observations.
For density methods with coordinate data, the asymptotic
time requirements are somewhere between *nln*(*n*) and
*n ^{2}*, depending on how the smoothing parameter increases.
For other methods except EML, time
is roughly proportional to

PROC CLUSTER runs much faster if the data can be stored
in memory and, if the stored distance algorithm is used,
the distance matrix can be stored in memory as well.
To estimate the bytes of memory needed for the data, use the
following equation and round up to the nearest multiple of *d*.

n(vd | + | 8d + i | ||

+ | i | if density estimation or the | ||

sorted distance algorithm used | ||||

+ | 3d | if stored data algorithm used | ||

+ | 3d | if density estimation used | ||

+ | max(8, length of ID variable) | if ID variable used | ||

+ | length of ID variable | if ID variable used | ||

+ | sum of lengths of COPY variables) | if COPY variables used |

where

n | is the number of observations |

v | is the number of variables |

d | is the size of a C variable of type double.
For most computers, d=8. |

i | is the size of a C variable of type int.
For most computers, i=4. |

The number of bytes needed for the distance matrix is *dn*(*n*+1)/2.

Chapter Contents |
Previous |
Next |
Top |

Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.