The CORR Procedure

# Concepts

Correlation coefficients contain information on both the strength and direction of a linear relationship between two numeric random variables. If one variable x is an exact linear function of another variable y, a positive relationship exists when the correlation is 1 and an inverse relationship exists when the correlation is -1. If there is no linear predictability between the two variables, the correlation is 0. If the variables are normal and correlation is 0, the two variables are independent. However, correlation does not imply causality because, in some cases, an underlying causal relationship may exist.

The scatterplots in Examining Correlations Using Scatterplots depict the relationship between two numeric random variables.

When the relationship between two variables is nonlinear or when outliers are present, the correlation coefficient incorrectly estimates the strength of the relationship. Plotting the data before computing a correlation coefficient enables you to verify the linear relationship and to identify the potential outliers.

The only factor limiting the number of variables that you can analyze is the amount of available memory. The computer resources that PROC CORR requires depend on which statements and options you specify. To determine the computer resources that you need, use
 N number of observations in the data set. C number of correlation types (1 to 4). V number of VAR statement variables. W number of WITH statement variables. P number of PARTIAL statement variables.
so that

 T= V+W+P K= V*W when W>0 V*(V+1)/2 when W=0 L= K when P=0 T*(T+1)/2 when P>0

For small N and large K, the CPU time varies as K for all types of correlations. For large N, the CPU time depends on the type of correlation. To calculate CPU time use

 K*N with PEARSON (default) T*N*log N with SPEARMAN K*N*log N with HOEFFDING or KENDALL

You can reduce CPU time by specifying NOMISS. Without NOMISS, processing is much faster when most observations do not contain missing values.

The options and statements you use in the procedure require different amounts of storage to process the data. For Pearson correlations, the amount of temporary storage in bytes (M) is

 40T+16L with NOMISS and NOSIMPLE 40T+16L+56T with NOMISS 40T+16L+56K with NOSIMPLE 40T+16L+56K+56T with no options

Using a PARTIAL statement increases the amount of temporary storage by 12T bytes. Using the ALPHA option increases the amount of temporary storage by 32V+16 bytes.

The following example uses a PARTIAL statement, which invokes NOMISS.

proc corr;
var x1 x2;
with y1 y2 y3;
partial z1;
Therefore, using 40T+16L+56T+12T, the minimum temporary storage equals 984 bytes (T=2+3+1 and L=T(T+1)/2).

Using the SPEARMAN, KENDALL, or HOEFFDING option requires additional temporary storage for each observation. For the most time-efficient processing, the amount of temporary storage in bytes is
 40T+8K+8L*C+12T*N+28N+QS+QP+QK
where

 QS= 0 with NOSIMPLE 68T otherwise QP= 56K with PEARSON and without NOMISS 0 otherwise QK = 32N with KENDALL or HOEFFDING 0 otherwise.

The following example uses KENDALL:

proc corr kendall;
var x1 x2 x3;
Therefore, the minimum temporary storage in bytes is
 40*3+8*6+8*6*1+12*3N+28N+3*68+32N = 420+96N
where N is the number of observations.

If M bytes are not available, PROC CORR must process the data multiple times to compute all the statistics. This reduces the minimum temporary storage you need by 12(T-2)N bytes. When this occurs, PROC CORR prints a note suggesting a larger memory region.