Chapter Contents |
Previous |
Next |
The CORRESP Procedure |
You can use an indicator matrix as input to PROC CORRESP using the VAR statement. An indicator matrix is composed of several submatrices, each of which is a design matrix with one column for each category of a categorical variable. In order to create an indicator matrix, you must code an indicator variable for each level of each categorical variable. For example, the categorical variable Sex, with two levels (Female and Male), would be coded using two indicator variables.
A binary indicator variable is coded 1 to indicate the presence of an attribute and 0 to indicate its absence. For the variable Sex, a male would be coded Female=0 and Male=1, and a female would be coded Female=1 and Male=0. The indicator variables representing a categorical variable must sum to 1.0. You can specify the BINARY option to create a binary table.
Sometimes binary data such as Yes/No data are available. For example, 1 means "Yes, I have bought this brand in the last month" and 0 means "No, I have not bought this brand in the last month".
title 'Doubling Yes/No Data'; proc format; value yn 0 = 'No ' 1 = 'Yes'; run; data BrandChoice; input a b c; label a = 'Brand A' b = 'Brand B' c = 'Brand B'; format a b c yn.; datalines; 0 0 1 1 1 0 0 1 1 0 1 0 1 0 0 ;
Data such as these cannot be analyzed directly because the raw data do not consist of partitions, each with one column per level and exactly one 1 in each row. The data must be doubled so that both Yes and No are both represented by a column in the data matrix. The TRANSREG procedure provides one way of doubling. In the following statements, the DESIGN option specifies that PROC TRANSREG is being used only for coding, not analysis. The option SEPARATORS=': ' specifies that labels for the coded columns are constructed from input variable labels, followed by a colon and space, followed by the formatted value. The variables are designated in the MODEL statement as CLASS variables, and the ZERO=NONE option creates binary variables for all levels. The OUTPUT statement specifies the output data set and drops the _NAME_, _TYPE_, and Intercept variables. PROC TRANSREG stores a list of coded variable names in a macro variable &_TRGIND, which in this case has the value "aNo aYes bNo bYes cNo cYes". This macro can be used directly in the VAR statement in PROC CORRESP.
proc transreg data=BrandChoice design separators=': '; model class(a b c / zero=none); output out=Doubled(drop=_: Intercept); run; proc print label; run; proc corresp data=Doubled norow short; var &_trgind; run;
A fuzzy-coded indicator also sums to 1.0 across levels of the categorical variable, but it is coded with fractions rather than with 1 and 0. The fractions represent the distribution of the attribute across several levels of the categorical variable.
Ordinal variables, such as survey responses of 1 to 3 can be represented as two design variables.
Ordinal | ||
Values | Coding | |
1 | 0.25 | 0.75 |
2 | 0.50 | 0.50 |
3 | 0.75 | 0.25 |
Values of the coding sum to one across the two coded variables.
This next example illustrates the use of
binary and fuzzy-coded indicator variables.
Fuzzy-coded indicators are used to represent missing data.
Note that the missing values in the observation
Igor are coded with equal proportions.
proc transreg data=Neighbor design cprefix=0; model class(Age Sex Height Hair / zero=none); output out=Neighbor2(drop=_: Intercept); id Name; run; data Neighbor3; set Neighbor2; if Sex = ' ' then do; Female = 0.5; Male = 0.5; end; if Hair = ' ' then do; White = 1/3; Brown = 1/3; Blond = 1/3; end; run; proc print label; run;
There is one set of coded variables for each input categorical variable. If observation 12 is excluded, each set is a binary design matrix. Each design matrix has one column for each category and exactly one 1 in each row.
Fuzzy-coding is shown in the final observation, Igor. The observation Igor has missing values for the variables Sex and Hair. The design matrix variables are coded with fractions that sum to one within each categorical variable.
An alternative way to represent missing data is to treat missing values as an additional level of the categorical variable. This alternative is available with the MISSING option in the PROC statement. This approach yields coordinates for missing responses, allowing the comparison of "missing" along with the other levels of the categorical variables.
Greenacre and Hastie (1987) discuss additional coding schemes, including one for continuous variables. Continuous variables can be coded with PROC TRANSREG by specifying BSPLINE(variables / degree=1) in the MODEL statement.
Chapter Contents |
Previous |
Next |
Top |
Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.