Chapter Contents |
Previous |
Next |
The CATMOD Procedure |
Data Levels | Design Columns | |
A | A | |
1 | 1 | 0 |
2 | 0 | 1 |
3 | -1 | -1 |
B | B | |
1 | 1 | |
2 | -1 |
For an effect with three levels, such as A, PROC CATMOD produces two parameter estimates for each response function. By default, the first (corresponding to the first row in the "Design Columns") estimates the effect of level 1 of A. The second (corresponding to the second row in the "Design Columns") estimates the effect of level 2 of A. The sum-to-zero constraint requires the effect of level 3 of A to be the negative of the sum of the level 1 and 2 effects (as shown by the third row in the "Design Columns").
Data Levels | Design Matrix Columns | |||||
A | B | A | B | A*B | ||
1 | 1 | 1 | 0 | 1 | 1 | 0 |
1 | 2 | 1 | 0 | -1 | -1 | 0 |
2 | 1 | 0 | 1 | 1 | 0 | 1 |
2 | 2 | 0 | 1 | -1 | 0 | -1 |
3 | 1 | -1 | -1 | 1 | -1 | -1 |
3 | 2 | -1 | -1 | -1 | 1 | 1 |
The number of degrees of freedom for a crossed effect (that is, the number of design matrix columns) is equal to the product of the numbers of degrees of freedom for the separate effects.
Data Levels | Design Matrix Columns | ||||
B | A | A(B) | |||
1 | 1 | 1 | 0 | 0 | 0 |
1 | 2 | 0 | 1 | 0 | 0 |
1 | 3 | -1 | -1 | 0 | 0 |
2 | 1 | 0 | 0 | 1 | 0 |
2 | 2 | 0 | 0 | 0 | 1 |
2 | 3 | 0 | 0 | -1 | -1 |
PROC CATMOD actually allocates a column for all possible combinations of values even though some combinations may not be present in the data.
Data Levels | Design Matrix Columns | ||||
B | A | A(B=1) | A(B=2) | ||
1 | 1 | 1 | 0 | 0 | 0 |
1 | 2 | 0 | 1 | 0 | 0 |
1 | 3 | -1 | -1 | 0 | 0 |
2 | 1 | 0 | 0 | 1 | 0 |
2 | 2 | 0 | 0 | 0 | 1 |
2 | 3 | 0 | 0 | -1 | -1 |
Each effect has n_{a}-1 degrees of freedom, assuming a complete combination. Thus, for the example, each effect has two degrees of freedom.
The procedure compares nested values to data values on the basis of formatted values. If a format is not specified for the variable, the procedure formats internal data values to BEST16, left-justified. The nested values specified in nested-by-value effects are also converted to a BEST16 formatted value, left-justified.
For example, if the numeric variable B has internal data values 1 and 2, then A(B=1), A(B=1.0), and A(B=1E0) are all valid nested-by-value effects. However, if the data value 1 is formatted as `one', then A(B='one') is a valid effect, but A(B=1) is not since the formatted nested value (1) does not match the formatted data value (one).
To ensure correct nested-by-value effects, look at the tables of population and response profiles. These are displayed by default, and they contain the formatted data values. In addition, the population and response profiles are displayed when you specify the ONEWAY option in the MODEL statement.
Data Levels | Design Columns | ||
X1 | X2 | X1 | X2 |
1 | 1 | 1 | 1 |
2 | 4 | 2 | 4 |
3 | 9 | 3 | 9 |
Unless there is a POPULATION statement that excludes the direct variables, the direct variables help to define the sample populations. In general, the variables should not be continuous in the sense that every subject has a different value because this would induce a separate population for each subject (note, however, that such a strategy is used purposely for logistic regression).
If there is a POPULATION statement that omits mention of the direct variables, then the values of the direct variables must be identical for all subjects in a given population since there can only be one independent variable profile for each population.
The following subsections illustrate the effect of specifying (or not specifying) an AVERAGED model type. This section does not apply to log-linear models; for these models, see the "Log-Linear Model Design Matrices" section.
proc catmod; model Y=A; run;
If the variable Y has two levels, then there is only one response function per population, and the design matrix is as follows.
Design Matrix | ||
Sample | Intercept | A |
1 | 1 | 1 |
2 | 1 | -1 |
But if the variable Y has three levels, then there are two response functions per population, and the preceding design matrix is assumed to hold for each of the two response functions. The response functions are always ordered so that the multiple response functions within a population are grouped together. For this example, the design matrix would be as follows.
Response | |||||
Function | Design Matrix | ||||
Sample | Number | Intercept | A | ||
1 | 1 | 1 | 0 | 1 | 0 |
1 | 2 | 0 | 1 | 0 | 1 |
2 | 1 | 1 | 0 | -1 | 0 |
2 | 2 | 0 | 1 | 0 | -1 |
Since the same submatrix applies to each of the multiple response functions, PROC CATMOD displays only the submatrix (that is, the one it would create if there were only one response function per population) rather than the entire design matrix. PROC CATMOD displays
Effect | Parameter | Estimate |
Intercept | 1 | 1.4979 |
2 | 0.8404 | |
A | 3 | 0.1116 |
4 | -0.3296 |
Notice that the intercept and the A effect each have two parameter estimates associated with them. The first estimate in each pair is associated with the first response function, and the second in each pair is associated with the second response function. Consequently, 0.1116 is the effect of the first level of A on the first response function. In any table of parameter estimates displayed by PROC CATMOD, as you read down the column of estimates, the response function level changes before levels of the variables making up the effect.
For example, suppose the dependent variable Y has three levels, the independent variable A has two levels, and you specify
proc catmod; response marginals; model y=a / averaged; run;
Then there are two response functions per population, and the response functions are always ordered so that the multiple response functions within a population are grouped together. For this example, the design matrix would be as follows.
Response | |||
Function | Design Matrix | ||
Sample | Number | Intercept | A |
1 | 1 | 1 | 1 |
1 | 2 | 1 | 1 |
2 | 1 | 1 | -1 |
2 | 2 | 1 | -1 |
Note that the model now has only two degrees of freedom. The remaining two degrees of freedom in the residual correspond to variation among the three levels of the dependent variable. Generally, that variation tends to be statistically significant and therefore should not be left out of the model. You can include it in the model by including the two effects, _RESPONSE_ and _RESPONSE_*A, but if the study is not a repeated measurement study, those sources of variation tend to be uninteresting. Thus, the usual solution for this type of study (one dependent variable) is to exclude the AVERAGED option from the MODEL statement.
An AVERAGED model type is automatically induced whenever you use the _RESPONSE_ keyword in the MODEL statement. The _RESPONSE_ effect models variation among the q response functions per population. If there is no REPEATED, FACTORS, or LOGLIN statement, then PROC CATMOD builds a main effect with q-1 degrees of freedom. For example, three response functions would induce the following design columns.
Response | ||
Function | Design Columns | |
Number | _Response_ | |
1 | 1 | 0 |
2 | 0 | 1 |
3 | -1 | -1 |
If there is more than one population, then the _RESPONSE_ effect is averaged over the populations. Also, the _RESPONSE_ effect can be crossed with any other effect, or it can be nested within an effect.
If there is a REPEATED statement that contains only one repeated measurement factor, then PROC CATMOD builds the design columns for _RESPONSE_ in the same way, except that the output labels the main effect with the factor name rather than with the word _RESPONSE_. For example, suppose an independent variable A has two levels, and the input statements are
proc catmod; response marginals; model Time1*Time2=A _response_ A*_response_; repeated Time 2 / _response_=Time; run;
If Time1 and Time2 each have two levels (so that they each have one independent marginal probability), then the RESPONSE statement causes PROC CATMOD to compute two response functions per population. Thus, the design matrix is as follows.
Response | |||||
Function | Design Matrix | ||||
Sample | Number | Intercept | A | Time | A*Time |
1 | 1 | 1 | 1 | 1 | 1 |
1 | 2 | 1 | 1 | -1 | -1 |
2 | 1 | 1 | -1 | 1 | -1 |
2 | 2 | 1 | -1 | -1 | 1 |
However, if Time1 and Time2 each have three levels (so that they each have two independent marginal probabilities), then the RESPONSE statement causes PROC CATMOD to compute four response functions per population. In that case, since Time has two levels, PROC CATMOD groups the functions into sets of 2 (=4/2) and constructs the preceding submatrix for each function in the set. This results in the following design matrix, which is obtained from the previous one by multiplying each element by an identity matrix of order two.
Response | Design Matrix | ||||||||
Sample | Function | Intercept | A | Time | A*Time | ||||
1 | P(Time1=1) | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
1 | P(Time1=2) | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |
1 | P(Time2=1) | 1 | 0 | 1 | 0 | -1 | 0 | -1 | 0 |
1 | P(Time2=2) | 0 | 1 | 0 | 1 | 0 | -1 | 0 | -1 |
2 | P(Time1=1) | 1 | 0 | -1 | 0 | 1 | 0 | -1 | 0 |
2 | P(Time1=2) | 0 | 1 | 0 | -1 | 0 | 1 | 0 | -1 |
2 | P(Time2=1) | 1 | 0 | -1 | 0 | -1 | 0 | 1 | 0 |
2 | P(Time2=2) | 0 | 1 | 0 | -1 | 0 | -1 | 0 | 1 |
If there is a REPEATED statement that contains two or more repeated measurement factors, then PROC CATMOD builds the design columns for _RESPONSE_ according to the definition of _RESPONSE_ in the REPEATED statement. For example, suppose you specify
proc catmod; response marginals; model R11*R12*R21*R22=_response_; repeated Time 2, Place 2 / _response_=Time Place; run;
If each of the dependent variables has two levels, then PROC CATMOD builds four response functions. The _RESPONSE_ effect generates a main effects model with respect to Time and Place.
Response | ||||||
Function | Design Matrix | |||||
Number | Variable | Time | Place | Intercept | _Response_ | |
1 | R11 | 1 | 1 | 1 | 1 | 1 |
2 | R12 | 1 | 2 | 1 | 1 | -1 |
3 | R21 | 2 | 1 | 1 | -1 | 1 |
4 | R22 | 2 | 2 | 1 | -1 | -1 |
proc catmod; model X*Y=_response_; loglin X Y X*Y; run;
Then the cross-classification of X and Y yields four response probabilities, p_{11}, p_{12}, p_{21}, and p_{22}, which are then reduced to three generalized logit response functions, F_{1} = log(p_{11}/p_{22}), F_{2} = log(p_{12}/p_{22}), and F_{3} = log(p_{21}/p_{22}).
Since the saturated log-linear model implies that
Response | ||||
Function | Design Matrix | |||
Sample | Number | X | Y | X*Y |
1 | 1 | 2 | 2 | 0 |
1 | 2 | 2 | 0 | -2 |
1 | 3 | 0 | 2 | -2 |
Design matrices for reduced models are constructed similarly. For example, suppose you request a main-effects log-linear model analysis of the factors X and Y:
proc catmod; model X*Y=_response_; loglin X Y; run;
Since the main-effects log-linear model implies that
Response | |||
Function | Design Matrix | ||
Sample | Number | X | Y |
1 | 1 | 2 | 2 |
1 | 2 | 2 | 0 |
1 | 3 | 0 | 2 |
Since it is difficult to tell from the final design matrix whether PROC CATMOD used the parameterization that you intended, the procedure displays the untransformed _RESPONSE_ matrix for log-linear models. For example, the main-effects model in the preceding example induces the display of the following matrix.
Response | ||
Function | _Response_ Matrix | |
Number | 1 | 2 |
1 | 1 | 1 |
2 | 1 | -1 |
3 | -1 | 1 |
4 | -1 | -1 |
You can suppress the display of this matrix by specifying the NORESPONSE option in the MODEL statement.
Chapter Contents |
Previous |
Next |
Top |
Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.