Chapter Contents
Chapter Contents
The GLM Procedure

Parameterization of PROC GLM Models

The GLM procedure constructs a linear model according to the specifications in the MODEL statement. Each effect generates one or more columns in a design matrix X. This section shows precisely how X is built.


All models include a column of 1s by default to estimate an intercept parameter \mu.You can use the NOINT option to suppress the intercept.

Regression Effects

Regression effects (covariates) have the values of the variables copied into the design matrix directly. Polynomial terms are multiplied out and then installed in X.

Main Effects

If a class variable has m levels, PROC GLM generates m columns in the design matrix for its main effect. Each column is an indicator variable for one of the levels of the class variable. The default order of the columns is the sort order of the values of their levels; this order can be controlled with the ORDER= option in the PROC GLM statement, as shown in the following table.

Data   Design Matrix
       A B
AB   \mu A1A2 B1B2B3
11   1 10 100
12   1 10 010
13   1 10 001
21   1 01 100
22   1 01 010
23   1 01 001

There are more columns for these effects than there are degrees of freedom for them; in other words, PROC GLM is using an over-parameterized model.

Crossed Effects

First, PROC GLM reorders the terms to correspond to the order of the variables in the CLASS statement; thus, B*A becomes A*B if A precedes B in the CLASS statement. Then, PROC GLM generates columns for all combinations of levels that occur in the data. The order of the columns is such that the rightmost variables in the cross index faster than the leftmost variables. No columns are generated corresponding to combinations of levels that do not occur in the data.

Data  Design Matrix
      A B A*B
AB  \mu A1A2 B1B2B3 A1B1A1B2A1B3A2B1A2B2A2B3
11  1 10 100 100000
12  1 10 010 010000
13  1 10 001 001000
21  1 01 100 000100
22  1 01 010 000010
23  1 01 001 000001

In this matrix, main-effects columns are not linearly independent of crossed-effect columns; in fact, the column space for the crossed effects contains the space of the main effect.

Nested Effects

Nested effects are generated in the same manner as crossed effects. Hence, the design columns generated by the following statements are the same (but the ordering of the columns is different):

model y=a b(a); (B nested within A)
model y=a a*b; (omitted main effect for B)

The nesting operator in PROC GLM is more a notational convenience than an operation distinct from crossing. Nested effects are characterized by the property that the nested variables never appear as main effects. The order of the variables within nesting parentheses is made to correspond to the order of these variables in the CLASS statement. The order of the columns is such that variables outside the parentheses index faster than those inside the parentheses, and the rightmost nested variables index faster than the leftmost variables.

Data Design Matrix
     A B(A)
AB \mu A1A2 B1A1B2A1B3A1B1A2B2A2B3A2
11 1 10 100000
12 1 10 010000
13 1 10 001000
21 1 01 000100
22 1 01 000010
23 1 01 000001

Continuous-Nesting-Class Effects

When a continuous variable nests with a class variable, the design columns are constructed by multiplying the continuous values into the design columns for the class effect.

Data   Design Matrix
       A X(A)
XA   \mu A1A2 X(A1)X(A2)
211   1 10 210
241   1 10 240
221   1 10 220
282   1 01 028
192   1 01 019
232   1 01 023

This model estimates a separate slope for X within each level of A.

Continuous-by-Class Effects

Continuous-by-class effects generate the same design columns as continuous-nesting-class effects. The two models differ by the presence of the continuous variable as a regressor by itself, in addition to being a contributor to X*A.

Data   Design Matrix
         A X*A
XA   \mu X A1A2 X*A1X*A2
211   1 21 10 210
241   1 24 10 240
221   1 22 10 220
282   1 28 01 028
192   1 19 01 019
232   1 23 01 023

Continuous-by-class effects are used to test the homogeneity of slopes. If the continuous-by-class effect is nonsignificant, the effect can be removed so that the response with respect to X is the same for all levels of the class variables.

General Effects

An example that combines all the effects is

X1*X2*A*B*C(D E)

The continuous list comes first, followed by the crossed list, followed by the nested list in parentheses.

The sequencing of parameters is important to learn if you use the CONTRAST or ESTIMATE statement to compute or test some linear function of the parameter estimates.

Effects may be retitled by PROC GLM to correspond to ordering rules. For example, B*A(E D) may be retitled A*B(D E) to satisfy the following:

The sequencing of the parameters generated by an effect can be described by which variables have their levels indexed faster:

For example, suppose a model includes four effects - A, B, C, and D -each having two levels, 1 and 2. If the CLASS statement is

   class A B C D;

then the order of the parameters for the effect B*A(C D), which is retitled A*B(C D), is as follows.

A1 B1 C1 D1
A1 B2 C1 D1
A2 B1 C1 D1
A2 B2 C1 D1
A1 B1 C1 D2
A1 B2 C1 D2
A2 B1 C1 D2
A2 B2 C1 D2
A1 B1 C2 D1
A1 B2 C2 D1
A2 B1 C2 D1
A2 B2 C2 D1
A1 B1 C2 D2
A1 B2 C2 D2
A2 B1 C2 D2
A2 B2 C2 D2

Note that first the crossed effects B and A are sorted in the order in which they appear in the CLASS statement so that A precedes B in the parameter list. Then, for each combination of the nested effects in turn, combinations of A and B appear. The B effect changes fastest because it is rightmost in the (renamed) cross list. Then A changes next fastest. The D effect changes next fastest, and C is the slowest since it is leftmost in the nested list.

When numeric class variables are used, their levels are sorted by their character format, which may not correspond to their numeric sort sequence. Therefore, it is advisable to include a format for numeric class variables or to use the ORDER=INTERNAL option in the PROC GLM statement to ensure that levels are sorted by their internal values.

Degrees of Freedom

For models with classification (categorical) effects, there are more design columns constructed than there are degrees of freedom for the effect. Thus, there are linear dependencies among the columns. In this event, the parameters are not jointly estimable; there is an infinite number of least-squares solutions. The GLM procedure uses a generalized (g2) inverse to obtain values for the estimates; see the "Computational Method" section for more details. The solution values are not produced unless the SOLUTION option is specified in the MODEL statement. The solution has the characteristic that estimates are zero whenever the design column for that parameter is a linear combination of previous columns. (Strictly termed, the solution values should not be called estimates, since the parameters may not be formally estimable.) With this full parameterization, hypothesis tests are constructed to test linear functions of the parameters that are estimable.

Other procedures (such as the CATMOD procedure) reparameterize models to full rank using certain restrictions on the parameters. PROC GLM does not reparameterize, making the hypotheses that are commonly tested more understandable. See Goodnight (1978) for additional reasons for not reparameterizing.

PROC GLM does not actually construct the entire design matrix X; rather, a row xi of X is constructed for each observation in the data set and used to accumulate the crossproduct matrix X'X = \sum_i x_i'x_i.

Chapter Contents
Chapter Contents

Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.