Chapter Contents Previous Next
 Tutorial: A Module for Linear Regression

# A Module for Linear Regression

The previous method may be more familiar to statisticians when different notation is used. A linear model is usually written

y = Xb + e
where y is the vector of responses, X is the design matrix, and b is a vector of unknown parameters estimated by minimizing the sum of squares of e, the error or residual.

The following example illustrates the programming techniques involved in performing linear regression. It is not meant to replace regression procedures such as the REG procedure, which are more efficient for regressions and offer a multitude of diagnostic options.

Suppose that you have response data y measured at five values of the independent variable x and you want to perform a quadratic regression.

Submit the PROC IML statement to begin the procedure.

```   > proc iml;

```
Input the design matrix X and the data vector y as matrix literals.
```   > x={1 1 1,
>    1 2 4,
>    1 3 9,
>    1 4 16,
>    1 5 25};

X           5 rows      3 cols    (numeric)

1         1         1
1         2         4
1         3         9
1         4        16
1         5        25

> y={1,5,9,23,36};

Y           5 rows      1 col     (numeric)

1
5
9
23
36
```
Compute the least-squares estimate of b using the traditional formula.
```   > b=inv(x\baccent *x)*x\baccent *y;

B             3 rows      1 col     (numeric)

2.4
-3.2
2
```

The predicted values are simply the X matrix multiplied by the parameter estimates, and the residuals are the difference between actual and predicted y.

```   > yhat=x*b;

YHAT          5 rows      1 col     (numeric)

1.2
4
10.8
21.6
36.4

> r=y-yhat;

R           5 rows      1 col     (numeric)

-0.2
1
-1.8
1.4
-0.4
```
To calculate the estimate of the variance of the responses, calculate the sum of squared errors (SSE), its degrees of freedom (DFE), and the mean squared error (MSE). Note that in computing the degrees, you use the function NCOL to return the number of columns of X.
```   > sse=ssq(r);

SSE           1 row       1 col     (numeric)

6.4

> dfe=nrow(x)-ncol(x);

DFE           1 row       1 col     (numeric)

2

> mse=sse/dfe;

MSE           1 row       1 col     (numeric)

3.2
```
Notice that each calculation has required one simple line of code.

Now suppose you want to solve the problem repeatedly on new data sets without reentering the code. To do this, define a module (or subroutine). Modules begin with a START statement and end with a FINISH statement, with the program statements in between. The following statements define a module named REGRESS to perform linear regression.

```   > start regress;                  /* begin module */
>    xpxi=inv(t(x)*x);         /* inverse of X'X          */
>    beta=xpxi*(t(x)*y);       /* parameter estimate      */
>    yhat=x*beta;              /* predicted values        */
>    resid=y-yhat;                 /* residuals           */
>    sse=ssq(resid);               /* SSE                 */
>    n=nrow(x);                    /* sample size         */
>    dfe=nrow(x)-ncol(x);          /* error DF            */
>    mse=sse/dfe;                  /* MSE                 */
>    cssy=ssq(y-sum(y)/n);         /* corrected total SS  */
>    rsquare=(cssy-sse)/cssy;      /* RSQUARE             */
>    print,"Regression Results",
>       sse dfe mse rsquare;
>    stdb=sqrt(vecdiag(xpxi)*mse); /* std of estimates    */
>    t=beta/stdb;                  /* parameter t-tests   */
>    prob=1-probf(t#t,1,dfe);      /* p-values            */
>    print,"Parameter Estimates",,
>       beta stdb t prob;
>    print,y yhat resid;
> finish regress;                  /* end module          */
```

Submit the module REGRESS for execution.

```   > reset noprint;
> run regress;                     /* execute module      */

Regression Results

SSE       DFE       MSE    RSQUARE
6.4         2       3.2  0.9923518

Parameter Estimates

BETA       STDB          T       PROB
2.4  3.8366652  0.6255432  0.5954801
-3.2  2.9237940  -1.094468  0.3879690
2  0.4780914  4.1833001  0.0526691

Y      YHAT     RESID
1       1.2      -0.2
5         4         1
9      10.8      -1.8
23      21.6       1.4
36      36.4      -0.4
```

At this point, you still have all of the matrices defined if you want to continue calculations. Suppose that you want to correlate the estimates. First, calculate the covariance estimate of the estimates; then, scale the covariance into a correlation matrix with values of 1 on the diagonal.

```   > reset print;             /* turn on auto printing   */
> covb=xpxi*mse;           /* covariance of estimates */

COVB          3 rows      3 cols    (numeric)

14.72     -10.56        1.6
-10.56  8.5485714  -1.371429
1.6  -1.371429  0.2285714

> s=1/sqrt(vecdiag(covb));

S             3 rows      1 col     (numeric)

0.260643
0.3420214
2.0916501

> corrb=diag(s)*covb*diag(s);   /* correlation of estimates */

CORRB         3 rows      3 cols    (numeric)

1  -0.941376  0.8722784
-0.941376          1  -0.981105
0.8722784  -0.981105          1
```

Your module REGRESS remains available to do another regression, in this case, an orthogonalized version of the last polynomial example. In general, the columns of X will not be orthogonal. You can use the ORPOL function to generate orthogonal polynomials for the regression. Using them provides greater computing accuracy and reduced computing times. When using orthogonal polynomial regression, you expect the statistics of fit to be the same and the estimates to be more stable and uncorrelated.

To perform an orthogonal regression on the data, you must first create a vector containing the values of the independent variable x, which is the second column of the design matrix X. Then, use the ORPOL function to generate orthogonal second degree polynomials.

```   > x1={1,2,3,4,5};               /* second column of X */

X1            5 rows      1 col     (numeric)

1
2
3
4
5

> x=orpol(x1,2);    /* generates orthogonal polynomials */

X             5 rows      3 cols    (numeric)

0.4472136  -0.632456  0.5345225
0.4472136  -0.316228  -0.267261
0.4472136          0  -0.534522
0.4472136  0.3162278  -0.267261
0.4472136  0.6324555  0.5345225

> reset noprint;           /* turns off auto printing   */
> run regress;             /* run REGRESS */

Regression Results

SSE       DFE       MSE   RSQUARE
6.4         2       3.2 0.9923518

Parameter Estimates

BETA      STDB         T      PROB
33.093806 1.7888544      18.5 0.0029091
27.828043 1.7888544 15.556349 0.0041068
7.4833148 1.7888544 4.1833001 0.0526691

Y      YHAT     RESID
1       1.2      -0.2
5         4         1
9      10.8      -1.8
23      21.6       1.4
36      36.4      -0.4

> reset print;
> covb=xpxi*mse;

COVB          3 rows      3 cols    (numeric)

3.2 -2.73E-17 4.693E-16
-2.73E-17       3.2 -2.18E-15
4.693E-16 -2.18E-15       3.2

> s=1/sqrt(vecdiag(covb));

S             3 rows      1 col     (numeric)

0.559017
0.559017
0.559017

> corrb=diag(s)*covb*diag(s);

CORRB         3 rows      3 cols    (numeric)

1  -8.54E-18  1.467E-16
-8.54E-18          1   -6.8E-16
1.467E-16   -6.8E-16          1
```
Note that the values on the off-diagonal are displayed in scientific notation; the values are close to zero but not exactly zero because of the imprecision of floating-point arithmetic. To clean up the appearance of the correlation matrix, use the FUZZ option.
```   > reset fuzz;
> corrb=diag(s)*covb*diag(s);

CORRB         3 rows      3 cols    (numeric)

1         0         0
0         1         0
0         0         1
```

 Chapter Contents Previous Next Top