Chapter Contents |
Previous |
Next |

Introduction to
Regression Procedures |

In most applications, regression models are merely useful approximations. Reality is often so complicated that you cannot know what the true model is. You may have to choose a model more on the basis of what variables can be measured and what kinds of models can be estimated than on a rigorous theory that explains how the universe really works. However, even in cases where theory is lacking, a regression model may be an excellent predictor of the response if the model is carefully formulated from a large sample. The interpretation of statistics such as parameter estimates may nevertheless be highly problematical.

Statisticians usually use the word
"prediction" in a technical sense.
*Prediction* in this sense does not refer to
"predicting the future" (statisticians call that
*forecasting*) but rather to guessing the response
from the values of the regressors in an observation
taken under the same circumstances as the sample
from which the regression equation was estimated.
If you developed a regression model for predicting
consumer preferences in 1958, it may not give very good
predictions in 1988 no matter how well it did in 1958.
If it is the future you want to predict, your model must
include whatever relevant factors may change over time.
If the process you are studying does in fact change over time, you
must take observations at several, perhaps many, different times.
Analysis of such data is the province of SAS/ETS
procedures such as AUTOREG and STATESPACE.
Refer to the *SAS/ETS User's Guide*
for more information on these procedures.

The comments in the rest of this section are directed toward linear least-squares regression. Nonlinear regression and non-least-squares regression often introduce further complications.

For more detailed discussions of the interpretation of regression statistics, see Darlington (1968), Mosteller and Tukey (1977), Weisberg (1985), and Younger (1979).

If the nonstatistical aspects of the experiment are also treated with sufficient care (including such things as use of placebos and double blinds), then you can state conclusions in causal terms; that is, this change in a regressor causes that change in the response. Causality can never be inferred from statistical results alone or from an observational study.

If the model that you fit is not the true model, then the parameter estimates may depend strongly on the particular values of the regressors used in the experiment. For example, if the response is actually a quadratic function of a regressor but you fit a linear function, the estimated slope may be a large negative value if you use only small values of the regressor, a large positive value if you use only large values of the regressor, or near zero if you use both large and small regressor values. When you report the results of an experiment, it is important to include the values of the regressors. It is also important to avoid extrapolating the regression equation outside the range of regressors in the sample.

If you conduct an observational study and if you do not know the true form of the model, interpretation of parameter estimates becomes even more convoluted. A coefficient must then be interpreted as an average over the sampled population of expected differences in response of observations that differ by one unit on only one regressor. The considerations that are discussed under controlled experiments for which the true model is not known also apply.

Sometimes standardized regression coefficients are used to compare the effects of regressors measured in different units. Standardizing the variables effectively makes the standard deviation the unit of measurement. This makes sense only if the standard deviation is a meaningful quantity, which usually is the case only if the observations are sampled from a well-defined population. In a controlled experiment, the standard deviation of a regressor depends on the values of the regressor selected by the experimenter. Thus, you can make a standardized regression coefficient large by using a large range of values for the regressor.

In some applications you may be able to compare regression coefficients in terms of the practical range of variation of a regressor. Suppose that each independent variable in an industrial process can be set to values only within a certain range. You can rescale the variables so that the smallest possible value is zero and the largest possible value is one. Then the unit of measurement for each regressor is the maximum possible range of the regressor, and the parameter estimates are comparable in that sense. Another possibility is to scale the regressors in terms of the cost of setting a regressor to a particular value, so comparisons can be made in monetary terms.

If the regressors are correlated, it becomes difficult to disentangle the effects of one regressor from another, and the parameter estimates may be highly dependent on which regressors are used in the model. Two correlated regressors may be nonsignificant when tested separately but highly significant when considered together. If two regressors have a correlation of 1.0, it is impossible to separate their effects.

It may be possible to recode correlated
regressors to make interpretation easier.
For example, if ** X** and

The ** p**-values are always approximations.
The assumptions required to compute exact

** R^{2}** is easiest to interpret when the observations,
including the values of both the regressors and response,
are randomly sampled from a well-defined population.
Nonrandom sampling can greatly distort

In a controlled experiment, ** R^{2}** depends
on the values chosen for the regressors.
A wide range of regressor values generally
yields a larger

Whether a given ** R^{2}** value is considered to be large
or small depends on the context of the particular study.
A social scientist might consider an

You can always get an ** R^{2}** arbitrarily close to 1.0 by including
a large number of completely unrelated regressors in the equation.
If the number of regressors is close to
the sample size,

If you fit many different models and choose the model
with the largest ** R^{2}**, all the statistics are biased and
the

Chapter Contents |
Previous |
Next |
Top |

Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.