However, we are often interested in testing whether a dependent variable
(y) is related to more than one independent variable (e.g. x1,
x2, x3).
We could perform regressions based on the following models:
y = ß0 + ß1x1
+ e
y = ß0 + ß2x2
+ e
y = ß0 + ß3x3
+ e
And indeed, this is commonly done. However it is possible that the
independent variables could obscure each other's effects. For example,
an animal's mass could be a function of both age and diet. The age effect
might override the diet effect, leading to a regression for diet which
would not appear very interesting.
One possible solution is to perform a regression with one independent variable, and then test whether a second independent variable is related to the residuals from this regression. You continue with a third variable, etc. A problem with this is that you are putting some variables in privileged positions.
A multiple regression allows the simultaneous testing and modelling of multiple independent variables. (Note: multiple regression is still not considered a "multivariate" test because there is only one dependent variable).
The model for a multiple regression takes the form:
y = ß0 + ß1x1
+ ß2x2 + ß3x3
+ ..... + e
And we wish to estimate the ß0, ß1,
ß2, etc. by obtaining
^
y1 = b0 + b1x1
+ b2x2 + b3x3
+ .....
The b's are termed the "regression coefficients". Instead of fitting a line to data, we are now fitting a plane (for 2 independent variables), a space (for 3 independent variables), etc.
The estimation can still be done according the principles of linear
least squares.
The formulae for the solution (i.e. finding all the b's) are
UGLY. However, the matrix solution is elegant:
As with simple regression, the y-intercept disappears if all variables are standardized (see Statistics) .
Consider the model:
y = ß0 + ß1x1
+ ß2x2 + ß3x3
+ ..... + e
Since y is a combination of linear functions, it is termed a
linear combination of the x's. The following
models are not linear combinations of the x's:
y = ß0 + ß1/x1
+ ß2x22 + e
y = exp(ß0 + ß1x1
+ ß2x2 + ß3x3
+ e)
But you can still use multiple regression if you transform variables.
For the first example, create two new variables:
x1' = 1/x1 and x2'
= x22
For the second example, take the logarithm of both sides:
log(y) = ß0 + ß1x1+
ß2x2 + ß3x3
+ e
There are some models which cannot be "linearizable", and hence linear
regression cannot be used, e.g.:
y = (ß0 - ß1x1)/3x2
+ e
Or, subtly,
y = exp(ß0 + ß1x1
+ ß2x2 + ß3x3)
+ e
These must be solved with nonlinear regression techniques. Unfortunately,
it is hard to find the solution to such nonlinear equations if there are
many parameters.
Note that:
y = ax3 + bx2 +
cx + d + e
can be expressed as:
y = ß0 + ß1x1+
ß2x2 + ß3x3
+ e
if x1 = x1, x2 = x2, x3 = x3
So polynomial regression is considered a special case of linear regression.
This is handy, because even if polynomials do not represent the true
model, they take a variety of forms, and may be close enough for a variety
of purposes.
Fitting a response surface is often useful:
y = ß0 + ß1x1+
ß2x12 + ß3x2
+ ß4x22 + ß4x1x2
+ e
This can fit simple ridges, peaks, valleys, pits, slopes, and saddles.
The significance tests are conditional: This means given all the other variables are in the model. The null hypothesis is: "This independent variable does not explain any of the variation in y, beyond the variation explained by the other variables". Therefore, an independent variable which is quite redundant with other independent variables is not likely to be significant.
Sometimes, an ANOVA table is included.
The following is an example SYSTAT output of a multiple regression:
Dep Var: LOGSP N: 1412 Multiple R: 0.7565 Squared multiple R: 0.5723 Adjusted squared multiple R: 0.5717 Standard error of estimate: 0.2794 Effect Coefficient Std Error Std Coef Tolerance t P(2 Tail) CONSTANT 2.4442 0.0345 0.0 . 70.8829 0.0000 LOGAREA 0.1691 0.0040 0.7652 0.9496 42.7987 0.0000 LAT -0.0136 0.0008 -0.2993 0.9496 -16.7381 0.0000 Analysis of Variance Source Sum-of-Squares DF Mean-Square F-Ratio P Regression 147.1326 2 73.5663 942.6110 0.0000 Residual 109.9657 1409 0.0780
Plant species richness is often correlated with soil pH, and it is often strongly correlated with soil calcium. But since soil pH and soil calcium are strongly related to each other, neither explains significantly more variation than the other.
This is called the problem of multicollinearity (though whether it is a 'problem', or something that yields new insight, is a matter of perspective).
It is also possible that nonsignificant patterns in simple regression
become significant in multiple regression, e.g. the effect of age and diet
on animal size.
Suppose you set a cutoff of p=0.05. If H0
is always true, then you would reject it 5% of the time. But
if you had two independent tests, you would falsely reject at least one
H0
1-(1-.05)2 = 0.0975, or almost 10% of the time.
If you had 20 independent tests, you would falsely reject at least one
H0
1-(1-.05)20 = 0.6415, or almost 2/3 of the time.
There are ways to adjust for the problem of multiple comparison, the
most famous being the Bonferroni test and the Scheffe test. But the Bonferroni
test is very conservative, and the Scheffe test is often difficult to implement.
For the Bonferroni test, you simply multiply each observed p-value
by the number of tests you perform.
Holm's method for correcting for multiple comparisons is less well-known, and is also less conservative (see Legendre and Legendre, p. 18).
Examples: Suppose you perform an experiment in which tadpoles are raised at different temperatures, and you wish to study adult frog size. You might want to "factor out" the effects of tadpole mass.
In the invertebrate species richness example, Species Richness is related
to area, but everyone knows that. If we are interested in fertilization
effects, it might be justifiable to "cancel out" the effects of lake area.
Often, you don't really care about statistical inference, but would really like a regression model that fits the data well. However, a model such as:
y = ß0 + ß1x1 + ß2x2 + ß3x3 + ß4x4 + ß5x5 + ß6x6 + ß7x7 + ß8x8 + ß9x9 + ß10x10 + ß11x11 + ß12x12 + ß13x13 + ß14x14 + ß14x14 + e
Is far too ungainly to use! It might be much more useful to choose a subset of the independent variables which "best" explains the dependent variable.
There are three basic approaches:
1) Forward Selection
Although stepwise methods can find meaningful patterns in data, it is also notorious for finding false patterns. If you doubt this, then try running a stepwise procedure using only random numbers. If you include enough variables, you will almost invariably find 'significant' results.
To
the ordination web page