|
Chapter Contents |
Previous |
Next |
| The GLM Procedure |
The following classification is due to Hsu (1996). Multiple comparison procedures can be categorized in two ways: by the comparisons they make and by the strength of inference they provide. With respect to which comparisons are made, the GLM procedure offers two types:
Table 28.3 and Table 28.4 display MCPs available in PROC GLM for all pairwise comparisons and comparisons with a control, respectively, along with associated strength of inference and the syntax (when applicable) for both the MEANS and the LSMEANS statements.
Table 28.3: Multiple Comparisons Procedures for All Pairwise Comparison| Strength of | Syntax | ||
| Method | Inference | MEANS | LSMEANS |
| Student's t | Individual | T | PDIFF ADJUST=T |
| Duncan | Individual | DUNCAN | |
| Student-Newman-Keuls | Inhomogeneity | SNK | |
| REGWQ | Inequalities | REGWQ | |
| Tukey-Kramer | Intervals | TUKEY | PDIFF ADJUST=TUKEY |
| Bonferroni | Intervals | BON | PDIFF ADJUST=BON |
| Sidak | Intervals | SIDAK | PDIFF ADJUST=SIDAK |
| Scheff | Intervals | SCHEFFE | PDIFF ADJUST=SCHEFFE |
| SMM | Intervals | SMM | PDIFF ADJUST=SMM |
| Gabriel | Intervals | GABRIEL | |
| Simulation | Intervals | PDIFF ADJUST=SIMULATE | |
| Strength of | Syntax | ||
| Method | Inference | MEANS | LSMEANS |
| Student's t | Individual | PDIFF=CONTROL ADJUST=T | |
| Dunnett | Intervals | DUNNETT | PDIFF=CONTROL ADJUST=DUNNETT |
| Bonferroni | Intervals | PDIFF=CONTROL ADJUST=BON | |
| Sidak | Intervals | PDIFF=CONTROL ADJUST=SIDAK | |
| Scheff | Intervals | PDIFF=CONTROL ADJUST=SCHEFFE | |
| SMM | Intervals | PDIFF=CONTROL ADJUST=SMM | |
| Simulation | Intervals | PDIFF=CONTROL ADJUST=SIMULATE | |
Note: One-sided Dunnett's tests are also available from the MEANS statement with the DUNNETTL and DUNNETTU options and from the LSMEANS statement with PDIFF=CONTROLL and PDIFF=CONTROLU.
Details of these multiple comparison methods are given in the following sections.


The simplest approach to multiple comparisons is to do a t test on every pair of means (the T option in the MEANS statement, ADJUST=T in the LSMEANS statement). For the ith and jth means, you can reject the null hypothesis that the population means are equal if


There is a problem with repeated t tests, however. Suppose there are ten means and each t test is performed at the 0.05 level. There are 10(10-1)/2=45 pairs of means to compare, each with a 0.05 probability of a type 1 error (a false rejection of the null hypothesis). The chance of making at least one type 1 error is much higher than 0.05. It is difficult to calculate the exact probability, but you can derive a pessimistic approximation by assuming that the comparisons are independent, giving an upper bound to the probability of making at least one type 1 error (the experimentwise error rate) of

If you decide to control the individual type 1 error rates for each comparison, you are controlling the individual or comparisonwise error rate. On the other hand, if you want to control the overall type 1 error rate for all the comparisons, you are controlling the experimentwise error rate. It is up to you to decide whether to control the comparisonwise error rate or the experimentwise error rate, but there are many situations in which the experimentwise error rate should be held to a small value. Statistical methods for comparing three or more means while controlling the probability of making at least one type 1 error are called multiple comparisons procedures.
It has been suggested that the experimentwise error rate can be held
to the
level by performing the overall ANOVA F-test
at the
level and making further comparisons only if the
F-test is significant, as in Fisher's protected LSD. This assertion is
false if there are more than three means (Einot and Gabriel 1975).
Consider again the situation with ten means. Suppose that one
population mean differs from the others by such a sufficiently large amount
that the power (probability of correctly rejecting the null
hypothesis) of the F-test is near 1 but that all the other
population means are equal to each other. There will be 9(9 -
1)/2=36 t tests of true null hypotheses, with an upper limit of
0.84 on the probability of at least one type 1 error. Thus, you must
distinguish between the experimentwise error rate under the complete
null hypothesis, in which all population means are equal, and the
experimentwise error rate under a partial null hypothesis, in which
some means are equal but others differ. The following abbreviations
are used in the discussion:
These error rates are associated with the different strengths of inference: individual tests control the CER; tests for inhomogeneity of means control the EERC; tests that yield confidence inequalities or confidence intervals control the MEER. A preliminary F-test controls the EERC but not the MEER.
You can control the MEER at the
level by setting the
CER to a sufficiently small value. The Bonferroni inequality
(Miller 1981) has been widely used for this purpose. If



Sidak (1967) has provided a tighter bound, showing that



You can use the Bonferroni additive inequality and the Sidak multiplicative inequality to control the MEER for any set of contrasts or other hypothesis tests, not just pairwise comparisons. The Bonferroni inequality can provide simultaneous inferences in any statistical application requiring tests of more than one hypothesis. Other methods discussed in this section for pairwise comparisons can also be adapted for general contrasts (Miller 1981).
Scheff
(1953, 1959) proposes another method to control the MEER for
any set of contrasts or other linear hypotheses in the analysis of
linear models, including pairwise comparisons, obtained with the
SCHEFFE option. Two means are declared significantly different if

Scheff
's test is compatible with the overall ANOVA F-test in that
Scheff
's method never declares a contrast significant if the overall
F-test is nonsignificant. Most other multiple comparison methods
can find significant contrasts when the overall F-test is nonsignificant
and, therefore, suffer a loss of power when used with a preliminary F-test.
Scheff
's method may be more powerful than the Bonferroni or Sidak
methods if the number of comparisons is large relative to the number
of means. For pairwise comparisons, Sidak t tests are generally
more powerful.
Tukey (1952, 1953) proposes a test designed specifically for pairwise
comparisons based on the studentized range, sometimes called the
"honestly significant difference test," that controls the
MEER when the sample sizes are equal. Tukey (1953) and Kramer
(1956) independently propose a modification for unequal cell sizes.
The Tukey or Tukey-Kramer method is provided by the TUKEY option in
the MEANS statement and the ADJUST=TUKEY option in the LSMEANS
statement. This method has fared extremely well in Monte Carlo
studies (Dunnett 1980). In addition, Hayter (1984) gives a proof that
the Tukey-Kramer procedure controls the MEER for means comparisons,
and Hayter (1989) describes the extent to which the Tukey-Kramer
procedure has been proven to control the MEER for LS-means comparisons.
The Tukey-Kramer
method is more powerful than the Bonferroni, Sidak, or Scheff
methods
for pairwise comparisons. Two means are considered significantly
different by the Tukey-Kramer criterion if

Hochberg (1974) devised a method (the GT2 or SMM option) similar to
Tukey's, but it uses the studentized maximum modulus instead of the
studentized range and employs Sidak's (1967)
uncorrelated t
inequality. It is proven to hold the MEER at a level not exceeding
with unequal sample sizes. It is generally less powerful
than the Tukey-Kramer method and always less powerful than Tukey's
test for equal cell sizes. Two means are declared significantly
different if

Gabriel (1978) proposes another method (the GABRIEL option) based on the studentized maximum modulus. This method is applicable only to arithmetic means. It rejects if

For equal cell sizes, Gabriel's test is equivalent to Hochberg's GT2
method. For unequal cell sizes, Gabriel's method is more powerful
than GT2 but may become liberal with highly disparate cell sizes (refer
also to Dunnett 1980). Gabriel's test is the only method for unequal
sample sizes that lends itself to a graphical representation as
intervals around the means. Assuming
,you can rewrite the preceding inequality as

The expression on the left does not depend on j, nor does the expression on the right depend on i. Hence, you can form what Gabriel calls an (l,u)-interval around each sample mean and declare two means to be significantly different if their (l,u)-intervals do not overlap. See Hsu (1996, section 5.2.1.1) for a discussion of other methods of graphically representing all pair-wise comparisons.


However, other important situations that do not result in a correlation matrix R that has the structure for exact computation include
In these situations, exact calculation of
is
intractable in general. Most of the preceding methods can be
viewed as using various approximations for
.When the sample
sizes are unequal, the Tukey-Kramer test is equivalent to another
approximation. For comparisons with a control when the correlation
R does not have a factor analytic structure,
Hsu (1992) suggests approximating
R with a matrix R* that does have such a structure and
correspondingly approximating
with
.When you request Dunnett's test for LS-means (the PDIFF=CONTROL and
ADJUST=DUNNETT options), the GLM procedure automatically uses Hsu's
approximation when appropriate.
Finally, Edwards and Berry (1987) suggest calculating
by simulation. Multivariate t vectors are sampled from a distribution
with the appropriate
and R parameters, and Edwards and Berry (1987)
suggest estimating
by
, the
percentile
of the observed values of
. Sufficient samples are generated for the
true
to be within a certain
accuracy radius
of
with accuracy confidence
. You can approximate
by simulation for comparisons between LS-means by specifying ADJUST=SIM
(with either PDIFF=ALL or PDIFF=CONTROL). By default,
and
, so that the tail area of
is within 0.005 of
with 99% confidence. You can use the ACC= and EPS= options
with ADJUST=SIM to reset
and
, or you can use the NSAMP=
option to set the sample size directly. You can also control the random
number sequence with the SEED= option.
Hsu and Nelson (1998) suggest a more accurate simulation method for
estimating
, using a control variate adjustment
technique. The same independent, standardized normal variates that
are used to generate multivariate t vectors from a distribution with
the appropriate
and R parameters are also used to generate
multivariate t vectors from a distribution for which the exact value
of
is known.
for the
second sample is used as a control variate for adjusting the
quantile estimate based on the first sample; refer to Hsu and Nelson
(1998) for more details. The control variate adjustment has the
drawback that it takes somewhat longer than the crude technique of
Edwards and Berry (1987), but it typically yields an estimate that is
many times more accurate. In most cases, if you are using ADJUST=SIM,
then you should specify ADJUST=SIM(CVADJUST). You can also specify
ADJUST=SIM(CVADJUST REPORT) to display a summary of the simulation
that includes, among other things, the actual accuracy radius
, which should be substantially smaller than the target
accuracy radius (0.005 by default).
Step-down MSTs first test the homogeneity of all of the means at a
level
. If the test results in a rejection, then each
subset of k-1 means is tested at level
; otherwise,
the procedure stops. In general, if the hypothesis of homogeneity of
a set of p means is rejected at the
level, then each
subset of p-1 means is tested at the
level;
otherwise, the set of p means is considered not to differ
significantly and none of its subsets are tested. The many varieties
of MSTs that have been proposed differ in the levels
and
the statistics on which the subset tests are based. Clearly, the
EERC of a step-down MST is not greater than
, and the
CER is not greater than
, but the MEER is a complicated
function of
, p = 2, ... ,k.
With unequal cell sizes, PROC GLM uses the harmonic mean of the cell sizes as the common sample size. However, since the resulting operating characteristics can be undesirable, MSTs are recommended only for the balanced case. When the sample sizes are equal and if the range statistic is used, you can arrange the means in ascending or descending order and test only contiguous subsets. But if you specify the F statistic, this shortcut cannot be taken. For this reason, only range-based MSTs are implemented. It is common practice to report the results of an MST by writing the means in such an order and drawing lines parallel to the list of means spanning the homogeneous subsets. This form of presentation is also convenient for pairwise comparisons with equal cell sizes.
The best known MSTs are the Duncan (the DUNCAN option) and
Student-Newman-Keuls (the SNK option) methods (Miller 1981). Both use
the studentized range statistic and, hence, are called multiple
range tests. Duncan's method is often called the "new"
multiple range test despite the fact that it is one of the oldest MSTs
in current use. The Duncan and SNK methods differ in the
values used. For Duncan's method, they are


The SNK method holds the EERC to the
level but does not
control the MEER (Einot and Gabriel 1975). Consider ten
population
means that occur in five pairs such that means within a pair are
equal, but there are large differences between pairs. If you make the
usual sampling assumptions and also assume that the sample sizes are
very large, all subset homogeneity hypotheses for three or more means
are rejected. The SNK method then comes down to five independent
tests, one for each pair, each at the
level. Letting
be 0.05, the probability of at least one false rejection is

A variety of MSTs that control the MEER have been proposed, but these methods are not as well known as those of Duncan and SNK. An approach developed by Ryan (1959, 1960), Einot and Gabriel (1975), and Welsch (1977) sets


REGWQ appears to be the most powerful step-down MST in the current
literature (for example, Ramsey 1978). Use of a preliminary F-test
decreases the power of all the other multiple comparison methods
discussed previously except for Scheff
's test.



Multiple comparisons can also lead to counter-intuitive results when the cell sizes are unequal. Consider four cells labeled A, B, C, and D, with sample means in the order A>B>C>D. If A and D each have two observations, and B and C each have 10,000 observations, then the difference between B and C may be significant, while the difference between A and D is not.
|
Chapter Contents |
Previous |
Next |
Top |
Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.