Chapter Contents Previous Next
 The GLM Procedure

### Multiple Comparisons

When comparing more than two means, an ANOVA F-test tells you whether the means are significantly different from each other, but it does not tell you which means differ from which other means. Multiple comparison procedures (MCPs), also called mean separation tests, give you more detailed information about the differences among the means. The goal in multiple comparisons is to compare the average effects of three or more "treatments" (for example, drugs, groups of subjects) to decide which treatments are better, which ones are worse, and by how much, while controlling the probability of making an incorrect decision. A variety of multiple comparison methods are available with the MEANS and LSMEANS statement in the GLM procedure.

The following classification is due to Hsu (1996). Multiple comparison procedures can be categorized in two ways: by the comparisons they make and by the strength of inference they provide. With respect to which comparisons are made, the GLM procedure offers two types:

• comparisons between all pairs of means
• comparisons between a control and all other means
The strength of inference says what can be inferred about the structure of the means when a test is significant; it is related to what type of error rate the MCP controls. MCPs available in the GLM procedure provide one of the following types of inference, in order from weakest to strongest.

• Individual: differences between means, unadjusted for multiplicity
• Inhomogeneity: means are different
• Inequalities: which means are different
• Intervals: simultaneous confidence intervals for mean differences
Methods that control only individual error rates are not true MCPs at all. Methods that yield the strongest level of inference, simultaneous confidence intervals, are usually preferred, since they enable you not only to say which means are different but also to put confidence bounds on how much they differ, making it easier to assess the practical significance of a difference. They are also less likely to lead nonstatisticians to the invalid conclusion that nonsignificantly different sample means imply equal population means. Interval MCPs are available for both arithmetic means and LS-means via the MEANS and LSMEANS statements, respectively.*

Table 30.3 and Table 30.4 display MCPs available in PROC GLM for all pairwise comparisons and comparisons with a control, respectively, along with associated strength of inference and the syntax (when applicable) for both the MEANS and the LSMEANS statements.

Table 30.3: Multiple Comparisons Procedures for All Pairwise Comparison
 Strength of Syntax Method Inference MEANS LSMEANS Student's t Individual T PDIFF ADJUST=T Duncan Individual DUNCAN Student-Newman-Keuls Inhomogeneity SNK REGWQ Inequalities REGWQ Tukey-Kramer Intervals TUKEY PDIFF ADJUST=TUKEY Bonferroni Intervals BON PDIFF ADJUST=BON Sidak Intervals SIDAK PDIFF ADJUST=SIDAK Scheff Intervals SCHEFFE PDIFF ADJUST=SCHEFFE SMM Intervals SMM PDIFF ADJUST=SMM Gabriel Intervals GABRIEL Simulation Intervals PDIFF ADJUST=SIMULATE

Table 30.4: Multiple Comparisons Procedures for Comparisons with a Control

Note: One-sided Dunnett's tests are also available from the MEANS statement with the DUNNETTL and DUNNETTU options and from the LSMEANS statement with PDIFF=CONTROLL and PDIFF=CONTROLU.

Details of these multiple comparison methods are given in the following sections.

#### Pairwise Comparisons

All the methods discussed in this section depend on the standardized pairwise differences ,where
• i and j are the indices of two groups
• and are the means or LS-means for groups i and j
• is the square-root of the estimated variance of . For simple arithmetic means, , where ni and nj are the sizes of groups i and j, respectively, and s2 is the mean square for error, with degrees of freedom. For weighted arithmetic means, , where wi and wj are the sums of the weights in groups i and j, respectively. Finally, for LS-means defined by the linear combinations li'b and lj'b of the parameter estimates, .
Furthermore, all of the methods are discussed in terms of significance tests of the form
where is some constant depending on the significance level. Such tests can be inverted to form confidence intervals of the form

The simplest approach to multiple comparisons is to do a t test on every pair of means (the T option in the MEANS statement, ADJUST=T in the LSMEANS statement). For the ith and jth means, you can reject the null hypothesis that the population means are equal if

where is the significance level, is the number of error degrees of freedom, and is the two-tailed critical value from a Student's t distribution. If the cell sizes are all equal to, say, n, the preceding formula can be rearranged to give
the value of the right-hand side being Fisher's least significant difference (LSD).

There is a problem with repeated t tests, however. Suppose there are ten means and each t test is performed at the 0.05 level. There are 10(10-1)/2=45 pairs of means to compare, each with a 0.05 probability of a type 1 error (a false rejection of the null hypothesis). The chance of making at least one type 1 error is much higher than 0.05. It is difficult to calculate the exact probability, but you can derive a pessimistic approximation by assuming that the comparisons are independent, giving an upper bound to the probability of making at least one type 1 error (the experimentwise error rate) of

The actual probability is somewhat less than 0.90, but as the number of means increases, the chance of making at least one type 1 error approaches 1.

If you decide to control the individual type 1 error rates for each comparison, you are controlling the individual or comparisonwise error rate. On the other hand, if you want to control the overall type 1 error rate for all the comparisons, you are controlling the experimentwise error rate. It is up to you to decide whether to control the comparisonwise error rate or the experimentwise error rate, but there are many situations in which the experimentwise error rate should be held to a small value. Statistical methods for comparing three or more means while controlling the probability of making at least one type 1 error are called multiple comparisons procedures.

It has been suggested that the experimentwise error rate can be held to the level by performing the overall ANOVA F-test at the level and making further comparisons only if the F-test is significant, as in Fisher's protected LSD. This assertion is false if there are more than three means (Einot and Gabriel 1975). Consider again the situation with ten means. Suppose that one population mean differs from the others by such a sufficiently large amount that the power (probability of correctly rejecting the null hypothesis) of the F-test is near 1 but that all the other population means are equal to each other. There will be 9(9 - 1)/2=36 t tests of true null hypotheses, with an upper limit of 0.84 on the probability of at least one type 1 error. Thus, you must distinguish between the experimentwise error rate under the complete null hypothesis, in which all population means are equal, and the experimentwise error rate under a partial null hypothesis, in which some means are equal but others differ. The following abbreviations are used in the discussion:

CER
comparisonwise error rate

EERC
experimentwise error rate under the complete null hypothesis

MEER
maximum experimentwise error rate under any complete or partial null hypothesis

These error rates are associated with the different strengths of inference: individual tests control the CER; tests for inhomogeneity of means control the EERC; tests that yield confidence inequalities or confidence intervals control the MEER. A preliminary F-test controls the EERC but not the MEER.

You can control the MEER at the level by setting the CER to a sufficiently small value. The Bonferroni inequality (Miller 1981) has been widely used for this purpose. If
where c is the total number of comparisons, then the MEER is less than . Bonferroni t tests (the BON option in the MEANS statement, ADJUST=BON in the LSMEANS statement) with declare two means to be significantly different if
where
for comparison of k means.

Sidak (1967) has provided a tighter bound, showing that
also ensures that for any set of c comparisons. A Sidak t test (Games 1977), provided by the SIDAK option, is thus given by
where
for comparison of k means.

You can use the Bonferroni additive inequality and the Sidak multiplicative inequality to control the MEER for any set of contrasts or other hypothesis tests, not just pairwise comparisons. The Bonferroni inequality can provide simultaneous inferences in any statistical application requiring tests of more than one hypothesis. Other methods discussed in this section for pairwise comparisons can also be adapted for general contrasts (Miller 1981).

Scheff (1953, 1959) proposes another method to control the MEER for any set of contrasts or other linear hypotheses in the analysis of linear models, including pairwise comparisons, obtained with the SCHEFFE option. Two means are declared significantly different if
where is the -level critical value of an F distribution with k-1 numerator degrees of freedom and denominator degrees of freedom.

Scheff's test is compatible with the overall ANOVA F-test in that Scheff's method never declares a contrast significant if the overall F-test is nonsignificant. Most other multiple comparison methods can find significant contrasts when the overall F-test is nonsignificant and, therefore, suffer a loss of power when used with a preliminary F-test.

Scheff's method may be more powerful than the Bonferroni or Sidak methods if the number of comparisons is large relative to the number of means. For pairwise comparisons, Sidak t tests are generally more powerful.

Tukey (1952, 1953) proposes a test designed specifically for pairwise comparisons based on the studentized range, sometimes called the "honestly significant difference test," that controls the MEER when the sample sizes are equal. Tukey (1953) and Kramer (1956) independently propose a modification for unequal cell sizes. The Tukey or Tukey-Kramer method is provided by the TUKEY option in the MEANS statement and the ADJUST=TUKEY option in the LSMEANS statement. This method has fared extremely well in Monte Carlo studies (Dunnett 1980). In addition, Hayter (1984) gives a proof that the Tukey-Kramer procedure controls the MEER for means comparisons, and Hayter (1989) describes the extent to which the Tukey-Kramer procedure has been proven to control the MEER for LS-means comparisons. The Tukey-Kramer method is more powerful than the Bonferroni, Sidak, or Scheff methods for pairwise comparisons. Two means are considered significantly different by the Tukey-Kramer criterion if
where is the -level critical value of a studentized range distribution of k independent normal random variables with degrees of freedom.

Hochberg (1974) devised a method (the GT2 or SMM option) similar to Tukey's, but it uses the studentized maximum modulus instead of the studentized range and employs Sidak's (1967) uncorrelated t inequality. It is proven to hold the MEER at a level not exceeding with unequal sample sizes. It is generally less powerful than the Tukey-Kramer method and always less powerful than Tukey's test for equal cell sizes. Two means are declared significantly different if
where is the -level critical value of the studentized maximum modulus distribution of c independent normal random variables with degrees of freedom and c = k(k-1)/2.

Gabriel (1978) proposes another method (the GABRIEL option) based on the studentized maximum modulus. This method is applicable only to arithmetic means. It rejects if

For equal cell sizes, Gabriel's test is equivalent to Hochberg's GT2 method. For unequal cell sizes, Gabriel's method is more powerful than GT2 but may become liberal with highly disparate cell sizes (refer also to Dunnett 1980). Gabriel's test is the only method for unequal sample sizes that lends itself to a graphical representation as intervals around the means. Assuming ,you can rewrite the preceding inequality as

The expression on the left does not depend on j, nor does the expression on the right depend on i. Hence, you can form what Gabriel calls an (l,u)-interval around each sample mean and declare two means to be significantly different if their (l,u)-intervals do not overlap. See Hsu (1996, section 5.2.1.1) for a discussion of other methods of graphically representing all pair-wise comparisons.

#### Comparing All Treatments to a Control

One special case of means comparison is that in which the only comparisons that need to be tested are between a set of new treatments and a single control. In this case, you can achieve better power by using a method that is restricted to test only comparisons to the single control mean. Dunnett (1955) proposes a test for this situation that declares a mean significantly different from the control if
where is the control mean and is the critical value of the "many-to-one t statistic" (Miller 1981; Krishnaiah and Armitage 1966) for k means to be compared to a control, with error degrees of freedom and correlations ,. The correlation terms arise because each of the treatment means is being compared to the same control. Dunnett's test holds the MEER to a level not exceeding the stated .

#### Approximate and Simulation-based Methods

Both Tukey's and Dunnett's tests are based on the same general quantile calculation:
where the ti have a joint multivariate t distribution with degrees of freedom and correlation matrix R. In general, evaluating requires repeated numerical calculation of an (n+1)-fold integral. This is usually intractable, but the problem reduces to a feasible 2-fold integral when R has a certain symmetry in the case of Tukey's test, and a factor analytic structure (cf. Hsu 1992) in the case of Dunnett's test. The R matrix has the required symmetry for exact computation of Tukey's test if the tis are studentized differences between
• k(k-1)/2 pairs of k uncorrelated means with equal variances -that is, equal sample sizes
• k(k-1)/2 pairs of k LS-means from a variance-balanced design (for example, a balanced incomplete block design)
Refer to Hsu (1992, 1996) for more information. The R matrix has the factor analytic structure for exact computation of Dunnett's test if the tis are studentized differences between
• k-1 means and a control mean, all uncorrelated. (Dunnett's one-sided methods depend on a similar probability calculation, without the absolute values.) Note that it is not required that the variances of the means (that is, the sample sizes) be equal.
• k-1 LS-means and a control LS-mean from either a variance-balanced design, or a design in which the other factors are orthogonal to the treatment factor (for example, a randomized block design with proportional cell frequencies).

However, other important situations that do not result in a correlation matrix R that has the structure for exact computation include
• all pairwise differences with unequal sample sizes
• differences between LS-means in many unbalanced designs

In these situations, exact calculation of is intractable in general. Most of the preceding methods can be viewed as using various approximations for .When the sample sizes are unequal, the Tukey-Kramer test is equivalent to another approximation. For comparisons with a control when the correlation R does not have a factor analytic structure, Hsu (1992) suggests approximating R with a matrix R* that does have such a structure and correspondingly approximating with .When you request Dunnett's test for LS-means (the PDIFF=CONTROL and ADJUST=DUNNETT options), the GLM procedure automatically uses Hsu's approximation when appropriate.

Finally, Edwards and Berry (1987) suggest calculating by simulation. Multivariate t vectors are sampled from a distribution with the appropriate and R parameters, and Edwards and Berry (1987) suggest estimating by , the percentile of the observed values of . Sufficient samples are generated for the true to be within a certain accuracy radius of with accuracy confidence . You can approximate by simulation for comparisons between LS-means by specifying ADJUST=SIM (with either PDIFF=ALL or PDIFF=CONTROL). By default, and , so that the tail area of is within 0.005 of with 99% confidence. You can use the ACC= and EPS= options with ADJUST=SIM to reset and , or you can use the NSAMP= option to set the sample size directly. You can also control the random number sequence with the SEED= option.

Hsu and Nelson (1998) suggest a more accurate simulation method for estimating , using a control variate adjustment technique. The same independent, standardized normal variates that are used to generate multivariate t vectors from a distribution with the appropriate and R parameters are also used to generate multivariate t vectors from a distribution for which the exact value of is known. for the second sample is used as a control variate for adjusting the quantile estimate based on the first sample; refer to Hsu and Nelson (1998) for more details. The control variate adjustment has the drawback that it takes somewhat longer than the crude technique of Edwards and Berry (1987), but it typically yields an estimate that is many times more accurate. In most cases, if you are using ADJUST=SIM, then you should specify ADJUST=SIM(CVADJUST). You can also specify ADJUST=SIM(CVADJUST REPORT) to display a summary of the simulation that includes, among other things, the actual accuracy radius , which should be substantially smaller than the target accuracy radius (0.005 by default).

#### Multiple-Stage Tests

You can use all of the methods discussed so far to obtain simultaneous confidence intervals (Miller 1981). By sacrificing the facility for simultaneous estimation, you can obtain simultaneous tests with greater power using multiple-stage tests (MSTs). MSTs come in both step-up and step-down varieties (Welsch 1977). The step-down methods, which have been more widely used, are available in SAS/STAT software.

Step-down MSTs first test the homogeneity of all of the means at a level . If the test results in a rejection, then each subset of k-1 means is tested at level ; otherwise, the procedure stops. In general, if the hypothesis of homogeneity of a set of p means is rejected at the level, then each subset of p-1 means is tested at the level; otherwise, the set of p means is considered not to differ significantly and none of its subsets are tested. The many varieties of MSTs that have been proposed differ in the levels and the statistics on which the subset tests are based. Clearly, the EERC of a step-down MST is not greater than , and the CER is not greater than , but the MEER is a complicated function of , p = 2, ... ,k.

With unequal cell sizes, PROC GLM uses the harmonic mean of the cell sizes as the common sample size. However, since the resulting operating characteristics can be undesirable, MSTs are recommended only for the balanced case. When the sample sizes are equal and if the range statistic is used, you can arrange the means in ascending or descending order and test only contiguous subsets. But if you specify the F statistic, this shortcut cannot be taken. For this reason, only range-based MSTs are implemented. It is common practice to report the results of an MST by writing the means in such an order and drawing lines parallel to the list of means spanning the homogeneous subsets. This form of presentation is also convenient for pairwise comparisons with equal cell sizes.

The best known MSTs are the Duncan (the DUNCAN option) and Student-Newman-Keuls (the SNK option) methods (Miller 1981). Both use the studentized range statistic and, hence, are called multiple range tests. Duncan's method is often called the "new" multiple range test despite the fact that it is one of the oldest MSTs in current use.

The Duncan and SNK methods differ in the values used. For Duncan's method, they are
whereas the SNK method uses
Duncan's method controls the CER at the level. Its operating characteristics appear similar to those of Fisher's unprotected LSD or repeated t tests at level (Petrinovich and Hardyck 1969). Since repeated t tests are easier to compute, easier to explain, and applicable to unequal sample sizes, Duncan's method is not recommended. Several published studies (for example, Carmer and Swanson 1973) have claimed that Duncan's method is superior to Tukey's because of greater power without considering that the greater power of Duncan's method is due to its higher type 1 error rate (Einot and Gabriel 1975).

The SNK method holds the EERC to the level but does not control the MEER (Einot and Gabriel 1975). Consider ten population means that occur in five pairs such that means within a pair are equal, but there are large differences between pairs. If you make the usual sampling assumptions and also assume that the sample sizes are very large, all subset homogeneity hypotheses for three or more means are rejected. The SNK method then comes down to five independent tests, one for each pair, each at the level. Letting be 0.05, the probability of at least one false rejection is
As the number of means increases, the MEER approaches 1. Therefore, the SNK method cannot be recommended.

A variety of MSTs that control the MEER have been proposed, but these methods are not as well known as those of Duncan and SNK. An approach developed by Ryan (1959, 1960), Einot and Gabriel (1975), and Welsch (1977) sets
You can use range statistics, leading to what is called the REGWQ method after the authors' initials. If you assume that the sample means have been arranged in descending order from through , the homogeneity of means , is rejected by REGWQ if
where p=j-i+1 and the summations are over u = i, ... ,j (Einot and Gabriel 1975). To ensure that the MEER is controlled, the current implementation checks whether is monotonically increasing in p. If not, then a set of critical values that are increasing in p is substituted instead.

REGWQ appears to be the most powerful step-down MST in the current literature (for example, Ramsey 1978). Use of a preliminary F-test decreases the power of all the other multiple comparison methods discussed previously except for Scheff's test.

#### Bayesian Approach

Waller and Duncan (1969) and Duncan (1975) take an approach to multiple comparisons that differs from all the methods previously discussed in minimizing the Bayes risk under additive loss rather than controlling type 1 error rates. For each pair of population means and , null (H0ij) and alternative (Haij) hypotheses are defined:
For any i, j pair, let d0 indicate a decision in favor of H0ij and da indicate a decision in favor of Haij, and let . The loss function for the decision on the i, j pair is
where k represents a constant that you specify rather than the number of means. The loss for the joint decision involving all pairs of means is the sum of the losses for each individual decision. The population means are assumed to have a normal prior distribution with unknown variance, the logarithm of the variance of the means having a uniform prior distribution. For the i, j pair, the null hypothesis is rejected if
where tB is the Bayesian t value (Waller and Kemp 1976) depending on k, the F statistic for the one-way ANOVA, and the degrees of freedom for F. The value of tB is a decreasing function of F, so the Waller-Duncan test (specified by the WALLER option) becomes more liberal as F increases.

#### Recommendations

In summary, if you are interested in several individual comparisons and are not concerned about the effects of multiple inferences, you can use repeated t tests or Fisher's unprotected LSD. If you are interested in all pairwise comparisons or all comparisons with a control, you should use Tukey's or Dunnett's test, respectively, in order to make the strongest possible inferences. If you have weaker inferential requirements and, in particular, if you don't want confidence intervals for the mean differences, you should use the REGWQ method. Finally, if you agree with the Bayesian approach and Waller and Duncan's assumptions, you should use the Waller-Duncan test.

#### Interpretation of Multiple Comparisons

When you interpret multiple comparisons, remember that failure to reject the hypothesis that two or more means are equal should not lead you to conclude that the population means are, in fact, equal. Failure to reject the null hypothesis implies only that the difference between population means, if any, is not large enough to be detected with the given sample size. A related point is that nonsignificance is nontransitive: that is, given three sample means, the largest and smallest may be significantly different from each other, while neither is significantly different from the middle one. Nontransitive results of this type occur frequently in multiple comparisons.

Multiple comparisons can also lead to counter-intuitive results when the cell sizes are unequal. Consider four cells labeled A, B, C, and D, with sample means in the order A>B>C>D. If A and D each have two observations, and B and C each have 10,000 observations, then the difference between B and C may be significant, while the difference between A and D is not.

 Chapter Contents Previous Next Top