Chapter Contents |
Previous |
Next |

The NPAR1WAY Procedure |

Exact tests can be useful in situations where the
asymptotic assumptions are not met and the asymptotic
*p*-values are not close approximations for the true *p*-values.
Standard asymptotic methods involve the assumption that
the test statistic follows a particular distribution when the
sample size is sufficiently large. When the sample size is
not large, asymptotic results may not be valid, with the
asymptotic *p*-values differing perhaps substantially from the
exact *p*-values. Asymptotic results may also be unreliable
when the distribution of the data is sparse, skewed, or heavily
tied. Refer to Agresti (1996) and Bishop, Fienberg, and Holland (1975).
Exact computations are based on the statistical theory of exact
conditional inference for contingency tables, reviewed by
Agresti (1992).

In addition to computation of exact *p*-values, PROC NPAR1WAY
provides the option of estimating exact *p*-values by Monte
Carlo simulation. This can be useful for
problems that are so large that exact computations require
a great amount of time and memory, but for which asymptotic
approximations may not be sufficient.

The following sections summarize the exact computational
algorithms, define the exact *p*-values that PROC NPAR1WAY
computes, discuss the computational resource
requirements, and describe the Monte Carlo estimation option.

PROC NPAR1WAY constructs a contingency table from the input data, with rows formed by the levels of the classification variable and columns formed by the response variable values. The reference set for a given contingency table is the set of all contingency tables with the observed marginal row and column sums. Corresponding to this reference set, the network algorithm forms a directed acyclic network consisting of nodes in a number of stages. A path through the network corresponds to a distinct table in the reference set. The distances between nodes are defined so that the total distance of a path through the network is the corresponding value of the test statistic. At each node, the algorithm computes the shortest and longest path distances for all the paths that pass through that node. For the two-sample linear rank statistics, which can be expressed as a linear combination of cell frequencies multiplied by increasing row and column scores, PROC NPAR1WAY computes shortest and longest path distances using the algorithm given in Agresti, Mehta, and Patel (1990). For the multisample one-way test statistics, PROC NPAR1WAY computes an upper bound for the longest path and a lower bound for the shortest path, following the approach of Valz and Thompson (1994).

The longest and shortest path distances or bounds for a node
are compared to the value of the test statistic to determine
whether all paths through the node contribute to the *p*-value,
none of the paths through the node contribute to the *p*-value,
or neither of these situations occur. If all paths through the node
contribute, the *p*-value is incremented accordingly, and these
paths are eliminated from further analysis. If no paths contribute, these
paths are eliminated from the analysis. Otherwise, the algorithm
continues, still processing this node and the associated paths. The
algorithm finishes when all nodes have been accounted for.

In applying the network algorithm, PROC NPAR1WAY uses full
precision to represent all statistics, row and column scores, and other
quantities involved in the computations. Although it is possible to
use rounding to improve the speed and memory requirements of the
algorithm, PROC NPAR1WAY does not do this since it can result in reduced
accuracy of the *p*-values.

where *S* is the observed value of the test statistic and *Mean*
is the expected value of the test statistic under the null hypothesis.
PROC NPAR1WAY computes the two-sided *p*-value as the sum of the
one-sided *p*-value and the corresponding area in the opposite tail
of the distribution of the statistic, equidistant from the expected value.
The two-sided *p*-value *P _{2}* can be expressed as

For multisample data, the tests are based on one-way ANOVA
statistics. For a test of this form, large values of the
test statistic indicate a departure from the null hypothesis;
the test is inherently two-sided.
The exact *p*-value is the sum of probabilities for those
tables having a test statistic greater than or equal to the
value of the observed test statistic.

A formula does not exist that can
predict in advance how much time and memory
are needed to compute an exact *p*-value for a
certain problem. The time and memory required depend on
several factors, including which test is being performed,
the total sample size, the number of rows and columns,
and the specific arrangement of the observations into
table cells. Generally, larger problems
(in terms of total sample size, number of rows, and number of
columns) tend to require more time and memory. Additionally,
for a fixed total sample size, time and memory requirements tend
to increase as the number of rows and columns increase,
since this corresponds to an increase in the number of
tables in the reference set. Also for a fixed sample size,
time and memory requirements increase as the marginal
row and column totals become more homogeneous. Refer to
Agresti, Mehta, and Patel (1990) and Gail and Mantel (1977).

At any time while PROC NPAR1WAY is computing exact *p*-values,
you can terminate the computations by pressing the system
interrupt key sequence (refer to the *SAS Companion* for your system)
and choosing to stop computations. After you terminate exact
computations, PROC NPAR1WAY completes all other remaining tasks.
The procedure produces the requested output and reports missing
values for any exact *p*-values not computed by the time of
termination.

You can also use the MAXTIME= option in the EXACT statement to
limit the amount of time PROC NPAR1WAY uses for exact computations.
You specify a MAXTIME= value that is the maximum amount of
time (in seconds) that PROC NPAR1WAY can use to compute an
exact *p*-value. If PROC NPAR1WAY does not finish computing
an exact *p*-value within that time, it terminates the
computation and completes all other remaining tasks.

To compute a Monte Carlo estimate of an exact *p*-value, PROC
NPAR1WAY generates a random sample of tables with the same total
sample size, row totals, and column totals as the observed table.
PROC NPAR1WAY uses the algorithm of Agresti, Wackerly, and Boyett (1979),
which generates tables in proportion to their hypergeometric
probabilities conditional on the marginal frequencies.
For each sample table, PROC NPAR1WAY computes the value of the test
statistic and compares it to the value for the observed table.
When estimating a right-sided *p*-value, PROC NPAR1WAY counts all
sample tables for which the test statistic is greater than or
equal to the observed test statistic. Then the *p*-value
estimate equals the number of these tables divided by the total
number of tables sampled.

The variable *M* is a binomially distributed variable with
*N* trials and success probability *p*. It follows that
the asymptotic standard error of the Monte Carlo estimate is

PROC NPAR1WAY constructs asymptotic confidence limits for the
*p*-values according to

When the Monte Carlo estimate equals 0, then PROC
NPAR1WAY computes the confidence limits for the *p*-value as

Chapter Contents |
Previous |
Next |
Top |

Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.