Chapter Contents |
Previous |
Next |

Introduction to Categorical Data Analysis Procedures |

Sometimes the observed data do not come from a random sample but instead represent a complete set of observations on some population. For example, suppose a class of 100 students is classified according to sex and favorite color. The results are shown in Table 5.4.

In this case, you could argue that all of the frequencies are fixed since the entire population is observed; therefore, there is no sampling error. On the other hand, you could hypothesize that the observed table has only fixed marginals and that the cell frequencies represent one realization of a conceptual process of assigning color preferences to individuals. The assignment process is open to hypothesis, which means that you can hypothesize restrictions on the joint probabilities.

Favorite Color | ||||||

Sex | Red | Blue | Green | Total | ||

Male | 16 | 21 | 20 | 57 | ||

Female | 12 | 20 | 11 | 43 | ||

Total | 28 | 41 | 31 | 100 |

The usual hypothesis (sometimes called *randomness*)
is that the distribution of the column variable (Favorite
Color) does not depend on the row variable (Sex).
This implies that, for each row of the table, the
assignment process corresponds to a simple random
sample (without replacement) from the finite population
represented by the column marginal totals (or by the column
marginal subtotals that remain after sampling other rows).
The hypothesis of randomness induces a probability
distribution on the frequencies in the table;
it is called the *hypergeometric distribution*.

If the same row and column variables are observed for
each of several populations, then the probability
distribution of all the frequencies can be called
the *multiple hypergeometric distribution.*
Each population is called a *stratum*, and an analysis
that draws information from each stratum and then summarizes
across them is called a *stratified analysis* (or a
*blocked analysis* or a *matched analysis*).
PROC FREQ does such a stratified analysis,
computing test statistics and measures of association.

In general, the populations are formed on the basis of cross-classifications of independent variables. Stratified analysis is a method of adjusting for the effect of these variables without being forced to estimate parameters for them.

The multiple hypergeometric distribution is the one used by PROC
FREQ for the computation of Cochran-Mantel-Haenszel statistics.
These statistics are in the class of
*randomization model test statistics*,
which require minimal assumptions for their validity.
PROC FREQ uses the multiple hypergeometric distribution
to compute the mean and the covariance matrix of a
function vector in order to measure the deviation
between the observed and expected frequencies with
respect to a particular type of alternative hypothesis.
If the cell frequencies are sufficiently large, then the
function vector is approximately normally distributed as
a result of central limit theory, and FREQ uses this
result to compute a quadratic form that has a chi-square
distribution when the null hypothesis is true.

Chapter Contents |
Previous |
Next |
Top |

Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.