Chapter Contents |
Previous |
Next |

Introduction to Categorical Data Analysis Procedures |

Several procedures in SAS/STAT software can be used for the analysis of categorical data:

- CATMOD
- fits linear models to functions of categorical data, facilitating
such analyses as regression, analysis of variance, linear modeling,
log-linear modeling, logistic regression, and repeated measures
analysis. Maximum likelihood estimation is used for the analysis
of logits and generalized logits, and weighted least squares
analysis is used for fitting models to other response functions.
- CORRESP
- performs simple and multiple correspondence analyses, using a
contingency table, Burt table, binary table, or raw categorical data as
input. For more on PROC CORRESP, see Chapter 6, "Introduction to Multivariate Procedures,"
and Chapter 24, "The CORRESP Procedure,".
- FREQ
- builds frequency tables or contingency tables and produces
numerous tests and measures of association including
chi-square statistics, odds ratios, correlation
statistics, and Fisher's exact test for any size two-way table. In
addition, it performs stratified analysis, computing
Cochran-Mantel-Haenszel statistics and estimates of the
common relative risk. It performs a test of binomial
proportions, computes measures of agreement such as
McNemar's test, kappa, and weighted kappa.
- GENMOD
- fits generalized linear models with maximum-likelihood methods.
This family includes logistic, probit, and complementary log-log
regression models for binomial data,
Poisson regression models for count data, and multinomial models
for ordinal response data.
It performs likelihood ratio and Wald tests for type I, type III, and
user-defined contrasts. It analyzes repeated measures data with
generalized estimating equation (GEE) methods.
- LOGISTIC
- fits linear logistic regression models for binary
or ordinal response data with maximum-likelihood methods.
It performs stepwise regression and provides regression diagnostics.
The logit link function in the logistic regression models can be replaced
by the normit function or the complementary log-log function.
- PROBIT
- computes maximum-likelihood estimates of regression parameters and optional threshold parameters for binary or ordinal response data.

A *categorical variable* is defined as one that
can assume only a limited number of discrete values.
The measurement scale for such a variable is unrestricted.
It can be *nominal*, which means
that the observed levels are not ordered.
It can be *ordinal*, which means that
the observed levels are ordered in some way.
Or it can be *interval*, which means that the observed
levels are ordered and numeric and that any interval
of one unit on the scale of measurement represents the
same amount, regardless of its location on the scale.
One example of a categorical variable is litter size;
another is the number of times a subject has been married.
A variable that lies on a nominal scale is sometimes called
a *qualitative* or *classification variable*.
Categorical data result from observations on
multiple subjects where one or more categorical
variables are observed for each subject.
If there is only one categorical variable, then the data are
generally represented by a *frequency table*, which lists each
observed value of the variable and its frequency of occurrence.

If there are two or more categorical variables,
then a subject's *profile* is defined as the
subject's observed values for each of the variables.
Such categorical data can be represented by a frequency table
that lists each observed profile and its frequency of occurrence.

If there are exactly two categorical variables, then the data are
often represented by a two-dimensional *contingency table*,
which has one row for each level of variable 1 and one column for
each level of variable 2. The intersections of rows and columns,
called *cells*, correspond to variable profiles, and each
cell contains the frequency of occurrence of the corresponding profile.

If there are more than two categorical variables, then the data
can be represented by a *multidimensional contingency table*.
There are two commonly used methods for displaying such tables, and
both require that the variables be divided into two sets.

In the first method, one set contains a row variable and a column variable for a two-dimensional contingency table, and the second set contains all of the other variables. The variables in the second set are used to form a set of profiles. Thus, the data are represented as a series of two-dimensional contingency tables, one for each profile. This is the data representation used by PROC FREQ. For example, if you request tables for RACE*SEX*AGE*INCOME, the FREQ procedure represents the data as a series of contingency tables: the row variable is AGE, the column variable is INCOME, and the combinations of levels of RACE and SEX form a set of profiles.

In the second method, one set contains the independent variables,
and the other set contains the dependent variables. Profiles based on
the independent variables are called *population profiles*,
whereas those based on
the dependent variables are called *response
profiles*. A two-dimensional contingency table is then formed, with
one row for each population profile and one column for each response
profile. Since any subject can have only one population profile and
one response profile, the contingency table is uniquely defined. This
is the data representation used by PROC CATMOD.

Chapter Contents |
Previous |
Next |
Top |

Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.