![]() Chapter Contents |
![]() Previous |
![]() Next |
| The UNIVARIATE Procedure |
PROC UNIVARIATE uses standard algorithms to compute the moment statistics (such as the mean, variance, skewness, and kurtosis). See SAS Elementary Statistics Procedures for the statistical formulas. The computational details for confidence limits, hypothesis test statistics, and quantile statistics follow.
| Confidence Limits for Parameters |
A two-sided
percent confidence interval for the mean has upper and
lower limits
where
is
and
is the (
) critical value of the Student's t statistics
with
degrees of freedom.
A one-sided
percent confidence interval is computed as
A two-sided
percent confidence interval for the standard deviation
has lower and upper limits
where
and
are the
and
critical values of the chi-square statistic with
degrees of freedom. A one-sided
percent confidence interval is computed by replacing
with
.
A
percent confidence interval for the variance has upper
and lower limits equal to the squares of the corresponding upper and lower
limits for the standard deviation.
When you use the WEIGHT statement and VARDEF=DF in the PROC statement,
the
percent confidence interval for the weighted mean has upper
and lower limits
where
is the weighted mean,
is the weighted standard deviation,
is the weight for
observation, and
is the
critical value for the Student's t distribution
with
degrees of freedom.
| Tests for Location |
PROC
UNIVARIATE computes tests for location that include the Student's t test, the sign test, and the Wilcoxon signed rank test. All three
tests produce a test statistic for the null hypothesis that the mean or median
is equal to a given value
against the two-sided alternative that the mean or median
is not equal to
. By default, PROC UNIVARIATE sets the value of
to zero. Use the MU0= option in the PROC UNIVARIATE statement
to test that the mean or median is equal to another value.
The Student's t test is appropriate when the data are from an approximately normal population; otherwise, use nonparametric tests such as the sign test or the signed rank test. For large sample situations, the Student's t test is asymptotically equivalent to a z test.
If you use the WEIGHT statement, PROC UNIVARIATE computes only one weighted test for location, the Student's t test. You must use the default value for the VARDEF= option in the PROC statement.
You can also compare
means or medians of paired data. Data are said to be paired when
subjects or units are matched in pairs according to one or more variables,
such as pairs of subjects with the same age and gender. Paired data also
occur when each subject or unit is measured at two times or under two conditions.
To compare the means or medians of the two times, create an analysis variable
that is the difference between the two measures. To test that the new analysis
variable's mean or median equals zero is equivalent to the test that the means
or medians of the two original variables are equal. For an example, see Performing a Sign Test Using Paired Data.
PROC UNIVARIATE calculates the t statistic as
where
is the sample mean,
is the number of nonmissing values for a variable, and
is the sample standard deviation. Under the null hypothesis,
the population mean equals
. When the data values are approximately normally distributed,
the probability under the null hypothesis of a t statistic as
extreme, or more extreme, than the observed value (the p-value)
is obtained from the t distribution with
degrees of freedom. For large
, the t statistic is asymptotically equivalent
to a z test.
When you use the WEIGHT statement and the default value of VARDEF=, which is DF, the Student's t statistic is calculated as
where
is the weighted mean,
is the weighted standard deviation, and
is the weight for
observation. The
statistic is treated as having a Student's t
distribution with
degrees of freedom. If you specify the EXCLNPWGT option
in the PROC statement,
is the number of nonmissing observations when the value
of the WEIGHT variable is positive. By default,
is the number of nonmissing observations for the WEIGHT
variable.
PROC UNIVARIATE calculates the sign test statistic as
where
is the number of values that is greater than
and
is the number of values that is less than
. Values equal to
are discarded.
Under the null hypothesis that the population median is equal to
, the p-value for the observed statistic M is
where
is the number of
values not equal to
.
PROC UNIVARIATE calculates the Wilcoxon signed rank test statistic as
where
is the rank of
after discarding values of
equal to
,
is the number of
values not equal to
, and the sum is calculated for values of
greater than 0. Average ranks are used for tied values.
The p-value is the probability of obtaining a
signed rank
statistic greater in absolute value than the absolute value of the observed
statistic S. If
, the significance level of
is computed from the exact distribution of
, which can be enumerated under the null hypothesis that
the distribution is symmetric about
. When
, the significance of level
is computed by treating
as a Student's
t variate with
degrees of freedom.
is computed as
where the sum is calculated over groups that are tied in
absolute value, and
is the number of tied values in the
th group (Iman 1974; Conover 1980).
The Wilcoxon signed rank test assumes that the distribution is symmetric.
If the assumption is not valid, you can use the sign test to test that the
median is
. See Lehmann (1975) for more details.
| Tests for Normality |
When you specify the NORMAL option, the procedure computes test statistics for the null hypothesis that the values of the analysis variable are a random sample from a normal distribution. The test statistics depend on the sample size. The procedure calculates goodness-of-fit tests based on the empirical distribution function (EDF): the Kolmogorov-Smirnov D statistic, the Anderson-Darling statistic, and the Cramer-von Mises statistic. In addition, if the sample size is less than or equal to 2000, PROC UNIVARIATE calculates the Shapiro-Wilk W statistic.
You determine whether to reject the null hypothesis of normality by examining the probability that is associated with a test statistic. A small p-value indicates nonnormal data. When the p-value is less than the predetermined critical value (alpha value), you reject the null hypothesis and conclude that the data do not come from a normal distribution.
If you want to test the normality assumptions underlying analysis of
variance methods, beware of using a statistical test for normality alone.
A test's ability to reject the null hypothesis (known as the power
of the test) increases with the sample size. As the sample size becomes larger,
increasingly smaller departures from normality can be detected. Since small
deviations from normality do not severely affect the validity of analysis
of variance tests, it is important to examine other statistics and plots to
make a final assessment of normality. The skewness and kurtosis measures
and the plots that are provided by the PLOTS option can be very helpful.
For small sample sizes, power is low for detecting larger departures from
normality that may be important. To increase the test's ability to detect
such deviations, you may want to declare significance at higher levels, such
as 0.15 or 0.20 rather than the often-used 0.05 level. Again, consulting
plots and additional statistics will help you assess the severity of the deviations
from normality.
If the sample size is less than or equal to 2000, PROC UNIVARIATE computes the Shapiro-Wilk statistic, W. The W statistic is the ratio of the best estimator of the variance (based on the square of a linear combination of the order statistics) to the usual corrected sum of squares estimator of the variance (Shapiro, 1965). W must be greater than zero and less than or equal to one. Small values of W lead to the rejection of the null hypothesis of normality. The distribution of W is highly skewed. Seemingly large values of W (such as 0.90) may be considered small and lead you to reject the null hypothesis. When the sample size is greater than three, the coefficients to compute the linear combination of the order statistics are approximated by the method of Royston (1992).
when
and
when
, where
and
are functions of
, obtained from simulation results, and
is a standard normal variate. Large values of
indicate departure from normality.
When you fit a normal distribution, PROC UNIVARIATE provides
goodness-of-fit tests that are based on the empirical distribution function
(EDF). The empirical distribution function is defined for a set of
independent observations
with a common distribution function
. Denote the observations that are ordered from smallest
to largest as
. The empirical distribution function,
, is defined as
Note that
is a step function that takes a step of height
at each observation. This function estimates the distribution
function
. At any value
is the proportion of observations that is less than or
equal to
while
is the theoretical probability of an observation that is
less than or equal to
. EDF statistics measure the discrepancy between
and
.
The computational formulas for the EDF statistics use the probability
integral transformation
. If
is the distribution function of
, the random variable
is uniformly distributed between 0 and 1.
Given
observations
, PROC UNIVARIATE
computes
the values
by applying the transformation,
as follows.
The NORMAL option in the PROC UNIVARIATE statement provides the following EDF tests:
and the proposed normal cumulative distribution function
.
Once the EDF test statistics are computed, the associated p-values
must be calculated. PROC UNIVARIATE uses internal tables of probability levels
that are similar to those given by D'Agostino and Stephens (1986). If the
value is between two probability levels, then linear interpolation is used
to estimate the probability value.
The Kolmogorov-Smirnov statistic (D) is defined as
The Kolmogorov-Smirnov statistic belongs to the supremum
class of EDF statistics. This class of statistics is based on the largest
vertical difference between
and
.
The Kolmogorov-Smirnov statistic is computed as the maximum of
and
.
is the largest vertical distance between the EDF and the
distribution function when the EDF is greater than the distribution function.
is the largest vertical distance when the EDF is less than
the distribution function.
PROC UNIVARIATE uses a modified Kolmogorov D statistic to test the data against a normal distribution with mean and variance equal to the sample mean and variance.
The procedure uses a set of five critical values to compare against
the calculated statistic. If the calculated value falls outside the interval
of these two extreme critical values, the probability of obtaining a more
extreme value is shown as >.15 or
<.01.
The Anderson-Darling statistic and the Cramer-von Mises statistic
belong to the quadratic class of EDF statistics. This class of statistics
is based on the squared difference
. Quadratic statistics have the following general form:
The function
weights the squared difference
.
The Anderson-Darling statistic (
) is defined as
where
the weight function is
.
The Anderson-Darling statistic is computed as
The Cramer-von Mises statistic (
) is defined as
where the weight function is
.
The Cramer-von Mises statistic is computed as
| Robust Estimators |
A statistical method is robust if the
method is insensitive to slight departures from the assumptions that justify
the method. PROC UNIVARIATE provides several methods for robust estimation
of location and scale.
When outliers are present in the data, the Winsorized mean is a robust estimator of the location that is relatively insensitive to the outlying values. The k-times Winsorized mean is calculated as
The Winsorized mean is computed after the
smallest observations are replaced by the (
) smallest observation, and the
largest observations are replaced by the (
) largest observation. In other words, the observations
at each end are used in the computations.
For a symmetric distribution, the symmetrically Winsorized mean is an unbiased estimate of the population mean. But the Winsorized mean does not have a normal distribution even if the data are from a normal population.
The Winsorized sum of squared deviations is defined as
A Winsorized t test is given by
where the standard error of the Winsorized mean is
When
the data are from a symmetric distribution, the distribution of
the Winsorized t statistic
is approximated by a Student's t distribution
with
degrees of freedom (Tukey and McLaughlin 1963, Dixon and
Tukey 1968).
A
percent confidence interval for the Winsorized mean has
upper and lower limits
and the (
) critical value of the Student's t statistics
has
degrees of freedom.
When
outliers are present in the data, the trimmed mean is a robust estimator of
the location that is relatively insensitive to the outlying values. The
-times trimmed mean is calculated as
The trimmed mean is computed after the
smallest and
largest observations are deleted from the sample. In other
words, the observations are trimmed at each end.
For a symmetric distribution, the symmetrically trimmed mean is an unbiased estimate of the population mean. But the trimmed mean does not have a normal distribution even if the data are from a normal population.
A robust estimate of the variance of the trimmed mean
can be based on the Winsorized sum of squared deviations
(Tukey and McLaughlin 1963). The resulting trimmed t test is
given by
where the standard error of the trimmed mean is
and
is the square root of the Winsorized sum of squared deviations
When the data are from a symmetric distribution, the distribution of
the trimmed t statistic
is approximated by a Student's t distribution
with
degrees of freedom (Tukey and McLaughlin 1963, Dixon and
Tukey 1968).
A
percent confidence interval for the trimmed mean has upper
and lower limits
and the (
) critical value of the Student's t statistics
has
degrees of freedom.
The sample standard deviation is a commonly used estimator of the population scale. However, it is sensitive to outliers and may not remain bounded when a single data point is replaced by an arbitrary number. With robust scale estimators, the estimates remain bounded even when a portion of the data points are replaced by arbitrary numbers.
PROC UNIVARIATE computes robust measures of scale that include statistics
of interquartile range, Gini's mean difference G, MAD,
, and
, with their corresponding estimates of
.
The
interquartile range is a simple robust scale estimator, which is the difference
between the upper and lower quartiles. For a normal population, the standard
deviation
can be estimated by dividing the interquartile range by
1.34898.
Gini's
mean difference is also a robust estimator of the standard deviation
. For a normal population, Gini's mean difference has expected
value
. Thus, multiplying Gini's mean difference by
yields a robust estimator of the standard deviation when
the data are from a normal sample. The constructed estimator has high efficiency
for the normal distribution relative to the usual sample standard deviation.
It is also less sensitive to the presence of outliers than the sample standard
deviation.
Gini's mean difference is computed as
If
the observations are from a normal distribution, then
is an unbiased estimator of the standard deviation
.
A very robust scale estimator is the MAD, the median absolute deviation about the median (Hampel, 1974.)
where the inner median,
, is the median of the
observations and the outer median,
, is the median of the
absolute values of the deviations about the median.
For a normal distribution, 1.4826·MAD can be used
to estimate the standard deviation
.
The MAD statistic has low efficiency for normal distributions, and it may not be appropriate for symmetric distributions. Rousseeuw and Croux (1993) proposed two new statistics as alternatives to the MAD statistic.
The first statistic is
where
the outer median,
, is the median of the
medians of
.
To reduce the small-sample bias,
is used to estimate the standard deviation
, where
is a the correction factor (Croux and Rousseeuw, 1992.)
The second statistic is
where
, and
is the integer part of
. That is,
is 2.2219 times the
th order statistic of the
distances between data
points.
The bias-corrected statistic,
, is used to estimate the standard deviation
, where
is a correction factor.
| Calculating Percentiles |
The UNIVARIATE procedure automatically computes the minimum, 1st, 5th, 10th, 25th, 50th, 75th, 90th, 95th, 99th, and maximum percentiles. You use the PCTLDEF= option in the PROC UNIVARIATE statement to specify one of five methods to compute quantile statistics. See Percentile and Related Statistics for more information.
To compute the quantile that each observation
falls in, use PROC RANK
with the GROUP= option. To calculate percentiles other than the default percentiles,
use PCTLPTS= and PCTLPRE= in the OUTPUT statement.
The CIPCTLDF option and CIPCTLNORMAL option compute confidence limits for quantiles using methods described in Hahn and Meeker (1991).
When
, the two-sided
percent confidence interval for quantiles that are based
on normal data has lower and upper limits
where
is the percentile
.
When
, the lower and upper limits are
A one-sided
percent confidence interval is computed by replacing
with
. The factor
is described in Owen and Hua (1977) and Odeh and Owen (1980).
The
two-sided distribution-free
% confidence interval for quantiles from a sample of size
has lower and upper limits
where
is jth order statistic. The lower rank
and upper rank
are integers that are symmetric or nearly symmetric around
, where
is the integral part of
.
The
and
are chosen so that the order statistics
and
![[IMAGE]](./images/leqn816.gif)
as possible
where
is the cumulative binomial probability,
, and
.
The coverage probability is sometimes less that
. This can occur in the tails of the distribution when the
sample size is small. To avoid this problem, you can specify the option TYPE=ASYMMETRIC,
which causes PROC UNIVARIATE to use asymmetric values of
and
. However, PROC UNIVARIATE first attempts to compute confidence
limits that satisfy all three conditions. If the last condition is not satisfied,
then the first condition is relaxed. Thus, some of the confidence limits are
symmetric while others, especially in the extremes, are not.
A one-sided distribution-free lower
percent confidence interval is computed as
when
is the largest integer that satisfies the inequality
where
, and
. Likewise, a one-sided distribution-free upper
% confidence interval is computed as
when
is the smallest integer that satisfies the inequality
where
, and
.
When you use the WEIGHT statement the percentiles are computed as follows,
let
be the
th ordered nonmissing value,
. Then for a given value of
between 0 and 1, the
th weighted quantile (or 100
th weighted percentile),
, is computed from the empirical distribution function with
averaging
where
is the weight associated with
,
is the sum of the weights and
is the weight for
th observation.
When the observations have identical weights, the weighted percentiles are the same as the unweighted percentiles with PCTLDEF=5.
| Calculating the Mode |
The mode is the value that occurs most often in the data. PROC UNIVARIATE counts repetitions of the actual values or, if you specify the ROUND= option, the rounded values. If a tie occurs for the most frequent value, the procedure reports the lowest value. To list all possible modes, use the MODES option in the PROC UNIVARIATE statement. When no repetitions occur in the data (as with truly continuous data), the procedure does not report the mode.
The WEIGHT statement has no effect on the mode.
![]() Chapter Contents |
![]() Previous |
![]() Next |
![]() Top of Page |
Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.