Chapter Contents

Previous

Next
The UNIVARIATE Procedure

Statistical Computations

PROC UNIVARIATE uses standard algorithms to compute the moment statistics (such as the mean, variance, skewness, and kurtosis). See SAS Elementary Statistics Procedures for the statistical formulas. The computational details for confidence limits, hypothesis test statistics, and quantile statistics follow.


Confidence Limits for Parameters

A two-sided [IMAGE] percent confidence interval for the mean has upper and lower limits

[IMAGE]

where [IMAGE] is [IMAGE] and [IMAGE] is the ( [IMAGE]) critical value of the Student's t statistics with [IMAGE] degrees of freedom.

A one-sided [IMAGE] percent confidence interval is computed as

[IMAGE]

A two-sided [IMAGE] percent confidence interval for the standard deviation has lower and upper limits

[IMAGE]

where [IMAGE] and [IMAGE] are the [IMAGE] and [IMAGE] critical values of the chi-square statistic with [IMAGE] degrees of freedom. A one-sided [IMAGE] percent confidence interval is computed by replacing [IMAGE] with [IMAGE].

A [IMAGE] percent confidence interval for the variance has upper and lower limits equal to the squares of the corresponding upper and lower limits for the standard deviation.

When you use the WEIGHT statement and VARDEF=DF in the PROC statement, the [IMAGE] percent confidence interval for the weighted mean has upper and lower limits

[IMAGE]

where [IMAGE] is the weighted mean, [IMAGE] is the weighted standard deviation, [IMAGE] is the weight for [IMAGE] observation, and [IMAGE] is the [IMAGE] critical value for the Student's t distribution with [IMAGE] degrees of freedom.


Tests for Location

PROC UNIVARIATE computes tests for location that include the Student's t test, the sign test, and the Wilcoxon signed rank test. All three tests produce a test statistic for the null hypothesis that the mean or median is equal to a given value [IMAGE] against the two-sided alternative that the mean or median is not equal to [IMAGE]. By default, PROC UNIVARIATE sets the value of [IMAGE] to zero. Use the MU0= option in the PROC UNIVARIATE statement to test that the mean or median is equal to another value.

The Student's t test is appropriate when the data are from an approximately normal population; otherwise, use nonparametric tests such as the sign test or the signed rank test. For large sample situations, the Student's t test is asymptotically equivalent to a z test.

If you use the WEIGHT statement, PROC UNIVARIATE computes only one weighted test for location, the Student's t test. You must use the default value for the VARDEF= option in the PROC statement.

You can also compare means or medians of paired data. Data are said to be paired when subjects or units are matched in pairs according to one or more variables, such as pairs of subjects with the same age and gender. Paired data also occur when each subject or unit is measured at two times or under two conditions. To compare the means or medians of the two times, create an analysis variable that is the difference between the two measures. To test that the new analysis variable's mean or median equals zero is equivalent to the test that the means or medians of the two original variables are equal. For an example, see Performing a Sign Test Using Paired Data.

Student's t Test

PROC UNIVARIATE calculates the t statistic as

[IMAGE]

where [IMAGE] is the sample mean, [IMAGE] is the number of nonmissing values for a variable, and [IMAGE] is the sample standard deviation. Under the null hypothesis, the population mean equals [IMAGE]. When the data values are approximately normally distributed, the probability under the null hypothesis of a t statistic as extreme, or more extreme, than the observed value (the p-value) is obtained from the t distribution with [IMAGE] degrees of freedom. For large [IMAGE], the t statistic is asymptotically equivalent to a z test.

When you use the WEIGHT statement and the default value of VARDEF=, which is DF, the Student's t statistic is calculated as

[IMAGE]

where [IMAGE] is the weighted mean, [IMAGE] is the weighted standard deviation, and [IMAGE] is the weight for [IMAGE] observation. The [IMAGE] statistic is treated as having a Student's t distribution with [IMAGE] degrees of freedom. If you specify the EXCLNPWGT option in the PROC statement, [IMAGE] is the number of nonmissing observations when the value of the WEIGHT variable is positive. By default, [IMAGE] is the number of nonmissing observations for the WEIGHT variable.

Sign Test

PROC UNIVARIATE calculates the sign test statistic as

[IMAGE]

where [IMAGE] is the number of values that is greater than [IMAGE] and [IMAGE] is the number of values that is less than [IMAGE]. Values equal to [IMAGE] are discarded.

Under the null hypothesis that the population median is equal to [IMAGE], the p-value for the observed statistic M is

[IMAGE]

where [IMAGE] is the number of [IMAGE] values not equal to [IMAGE].

Wilcoxon Signed Rank Test

PROC UNIVARIATE calculates the Wilcoxon signed rank test statistic as

[IMAGE]

where [IMAGE] is the rank of [IMAGE] after discarding values of [IMAGE] equal to [IMAGE], [IMAGE] is the number of [IMAGE] values not equal to [IMAGE], and the sum is calculated for values of [IMAGE] greater than 0. Average ranks are used for tied values.

The p-value is the probability of obtaining a signed rank statistic greater in absolute value than the absolute value of the observed statistic S. If [IMAGE], the significance level of [IMAGE] is computed from the exact distribution of [IMAGE], which can be enumerated under the null hypothesis that the distribution is symmetric about [IMAGE]. When [IMAGE], the significance of level [IMAGE] is computed by treating

[IMAGE]

as a Student's t variate with [IMAGE] degrees of freedom. [IMAGE] is computed as

[IMAGE]

where the sum is calculated over groups that are tied in absolute value, and [IMAGE] is the number of tied values in the [IMAGE]th group (Iman 1974; Conover 1980).

The Wilcoxon signed rank test assumes that the distribution is symmetric. If the assumption is not valid, you can use the sign test to test that the median is [IMAGE]. See Lehmann (1975) for more details.


Tests for Normality

When you specify the NORMAL option, the procedure computes test statistics for the null hypothesis that the values of the analysis variable are a random sample from a normal distribution. The test statistics depend on the sample size. The procedure calculates goodness-of-fit tests based on the empirical distribution function (EDF): the Kolmogorov-Smirnov D statistic, the Anderson-Darling statistic, and the Cramer-von Mises statistic. In addition, if the sample size is less than or equal to 2000, PROC UNIVARIATE calculates the Shapiro-Wilk W statistic.

You determine whether to reject the null hypothesis of normality by examining the probability that is associated with a test statistic. A small p-value indicates nonnormal data. When the p-value is less than the predetermined critical value (alpha value), you reject the null hypothesis and conclude that the data do not come from a normal distribution.

If you want to test the normality assumptions underlying analysis of variance methods, beware of using a statistical test for normality alone. A test's ability to reject the null hypothesis (known as the power of the test) increases with the sample size. As the sample size becomes larger, increasingly smaller departures from normality can be detected. Since small deviations from normality do not severely affect the validity of analysis of variance tests, it is important to examine other statistics and plots to make a final assessment of normality. The skewness and kurtosis measures and the plots that are provided by the PLOTS option can be very helpful. For small sample sizes, power is low for detecting larger departures from normality that may be important. To increase the test's ability to detect such deviations, you may want to declare significance at higher levels, such as 0.15 or 0.20 rather than the often-used 0.05 level. Again, consulting plots and additional statistics will help you assess the severity of the deviations from normality.

Shapiro-Wilk Statistic

If the sample size is less than or equal to 2000, PROC UNIVARIATE computes the Shapiro-Wilk statistic, W. The W statistic is the ratio of the best estimator of the variance (based on the square of a linear combination of the order statistics) to the usual corrected sum of squares estimator of the variance (Shapiro, 1965). W must be greater than zero and less than or equal to one. Small values of W lead to the rejection of the null hypothesis of normality. The distribution of W is highly skewed. Seemingly large values of W (such as 0.90) may be considered small and lead you to reject the null hypothesis. When the sample size is greater than three, the coefficients to compute the linear combination of the order statistics are approximated by the method of Royston (1992).

[IMAGE]

when [IMAGE] and

[IMAGE]

when [IMAGE], where [IMAGE] and [IMAGE] are functions of [IMAGE], obtained from simulation results, and [IMAGE] is a standard normal variate. Large values of [IMAGE] indicate departure from normality.

EDF Goodness-of-Fit Tests

When you fit a normal distribution, PROC UNIVARIATE provides goodness-of-fit tests that are based on the empirical distribution function (EDF). The empirical distribution function is defined for a set of [IMAGE] independent observations [IMAGE] with a common distribution function [IMAGE]. Denote the observations that are ordered from smallest to largest as [IMAGE]. The empirical distribution function, [IMAGE], is defined as

[IMAGE]

Note that [IMAGE] is a step function that takes a step of height [IMAGE] at each observation. This function estimates the distribution function [IMAGE]. At any value [IMAGE] is the proportion of observations that is less than or equal to [IMAGE] while [IMAGE] is the theoretical probability of an observation that is less than or equal to [IMAGE]. EDF statistics measure the discrepancy between [IMAGE] and [IMAGE].

The computational formulas for the EDF statistics use the probability integral transformation [IMAGE]. If [IMAGE] is the distribution function of [IMAGE], the random variable [IMAGE] is uniformly distributed between 0 and 1.

Given [IMAGE] observations [IMAGE], PROC UNIVARIATE computes the values [IMAGE] by applying the transformation, as follows.

The NORMAL option in the PROC UNIVARIATE statement provides the following EDF tests:

These tests are based on various measures of the discrepancy between the empirical distribution function [IMAGE] and the proposed normal cumulative distribution function [IMAGE].

Once the EDF test statistics are computed, the associated p-values must be calculated. PROC UNIVARIATE uses internal tables of probability levels that are similar to those given by D'Agostino and Stephens (1986). If the value is between two probability levels, then linear interpolation is used to estimate the probability value.

Kolmogorov D Statistic

The Kolmogorov-Smirnov statistic (D) is defined as

[IMAGE]

The Kolmogorov-Smirnov statistic belongs to the supremum class of EDF statistics. This class of statistics is based on the largest vertical difference between [IMAGE] and [IMAGE].

The Kolmogorov-Smirnov statistic is computed as the maximum of [IMAGE] and [IMAGE]. [IMAGE] is the largest vertical distance between the EDF and the distribution function when the EDF is greater than the distribution function. [IMAGE] is the largest vertical distance when the EDF is less than the distribution function.

[IMAGE]

PROC UNIVARIATE uses a modified Kolmogorov D statistic to test the data against a normal distribution with mean and variance equal to the sample mean and variance.

The procedure uses a set of five critical values to compare against the calculated statistic. If the calculated value falls outside the interval of these two extreme critical values, the probability of obtaining a more extreme value is shown as >.15 or <.01.

Anderson-Darling Statistic

The Anderson-Darling statistic and the Cramer-von Mises statistic belong to the quadratic class of EDF statistics. This class of statistics is based on the squared difference [IMAGE]. Quadratic statistics have the following general form:

[IMAGE]

The function [IMAGE] weights the squared difference [IMAGE].

The Anderson-Darling statistic ( [IMAGE]) is defined as

[IMAGE]

where the weight function is [IMAGE].

The Anderson-Darling statistic is computed as

[IMAGE]


Cramer-von Mises Statistic

The Cramer-von Mises statistic ( [IMAGE]) is defined as

[IMAGE]

where the weight function is [IMAGE].

The Cramer-von Mises statistic is computed as

[IMAGE]


Robust Estimators

A statistical method is robust if the method is insensitive to slight departures from the assumptions that justify the method. PROC UNIVARIATE provides several methods for robust estimation of location and scale.

Winsorized Means

When outliers are present in the data, the Winsorized mean is a robust estimator of the location that is relatively insensitive to the outlying values. The k-times Winsorized mean is calculated as

[IMAGE]

The Winsorized mean is computed after the [IMAGE] smallest observations are replaced by the ( [IMAGE]) smallest observation, and the [IMAGE] largest observations are replaced by the ( [IMAGE]) largest observation. In other words, the observations at each end are used in the computations.

For a symmetric distribution, the symmetrically Winsorized mean is an unbiased estimate of the population mean. But the Winsorized mean does not have a normal distribution even if the data are from a normal population.

The Winsorized sum of squared deviations is defined as

[IMAGE]

A Winsorized t test is given by

[IMAGE]

where the standard error of the Winsorized mean is

[IMAGE]

When the data are from a symmetric distribution, the distribution of the Winsorized t statistic [IMAGE] is approximated by a Student's t distribution with [IMAGE] degrees of freedom (Tukey and McLaughlin 1963, Dixon and Tukey 1968).

A [IMAGE] percent confidence interval for the Winsorized mean has upper and lower limits

[IMAGE]

and the ( [IMAGE]) critical value of the Student's t statistics has [IMAGE] degrees of freedom.

Trimmed Means

When outliers are present in the data, the trimmed mean is a robust estimator of the location that is relatively insensitive to the outlying values. The [IMAGE]-times trimmed mean is calculated as

[IMAGE]

The trimmed mean is computed after the [IMAGE] smallest and [IMAGE] largest observations are deleted from the sample. In other words, the observations are trimmed at each end.

For a symmetric distribution, the symmetrically trimmed mean is an unbiased estimate of the population mean. But the trimmed mean does not have a normal distribution even if the data are from a normal population.

A robust estimate of the variance of the trimmed mean [IMAGE] can be based on the Winsorized sum of squared deviations (Tukey and McLaughlin 1963). The resulting trimmed t test is given by

[IMAGE]

where the standard error of the trimmed mean is

[IMAGE]

and [IMAGE] is the square root of the Winsorized sum of squared deviations

When the data are from a symmetric distribution, the distribution of the trimmed t statistic [IMAGE] is approximated by a Student's t distribution with [IMAGE] degrees of freedom (Tukey and McLaughlin 1963, Dixon and Tukey 1968).

A [IMAGE] percent confidence interval for the trimmed mean has upper and lower limits

[IMAGE]

and the ( [IMAGE]) critical value of the Student's t statistics has [IMAGE] degrees of freedom.

Robust Measures of Scale

The sample standard deviation is a commonly used estimator of the population scale. However, it is sensitive to outliers and may not remain bounded when a single data point is replaced by an arbitrary number. With robust scale estimators, the estimates remain bounded even when a portion of the data points are replaced by arbitrary numbers.

PROC UNIVARIATE computes robust measures of scale that include statistics of interquartile range, Gini's mean difference G, MAD, [IMAGE], and [IMAGE], with their corresponding estimates of [IMAGE].

The interquartile range is a simple robust scale estimator, which is the difference between the upper and lower quartiles. For a normal population, the standard deviation [IMAGE] can be estimated by dividing the interquartile range by 1.34898.

Gini's mean difference is also a robust estimator of the standard deviation [IMAGE]. For a normal population, Gini's mean difference has expected value [IMAGE]. Thus, multiplying Gini's mean difference by [IMAGE] yields a robust estimator of the standard deviation when the data are from a normal sample. The constructed estimator has high efficiency for the normal distribution relative to the usual sample standard deviation. It is also less sensitive to the presence of outliers than the sample standard deviation.

Gini's mean difference is computed as

[IMAGE]

If the observations are from a normal distribution, then [IMAGE] is an unbiased estimator of the standard deviation [IMAGE].

A very robust scale estimator is the MAD, the median absolute deviation about the median (Hampel, 1974.)

[IMAGE]

where the inner median, [IMAGE], is the median of the [IMAGE] observations and the outer median, [IMAGE], is the median of the [IMAGE] absolute values of the deviations about the median.

For a normal distribution, 1.4826·MAD can be used to estimate the standard deviation [IMAGE].

The MAD statistic has low efficiency for normal distributions, and it may not be appropriate for symmetric distributions. Rousseeuw and Croux (1993) proposed two new statistics as alternatives to the MAD statistic.

The first statistic is

[IMAGE]

where the outer median, [IMAGE], is the median of the [IMAGE] medians of [IMAGE].

To reduce the small-sample bias, [IMAGE] is used to estimate the standard deviation [IMAGE], where [IMAGE] is a the correction factor (Croux and Rousseeuw, 1992.)

The second statistic is

[IMAGE]

where [IMAGE], and [IMAGE] is the integer part of [IMAGE]. That is, [IMAGE] is 2.2219 times the [IMAGE]th order statistic of the [IMAGE]distances between data points.

The bias-corrected statistic, [IMAGE], is used to estimate the standard deviation [IMAGE], where [IMAGE] is a correction factor.


Calculating Percentiles

The UNIVARIATE procedure automatically computes the minimum, 1st, 5th, 10th, 25th, 50th, 75th, 90th, 95th, 99th, and maximum percentiles. You use the PCTLDEF= option in the PROC UNIVARIATE statement to specify one of five methods to compute quantile statistics. See Percentile and Related Statistics for more information.

To compute the quantile that each observation falls in, use PROC RANK with the GROUP= option. To calculate percentiles other than the default percentiles, use PCTLPTS= and PCTLPRE= in the OUTPUT statement.

Confidence Limits for Quantiles

The CIPCTLDF option and CIPCTLNORMAL option compute confidence limits for quantiles using methods described in Hahn and Meeker (1991).

When [IMAGE], the two-sided [IMAGE] percent confidence interval for quantiles that are based on normal data has lower and upper limits

[IMAGE]

where [IMAGE] is the percentile [IMAGE].

When [IMAGE], the lower and upper limits are

[IMAGE]

A one-sided [IMAGE] percent confidence interval is computed by replacing [IMAGE] with [IMAGE]. The factor [IMAGE] is described in Owen and Hua (1977) and Odeh and Owen (1980).

The two-sided distribution-free [IMAGE]% confidence interval for quantiles from a sample of size [IMAGE] has lower and upper limits

[IMAGE]

where [IMAGE] is jth order statistic. The lower rank [IMAGE] and upper rank [IMAGE] are integers that are symmetric or nearly symmetric around [IMAGE], where [IMAGE] is the integral part of [IMAGE].

The [IMAGE] and [IMAGE] are chosen so that the order statistics [IMAGE] and [IMAGE]

The coverage probability is sometimes less that [IMAGE]. This can occur in the tails of the distribution when the sample size is small. To avoid this problem, you can specify the option TYPE=ASYMMETRIC, which causes PROC UNIVARIATE to use asymmetric values of [IMAGE] and [IMAGE]. However, PROC UNIVARIATE first attempts to compute confidence limits that satisfy all three conditions. If the last condition is not satisfied, then the first condition is relaxed. Thus, some of the confidence limits are symmetric while others, especially in the extremes, are not.

A one-sided distribution-free lower [IMAGE] percent confidence interval is computed as [IMAGE] when [IMAGE] is the largest integer that satisfies the inequality

[IMAGE]

where [IMAGE], and [IMAGE]. Likewise, a one-sided distribution-free upper [IMAGE]% confidence interval is computed as [IMAGE] when [IMAGE] is the smallest integer that satisfies the inequality

[IMAGE]

where [IMAGE], and [IMAGE].

Weighted Quantiles

When you use the WEIGHT statement the percentiles are computed as follows, let [IMAGE] be the [IMAGE]th ordered nonmissing value, [IMAGE]. Then for a given value of [IMAGE] between 0 and 1, the [IMAGE]th weighted quantile (or 100 [IMAGE]th weighted percentile), [IMAGE], is computed from the empirical distribution function with averaging

[IMAGE]

where [IMAGE] is the weight associated with [IMAGE], [IMAGE] is the sum of the weights and [IMAGE] is the weight for [IMAGE]th observation.

When the observations have identical weights, the weighted percentiles are the same as the unweighted percentiles with PCTLDEF=5.


Calculating the Mode

The mode is the value that occurs most often in the data. PROC UNIVARIATE counts repetitions of the actual values or, if you specify the ROUND= option, the rounded values. If a tie occurs for the most frequent value, the procedure reports the lowest value. To list all possible modes, use the MODES option in the PROC UNIVARIATE statement. When no repetitions occur in the data (as with truly continuous data), the procedure does not report the mode.

The WEIGHT statement has no effect on the mode.


Chapter Contents

Previous

Next

Top of Page

Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.