The UNIVARIATE Procedure

# Concepts

When you specify ROUND=u, PROC UNIVARIATE rounds a variable by using the rounding unit to divide the number line into intervals with midpoints u*i, where u is the nonnegative rounding unit and i equals the integers (..., -4, -3, -2, -1, 0, 1, 2, 3, 4,...). The interval width is u. Any variable value that falls in an interval rounds to the midpoint of that interval. A variable value that is midway between two midpoints, and is therefore on the boundary of two intervals, rounds to the even midpoint. Even midpoints occur when i is an even integer (0,±2,±4,...).

When ROUND=1 and the analysis variable values are between -2.5 and 2.5, the intervals are as follows:

i Interval Midpoint Left endpt rounds to Right endpt rounds to
-2 [-2.5,-1.5] -2 -2 -2
-1 [-1.5,-0.5] -1 -2 0
0 [-0.5,0.5] 0 0 0
1 [0.5,1.5] 1 0 2
2 [1.5,2.5] 2 2 2

When ROUND=.5 and the analysis variable values are between -1.25 and 1.25, the intervals are as follows:

i Interval Midpoint Left endpt rounds to Right endpt rounds to
-2 [-1.25,-0.75] -1.0 -1 -1
-1 [-0.75,-0.25] -0.5 -1 0
0 [-0.25,0.25] 0.0 0 0
1 [0.25,0.75] 0.5 0 1
2 [0.75,1.25] 1.0 1 1

As the rounding unit increases, the interval width also increases. This reduces the number of unique values and decreases the amount of memory that PROC UNIVARIATE needs.

The PLOTS option in the PROC UNIVARIATE statement provides up to four diagnostic line printer plots to examine the data distribution. These plots are the stem-and-leaf plot or horizontal bar chart, the box plot, the normal probability plot, and the side-by-side box plots. If you specify the WEIGHT statement, PROC UNIVARIATE provides a weighted histogram, a weighted box plot based on the weighted quantiles, and a weighted normal probability plot.

### Stem-and-Leaf Plot

The first plot in the output is either a stem-and-leaf plot (Tukey 1977) or a horizontal bar chart. If any single interval contains more than 49 observations, the horizontal bar chart appears. Otherwise, the stem-and-leaf plot appears. The stem-and-leaf plot is like a horizontal bar chart in that both plots provide a method to visualize the overall distribution of the data. The stem-and-leaf plot provides more detail because each point in the plot represents an individual data value.

To change the number of stems that the plot displays, use PLOTSIZE= to increase or decrease the number of rows. Instructions that appear below the plot explain how to determine the values of the variable. If no instructions appear, you multiply Stem.Leaf by 1 to determine the values of the variable. For example, if the stem value is 10 and the leaf value is 1, then the variable value is approximately 10.1.

For the stem-and-leaf plot, the procedure rounds a variable value to the nearest leaf. If the variable value is exactly halfway between two leaves, the value rounds to the nearest leaf with an even integer value. For example, a variable value of 3.15 has a stem value of 3 and a leaf value of 2.

### Box Plot

The box plot, also known as a schematic plot, appears beside the stem-and-leaf plot. Both plots use the same vertical scale. The box plot provides a visual summary of the data and identifies outliers. The bottom and top edges of the box correspond to the sample 25th (Q1) and 75th (Q3) percentiles. The box length is one interquartile range (Q3 - Q1). The center horizontal line with asterisk endpoints corresponds to the sample median. The central plus sign (+) corresponds to the sample mean. If the mean and median are equal, the plus sign falls on the line inside the box. The vertical lines that project out from the box, called whiskers, extend as far as the data extend, up to a distance of 1.5 interquartile ranges. Values farther away are potential outliers. The procedure identifies the extreme values with a zero or an asterisk (*). If zero appears, the value is between 1.5 and 3 interquartile ranges from the top or bottom edge of the box. If an asterisk appears, the value is more extreme.

To generate box plot using high-resolution graphics, use the BOXPLOT procedure in SAS/STAT software.

### Normal Probability Plot

The normal probability plot is a quantile-quantile plot of the data. The procedure plots the empirical quantiles against the quantiles of a standard normal distribution. Asterisks (*) indicate the data values. The plus signs (+) provide a straight reference line that is drawn by using the sample mean and standard deviation. If the data are from a normal distribution, the asterisks tend to fall along the reference line. The vertical coordinate is the data value, and the horizontal coordinate is where

and where

 is . -1 is the inverse of the standard normal distribution function. is the rank of the data value when ordered from smallest to largest. is the number of nonmissing data values.

For weighted normal probability plot, the ith ordered observation is plotted against the normal quantile , where is the inverse standard cumulative normal distribution and

where is weight that is associated with for the ordered observation and is the sum of the individual weights.

When each observation has an identical weight, , the formula for reduces to the expression for in the unweighted normal probability plot

When the value of VARDEF= is WDF or WEIGHT, PROC UNIVARIATE draws a reference line with intercept and slope and when the value of VARDEF= is DF or N, the slope is where is the average weight.

When each observation has an identical weight and the value of VARDEF= is DF, N, or WEIGHT, the reference line reduces to the usual reference line with intercept and slope in the unweighted normal probability plot.

If the data are normally distributed with mean , standard deviation , and each observation has an identical weight , then, as in the unweighted normal probability plot, the points on the plot should lie approximately on a straight line. The intercept is and slope is when VARDEF= is WDF or WEIGHT, and the slope is when VARDEF= is DF or N.

### Side-by-Side Box Plots

When you use a BY statement with the PLOT option, PROC UNIVARIATE produces full-page side-by-side box plots, one for each BY group. The box plots (also known as schematic plots) use a common scale that allows you to compare the data distribution across BY groups. This plot appears after the univariate analyses of all BY groups. Use the NOBYPLOT option to suppress this plot.

For more information on how to interpret these plots see SAS System for Elementary Statistical Analysis and SAS System for Statistical Graphics.

If your site licenses SAS/GRAPH software, you can use the HISTOGRAM statement, PROBPLOT statement, and QQPLOT statement to create high-resolution graphs.

The HISTOGRAM statement generates histograms and comparative histograms that allow you to examine the data distribution. You can optionally fit families of density curves and superimpose kernel density estimates on the histograms. For additional information about the fitted distributions and kernel density estimates, see Formulas for Fitted Continuous Distributions .

The PROBPLOT statement generates a probability plot, which compares ordered values of a variable with percentiles of a specified theoretical distribution. The QQPLOT statement generates a quantile-quantile plot, which compares ordered values of a variable with quantiles of a specified theoretical distribution. Thus, you can use these plots to determine how well a theoretical distribution models a set of measures.

### Quantile-Quantile and Probability Plots

The following figure illustrates how to construct a Q-Q plot for a specified theoretical distribution with the QQPLOT statement.

Construction of a Q-Q Plot

First, the nonmissing values of the variable are ordered from smallest to largest: . Then, the ordered value is represented on the plot by a point whose -coordinate is and whose -coordinate is , where is the theoretical distribution with a zero location parameter and a unit scale parameter. For additional information about the theoretical distributions that you can request, see Theoretical Distributions for Quantile-Quantile and Probability Plots .

You can modify the adjustment constants -0.375 and 0.25 with the RANKADJ= and NADJ= options. The default combination is recommended by Blom (1958). For additional information, see Chambers et al. (1983). Since is a quantile of the empirical cumulative distribution function (ecdf), a Q-Q plot compares quantiles of the ecdf with quantiles of a theoretical distribution. Probability plots are constructed the same way, except that the -axis is scaled nonlinearly in percentiles.

### Interpreting Quantile-Quantile and Probability Plots

If the data distribution matches the theoretical distribution, the points on the plot form a linear pattern. Thus, you can use a Q-Q plot or a probability plot to determine how well a theoretical distribution models a set of measurements. The following properties of these plots make them useful diagnostics to test how well a specified theoretical distribution fits a set of measurements:

• If the quantiles of the theoretical and data distributions agree, the plotted points fall on or near the line .

• If the theoretical and data distributions differ only in their location or scale, the points on the plot fall on or near the line . The slope and intercept are visual estimates of the scale and location parameters of the theoretical distribution.

Q-Q plots are more convenient than probability plots for graphical estimation of the location and scale parameters because the -axis of a Q-Q plot is scaled linearly. On the other hand, probability plots are more convenient for estimating percentiles or probabilities. There are many reasons why the point pattern in a Q-Q plot may not be linear. Chambers et al. (1983) and Fowlkes (1987) discuss the interpretations of commonly encountered departures from linearity, and these are summarized in the following table.

Quantile-Quantile Plot Diagnostics
Description of Point Pattern Possible Interpretation
All but a few points fall on a line Outliers in the data
Left end of pattern is below the line; right end of pattern is above the line Long tails at both ends of the data distribution
Left end of pattern is above the line; right end of pattern is below the line Short tails at both ends of the distribution
Curved pattern with slope increasing from left to right Data distribution is skewed to the right
Curved pattern with slope decreasing from left to right Data distribution is skewed to the left
Staircase pattern (plateaus and gaps) Data have been rounded or are discrete

In some applications, a nonlinear pattern may be more revealing than a linear pattern. However as noted by Chambers et al. (1983), departures from linearity can also be due to chance variation.

Because PROC UNIVARIATE computes quantile statistics, it requires additional memory to store a copy of the data in memory. By default, the report procedures PROC MEANS, PROC SUMMARY, and PROC TABULATE require less memory because they do not automatically compute quantiles. These procedures also provide an option to use a new fixed-memory quantiles estimation method that is usually less memory intense. For more information, see Quantiles .

The only factor that limits the number of variables that you can analyze is the computer resources that are available. The amount of temporary storage and CPU time that PROC UNIVARIATE requires depends on the statements and the options that you specify. To calculate the computer resources the procedure needs, let

 be the number of observations in the data set be the number of variables in the VAR statement be the number of unique values for the ith variable.

Then the minimum memory requirement in bytes to process all variables is

If bytes are not available, PROC UNIVARIATE must process the data multiple times to compute all the statistics. This reduces the minimum memory requirement to

ROUND= reduces the number of unique values ( ), thereby reducing memory requirements. ROBUSTSCALE requires bytes of temporary storage.

Several factors affect the CPU time requirement:

• The time to create tree structures to internally store the observations is proportional to .

• The time to compute moments and quantiles for the ith variable is proportional to .

• The time to compute the NORMAL option test statistics is proportional to .

• The time to compute the ROBUSTSCALE option test statistics is proportional to .

• The time to compute the exact significance level of the sign rank statistic may increase when the number of nonzero values is less than or equal to 20.

Each of these factors has a different constant of proportionality. For additional information on how to optimize CPU performance and memory usage, see the SAS documentation for your operating environment.