Chapter Contents 
Previous 
Next 
The STDIZE Procedure 
The following example demonstrates how you can use the STDIZE procedure to obtain location and scale measures of your data.
In the following hypothetical data set, a random sample of grade 12 students is selected from a number of coeducational schools. Each school is classified as one of two types: Urban or Rural. There are 40 observations.
The variables are id (student identification), Type (type of school attended: `urban'=urban area and `rural'=rural area), and total (total assessment scores in History, Geometry, and Chemistry).
The following DATA step creates the SAS data set TotalScores.
data TotalScores; title 'High School Scores Data'; input id Type $ total; datalines; 1 rural 135 2 rural 125 3 rural 223 4 rural 224 5 rural 133 6 rural 253 7 rural 144 8 rural 193 9 rural 152 10 rural 178 11 rural 120 12 rural 180 13 rural 154 14 rural 184 15 rural 187 16 rural 111 17 rural 190 18 rural 128 19 rural 110 20 rural 217 21 urban 192 22 urban 186 23 urban 64 24 urban 159 25 urban 133 26 urban 163 27 urban 130 28 urban 163 29 urban 189 30 urban 144 31 urban 154 32 urban 198 33 urban 150 34 urban 151 35 urban 152 36 urban 151 37 urban 127 38 urban 167 39 urban 170 40 urban 123 ; run;
Suppose you would now like to standardize the total scores in different types of schools prior to any further analysis. Before standardizing the total scores, you can use the Schematic Plots from PROC UNIVARIATE to summarize the total scores for both types of schools.
proc univariate data=TotalScores plot; var total; by Type; run;
The PLOT option in the PROC UNIVARIATE statement creates the Schematic Plots and several other types of plots. The Schematic Plots display sidebyside box plots for each BY group (Figure 59.1). The vertical axis represents the total scores, and the horizontal axis displays two box plots: the one on the left is for the rural scores and the one on the right is for the urban scores.
Inspection reveals that one urban score is a low outlier. Also, if you compare the lengths of two boxplots, there seems to be twice as much dispersion for the rural scores as for the urban scores.

Figure 59.2 displays the table from PROC UNIVARIATE for the lowest and highest five total scores for urban schools. The outlier (Obs = 3), marked in Figure 59.1 by the symbol `0', has a score of 64.
The following statements use the traditional standardization method (METHOD=STD) to compute the location and scale measures:
proc stdize data=totalscores method=std pstat; title2 'METHOD=STD'; var total; by Type; run;

Figure 59.3 displays the table of location and scale measures from the PROC STDIZE statement. PROC STDIZE uses the mean as the location measure and the standard deviation as the scale measure for standardizing. The PSTAT option displays this table; otherwise, no display is created.
The ratio of the scale of rural scores to the scale of urban scores is approximately 1.4 (41.96/30.07). This ratio is smaller than the dispersion ratio observed in the previous Schematic Plots.
The STDIZE procedure provides several location and scale measures that are resistant to outliers. The following statements invoke three different standardization methods and display the Location and Scale Measures tables:
proc stdize data=totalscores method=mad pstat; title2 'METHOD=MAD'; var total; by Type; run; proc stdize data=totalscores method=iqr pstat; title2 'METHOD=IQR'; var total; by Type; run; proc stdize data=totalscores method=abw(4) pstat; title2 'METHOD=ABW(4)'; var total; by Type; run;
The results from this analysis are displayed in the following figures.

Figure 59.4 displays the table of location and scale measures when the standardization method is MAD. The location measure is the median, and the scale measure is the median absolute deviation from median. The ratio of the scale of rural scores to the scale of urban scores is approximately 2.06 (32.0/15.5) and is close to the dispersion ratio observed in the previous Schematic Plots.

Figure 59.5 displays the table of location and scale measures when the standardization method is IQR. The location measure is the median, and the scale measure is the interquartile range. The ratio of the scale of rural scores to the scale of urban scores is approximately 2.03 (61/30) and is, in fact, the dispersion ratio observed in the previous Schematic Plots.

Figure 59.6 displays the table of location and scale measures when the standardization method is ABW. The location measure is the biweight 1step Mestimate, and the scale measure is the biweight Aestimate. Note that the initial estimate for ABW is MAD. The tuning constant (4) of ABW is obtained by the following steps:
Refer to Goodall (1983, Chapter 11) for details on the tuning constant. The ratio of the scale of rural scores to the scale of urban scores is approximately 2.06 (32.0/15.5). It is also close to the dispersion ratio observed in the previous Schematic Plots.
The preceding analysis shows that METHOD=MAD, METHOD=IQR, and METHOD=ABW all provide better dispersion ratios than does METHOD=STD.
You can recompute the standard deviation after deleting the outlier from the original data set for comparison. The following statements create a DATA set NoOutlier that excludes the outlier from the TotalScores data set and invoke PROC STDIZE with METHOD=STD.
data NoOutlier; set totalscores; if (total = 64) then delete; run; proc stdize data=NoOutlier method=std pstat; title2 'after removing outlier, METHOD=STD'; var total; by Type; run;

Figure 59.7 displays the location and scale measures after deleting the outlier. The lack of resistance of the standard deviation to outliers is clearly illustrated: if you delete the outlier, the sample standard deviation of urban scores changes from 30.07 to 22.09. The new ratio of the scale of rural scores to the scale of urban scores is approximately 1.90 (41.96/22.09).
Chapter Contents 
Previous 
Next 
Top 
Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.