Chapter Contents Previous Next
 The SURVEYREG Procedure

Example 62.2: Simple Random Cluster Sampling

This example illustrates the use of regression analysis in a simple random cluster sampling design. The data are from Srndal, Swenson, and Wretman (1992, p. 652).

A total of 284 Swedish municipalities are grouped into 50 clusters of neighboring municipalities. Five clusters with a total of 32 municipalities are randomly selected. The results from the regression analysis in which clusters are used in the sample design are compared to the results of a regression analysis that ignores the clusters. The linear relationship between the population in 1975 and in 1985 is investigated.

The 32 selected municipalities in the sample are saved in the data set Municipalities.

   data Municipalities;
input Municipality Cluster Population85 Population75;
datalines;
205   37    5    5
206   37   11   11
207   37   13   13
208   37    8    8
209   37   17   19
6    2   16   15
7    2   70   62
8    2   66   54
9    2   12   12
10    2   60   50
94   17    7    7
95   17   16   16
96   17   13   11
97   17   12   11
98   17   70   67
99   17   20   20
100   17   31   28
101   17   49   48
276   50    6    7
277   50    9   10
278   50   24   26
279   50   10    9
280   50   67   64
281   50   39   35
282   50   29   27
283   50   10    9
284   50   27   31
52   10    7    6
53   10    9    8
54   10   28   27
55   10   12   11
56   10  107  108
;


The variable Municipality identifies the municipalities in the sample; the variable Cluster indicates the cluster to which a municipality belongs; and the variables Population85 and Population75 contain the municipality populations in 1985 and in 1975 (in thousands), respectively. A regression analysis is performed by PROC SURVEYREG with a CLUSTER statement.

   title1 'Regression Analysis for Swedish Municipalities';
title2 'Cluster Simple Random Sampling';
proc surveyreg data=Municipalities total=50;
cluster Cluster;
model Population85=Population75;
run;


The TOTAL=50 option specifies the total number of clusters in the sampling frame.

Output 62.2.1: Regression Analysis for Simple Random Cluster Sampling

 Regression Analysis for Swedish Municipalities Cluster Simple Random Sampling

 The SURVEYREG Procedure Regression Analysis for Dependent Variable Population85

 Data Summary Number of Observations 32 Mean of Population85 27.50000 Sum of Population85 880.00000

 Design Summary Number of Clusters 5

 Fit Statistics R-square 0.9860 Root MSE 3.0488 Denominator DF 4

 Estimated Regression Coefficients Parameter Estimate Standard Error t Value Pr > |t| Intercept -0.0191292 0.89204053 -0.02 0.9839 Population75 1.0546253 0.05167565 20.41 <.0001

 NOTE: The denominator degrees of freedom for the t tests is 4.

Output 62.2.1 displays the data summary, design summary, fit summary, and regression coefficient estimates. Since the sample design includes clusters, the procedure displays the total number of clusters in the sample in the "Design Summary" table. In the "Estimated Regression Coefficients" table, the estimated slope for the linear relationship is 1.05, which is significant at the 5% level; but the intercept is not significant. This suggests that a regression line crossing the original can be established between populations in 1975 and in 1985.

The CLUSTER statement is necessary in PROC SURVEYREG in order to incorporate the sample design. If you do not specify a CLUSTER statement in the regression analysis, the standard deviation of the regression coefficients will be incorrectly estimated.

   title1 'Regression Analysis for Swedish Municipalities';
title2 'Simple Random Sampling';
proc surveyreg data=Municipalities total=284;
model Population85=Population75;
run;


The analysis ignores the clusters in the sample, assuming that the sample design is a simple random sampling. Therefore, the TOTAL= option specifies the total number of municipalities, which is 284.

Output 62.2.2: Regression Analysis for Simple Random Sampling

 Regression Analysis for Swedish Municipalities Simple Random Sampling

 The SURVEYREG Procedure Regression Analysis for Dependent Variable Population85

 Data Summary Number of Observations 32 Mean of Population85 27.50000 Sum of Population85 880.00000

 Fit Statistics R-square 0.9860 Root MSE 3.0488 Denominator DF 31

 Estimated Regression Coefficients Parameter Estimate Standard Error t Value Pr > |t| Intercept -0.0191292 0.67417606 -0.03 0.9775 Population75 1.0546253 0.03668414 28.75 <.0001

 NOTE: The denominator degrees of freedom for the t tests is 31.

Output 62.2.2 displays the regression results ignoring the clusters. Compared to the results in Output 62.2.1, the regression coefficient estimates are the same. However, without using clusters, the regression coefficients have a smaller variance estimate in Output 62.2.2. Using clusters in the analysis, the estimated regression coeffiecient for effect Population75 is 1.05, with the estimated standard error 0.05, as displayed in Output 62.2.1; without using the clusters, the estimate is 1.05, but with the estimated standard error 0.04, as displayed in Output 62.2.2. To estimated the variance of the regression coefficients correctly, you should include the clustering information in the regression analysis.

 Chapter Contents Previous Next Top