﻿ Simple Regression & Correlation Example, Dr. Usip, Economics

# Simple Regression & Correlation Example

Estimation: The Ordinary Least Squares (OLS) Method.
The scattergram examined earlier contains a discussion of both the Problem Description and the Data used in deriving the results presented here. The estimation method is the classical Ordinary Least Squares (OLS) which is programmed into the SPSS/win statistical package. The Linear Regression Model (LRM) has the form Yi = A + BXi + Ei where Y is the DV (in this case, annual Family Food Expenditure), X is the IV (in this case, annual Family Income), and E is the random error term; it is a proxy for all the uncertain factors that may also affect family food expenditure. In regression analysis, all of the Classical Assumptions of the LRM basically apply to the error term. A and B are the regression parameters whose numerical values we seek to estimate; and in so doing, we will have succeeded to estimate the underlying Population Regression Line (PRL) using the OLS method. By using the command sequence presented earlier, SPSS/win automatically implements this method.

Discussion of the Outputs/Results and Related Tests
The results will be discussed in the order in which SPSS/win generates the outputs. These outputs are presented in the tables below. For instance, the discussion in part I pertains to the DESCRIPTIVE STATISTICS table, followed by part II which pertains to the CORRELATIONS table, and so on. This approach permits a critical analysis of the results and their implications.

Descriptive Statistics

Mean Std. Deviation N
Annual Food Expenditure (\$000) 7.965 4.664 20
Annual Income (\$000) 45.50 23.96 20

Correlations

Annual Food Expenditure (\$000) Annual Income (\$000)
Pearson Correlation Annual Food Expenditure (\$000) 1.000 .946
Annual Income (\$000) .946 1.000
Sig. (1-tailed) Annual Food Expenditure (\$000) . .000
Annual Income (\$000) .000 .
N Annual Food Expenditure (\$000) 20 20
Annual Income (\$000) 20 20

Model Summary(a,b)

Variables R R Square Adjusted R Square Std. Error of the Estimate Durbin-Watson
Model Entered Removed

1 Annual Income (\$000)(c,d) . .946 .894 .888 1.559 2.834
a Dependent Variable: Annual Food Expenditure (\$000)
b Method: Enter
c Independent Variables: (Constant), Annual Income (\$000)
d All requested variables entered.

ANOVA(a)
Model Sum of Squares df Mean Square F Sig.
1 Regression 369.573 1 369.573 151.975 .000(b)
Residual 43.773 18 2.432

Total 413.346 19

a Dependent Variable: Annual Food Expenditure (\$000)
b Independent Variables: (Constant), Annual Income (\$000)

Coefficients(a)

Unstandardized Coefficients Standardized Coefficients t Sig.
Model B Std. Error Beta

1 (Constant) -.412 .764
-.539 .596
Annual Income (\$000) .184 .015

.946

12.328 .000
a Dependent Variable: Annual Food Expenditure (\$000)

I. Descriptive Statistics Table
1.
Annual Food Expenditure
a) The sample mean is 7.965 thousands of dollars. This means that an average family in the sample spends \$7965 annually on food.

b) The sample standard deviation of 4.664 (thousands of dollars) is equivalent to a one-standard deviation of \$4660 about the mean values of \$7965. This implies that 68.3% of the families spend between \$3305 and \$12,625 annually on food.

2. Annual Income
a)
The sample mean is 45.50 thousands of dollars. In terms of income, this implies that an average family in the sample makes \$45,500 annually.

b) The sample standard deviation of 23.96 (thousands of dollars) is equivalents to ±\$23,960 about the mean income of \$45,500. Thus, 68.3% of the families could be said to make between \$21,540 and \$69,460 annually.

3. Sample size N (actually 'n' ) = 20 simply means that there is no missing value during estimation.

II. Correlations Table
This table contains the Pearson sample correlation coefficients of variable i with variable j ( denoted as ri,j ), which are the key analytical tools of Correlation Analysis. This is the same Karl Pearson that I mentioned in the historical footnote under the discussion of the Chi-square test of Independence (and also, in glossary under regression analysis). Let us focus for now on the top part of the table. It is a 2 by 2 matrix (i = 1,2; and j = 1, 2). The following conclusions are obvious:

1. The correlation of annual Food Expenditure with itself is perfect, linear, and direct since r1,1 = 1.000.

2. The correlation of annual Food expenditure with Income is quite strong, linear and direct because r1,2 = .946

3. The 2 x 2 matrix is symmetric about the main diagonal; hence, all the information about the type and strength of relationship between the two variables can be obtained from the correlation coefficients either above the main diagonal or below it.

4. The middle portion of the table contains the p-values (sig=significance for a one-tailed test that Ho: Pi,j = 0 against Ha: Pi,j > 0 ; where P (rho) is the population correlation coefficient whose value is unknown). The probability or p-values (i.e.; computed/observed values or alphaov ) of .000 means that Ho can be rejected unequivocally at the critical level of alpha = .01. Thus, the conclusions in (1) and (2) above are indeed valid.

5. Again, N (i.e. 'n' ) = 20 since all the observations were used in the estimation.

III. Model Summary Table
This table contains the necessary summary statistics for assessing the accuracy of the estimated sample regression line (SRL) yi = a + bXi + ei, where 'a' and 'b' are the estimators of A and B, respectively; and 'e' denotes the residual as an estimator of the random error term E previously defined.

Before examining the meaning of the summary statistics, some remarks about the appended footnotes are in order. Footnotes a and c are self-explanatory. Footnote b relates to the algorithm that SPSS/win uses to estimate the model/SRL.

'Enter' simply means that annual Family Food Expenditure (DV) is regressed on both the constant term ‘a’ and the income (IV) using the OLS method. For those students who might take Econ. 825, I should mention that SPSS/win can also implement STEPWISE, BACKWARD, and FORWARD algorithms based on the OLS estimation procedure. These algorithms are however useful only in the context of multiple regression analysis where the goal is to select from among many potential IVs those that significantly influence the DV. That said, let us now discuss the results reported in the table.

1. R is the sample correlation coefficient (the standard notation is r as discussed earlier). The meaning of r =.946 is the same as the one given earlier - that the relationship between annual family Food expenditure and Income is quite strong, positive and linear.

2. R-square or R2 is the sample Coefficient of Determination (r2 is commonly used in simple regression analysis while R2 is appropriately reserved  for multiple regression analysis). It measures the goodness-of-fit of the estimated SRL in terms of the proportion of the variation in the DV explained by the fitted sample regression equation or SRL. Thus, the value of r2 = .894 simply means that 89.4% of the variation in annual Family Food Expenditure is explained or accounted for by the estimated SRL/equation of ýi = -.412 + .184xi, which is reported in Coefficients table (last one). This information is quite useful in assessing the overall accuracy of the model. Notice that r2 is the square of r = .946. The implication is that the value of r can be determined conversely from a simple rule as r = ±(r2)½ , where ± is the sign preceding the estimated value of the slope coefficient ‘b’. In this example, the + sign applies since b = .184.

3. Adjusted R-Square (or r2 with a bar over it) is the sample Coefficient of Determination after adjusting for the degrees of freedom lost in the process of estimating the regression parameters. In this case, only two parameters A and B were estimated; thus, the remaining degrees of freedom can be determined as v = n -2. Hence, the adjusted r-square is a better measure of the goodness-of-fit of the estimated SRL than its nominal/unadjusted counterpart. It is always smaller in value than the unadjusted. I will examine the adjusted coefficient of determination in some details in Econs. 853 and 976.

4. Standard Error of the Estimate (standard notation is Se). This summary statistic measures the overall accuracy or quality of the estimated SRL in terms of the average/standardized unexplained variation in the DV that may be due to possible errors that could originate from (i) chance errors of sampling or sampling errors, thereby causing the values of ‘a’ and ‘b’ to differ significantly from the true but unknown values of the parameters ‘A’ and ‘B’; and (ii) possible variation in the parameter which , according to the Classical Assumption, are presumed constant. If these errors are small, on average, then the value of Se could approach zero (exactly equal to zero if the estimated values of the DV, denoted here as ýi equals their actual/observed counterparts yi for all i = 1, 2, ..., n). If otherwise, the values of Se approach +infinity; in which case the estimates SRL must be considered useless especially if application involves the prediction of the DV outside the sample period. Note that Se is an unbiased estimator of the standard deviation around the true conditional PRL µy/x = A + BXi which is denoted as Óy/x

In this example, Se = 1.559 means that, on average, the predicted values of the annual family Food expenditure could vary by ±\$1559 about the estimated regression equation for each value of the Income during the sample period; and even by a much larger amount outside the sample period. This is why prediction outside the sample period requires the use of the standard errors of the estimators ‘a’ and ‘b’ (denoted, respectively, as Sa and Sb) for establishing confidence intervals. Note that Sa and Sb take into account the aforementioned chance errors of sampling. Accounting for parameter variation will require the application of advanced econometric techniques which is beyond the scope of the undergraduate material.

5. Durbin Watson (DW) Statistics measures the presence, or lack thereof, of Serial Correlation (also known as Autocorrelation) among the errors from one observation (or time period) to other observations (or time periods). Details about the implications of the existence of the autocorrelation will be examined in Econ. 853 and Econ. 976. For now, suffice it to say that a value of DW = 2.834 means that the residuals e = yi - ýi (for all i = 1, 2, ..., n) from the estimated regression model are negatively correlated and strongly so. This is undesirable according to the Classical Assumptions. The ideal value should be 2.00 indicating no autocorrelation.

IV. ANOVA Table
The summary measures reported here are used in the partitioning of the the total variation in the DV according to the identity relation TSS = ESS + RSS, where TSS is the Total Sum of Squares in the DV, ESS is the Explained Sum of Squares due to the fitted regression equation or model, and RSS is the Residual (remaining) Sum of Squares that is unexplained and hence attributable to errors (i.e.; chance sampling errors, and those resulting from parameter invariance). Note the following: (1) The smaller RSS is relative to the TSS, (or the larger ESS is relative to TSS), the better the estimated regression equation appears to fit the data. (2) The underlying principle in the partition of TSS is similar to that of the One-way ANOVA technique examined earlier. As in that technique, the identity relation carries over to the associated degrees of freedom in the this manner v = v1 + v2 where v1 = k-1, and v2 = n-k so that v = n -1; where k is denotes the number of parameters that are estimated. (3) If k is defined as the number of IVs in the model, then v1 = k, and v2 = n-k-1; again, v = v1 + v2 = n -1.

Caution: Some authors use RSS (regression sum of squares) instead of ESS (explained sum of squares), and ESS (error sum of squares) instead of RSS (residual sum of squares) so that the identity is stated as TSS = RSS + ESS.  So pay attention to how these acronyms are defined.

From the table, under the df column, v1 = 1, v2 = 18, v = 19, and Fov = 151.975. In the context of simple regression analysis, the F-test is not very useful since there is only one IV in the model so that assessing the over-all significance of the estimated model can be accomplished by performing a simple t- test on the slope coefficient of the IV. After all, from the the formal conceptual definition of the t- and F- distributions, the value of F = t2 (As a check, Fov = 151.975 = t2 = (12.328)2 = 151.97958; the minor difference in this case is due to rounding during estimation. This is only true in the simple regression; in multiple regression, the F- and t -tests are quite different). Thus, the t-test of the significance of the causal influence of the only IV provides adequate assessment of the significance of the whole model. The next/final discussion presents the t-test.

V. Coefficients Table
This table contains the estimated regression coefficients (a = -.412, b = .184), and hence the estimated SRL/equation written as ýi = -.412 + .184Xi. These interpretations follow:

1. b = .184 represents the marginal effect of annual family Income on Food Expenditure. The estimated positive sign implies that such effect is positive while the absolute value implies that Food Expenditure would increase by \$184 for every \$1000 increase in Income.

2. a = -.412 has no interpretable meaning because the average level of family Food expenditure could not be negative even when no member of the is gainfully employed. Relative, friends, or Uncle Sam can help such a family. Nonetheless, this value should not be discarded; it plays an important role when the estimated regression line/equation is used for prediction.

3. Standard errors of the estimators , Sa = .764, and Sb = .015, measure the precision of the estimated values of a = -.412 and b = .184, respectively, in taking on or estimating the true but unknown values of the corresponding regression parameters A and B. The closer the values of Sa and Sb to zero, the higher the precision of the estimates, suggesting that chance errors due to sampling is not severe. The converse would suggest the opposite. Thus Sb = .015 implies that b = .184 is precisely closer to the true value of B; and Sa = .764 implies that a = -.412 implies quite the opposite coupled with the fact the estimated sign contradicts commonsense or reality.

4. Standardized Coefficient (also called beta coefficient) for the only IV is the same as the correlation coefficient r = .946. This simply means that family Income is an important determinant of family Food Expenditure with a strong positive effect. Application of the standardized coefficient is, however, useful when there are two or more IVs in the model so that their relative importance can be ranked according to the size (i.e., the absolute value) of the beta coefficients. See the multiple regression tutorial.

5. Observed/computed t statistic ( tov): T-test of Significance and the sign of the Regression                                                                    Coefficient (B)
As part of investigating the accuracy of the fitted SRL, it is often useful to verify both the statistical significance and the economic significance (i.e., the sign) of the regression parameter/coefficient B.   For statistical significance, the null hypothesis is stated as H0: B = 0 against the alternative that Ha: B is not equal to zero. Stated otherwise, H0 says that Income has no significant causal influence on Food Expenditure; this is refuted completely by Ha. For alpha = .05 and v = n -k = 20 -2 = 18, this implies a critical t-value of tcv = t.025,18 = ±2.101. But tov = 12.328, thus, Ho will have to be rejected in favor of Ha; in which case, family Income can be said to have a significant influence on family Food Expenditure.

An interesting variation of the t-test is to verify the economic significance of the parameter with respect to the direction of causality of the associated IV.  In this case, the null is phrased as H0: B has a value that is at the most zero, against Ha: B > 0 (i.e; its value is strictly positive according to economic theory). At the level of alpha = .05, the critical t-value is tcv = t.05,18 = +1.734. But the tov = 12.328 , thus Ho of negative or no effect of Income will have to be rejected unequivocally.

6. Prediction -- using the estimated SRL
Suppose a typical or ith family drawn from the same population had an annual net Income of \$30,000 in 1993 (this is the 8th family in our sample).   Its estimated annual Food Expenditure, corresponding Xi = \$30 , would be ýi = -.412 + .184 x 30 = 5.111 thousands of dollars. Thus, \$5111 is the best estimate of the average annual Food Expenditure for this family. But this family actually spent 5.8 thousands of dollars or \$5800. Hence, the positive residual of \$689 (i.e., e8 = 5800-5111) is the amount by which the estimated SRP has underpredicted the annual food expenditure for this family.