Motivation: Oftentimes, it may not be realistic to conclude that only one factor or IV influences the behavior of the DV. In such situations, a researcher needs to carefully identify those other possible factors and explicitly include them in the Linear Regression Model (LRM). Existing economic theory or common sense should constitute a basis for selecting the IVs; and where data on a theoretically construed variable is not readily available a proxy should be carefully chosen.
This tutorial will illustrate the key steps involved in
using multiple regression and correlation to solve real world problems. The example will
consider a multiple LRM which typically has the form:
Yi =A +B1Xi,1+ B2Xi,2+ ... + BjXi,j + Eiwhere Xjs are the IVs; A, Bj (j = 1, 2, ..., K) are the regression parameters or coefficients and reflect the partial effect of the associated IV, holding the effects of all other IVs constant; K is the number of IVs in the model; and Ei is the random error term. Again, note that in regression analysis, all of the underlying classical assumptions essentially apply to this random error term. In multiple regression the three most crucial ones are the assumptions of no multicollinearity among the IVs, of no heteroskedasticity in the error variances, and of no autocorrelation in the errors for all i.
Step 1: Formulate the LRM and State the Expected Signs
of the Regression Parameters
When specifying a LRM theory or common sense should be your guide in stipulating, a priori, the expected signs of the regression parameters.
Let us return to the family food expenditure example that we introduced in the simple regression tutorial. In that tutorial, the only factor that was explicitly identified as the predictor of annual family Food Expenditure (Y) was Income (henceforth denoted as X1). The effects of all other predictors were assumed away or held constant. We now extend the model to include another important determinant, viz., the family size (X2) which is easily measured in terms of the number of people in a family. The representative LRM has the form:
YiA + B1Xi,1+ B2Xi,2 + Ei Based on economic theory, we should expect the signs of B1 and B2 to be positive; in other words, both the family Income and Size, respectively, are expected to have positive effects on the family food expenditure. Note that B1 measures the partial effect of Income on family food expenditure, holding family size constant; whereas B2 measures the partial effect of family Size on Food Expenditure, holding income constant. Also, note that holding one IV constant while examining the effect of the other assumes that there is no collinearity between the two IVs. The sign of A could be positive or negative and indeed A may or may not have an interpretable meaning. Nonetheless, always include the intercept term in your model -- more on this in Econs 853 and 976.
Step 2: Examine the DataVisually for Inherent Patterns
with a Scatterplot Matrix.
It is always advisable to do some exploratory analysis of the data to uncover inherent patterns as to the type and strength of relationship among the variables as well as the presence of outliers in the data. The scatterplot matrix is a useful graphical device for doing so. While a strong linear association between the DV and each of the IV is highly desirable, a strong linear association between (or among) the IVs is highly undesirable since it is indicative of the presence of collinearity (or multicollinearity) problem in the model. The consequences of collinearity/multicollinearity will be treated in Econs 853 and 976.
For this example, the data set
for the simple regression analysis has been augmented to include data on X2.
The results of the preliminary analysis of the data are discussed separately in the
scatterplot matrix component.
After studying the results for reasonable inferences, the next phase of the data analysis is to estimate the LRM. Estimating the embedded parameters of the population regression plane (PRP) is accomplished by fitting the sample regression plane (SRP) to a sample of data on all the variables of the model.
Step 3: Estimate the SRP
Again, the estimation method is the classical Ordinary Least Squares (OLS) technique which is applied to the sample regression plane (SRP) that has the form:
yi=a+b1Xi,1+b2Xi,2 + ei or ýi = a+b1Xi,1+b2Xi,2
Note that yi and ýi are the actual/observed ant the predicted/estimated value of Y, respectively, (for all i = 1, 2, ..., n). 'a' and 'b1' , and 'b2' are the estimators of A, B1, and B2, respectively. 'e' denotes the residual (defined as e = yi - ýi) and is the estimator of the random error term E.
The OLS method is programmed into the SPSS/win statistical package. Using the command sequence presented earlier will automatically implements this method. The following outputs contain the necessary results which are based on selected options that are accessible via the 'Statistics...' button.
|Annual Food Expenditure ($000)||7.965||4.664||20|
|Annual Income ($000)||45.50||23.96||20|
|Annual Food Expenditure ($000)||Annual Income ($000)||Family Size|
|Pearson Correlation||Annual Food Expenditure ($000)||1.000||.946||.787|
|Annual Income ($000)||.946||1.000||.676|
|Sig. (1-tailed)||Annual Food Expenditure ($000)||.||.000||.000|
|Annual Income ($000)||.000||.||.001|
|N||Annual Food Expenditure ($000)||20||20||20|
|Annual Income ($000)||20||20||20|
|Model||R||R Square||Adjusted R Square||Std. Error of the Estimate||Durbin-Watson|
|a Predictors: (Constant), Family Size , Annual Income ($000)|
|b Dependent Variable: Annual Food Expenditure ($000)|
|Model||Sum of Squares||df||Mean Square||F||Sig.|
|a Predictors: (Constant), Family Size , Annual Income ($000)|
|b Dependent Variable: Annual Food Expenditure ($000)|
|Unstandardized Coefficients||Standardized Coefficients||t||Sig.|
|Annual Income ($000)||.148||.016||.761||9.049||.000|
|a Dependent Variable: Annual Food Expenditure ($000)|
|Std. Predicted Value||-1.050||2.722||.000||1.000||20|
|a Dependent Variable: Annual Food Expenditure ($000)|
Step 4: Discuss the Results and Summarize your
Similar to the presentation in the simple regression tutorial, I will discuss the results in the order in which SPSS/win generates the outputs beginning with the descriptive statistics tables. This approach permits a critical analysis of the all results and their implications.
I. Descriptive Statistics
1. Annual Food Expenditure
a) The sample mean is 7.965 thousands of dollars. This means that an average family in the sample spends $7965 annually on food.
b) The sample standard deviation of 4.664 (thousands of dollars) is equivalent to a one-standard deviation of $4660 about the mean values of $7965. This implies that 68.3% of the families spend between $3305 and $12,625 annually on food.
2. Annual Income
a) The sample mean is 45.50 thousands of dollars. In terms of income, this implies that an average family in the sample makes $45,500 annually.
b) The sample standard deviation of 23.96 (thousands of dollars) is equivalents to ±$23,960 about the mean income of $45,500. Thus, 68.3% of the families could be said to make between $21,540 and $69,460 annually.
3. Family Size (measured in terms of the number people in a family during a year)
a) The sample mean of 2.95 means that an average family comprised of about 3 persons during the year.
b) The sample standard deviation of 1.61 (or 2) means that there were between 1 and 5 members in approximately 68.3% of the families during the year.
4. Sample size N (actually 'n' ) = 20 simply means that there is no missing value during estimation.
II. Correlations Analysis
This table contains the Pearson sample correlation coefficients of variable i with variable j ( denoted as ri,j ), which are the key tools of Correlation Analysis. This is the same Karl Pearson that I mentioned in the historical footnote under the discussion of the Chi-square test of Independence (and also, in glossary under regression analysis). Let us focus for now on the top part of the table. It is a 3 by 3 matrix. The following conclusions are obvious:
1. The correlation of annual Food Expenditure with itself is perfect, linear, and direct since ry,y = 1.000. Similar interpretations apply to Income (r1,1 = 1) and family Size (r2,2 = 1).
2. The correlation of annual Food expenditure with Income is quite strong, linear and direct because ry,1 = .946
3. The correlation of annual Food expenditure with Family Size is relatively strong, linear and direct because ry,2 = .787
4. The correlation of annual Income with Family Size is also strong (albeit undesirable), linear and direct because r1,2 = .676
5. The 3 x 3 matrix is symmetric about the main diagonal; hence, all the information about the type and strength of relationship between the two variables can be obtained from the correlation coefficients either above the main diagonal or below it.
4. The middle portion of the table contains the p-values (sig=significance for a two-tailed test that Ho: Pi,j = 0 against Ha: Pi,j ≠ 0 (for i not equal to j); where P (rho) is the population correlation coefficient whose value is unknown). The probability or p-values (i.e.; computed/observed values or alphaov ) of .000 means that Ho can be rejected unequivocally at the critical level of alpha = .01. Thus, the conclusions in (1) through (4) above are indeed valid.
5. Again, N (i.e. 'n' ) = 20 since all the observations were used in the estimation.
III. Model Summary and Evaluation with Se,
R, R2, and DW Statistics
From the 'Coefficients' table, the OLS method produces the following estimated SRP:
From the 'Model Summary' table the following summary statistics are reported R = .967, R2= .935, Adjusted R2 = .927, Se = 1.261 and Durbin Watson (DW) statistic = 2.616. Let us explore their implications for the accuracy of the estimated SRP.
1. The sample multiple correlation coefficient R =.967 measures the degree of relationship between the actual values (yi) and the predicted values (ýi) of the annual family food expenditure. Because the ýi values are obtained as a linear combination of Income (X1) and family Size (X2), the coefficient value of .967 indicates that the relationship between family food expenditure and the two IVs is quite strong and positive.
2. The sample Coefficient of Determination R-square or R2 (r2 is commonly used in simple regression analysis while R2 is appropriately reserved for multiple regression analysis). It measures the goodness-of-fit of the estimated SRP in terms of the proportion of the variation in the DV explained by the fitted sample regression equation or SRP. Thus, the value of R2 = .935 simply means that about 94% of the variation in annual Family Food Expenditure is explained or accounted for by the estimated SRP that uses Income and family Size as the IVs. This information is quite useful in assessing the overall accuracy of the model. Notice that R2 = (R = .967)2.
3. Adjusted R-Square (or R2 with a bar over it) is the sample Coefficient of Determination after adjusting for the degrees of freedom lost in the process of estimating the regression parameters. In this case, three parameters A and B1 and B2 were estimated so that three degrees of freedom (df) have been lost; thus, the remaining df can be determined as v = n -k where K denotes the number of parameters in the LRM. Hence, the adjusted R-square is a better measure of the goodness-of-fit of the estimated SRP than its nominal/unadjusted counterpart. It is always smaller in value than the unadjusted. I will examine the adjusted coefficient of determination in some details in Econs. 853 and 976.
4. Standard Error of the Estimate (standard notation is Se). This summary statistic measures the overall accuracy or quality of the estimated SRP in terms of the average/standardized unexplained variation in the DV that may be due to possible errors that could originate from (i) chance errors of sampling or sampling errors, thereby causing the values of ‘a’ and ‘b’ to differ significantly from the true but unknown values of the parameters ‘A’ and ‘B’; and (ii) possible variation in the parameter which , according to the Classical Assumption, are presumed constant. If these errors are small, on average, then the value of Se could approach zero (exactly equal to zero if the estimated values of the DV, denoted here as ýi equals their actual/observed counterparts yi for all i = 1, 2, ..., n). If otherwise, the values of Se approach +infinity; in which case the estimated SRP must be considered useless especially if application involves the prediction of the DV outside the sample period. Note that Se is an unbiased estimator of the standard deviation around the true conditional PRP µy/x = A + B1Xi,1 + B2Xi,2 which is denoted as Óy/x
In this example, Se = 1.261 means that, on average, the predicted values of the annual family Food expenditure could vary by ±$1261 about the estimated regression equation for each value of the Income and Family size during the sample period -- and by a much larger amount outside the sample period. This is why prediction outside the sample period requires the use of the standard errors of the estimators ‘a’, ‘b1’ and ‘b2’ (denoted, respectively, as Sa, Sb1, and Sb2) for establishing confidence intervals about the condition mean values µy/x. Note that Sa, Sb1and Sb2 take into account the chance errors of sampling mentioned earlier. Accounting for parameter variation will require the application of advanced econometric techniques which is beyond the scope of the undergraduate material.
5. Durbin Watson (DW) Statistics measures the presence, or lack thereof, of Serial Correlation (also known as Autocorrelation) among the errors from one observation (or time period) to other observations (or time periods). Details about the implications of the existence of the autocorrelation will be examined in Econs 853 and 976 classes. For now, suffice it to say that a value of DW = 2.616 means that the residuals é = yi - ýi (for all i = 1, 2, ..., n) from the estimated regression model are negatively correlated and strongly so -- suggesting the presence of a positive autocorrelation in the error terms (Ei). According to the Classical Assumptions, this is undesirable. The ideal value of the DW statistic should be 2.00 to indicate the absence of autocorrelation. Again, detail discussion of autocorrelation will be presented in Econs. 856 & 976.
IV. ANOVA Table: Testing the Significance of the
The summary measures reported here are used in the partitioning of the the total variation in the DV according to the identity relation TSS = ESS + RSS, where TSS is the Total Sum of Squares in the DV, ESS is the Explained Sum of Squares due to the fitted regression equation or model, and RSS is the Residual (remaining) Sum of Squares that is unexplained and hence attributable to errors (i.e.; chance sampling errors, and those resulting from parameter invariance). Note the following: (1) The smaller RSS is relative to the TSS, (or the larger ESS is relative to TSS), the better the estimated regression equation fits the data. (2) The underlying principle in the partition of TSS is similar to that of the ANOVA technique. As in that technique, the identity relation carries over to the associated degrees of freedom in the this manner v = v1 + v2 where v1 = k-1, and v2 = n-k so that v = n -1; where k is denotes the number of parameters that are estimated. (3) If k is defined as the number of IVs in the model, then v1 = k, and v2 = n-k-1; again, v = v1 + v2 = n -1.
: Some authors use RSS (regression sum of squares) instead of ESS (explained sum of squares), and ESS (error sum of squares) instead of RSS (residual sum of squares) so that the identity is stated as TSS = RSS + ESS. So pay attention to how these acronyms are defined.
The null hypothesis (Ho) to verify is that all of the IVs in the model, considered together, have no causal effect on the DV; in which case the LRM that relates these IVs to the DV does nor exist. The alternative hypothesis (Ha) is that that is not the case; indeed one, if not all, of the IVs significantly influences the DV. The formats of both Ho and Ha are:
Ho: B1 = B2 = 0 against Ha: They not are all equal to zero; at least one is nonzero
From the ANOVA table, under the df column, v1 = 2, v2 = 17, v = 19, and Fov = 121.470. Using the significance level of .05, implies the critical F-value or Fcv = F.05, 2, 17 = 3.59 from the F distribution table. Thus, we can reject Ho in favor of Ha. This means that the LRM that has been estimated is not a mere theoretical construct; indeed it does exist and is statistically significant.
V. Coefficients Table: T-Test of the
Significance of the Regression Coefficients
This table contains the estimated regression coefficients (a = -1.118, b1 = .148, and b2 = .973); hence, the estimated SRP/equation can be written as . The estimated coefficients have the following interpretations:
1. a = -1.118 has no interpretable meaning because the average level of family Food expenditure could not be negative even when no member of the is gainfully employed. Moreover, it is unrealistic to think of the existence a family that has no income and member and yet incurs expenditure on food. Nonetheless, this value should not be discarded; it plays an important role when using the estimated regression line/equation for prediction.
2. b1 = .148 represents the partial effect of annual family Income on Food Expenditure, holding family Size constant. The estimated positive sign implies that such effect is positive while the absolute value implies that Food Expenditure would increase by $148 for every $1000 increase in Income.
3. b2 = .793 represents the partial effect of family Size on Food Expenditure, holding family Income constant. The estimated positive sign implies that such effect is positive while the absolute value implies that Food Expenditure would increase by $793 for every additional member to the family either by marriage, birth or adoption. Note that the addition to a family by marriage is a possibility because there were some families in the sample with only one person.
4. Standard errors of the estimators: Assessing the precision of 'a', 'b1', and 'b2'
Sa = .655, Sb1 = .016, and Sb2 = .244, respectively, measure the precision of the estimated values of a = -1.118, b1 = .148, and b2 = .793, in taking on or estimating the true but unknown values of the corresponding regression parameters A and B1 and B2. The closer the values of Sa, Sb1, and Sb2 to zero, the higher the precision of the estimates, suggesting that chance errors due to sampling is not severe. The converse would suggest the opposite. Thus Sb = .016 implies that b1 = .148 is much more closer to the true value of B1 than is b2 = .793 to B2; and Sa = .655 implies quite the opposite coupled with the fact the estimated sign contradicts commonsense or reality.
5. Standardized Coefficients: Assessing the Relative Importance of the IVs
The standardized coefficients are useful for determining the relative importance of the IVs the model. In effect, the importance of IVs can ranked according to the size (i.e., the absolute value) of the beta coefficients. In this example, the beta coefficient for income b*1= .148 (23.96/4.66) = .762 (under the "Beta" column), where 23.96, and 4.66 are the sample standard deviation of family Income and Food Expenditure, respectively. The beta coefficient for family Size is b*2 = .793(1.61/4.66) = .273, where 1.61 is the sample standard deviation of the family Size variable. Thus the estimated SRP can be expressed in terms of the beta coefficients as ýi = .762Xi,1 +.273Xi,2. Because the absolute value of the beta coefficient for income is larger, it can be concluded that income is relatively a more important predictor of family food expenditure than the size of the family.
Suppose we had included a third IV (X3, say, the local price level for each family assuming families were randomly selected from a national pool) and came up with an estimated beta coefficient of -.825. then the ranking of the IVs according their relative importance in predicting/explaining family food expenditure would be as follows: 1 for X3, 2 for X1, and 3 for X2 .
6. Observed/computed t statistic (tov): T-test of the Significance and Signs of the Regression Parameters.
As part of investigating the accuracy of the fitted SRP, it is often useful to verify both the statistical significance and the sign (i.e., economic significance) of the regression parameters/coefficients (B1, B2) individually. For statistical significance, the maintained hypothesis is that the IV or Xj has no causal effect on the DV or Y. Thus, the null is H0: Bj = 0 (i.e., Xj has no causal effect on the DV) against the alternative that Ha: Bj is not equal to zero (i.e., Xj does indeed have some causal effect on the DV; such effect may be direct or indirect).
a. Testing for Statistical Significance of Bj
With respect to income, the null is H0: B1 = 0 (i.e., Income has no causal effect on Food Expenditure), against the alternative that Ha: B1 is not equal to zero (i.e., income indeed does have some causal effect on food expenditure). For the Family Size, the null is H0: B2 = 0 (i.e., Family Size has no causal effect on food expenditure), against the alternative that Ha: B2 is not equal to zero (i.e., Family Size indeed does have some causal effect on food expenditure). For alpha = .05 and v = n -k-1 = 20 -2-1 = 17, this implies a critical t-value of tcv = t.025,17 = ±2.110. For Income, tov = 9.049. Thus, Ho must unequivocally be rejected in favor of Ha; in which case, family Income can be said to have a significant influence on family Food Expenditure. For family Size, tov = 3.245. So, Ho must be rejected in favor of Ha; in which case, family Size can be said to have a significant influence on family Food Expenditure.
b. Testing for Economic/practical Significance of Bj
An interesting variation of the t-test is to verify the economic significance of the parameter with respect to the direction of causality of the associated IV. In this case, the null is phrased as H0: Bj has a value that is at the most zero, against Ha: Bj > 0 (i.e; its value is strictly positive according to the underlying economic theory). If the sign of the parameter was expected to be negative on the basis of theory or common sense, then the null is phrased as H0: Bj has a value that is at the least zero, against Ha: Bj < 0 (i.e; its value is strictly negative according to the underlying economic theory).
Consider, for example, family size where the sign of B2 is expected to be positive. H0: B2 has a value that is at the most zero against Ha: B2 > 0. At the level of alpha = .05, the critical t-value is tcv = t.05,17 = +1.740. But the tov = 3.245 , thus Ho of negative or no effect of family Size must be rejected unequivocally.
Note that in the test for economic significance of a parameter the alpha value is not divided by two since this is always a one-tailed test; whereas, it is divided by 2 in the test for statistical significance since this is always a two-tailed test.
7. Prediction --using the estimated SRP
Suppose a typical or ith family drawn from the same population had an annual Income of $30,000 in 1993 with a family size of 2 members (this is the 8th family in our sample). Its estimated/predicted annual Food Expenditure, corresponding X1,8 = $30 and X2,8 = 2 would be ýi = -1.118 + .148 x 30 + .793 x 2 = 4.908 thousands of dollars. Thus, $4908 is the best estimate of the average annual Food Expenditure for this family. But this family actually spent 5.8 thousands of dollars or $5800. Hence, the positive residual of $892 (i.e., e8 = 5800-4908) is the amount by which the estimated SRP has underpredicted the annual Food Expenditure for this family.
Top or Return to Regression & Correlation Analysis or Learning Statistics with SPSS/win
or Home Page or Send me your Comments via E-mail.
Copyright© 1996, Ebenge Usip, all rights reserved.
Last revised: Wednesday, July 10, 2013.