Motivation: Oftentimes, it may not be realistic to conclude that only one factor or IV influences the behavior of the DV. In such situations, a researcher needs to carefully identify those other possible factors and explicitly include them in the Linear Regression Model (LRM). Existing economic theory or common sense should constitute a basis for selecting the IVs; and where data on a theoretically construed variable is not readily available a proxy should be carefully chosen.
This tutorial will illustrate the key steps involved in
using multiple regression and correlation to solve real world problems. The example will
consider a multiple LRM which typically has the form:
Yi =A +B1Xi,1+ B2Xi,2+ ... + BjXi,j +
Eiwhere Xjs are the IVs; A,
Bj (j = 1, 2, ..., K) are the regression parameters or coefficients and reflect
the partial effect of the associated IV, holding the effects of all other IVs constant; K
is the number of IVs in the model; and Ei is the random error term.
Again, note that in regression analysis, all of the
underlying classical assumptions
essentially apply to this random error term. In multiple
regression the three most crucial ones are the assumptions of no multicollinearity among the
IVs, of no heteroskedasticity
in the error variances, and of no autocorrelation
in the errors for all i.
Step 1: Formulate the LRM and State the
Expected Signs
of the Regression
Parameters
When specifying a LRM theory or common sense should be your guide in stipulating, a
priori, the expected signs of the regression parameters.
Let us return to the family food expenditure example that we introduced in the simple regression tutorial. In that
tutorial, the only factor that was explicitly identified as the predictor of annual
family Food Expenditure (Y) was Income (henceforth
denoted as X1). The effects of all other predictors were assumed away or
held constant. We now extend the model to include another important determinant,
viz., the family size (X2) which is easily measured in terms of the number of
people in a family. The representative LRM has the form:
YiA + B1Xi,1+ B2Xi,2 + Ei Based on economic theory, we should expect the signs
of B1 and B2 to be positive; in
other words, both the family Income and Size,
respectively, are expected to have positive effects on the family food expenditure. Note
that B1
measures the partial effect of Income on family food expenditure, holding family size
constant; whereas B2 measures the partial effect of family Size on Food Expenditure, holding
income constant. Also, note that holding one IV constant while examining the effect
of the other assumes that there is no collinearity between the two IVs. The sign of A could be positive or negative and
indeed A may or may not
have an interpretable meaning. Nonetheless, always include the intercept term in your
model -- more on this in Econs 853 and 976.
Step 2: Examine the DataVisually for
Inherent Patterns
with a Scatterplot
Matrix.
It is always advisable to do some exploratory analysis of the data to uncover inherent
patterns as to the type and strength of relationship among the variables as well as the
presence of outliers in the data. The scatterplot matrix is a useful graphical
device for doing so. While a strong linear association between the DV and each of the IV
is highly desirable, a strong linear association between (or among) the IVs is highly
undesirable since it is indicative of the presence of collinearity (or
multicollinearity) problem in the model. The consequences of
collinearity/multicollinearity will be treated in Econs 853 and 976.
For this example, the data set
for the simple regression analysis has been augmented to include data on X2.
The results of the preliminary analysis of the data are discussed separately in the
scatterplot matrix component.
After studying the results for reasonable inferences, the next phase of the data analysis
is to estimate the LRM. Estimating the embedded parameters of the population regression
plane (PRP) is accomplished by fitting the sample regression plane (SRP) to a sample of
data on all the variables of the model.
Step 3: Estimate the SRP
Again, the estimation method is the classical Ordinary Least Squares (OLS) technique which
is applied to the sample regression plane (SRP) that has the form:
yi=a+b1Xi,1+b2Xi,2
+ ei or ýi
= a+b1Xi,1+b2Xi,2
Note that yi and ýi are
the actual/observed ant the predicted/estimated value of Y, respectively, (for all i = 1,
2, ..., n). 'a' and 'b1' , and 'b2'
are the estimators of A, B1, and B2,
respectively. 'e' denotes the residual
(defined as e = yi - ýi) and is the estimator of the random
error term E.
The OLS method is programmed into the SPSS/win statistical package. Using the
command sequence presented earlier will automatically implements this method. The
following outputs contain the necessary results which are based on selected options that
are accessible via the 'Statistics...' button.
| Mean | Std. Deviation | N | |
|---|---|---|---|
| Annual Food Expenditure ($000) | 7.965 | 4.664 | 20 |
| Annual Income ($000) | 45.50 | 23.96 | 20 |
| Family Size | 2.95 | 1.61 | 20 |
| Annual Food Expenditure ($000) | Annual Income ($000) | Family Size | ||
|---|---|---|---|---|
| Pearson Correlation | Annual Food Expenditure ($000) | 1.000 | .946 | .787 |
| Annual Income ($000) | .946 | 1.000 | .676 | |
| Family Size | .787 | .676 | 1.000 | |
| Sig. (1-tailed) | Annual Food Expenditure ($000) | . | .000 | .000 |
| Annual Income ($000) | .000 | . | .001 | |
| Family Size | .000 | .001 | . | |
| N | Annual Food Expenditure ($000) | 20 | 20 | 20 |
| Annual Income ($000) | 20 | 20 | 20 | |
| Family Size | 20 | 20 | 20 | |
| Model | R | R Square | Adjusted R Square | Std. Error of the Estimate | Durbin-Watson |
|---|---|---|---|---|---|
| 1 | .967(a) | .935 | .927 | 1.261 | 2.616 |
| a Predictors: (Constant), Family Size , Annual Income ($000) | |||||
| b Dependent Variable: Annual Food Expenditure ($000) | |||||
| Model | Sum of Squares | df | Mean Square | F | Sig. | |
|---|---|---|---|---|---|---|
| 1 | Regression | 386.313 | 2 | 193.156 | 121.470 | .000(a) |
| Residual | 27.033 | 17 | 1.590 | |||
| Total | 413.346 | 19 | ||||
| a Predictors: (Constant), Family Size , Annual Income ($000) | ||||||
| b Dependent Variable: Annual Food Expenditure ($000) | ||||||
| Unstandardized Coefficients | Standardized Coefficients | t | Sig. | |||
|---|---|---|---|---|---|---|
| Model | B | Std. Error | Beta | |||
| 1 | (Constant) | -1.118 | .655 | -1.708 | .106 | |
| Annual Income ($000) | .148 | .016 | .761 | 9.049 | .000 | |
| Family Size | .793 | .244 | .273 | 3.245 | .005 | |
| a Dependent Variable: Annual Food Expenditure ($000) | ||||||
| Minimum | Maximum | Mean | Std. Deviation | N | |
|---|---|---|---|---|---|
| Predicted Value | 3.232 | 20.240 | 7.965 | 4.509 | 20 |
| Residual | -2.586 | 2.206 | 1.110E-16 | 1.193 | 20 |
| Std. Predicted Value | -1.050 | 2.722 | .000 | 1.000 | 20 |
| Std. Residual | -2.051 | 1.750 | .000 | .946 | 20 |
| a Dependent Variable: Annual Food Expenditure ($000) | |||||
Step 4: Discuss the Results and Summarize your
Findings
Similar to the presentation in the simple regression tutorial, I will discuss the
results in the order in which SPSS/win generates the outputs beginning with the
descriptive statistics tables. This approach permits a critical analysis of the all
results and their implications.
I. Descriptive Statistics
1. Annual Food Expenditure
a) The sample mean is 7.965 thousands of dollars. This
means that an average family in the sample spends $7965 annually on food.
b) The sample standard deviation of 4.664 (thousands of
dollars) is equivalent to a one-standard deviation of $4660 about the mean values of
$7965. This implies that 68.3% of the families spend between $3305 and $12,625 annually on
food.
2. Annual Income
a) The sample mean is 45.50 thousands of dollars. In terms of
income, this implies that an average family in the sample makes $45,500 annually.
b) The sample standard deviation of 23.96 (thousands of
dollars) is equivalents to ±$23,960 about the mean income of $45,500. Thus, 68.3% of the
families could be said to make between $21,540 and $69,460 annually.
3. Family Size (measured in terms of the number people in a
family during a year)
a) The sample mean of 2.95 means that an average family comprised of about 3 persons
during the year.
b) The sample standard deviation of 1.61 (or 2) means that there were between 1 and
5 members in approximately 68.3% of the families during the year.
4. Sample size N (actually 'n' ) =
20 simply means that there is no missing value during estimation.
II. Correlations Analysis
This table contains the Pearson
sample correlation coefficients of variable i
with variable j ( denoted as ri,j ), which are the key tools of Correlation Analysis. This
is the same Karl Pearson that I mentioned in the historical footnote
under the discussion of the Chi-square test of Independence (and also, in glossary under regression analysis). Let us
focus for now on the top part of the table. It is a 3 by 3 matrix. The
following conclusions are obvious:
1. The correlation of annual Food
Expenditure with itself is perfect, linear, and direct since ry,y = 1.000. Similar
interpretations apply to Income (r1,1
= 1) and family Size (r2,2
= 1).
2. The correlation of annual Food
expenditure with Income is quite strong, linear and direct because
ry,1 = .946
3. The correlation of annual
Food expenditure with Family Size is relatively strong, linear and direct
because ry,2 = .787
4. The correlation of annual
Income with Family Size is also strong (albeit undesirable), linear and
direct because r1,2 = .676
5. The 3 x 3 matrix is symmetric about the
main diagonal; hence, all the information about the type and strength of
relationship between the two variables can be obtained from the correlation coefficients
either above the main diagonal or below it.
4. The middle portion of the table contains the p-values (sig=significance for a two-tailed test that Ho: Pi,j = 0 against Ha:
Pi,j ≠ 0 (for i not equal to j); where P (rho) is the population
correlation coefficient whose value is unknown). The probability
or p-values (i.e.; computed/observed values or alphaov ) of .000 means
that Ho can be rejected unequivocally at the critical level of
alpha = .01. Thus, the conclusions in (1) through (4) above are indeed
valid.
5. Again, N (i.e. 'n' ) = 20
since all the observations were used in the estimation.
III. Model Summary and Evaluation with Se,
R, R2, and DW Statistics
From the 'Coefficients' table, the OLS method produces the following
estimated SRP:
From the 'Model Summary' table the following summary
statistics are reported R = .967, R2= .935, Adjusted R2 = .927,
Se = 1.261 and Durbin Watson (DW) statistic = 2.616. Let us explore their implications for
the accuracy of the estimated SRP.
1. The sample multiple correlation
coefficient R =.967 measures the degree of
relationship between the actual values (yi)
and the predicted values (ýi) of
the annual family food expenditure. Because the ýi
values are obtained as a linear combination of Income (X1)
and family Size (X2), the
coefficient value of .967 indicates
that the relationship between family food expenditure and the two IVs is quite strong and
positive.
2. The sample Coefficient of Determination
R-square or R2 (r2 is commonly used in simple regression analysis while
R2 is appropriately reserved for multiple regression analysis).
It measures the goodness-of-fit of the estimated SRP in terms of the proportion of the
variation in the DV explained by the fitted sample regression equation or SRP. Thus, the
value of R2 = .935 simply means
that about 94% of the variation in annual Family Food Expenditure is
explained or accounted for by the estimated SRP that uses Income and family Size as the
IVs. This information is quite useful in assessing the overall accuracy of the
model. Notice that R2 = (R = .967)2.
3. Adjusted R-Square (or R2
with a bar over it) is the sample Coefficient of Determination
after adjusting for the degrees of freedom lost in the process of estimating the
regression parameters. In this case, three parameters A
and B1 and B2 were
estimated so that three degrees of freedom (df) have been lost; thus, the remaining df can
be determined as v = n -k where K denotes the number of parameters in the LRM.
Hence, the adjusted R-square is a better measure of the
goodness-of-fit of the estimated SRP than its nominal/unadjusted counterpart. It is always
smaller in value than the unadjusted. I will examine the adjusted coefficient of
determination in some details in Econs. 853 and 976.
4. Standard Error of the Estimate (standard
notation is Se). This summary statistic measures the overall
accuracy or quality of the estimated SRP in terms of the average/standardized unexplained
variation in the DV that may be due to possible errors that could originate from (i)
chance errors of sampling or sampling errors, thereby causing the values of ‘a’
and ‘b’ to differ significantly from the true but unknown values of the
parameters ‘A’ and ‘B’; and (ii) possible
variation in the parameter which , according to the Classical Assumption, are presumed
constant. If these errors are small, on average, then the value of Se
could approach zero (exactly equal to zero if the estimated values of the DV, denoted here
as ýi
equals their actual/observed counterparts yi for all i = 1, 2, ..., n).
If otherwise, the values of Se approach +infinity; in which
case the estimated SRP must be considered useless especially if application involves the
prediction of the DV outside the sample period. Note that Se
is an unbiased estimator of the standard deviation
around the true conditional PRP µy/x
= A + B1Xi,1 + B2Xi,2
which is denoted as Óy/x
In this example, Se = 1.261
means that, on average, the predicted values of the annual family Food expenditure
could vary by ±$1261 about the estimated regression equation for each value of the Income
and Family size during the sample period -- and by a much larger amount outside the sample
period. This is why prediction outside the sample period requires the use of the standard errors of the estimators ‘a’, ‘b1’ and ‘b2’
(denoted, respectively, as Sa, Sb1,
and Sb2) for establishing confidence intervals about the
condition mean values µy/x.
Note that Sa, Sb1and Sb2 take into account the
chance errors of sampling mentioned earlier. Accounting for parameter variation will
require the application of advanced econometric techniques which is beyond the scope of
the undergraduate material.
5. Durbin Watson (DW) Statistics
measures the presence, or lack thereof, of Serial
Correlation (also known as Autocorrelation) among the errors from one
observation (or time period) to other observations (or time periods). Details about the
implications of the existence of the autocorrelation will be examined in Econs 853 and 976
classes. For now, suffice it to say that a value of DW =
2.616 means that the residuals é =
yi
- ýi (for all i = 1, 2, ..., n) from
the estimated regression model are negatively correlated and strongly so -- suggesting the
presence of a positive autocorrelation in the error terms (Ei). According
to the Classical Assumptions, this is undesirable. The ideal value of the DW
statistic should be 2.00 to indicate the
absence of autocorrelation. Again, detail discussion of autocorrelation will be presented
in Econs. 856 & 976.
IV. ANOVA Table: Testing the Significance of the
Model
The summary measures reported here are used in the partitioning of the the
total variation in the DV according to the identity relation TSS
= ESS + RSS, where TSS is the Total Sum of Squares in the DV, ESS is the
Explained Sum of Squares due to the fitted regression equation or model, and RSS is the
Residual (remaining) Sum of Squares that is unexplained and hence attributable to errors
(i.e.; chance sampling errors, and those resulting from parameter invariance). Note the
following: (1) The smaller RSS is relative to the TSS, (or the larger ESS is relative to
TSS), the better the estimated regression equation fits the data. (2) The underlying
principle in the partition of TSS is similar to that of the ANOVA technique. As in
that technique, the identity relation carries over to the associated degrees of freedom in
the this manner v = v1 + v2
where v1 = k-1, and v2 = n-k
so that v = n -1; where k is denotes the number of parameters that are
estimated. (3) If k is defined as the number
of IVs in the model, then v1 = k, and v2
= n-k-1; again, v = v1 + v2 = n -1.
: Some authors use RSS
(regression sum of squares) instead of ESS (explained sum of squares), and ESS (error sum
of squares) instead of RSS (residual sum of squares) so that the identity is stated as TSS
= RSS + ESS. So pay attention to how these acronyms are defined.
The null hypothesis (Ho) to
verify is that all of the IVs in the model, considered together, have no causal effect on
the DV; in which case the LRM that relates these IVs to the DV does nor exist. The
alternative hypothesis (Ha) is
that that is not the case; indeed one, if not all, of the IVs significantly influences the
DV. The formats of both Ho and Ha are:
Ho: B1 = B2 = 0
against Ha: They not are all
equal to zero; at least one is nonzero
From the ANOVA table, under the df column, v1 =
2, v2 = 17, v = 19, and Fov
= 121.470. Using the significance level of
.05, implies the critical F-value or Fcv = F.05,
2, 17 = 3.59 from the F distribution table. Thus, we
can reject Ho in favor of Ha. This means that the LRM that has been
estimated is not a mere theoretical construct; indeed it does exist and is statistically
significant.
V. Coefficients Table: T-Test of the
Significance of the Regression Coefficients
This table contains the estimated regression coefficients (a = -1.118, b1 =
.148, and b2 = .973); hence, the estimated SRP/equation can be written as
. The estimated coefficients
have the following interpretations:
1. a = -1.118 has no interpretable meaning
because the average level of family Food expenditure could not be negative even when no
member of the is gainfully employed. Moreover, it is unrealistic to think of the existence
a family that has no income and member and yet incurs expenditure on food.
Nonetheless, this value should not be discarded; it plays an important role when using the
estimated regression line/equation for prediction.
2. b1 = .148 represents the
partial effect of annual family Income on Food Expenditure, holding family Size constant.
The estimated positive sign implies that such effect is positive while the absolute value
implies that Food Expenditure would increase by $148 for every $1000 increase in Income.
3. b2 = .793 represents the
partial effect of family Size on Food Expenditure, holding family Income constant. The
estimated positive sign implies that such effect is positive while the absolute value
implies that Food Expenditure would increase by $793 for every additional member to the
family either by marriage, birth or adoption. Note that the addition to a family by
marriage is a possibility because there were some families in the sample with only one
person.
4. Standard errors of the estimators: Assessing
the precision of 'a', 'b1', and 'b2'
Sa = .655, Sb1 =
.016, and Sb2 =
.244, respectively, measure the precision of the estimated values of a = -1.118, b1
= .148, and b2 = .793,
in taking on or estimating the true but unknown values of the corresponding regression
parameters A and B1 and B2.
The closer the values of Sa, Sb1, and Sb2 to zero, the higher the precision of the estimates, suggesting
that chance errors due to sampling is not severe. The converse would suggest the opposite.
Thus Sb = .016 implies that b1
= .148 is much more closer to the true value
of B1 than
is b2
= .793 to B2; and Sa = .655 implies quite the opposite coupled with the
fact the estimated sign contradicts commonsense or reality.
5. Standardized Coefficients: Assessing
the Relative Importance of the IVs
The standardized coefficients are useful for determining the relative importance of the
IVs the model. In effect, the importance of IVs can ranked according to the size (i.e.,
the absolute value) of the beta coefficients. In this example, the beta coefficient
for income b*1=
.148 (23.96/4.66) =
.762 (under the "Beta" column), where 23.96, and 4.66 are the
sample standard deviation of family Income and Food Expenditure, respectively. The
beta coefficient for family Size is b*2 =
.793(1.61/4.66) = .273, where 1.61 is the sample standard deviation of the
family Size variable. Thus the estimated SRP can be expressed in terms of the beta
coefficients as ýi = .762Xi,1 +.273Xi,2. Because the absolute value of the beta
coefficient for income is larger, it can be concluded that income is relatively a more
important predictor of family food expenditure than the size of the family.
Suppose we had included a third IV (X3,
say, the local price level for each family assuming families were randomly selected from a
national pool) and came up with an estimated beta coefficient of -.825. then the
ranking of the IVs according their relative importance in predicting/explaining family
food expenditure would be as follows: 1 for X3,
2 for X1, and 3 for X2 .
6. Observed/computed t statistic
(tov):
T-test of the Significance and Signs of the
Regression Parameters.
As part of investigating the accuracy of the fitted SRP, it is often useful to verify both
the statistical significance and the sign (i.e., economic significance) of the regression
parameters/coefficients (B1, B2) individually. For statistical
significance, the maintained hypothesis is that the IV or Xj has no causal
effect on the DV or Y. Thus, the null is H0: Bj
= 0 (i.e., Xj has no causal effect on the DV) against the
alternative that Ha: Bj is not equal to
zero (i.e., Xj does indeed have some causal effect on the DV;
such effect may be direct or indirect).
a. Testing for Statistical Significance of Bj
With respect to income, the null is H0: B1
= 0 (i.e., Income has no causal effect on Food Expenditure), against the
alternative that Ha: B1 is not equal to
zero (i.e., income indeed does have some causal effect on food
expenditure). For the Family Size, the null is H0: B2 = 0 (i.e., Family Size has no
causal effect on food expenditure), against the alternative that Ha: B2 is not equal to zero (i.e.,
Family Size indeed does have some causal effect on food expenditure). For alpha =
.05 and v = n -k-1 = 20 -2-1 = 17, this implies a critical t-value of tcv = t.025,17 = ±2.110. For Income, tov = 9.049. Thus,
Ho must unequivocally be rejected
in favor of Ha; in which case,
family Income can be said to have a significant influence on family Food Expenditure. For family Size, tov = 3.245.
So, Ho must be rejected in favor
of Ha; in which case, family Size
can be said to have a significant influence on family Food Expenditure.
b. Testing for Economic/practical Significance of Bj
An interesting variation of the t-test is to verify the economic significance of the
parameter with respect to the direction of causality of the associated IV. In this
case, the null is phrased as H0: Bj has a
value that is at the most zero, against Ha:
Bj > 0
(i.e; its value is strictly positive according to the underlying economic theory). If the
sign of the parameter was expected to be negative on the basis of theory or common sense,
then the null is phrased as H0: Bj has a
value that is at the least zero, against Ha:
Bj < 0
(i.e; its value is strictly negative according to the underlying economic theory).
Consider, for example, family size where the sign of B2
is expected to be positive. H0: B2 has a value
that is at the most zero against Ha:
B2 > 0.
At the level of alpha = .05, the critical t-value is tcv = t.05,17 =
+1.740. But the tov = 3.245 , thus Ho of negative or no effect of
family Size must be rejected unequivocally.
Note that in the test for economic significance of a parameter the alpha value is not
divided by two since this is always a one-tailed test; whereas, it is divided by 2 in the
test for statistical significance since this is always a two-tailed test.
7. Prediction --using the estimated SRP
Suppose a typical or ith family drawn from
the same population had an annual Income of $30,000 in 1993 with a family size of 2
members (this is the 8th family in our sample). Its estimated/predicted annual Food Expenditure,
corresponding X1,8 = $30 and X2,8 = 2 would be ýi = -1.118 + .148 x 30 + .793 x 2 = 4.908 thousands of dollars. Thus, $4908 is
the best estimate of the average annual Food Expenditure for this family. But this
family actually spent 5.8 thousands of dollars or $5800. Hence, the positive residual of
$892 (i.e., e8 = 5800-4908) is the amount by which the estimated
SRP has underpredicted the annual Food Expenditure for this family.
Top or Return to Regression & Correlation Analysis or Learning Statistics with SPSS/win
or Home Page or Send me your Comments via E-mail.
Copyright© 1996, Ebenge Usip, all rights reserved.
Last revised:
Wednesday, July 10, 2013.