Model Building in Regression
and
Comparing Models
Model building in regression takes many forms, but one common form is for individuals with subject-matter knowledge to develop theories about potential sources of variation in some quantity of interest. This tutorial will focus on this approach.
Overview:
Develop theories
Define variables
Translate theories into statistical model statements
Collect data
Obtain computer output for each model
Assess the usefulness of each model (i.e., if the model should continue to be
considered) [This should not be confused with evaluating the specific theory.
A theory may be very specific and not completely supported by the data; yet the
model developed from this theory may be better than some other model from a more
general theory that is supported by the data.]
If multiple models appear to be useful, determine which one provides the best
information--best of those considered.
Example:
Suppose that a professor wants to understand variability in mid-term grades in BUSA 3110, a second statistics course. The professor might develop theories such as:
Once the theories are stated, variables can be defined and statistical model statements can be written to correspond to the theories. For example the following variables might be used for this situation:
AVERAGE Numerical grade for a student midway through the semester
2400 GRADE Letter grade for MATH 2400 (the prerequisite course for BUSA
3110)
STAT1
A numerical representation for the 2400 Grade (A=4; B=3, C=2, D=1, and F=0)
MAJOR The student’s major (only students majoring in one of the
four business majors were considered)
[A qualitative variable that must be converted to dummy variables for analysis.]
ABSENCES Number of class periods that the student has not signed the
attendance sheet
Model statements that could be used to analyze this situation include:
Model 1: AVERAGE = β0
+ β1ABSENCES +
ε
Model 2: AVERAGE = β0
+ β1STAT1 +
ε
Model 3: AVERAGE = β0
+ β1ABSENCES +
β2STAT1 +
ε
Model 4: AVERAGE = β0
+ β1ABSENCES +
β2STAT1 +
β3ABSSTAT + ε
Where ABSSTAT
is an interaction term between ABSENCES and STAT1
Model 5: AVERAGE = β0
+ β1ABSENCES +
β2STAT1 +
β3ACCT +
β4FINC +
β5MGMT +
ε
Where
ACCT = 1 for accounting majors; 0 otherwise
FINC = 1 for finance majors: 0 otherwise
MGMT = 1 for management major; 0 otherwise
Note: It would be incorrect to define a fourth dummy variable since there
are only four majors included in the data set; the fourth major is included by
recognizing that someone who is not an accounting major, not a finance major,
and not a management major must be a marketing major!
The data for this example and the output for these models are available in an Excel file. The data are on the first worksheet; the output is on the second worksheet (click the tab at the lower left of the spreadsheet to move from worksheet to worksheet within the file). An α of .05 will be used for this exercise.
Checking the p value in the ANOVA table (Excel calls this "Significance F"), reveals that all five models provide some useful information since the output for each model produces a p value in the ANOVA table that is below the selected α. If any of the models had resulted in a p value in the ANOVA table that was larger than the selected α, these models would be omitted from further consideration. [NOTE: This does not say whether the data support the specific theories; further analysis of output would be required to address these statements.]
Comparing the models will be done using a pair-wise approach; with the "better" model at each step remaining in consideration and the other one dropped from further consideration. The following flow diagram will be used to help with the comparisons:

For this example:
Comparing Models 1 and 2:
AVERAGE = β0 +
β1ABSENCES +
ε and AVERAGE =
β0 +
β1STAT1 +
ε
Model 1 has one independent variable (Absences); Model 2 has one independent
variable (STAT1). So, the answer to the first question is "no."
Adjusted R2
for Model 1 is .348; adjusted R2 for Model 2 is .193.
Therefore, Model 1 is better than Model 2.
Comparing Models 1
(the better in the previous comparison) and 3:
AVERAGE = β0 +
β1ABSENCES +
ε and AVERAGE =
β0 +
β1ABSENCES +
β2STAT1 +
ε
Model 1 has one independent variable (Absences); Model 3 has two independent
variables (STAT1 and Absences). Since the all of the variables in Model 1
are also in Model 2, the answer to the first question is "yes." Also, the
only additional variable is STAT1, so we will look at the p value associated
with STAT1 in the coefficients section of the output for Model 3. This p
value is .122. Since this p value is not less than the selected
α of .05;
the "longer" model is not considered to be better. Model 3 will not be
considered further.
Comparing Models 1
(the better in the previous comparison) and 4:
AVERAGE = β0 +
β1ABSENCES +
ε and AVERAGE =
β0 +
β1ABSENCES +
β2STAT1 +
β3ABSSTAT + ε
Model 1 has one independent
variable (Absences); Model 4 has this same variable plus two more independent
variables. Therefore, an F statistics and the associated p value must be
obtained to see if the two new variables should be added. This F
statistics is used to test
Ho:
β2
= β3
= 0 vs. Ha: at least one of these
β’s
is not 0. The F statistic is obtained by determining reduction in
unexplained variation per additional variable and comparing this to the MSE
(mean square error) for the "longer" model.
F = [(SSE“shorter” – SSE“longer”)/# new variables]/MSE“longer”
Using the numbers from the printout, F=[(2979.328119-2636.063345)/2]/119.8210611 = 1.4324
Excel is used to look up the p value associated with this F statistic. This F statistic needs to be looked up in an F table with numerator degrees of freedom determined by the number of additional variables in the "longer" model and denominator degrees of freedom based on the degrees of freedom in the error (residual) row for the "longer" model (Model 4). The Excel command used to find the p value is
=FDIST(calculated F, numerator degrees of freedom, denominator degrees of freedom)
For this comparison the command is =FDIST(1.4324,2,22); the p value returned is .2601. Therefore, Model 4 is not considered further.
Comparing Models 1
(the better in the previous comparison) and 5:
AVERAGE =β0+
β1ABSENCES +
ε and AVERAGE =β0+
β1ABSENCES +
β2STAT1 +
β3ACCT +
β4FINC +
β5MGMT +
ε
Model 1 has one independent variable
(Absences); Model 5 has this same variable plus four more independent variables.
Therefore, an F statistics and the associated p value must be obtained to see if
the four new variables should be added. The calculated F statistic is
1.827 with an associated p value of .1632. Therefore, Model 5 is not
considered further.
As a result, Model 1 appears to be the best model that has been considered.
Back to Tutorials Page
Back to BUSA 3110 Home Page