Model Building in Regression
and Comparing Models

 

Model building in regression takes many forms, but one common form is for individuals with subject-matter knowledge to develop theories about potential sources of variation in some quantity of interest.  This tutorial will focus on this approach. 

Overview:
Develop theories
Define variables
Translate theories into statistical model statements
Collect data
Obtain computer output for each model
Assess the usefulness of each model (i.e., if the model should continue to be considered) [This should not be confused with evaluating the specific theory.  A theory may be very specific and not completely supported by the data; yet the model developed from this theory may be better than some other model from a more general theory that is supported by the data.]
If multiple models appear to be useful, determine which one provides the best information--best of those considered.

Example:

Suppose that a professor wants to understand variability in mid-term grades in BUSA 3110, a second statistics course.  The professor might develop theories such as:

  1. Attendance is important.  The more absences that a student has, the lower the grade will be.
  2. Mastery of the pre-requisite material is important.  The better a student did in the first statistics course, the better that student would be expected to do in the second statistics course.
  3. Both attendance in the current course and performance in the prerequisite course need to be considered when estimating how a student will do in the current course.
  4. Attendance is more important for students who struggled in the prerequisite course than for students who did well in the earlier course.
  5. Even when you consider attendance and performance in the prerequisite course, you still need to take into account the major of the student--since some majors tend to draw more quantitatively oriented students than others.

Once the theories are stated, variables can be defined and statistical model statements can be written to correspond to the theories.  For example the following variables might be used for this situation:

      AVERAGE      Numerical grade for a student midway through the semester
      2400 GRADE  Letter grade for MATH 2400 (the prerequisite course for BUSA 3110) 
      STAT1            A numerical representation for the 2400 Grade (A=4; B=3, C=2, D=1, and F=0)
      MAJOR          The student’s major (only students majoring in one of the four business majors were considered) 
                              [A qualitative variable that must be converted to dummy variables for analysis.]
      ABSENCES    Number of class periods that the student has not signed the attendance sheet

Model statements that could be used to analyze this situation include:

Model 1:            AVERAGE = β0 + β1ABSENCES + ε
Model 2:            AVERAGE = β0 + β1STAT1 + ε
Model 3:            AVERAGE = β0 + β1ABSENCES + β2STAT1 + ε
Model 4:            AVERAGE =  β0 + β1ABSENCES + β2STAT1 + β3ABSSTAT + ε
                              Where ABSSTAT is an interaction term between ABSENCES and STAT1
Model 5:            AVERAGE = β0 + β1ABSENCES + β2STAT1 + β3ACCT + β4FINC + β5MGMT + ε
                              Where
                                 ACCT = 1 for accounting majors; 0 otherwise
                                 FINC = 1 for finance majors: 0 otherwise
                                 MGMT = 1 for management major; 0 otherwise
Note:  It would be incorrect to define a fourth dummy variable since there are only four majors included in the data set; the fourth major is included by recognizing that someone who is not an accounting major, not a finance major, and not a management major must be a marketing major!

The data for this example and the output for these models are available in an Excel file.  The data are on the first worksheet; the output is on the second worksheet (click the tab at the lower left of the spreadsheet to move from worksheet to worksheet within the file).  An α of .05 will be used for this exercise.

Checking the p value in the ANOVA table (Excel calls this "Significance F"), reveals that all five models provide some useful information since the output for each model produces a p value in the ANOVA table that is below the selected α.  If any of the models had resulted in a p value in the ANOVA table that was larger than the selected α, these models would be omitted from further consideration.   [NOTE:  This does not say whether the data support the specific theories; further analysis of output would be required to address these statements.]

Comparing the models will be done using a pair-wise approach; with the "better" model at each step remaining in consideration and the other one dropped from further consideration.  The following flow diagram will be used to help with the comparisons:

For this example:

Comparing Models 1 and 2
AVERAGE = β0 + β1ABSENCES + ε  and   AVERAGE = β0 + β1STAT1 + ε
Model 1 has one independent variable (Absences); Model 2 has one independent variable (STAT1).  So, the answer to the first question is "no."  Adjusted R2 for Model 1 is .348; adjusted R2 for Model 2 is .193.  Therefore, Model 1 is better than Model 2.

Comparing Models 1 (the better in the previous comparison) and 3
AVERAGE = β0 + β1ABSENCES + ε  and   AVERAGE = β0 + β1ABSENCES + β2STAT1 + ε
Model 1 has one independent variable (Absences); Model 3 has two independent variables (STAT1 and Absences).  Since the all of the variables in Model 1 are also in Model 2, the answer to the first question is "yes."  Also, the only additional variable is STAT1, so we will look at the p value associated with STAT1 in the coefficients section of the output for Model 3.  This p value is .122.  Since this p value is not less than the selected
α of .05; the "longer" model is not considered to be better.  Model 3 will not be considered further.

Comparing Models 1 (the better in the previous comparison) and 4:
AVERAGE = β0 + β1ABSENCES + ε  and   AVERAGE =  β0 + β1ABSENCES + β2STAT1 + β3ABSSTAT + ε
Model 1 has one independent variable (Absences); Model 4 has this same variable plus two more independent variables.  Therefore, an F statistics and the associated p value must be obtained to see if the two new variables should be added.  This F statistics is used to test Ho: β2 = β3 = 0  vs. Ha: at least one of these β’s is not 0.  The F statistic is obtained by determining reduction in unexplained variation per additional variable and comparing this to the MSE (mean square error) for the "longer" model.

F = [(SSE“shorter” – SSE“longer”)/# new variables]/MSE“longer”

Using the numbers from the printout, F=[(2979.328119-2636.063345)/2]/119.8210611 = 1.4324

Excel is used to look up the p value associated with this F statistic.  This F statistic needs to be looked up in an F table with numerator degrees of freedom determined by the number of additional variables in the "longer" model and denominator degrees of freedom based on the degrees of freedom in the error (residual) row for the "longer" model (Model 4).  The Excel command used to find the p value is

=FDIST(calculated F, numerator degrees of freedom, denominator degrees of freedom)

For this comparison the command is =FDIST(1.4324,2,22); the p value returned is .2601.  Therefore, Model 4 is not considered further.

Comparing Models 1 (the better in the previous comparison) and 5:
AVERAGE =β0+ β1ABSENCES + ε and AVERAGE =β0+ β1ABSENCES + β2STAT1 + β3ACCT + β4FINC + β5MGMT + ε
Model 1 has one independent variable (Absences); Model 5 has this same variable plus four more independent variables.  Therefore, an F statistics and the associated p value must be obtained to see if the four new variables should be added.  The calculated F statistic is 1.827 with an associated p value of .1632.  Therefore, Model 5 is not considered further.

As a result, Model 1 appears to be the best model that has been considered.

Back to Tutorials Page

Back to BUSA 3110 Home Page