# Rutgers University Newark Regression Methods Worksheet

Linear Regression Analysis 6E Montgomery, Peck
& Vining
1
4.1 Introduction

Assumptions:
1. Relationship between response and regressors is linear (at least
approximately)
2. Error term,  has zero mean
3. Error term,  has constant variance
4. Errors are uncorrelated
5. Errors are normally distributed (required for tests and intervals)
Linear Regression Analysis 6E Montgomery, Peck
& Vining
2
4.2 Residual Analysis
• Definition of Residual (= data – fit):
• Approximate average variance:
Linear Regression Analysis 6E Montgomery, Peck
& Vining
3
4.2.2 Methods for Scaling Residuals
• Scaling helps in identifying outliers or extreme values
Four Methods
1.
2.
3.
4.
Standardized Residuals
Studentized Residuals
PRESS Residuals
R-student Residuals
Linear Regression Analysis 6E Montgomery, Peck
& Vining
4
4.2.2 Methods for Scaling Residuals
1. Standardized Residuals
– di’s have mean zero and variance approximately equal to 1.
– Large values of di (di > 3) may indicate an outlier.
Linear Regression Analysis 6E Montgomery, Peck
& Vining
5
4.2.2 Methods for Scaling Residuals
2. Studentized Residuals
– MSRes is only an approximation of the variance of the ith
residual.
– Improve scaling by dividing ei by the exact standard
deviation:
Linear Regression Analysis 6E Montgomery, Peck
& Vining
6
4.2.2 Methods for Scaling Residuals
2. Studentized Residuals
The studentized residuals are then:
– ri’s have mean zero and unit variance.
– Studentized residuals are generally larger than the corresponding
standardized residuals.
Linear Regression Analysis 6E Montgomery, Peck
& Vining
7
4.2.2 Methods for Scaling Residuals
3. PRESS Residuals
Examine the differences:
– [these are the differences between the actual response for the ith
data point and the fitted value of the response for the ith data
point, using all observations except the ith one.]
Linear Regression Analysis 6E Montgomery, Peck
& Vining
8
4.2.2 Methods for Scaling Residuals
3. PRESS Residuals
• Logic: If the ith point is unusual, then it can “overly”
influence the regression model.
– If the ith point is used in fitting the model, then the residual for
the ith point will be small.
– If the ith point is not used in fitting the model, then the residual
will better reflect how unusual that point is.
Linear Regression Analysis 6E Montgomery, Peck
& Vining
9
4.2.2 Methods for Scaling Residuals
3. PRESS Residuals
• Prediction error:
• Calculated for each point, called PRESS residuals – [they
will be used later to calculate the “prediction error sum of
squares].
• Calculate the PRESS residuals using
Linear Regression Analysis 6E Montgomery, Peck
& Vining
10
4.2.2 Methods for Scaling Residuals
3. PRESS Residuals
Linear Regression Analysis 6E Montgomery, Peck
& Vining
11
4.2.2 Methods for Scaling Residuals
3. PRESS Residuals
• The standardized PRESS residuals are

Note: these are the studentized residuals when MSRes is
used as the estimate of the variance.
Linear Regression Analysis 6E Montgomery, Peck
& Vining
12
4.2.2 Methods for Scaling Residuals
4. R-Student
• MSRes is an “internal” estimate of variance.
• Use a variance estimate that is based on all observations
except the ith observation:
Linear Regression Analysis 6E Montgomery, Peck
& Vining
13
4.2.2 Methods for Scaling Residuals
4. R-Student
• The R-student residual is

This is an externally studentized residual.
Linear Regression Analysis 6E Montgomery, Peck
& Vining
14
Linear Regression Analysis 6E Montgomery, Peck
& Vining
15
Leverage and
influence
Linear Regression Analysis 6E Montgomery, Peck
& Vining
16
4.2.3 Residual Plots
• Normal Probability Plot of Residuals
– Checks the normality assumption
• Residuals against Fitted values,
– Checks for nonconstant variance
– Checks for nonlinearity
– Look for potential outliers
• Do not plot residuals versus yi (why?)
Linear Regression Analysis 6E Montgomery, Peck
& Vining
17
4.2.3 Residual Plots
Linear Regression Analysis 6E Montgomery, Peck
& Vining
18
4.2.3 Residual Plots
Linear Regression Analysis 6E Montgomery, Peck
& Vining
19
4.2.3 Residual Plots
• Residuals against Regressors in the model
– Checks for nonconstant variance
– Look for nonlinearity
• Residuals against Regressors not in the model
– If a pattern appears, could indicate that adding that regressor
might improve the model fit
• Residuals against time order
– Check for Correlated errors
Linear Regression Analysis 6E Montgomery, Peck
& Vining
20
Linear Regression Analysis 6E Montgomery, Peck
& Vining
21
Example 4.4 The Delivery Time Data
Linear Regression Analysis 6E Montgomery, Peck
& Vining
22
Example 4.4 The Delivery Time Data
Linear Regression Analysis 6E Montgomery, Peck
& Vining
23
Plot of Residuals in Time Sequence
Linear Regression Analysis 6E Montgomery, Peck
& Vining
24
4.2.4 Partial Regression and Partial Residual Plots
Partial Regression Plots
• Why are these used?
– To determine if the correct relationship between y and xi
has been identified.
– To determine the marginal contribution of a variable,
given all other variables are in the model.
Linear Regression Analysis 6E Montgomery, Peck
& Vining
25
4.2.4 Partial Regression and Partial Residual Plots
Partial Regression Plots
• Method
Say we want to know the importance/relationship between y
and some regressor variable, xi.
– Regress y against all variables except xi and calculate
residuals.
– Regress xi against all other regressor variables and calculate
residuals.
– Plot these two sets of residuals against each other.
Linear Regression Analysis 6E Montgomery, Peck
& Vining
26
4.2.4 Partial Regression and Partial Residual Plots
Partial Regression Plots
• Interpretation
– If the plot appears to be linear, then a linear relationship between
y and xi seems reasonable.
– If plot is curvilinear, may need xi2 or 1/xi instead.
– If xi is a candidate variable, and a horizontal “band” appears,
then that variable adds no new information.
Linear Regression Analysis 6E Montgomery, Peck
& Vining
27
Example 4.5
Linear Regression Analysis 6E Montgomery, Peck
& Vining
28
4.2.4 Partial Regression and Partial Residual Plots
• Use with caution, they only suggest possible relationships.
• Do not generally detect interaction effects.
• If multicollinearity is present, regression plots could give
incorrect information.
• The slope of the partial regression plot is the regression
coefficient for the variable of interest!
Linear Regression Analysis 6E Montgomery, Peck
& Vining
29
4.2.5 Other Residual Plotting and Analysis Methods
• Plotting regressors against each other can give
information about the relationship between the two:
– may indicate correlation between the regressors.
– may uncover remote points.
Linear Regression Analysis 6E Montgomery, Peck
& Vining
30
Note location of these
two point in the x space
Linear Regression Analysis 6E Montgomery, Peck
& Vining
31
4.3 The PRESS Statistic
• PRESS Residual:
• Prediction Error Sum of Squares (PRESS) Statistic:
• A small value of the PRESS Statistic is desired
• See Table 4.1
Linear Regression Analysis 6E Montgomery, Peck
& Vining
32
4.3 The PRESS Statistic
R2 for Prediction Based on PRESS
• Interpretation:
– We expect the model to explain about R2% of the variability in prediction of
a new observation.
• PRESS is a valuable statistic for comparison of models.
Linear Regression Analysis 6E Montgomery, Peck
& Vining
33
4.4 Outliers
• An outlier is an observation that is considerably different
from the others.
• Formal tests for outliers
• Points with large residuals may be outliers
• Impact can be assessed by removing the points and refitting
• How should they be treated?
Linear Regression Analysis 6E Montgomery, Peck
& Vining
34
4.5 Lack of Fit of the Regression Model
A Formal Test for Lack of Fit
• Assumes
– normality, independence, constant variance assumptions have
been met.
– Only the first-order or straight line model is in doubt.
• Requires
– replication of y for at least one level of x.
Linear Regression Analysis 6E Montgomery, Peck
& Vining
35
4.5 Lack of Fit of the Regression Model
A Formal Test for Lack of Fit
• With replication, we can obtain a “model-independent” estimate of 2
• Say there are ni observations of the response at the ith level of the
regressor xi, i = 1, 2, …m
• yij denotes the jth observation on the response at xi, j = 1, 2, …, ni
• Total number of observations is
Linear Regression Analysis 6E Montgomery, Peck
& Vining
36
4.5 Lack of Fit of the Regression Model
A Formal Test for Lack of Fit
• Partitioning of the residual sum of squares:
SSRes = SSPE + SSLOF
• SSPE – pure error sum of squares
• SSLOF – lack of fit sum of squares
• Note that the (ij)th residual can be partitioned, squared and summed.
Linear Regression Analysis 6E Montgomery, Peck
& Vining
37
4.5 Lack of Fit of the Regression Model
A Formal Test for Lack of Fit
• If the assumption of constant variance is satisfied, then SSPE is a
“model-independent” measure of pure error.
• If the function really is linear, then will be very close to and
SSLOF will be quite small.
Linear Regression Analysis 6E Montgomery, Peck
& Vining
38
4.5 Lack of Fit of the Regression Model
A Formal Test for Lack of Fit
• Test Statistic:
• If F0 > F,m-2,n-m conclude that the regression function is not linear. Why?
Linear Regression Analysis 6E Montgomery, Peck
& Vining
39
4.5 Lack of Fit of the Regression Model
A Formal Test for Lack of Fit
• If the test indicates lack of fit, abandon the model, try a
different one.
• If the test indicates no lack of fit, then MSLOF and MSPE are
combined to estimate 2 .
Linear Regression Analysis 6E Montgomery, Peck
& Vining
40
Example 4.8
Linear Regression Analysis 6E Montgomery, Peck
& Vining
41
Linear Regression Analysis 6E Montgomery, Peck
& Vining
42
An Approximate Procedure based on Estimating Error
from Near-Neighbors
Linear Regression Analysis 6E Montgomery, Peck
& Vining
43
See Example 4.10, pg. 167
Linear Regression Analysis 6E Montgomery, Peck
& Vining
44
Chapter 6
Diagnostics for Leverage
and Influence
Linear Regression Analysis 6E Montgomery, Peck
and Vining
45
6.1 Importance of Detecting Influential Observations
• Leverage Point:
– unusual x-value;
– very little effect
on regression coefficients.
Linear Regression Analysis 6E Montgomery, Peck
and Vining
46
6.1 Importance of Detecting Influential Observations
• Influence Point:
unusual in y and x;
Linear Regression Analysis 6E Montgomery, Peck
and Vining
47
6.2 Leverage
• The hat matrix is:
H = X(XX)- 1 X
• The diagonal elements of the hat matrix are given by
hii = xi(XX)-1xi
• hii – standardized measure of the distance of the ith
observation from the center of the x-space.
Linear Regression Analysis 6E Montgomery, Peck
and Vining
48
6.2 Leverage
• The average size of the hat diagonal is p/n.
• Traditionally, any hii > 2p/n indicates a leverage point.
• An observation with large hii and a large residual is likely
to be influential
Linear Regression Analysis 6E Montgomery, Peck
and Vining
49
Linear Regression Analysis 6E Montgomery, Peck
and Vining
50
Example 6.1 The Delivery Time Data
• Examine Table 6.1; if some possibly influential points are removed
here is what happens to the coefficient estimates and model statistics:
Linear Regression Analysis 6E Montgomery, Peck
and Vining
51
6.3 Measures of Influence

1.
2.
3.
4.
The influence measures discussed here are those that
measure the effect of deleting the ith observation.
Cook’s Di, which measures the effect on
DFBETASj(i), which measures the effect on
DFFITSi, which measures the effect on
COVRATIOi, which measures the effect on the variancecovariance matrix of the parameter estimates.
Linear Regression Analysis 6E Montgomery, Peck
and Vining
52
6.3 Measures of Influence: Cook’s D
What contributes to Di:
1. How well the model fits the ith observation, yi
2. How far that point is from the remaining dataset.
Large values of Di indicate an influential point, usually if Di > 1.
Linear Regression Analysis 6E Montgomery, Peck
and Vining
53
Linear Regression Analysis 6E Montgomery, Peck
and Vining
54
6.4 Measures of Influence: DFFITS and DFBETAS
• DFBETAS – measures how much the regression
coefficient changes in standard deviation units if the ith
observation is removed.
where
is an estimate of the jth coefficient when the
ith observation is removed.

Large DFBETAS indicates ith observation has considerable
influence. In general, |DFBETASj,i| > 2/
Linear Regression Analysis 6E Montgomery, Peck
and Vining
55
6.4 Measures of Influence: DFFITS and DFBETAS
DFFITS – measures the influence of the ith observation on
the fitted value, again in standard deviation units.

Cutoff: If |DFFITSi| > 2
influential.
, the point is most likely
Linear Regression Analysis 6E Montgomery, Peck
and Vining
56
6.4 Measures of Influence: DFFITS and DFBETAS
Equivalencies
• See the computational equivalents of both DFBETAS
and DFFITS (page 223). You will see that they are both
functions of R-student and hii.
Linear Regression Analysis 6E Montgomery, Peck
and Vining
57
Linear Regression Analysis 6E Montgomery, Peck
and Vining
58
6.5 A Measure of Model Performance

Information about the overall precision of
estimation can be obtained through another statistic,
COVRATIOi
Linear Regression Analysis 6E Montgomery, Peck
and Vining
59
6.5 A Measure of Model Performance
Cutoffs and Interpretation
• If COVRATIOi > 1, the ith observation improves the
precision.
• If COVRATIOi < 1, ith observation can degrade the precision. Or, • Cutoffs: COVRATIOi > 1 + 3p/n or COVRATIOi < 1 3p/n; (the lower limit is really only good if n > 3p).
Linear Regression Analysis 6E Montgomery, Peck
and Vining
60
Linear Regression Analysis 6E Montgomery, Peck
and Vining
61
6.6 Detecting Groups of Influential Observations

Previous diagnostics were “single-observation”
It is possible that a group of points have highleverage or exert undue influence on the regression
model.
Multiple-observation deletion diagnostic can be
implemented.
Linear Regression Analysis 6E Montgomery, Peck
and Vining
62
6.6 Detecting Groups of Influential Observations

Cook’s D can be extended to incorporate multiple
observations:
where i denotes the m  1 vector of indices specifying the
points to be deleted.
Large values of Di indicate that the set of m points are
influential.
Linear Regression Analysis 6E Montgomery, Peck
and Vining
63
6.7 Treatment of Influential Observations

Should an influential point be discarded?
Yes, if:
– there is an error in recording a measured value;
– the sample point is invalid; or,
– the observation is not part of the population that was intended to be
sampled.
No, if:
– the influential point is a valid observation.
Linear Regression Analysis 6E Montgomery, Peck
and Vining
64
6.7 Treatment of Influential Observations

Robust estimation techniques
– These techniques offer an alternative to deleting an influential
observation.
– Observations are retained but downweighted in proportion to
residual magnitude or influence.
Linear Regression Analysis 6E Montgomery, Peck
and Vining
65
Chapter 7
Polynomial Regression Models
Linear Regression Analysis 6E Montgomery, Peck
& Vining
1
7.1 Introduction
A second-order polynomial in one variable:
y =  0 + 1 x +  2 x 2 + 
A second-order polynomial in two variables:
y =  0 + 1 x1 +  2 x2 +  x +  22 x + 12 x1 x2 + 
2
11 1
2
2
Linear Regression Analysis 6E Montgomery, Peck
& Vining
2
7.2 Polynomial Models in One Variable
• One-variable form, again:
y =  0 + 1 x +  2 x + 
2
• If we let x1 = x and x2 = x2, we have the same type of model as in
previous chapters – standard linear regression analysis applies.
• The expectation of y for a one-variable second-order polynomial model is
E ( y ) =  0 + 1 x +  2 x
2
Linear Regression Analysis 6E Montgomery, Peck
& Vining
3
7.2 Polynomial Models in One Variable
Linear Regression Analysis 6E Montgomery, Peck
& Vining
4
7.2 Polynomial Models in One Variable
Cautions in fitting a polynomial in one-variable:
1. Keep the order of the model as low as possible.
– This is especially true if you are using the model as a predictor.
– Transformations are often preferred over higher-order models.
– Parsimony – this is a good thing, try to fit the data using the simplest model
possible.
– Remember: You can always fit an n – 1 order model to a set of
data with n points, but this is undesirable.
Linear Regression Analysis 6E Montgomery, Peck
& Vining
5
7.2 Polynomial Models in One Variable
Cautions in fitting a polynomial in one-variable:
2. Model Building Strategy
– One approach is fitting the lowest order polynomial possible and build up
(forward selection).
– Second approach is fitting the highest order polynomial of interest, and
removing terms (backward elimination).
– In general, you may not get the same result from the two approaches. You
should always try to fit the lowest-order model possible.
Linear Regression Analysis 6E Montgomery, Peck
& Vining
6
7.2 Polynomial Models in One Variable
Cautions in fitting a polynomial in one-variable:
3. Extrapolation
– Can be dangerous when the model is a higher-order polynomial. The nature
of the true underlying relationship may change or be completely different than
the system that produced the data used to fit the model.
Linear Regression Analysis 6E Montgomery, Peck
& Vining
7
Linear Regression Analysis 6E Montgomery, Peck
& Vining
8
7.2 Polynomial Models in One Variable
Cautions in fitting a polynomial in one-variable:
4. Ill-conditioning I
– Ill-conditioning refers to the fact that as the order of the model increases, the
X’X matrix inversion will become inaccurate –error can be introduced into
the parameter estimates.
– As the order of the model , multicollinearity 
– Centering the variables first may remove some ill-conditioning but not all.
Linear Regression Analysis 6E Montgomery, Peck
& Vining
9
7.2 Polynomial Models in One Variable
Cautions in fitting a polynomial in one-variable
5. Ill-conditioning II
– Narrow ranges on the x variables can result in significant illconditioning and multicollinearity problems.
Linear Regression Analysis 6E Montgomery, Peck
& Vining
10
7.2 Polynomial Models in One Variable
Cautions in fitting a polynomial in one-variable
6. Hierarchy
– hierarchical model is one which, if it is of order n, then it contains all terms
with orders of n and below:
y =  0 + 1 x +  2 x 2 +  +  n−1 x n−1 +  n x n 
– Two schools of thought: 1) Maintain hierarchy and, 2) Maintaining hierarchy
is not important.
– What to do? Fit the model with only significant terms and use knowledge and
understanding of the process to determine if a hierarchical model is necessary
(if you do not have one).
Linear Regression Analysis 6E Montgomery, Peck
& Vining
11
7.2 Polynomial Models in One Variable
Centering
– Sometimes, centering the regressor variables can minimize or
eliminate at least some of the ill-conditioning that may be present
in a polynomial model.
Linear Regression Analysis 6E Montgomery, Peck
& Vining
12
Linear Regression Analysis 6E Montgomery, Peck
& Vining
13
Centering
Consider the Hardwood data in Example 7.1. The regression analysis is provided below:
The regression equation is
y = – 6.67 + 11.8 x – 0.635 x2
Predictor
Constant
x
x2
S = 4.420
Coef
-6.674
11.764
-0.63455
SE Coef
3.400
1.003
0.06179
R-Sq = 90.9%
Analysis of Variance
Source
DF
Regression
2
Residual Error
16
Total
18
SS
3104.2
312.6
3416.9
T
-1.96
11.73
-10.27
P
0.067
0.000
0.000
VIF
17.1
17.1
MS
1552.1
19.5
F
79.43
P
0.000
Note that the variance inflation factors indicate that multicollinearity may be a problem.
Linear Regression Analysis 6E Montgomery, Peck
& Vining
14
Now, center the data using the mean of the regressor variable. The new data is
given as
xi – 7.2632
-6.2632
-5.7632
-5.2632
-4.2632
-3.2632
-2.7632
-2.2632
-1.7632
-1.2632
-0.7632
-0.2632
0.7368
1.7368
2.7368
3.7368
4.7368
5.7368
6.7368
7.7368
(xi – 7.2632)2
39.2277
33.2145
27.7013
18.1749
10.6485
7.6353
5.1221
3.1089
1.5957
0.5825
0.0693
0.5429
3.0165
7.4901
13.9637
22.4373
32.9109
45.3845
59.8581
y
6.3
11.1
20.0
24.0
26.1
30.0
33.8
34.0
38.1
39.9
42.0
46.1
53.1
52.0
52.5
48.0
42.8
27.8
21.9
Linear Regression Analysis 6E Montgomery, Peck
& Vining
15
Now, a new model is fit:
y =  0 + 1 ( x − 7.2632) +  2 ( x − 7.2632) + 
2
The regression equation is
y = 45.3 + 2.55 xcent – 0.635 x2cent
Predictor
Constant
xcent
x2cent
S = 4.420
Coef
45.295
2.5463
-0.63455
SE Coef
1.483
0.2538
0.06179
R-Sq = 90.9%
Analysis of Variance
Source
DF
Regression
2
Residual Error
16
Total
18
SS
3104.2
312.6
3416.9
T
30.55
10.03
-10.27
P
0.000
0.000
0.000
VIF
1.1
1.1
MS
1552.1
19.5
Linear Regression Analysis 6E Montgomery, Peck
& Vining
F
79.43
P
0.000
16
7.2.2 Piecewise Polynomial Fitting (Splines)

This is a technique that can be used a particular function behaves
differently for different ranges of x. Generally, divide the range of
x into “homogeneous” segments and fit an appropriate function in
each section.
Splines:
a. Splines are piecewise polynomials of order k.
b. Splines have knots – the points at which the segments are joined.
Too many knots can result in “overfitting” and will not necessarily
provide more insight into the system.
c. Usually, a cubic spline is sufficient – polynomial of order 3.
Linear Regression Analysis 6E Montgomery, Peck
& Vining
17
7.2.2 Piecewise Polynomial Fitting (Splines)

Cubic Spline with continuous first and second derivatives.
Say there are h knots, t1 < t2 < … < th. This cubic spline is given by: 3 h E ( y ) = S ( x) =  0 j x +  i ( x − ti ) 3+ j =0 with • j i =1  x − ti , x  ti ( x − ti ) + =  x  ti  0, Think of (x – ti)+ as an “indicator variable” – that is, “on” or “off”. Linear Regression Analysis 6E Montgomery, Peck & Vining 18 7.2.2 Piecewise Polynomial Fitting (Splines) • If continuity restrictions are not necessarily appropriate, then the general spline is 3 h 3 E ( y ) = S ( x) =   0 j x +    ij ( x − t i ) +j j =0 j i =1 j = 0 Linear Regression Analysis 6E Montgomery, Peck & Vining 19 7.2.2 Piecewise Polynomial Fitting (Splines) To illustrate, consider the data in Example 7.2 – voltage drop data. First, look at the plot of the data. 15 14 13 12 y • 11 10 9 8 7 0 10 20 x Linear Regression Analysis 6E Montgomery, Peck & Vining 20 Linear Regression Analysis 6E Montgomery, Peck & Vining 21 • If we attempt to fit a standard quadratic model to this data, we would obtain the following Minitab output: The regression equation is y = 5.27 + 1.49 x - 0.0652 x2 Predictor Constant x x2 Coef 5.2657 1.4872 -0.065198 S = 1.076 SE Coef 0.4807 0.1112 0.005375 R-Sq = 83.2% T 10.95 13.37 -12.13 P 0.000 0.000 0.000 VIF 15.3 15.3 R-Sq(adj) = 82.4% Analysis of Variance Source Regression Residual Error Total DF 2 38 40 SS 218.66 44.03 262.69 MS 109.33 1.16 F 94.35 P 0.000 • This looks as though it may be a good fit, but examine the residual plots. Linear Regression Analysis 6E Montgomery, Peck & Vining 22 Obviously, something is missing. (Note, even if you include the x3 term, the residual plots are not acceptable). Residuals Versus the Fitted Values (response is y) 3 Residual 2 1 0 -1 -2 5 6 7 8 9 10 11 12 13 14 Fitted Value Linear Regression Analysis 6E Montgomery, Peck & Vining 23 Example 7.2 (continued) • • • • A cubic spline is now investigated. Based on the plot of the original data and knowledge of the process, two knots are chosen. It appears that voltage behaves different between time 0 and 6.5 seconds than it does between 6.5 and 13 seconds. It appears to behave differently yet again after 13 seconds. Therefore, h = 2 knots are chosen to be t1 = 6.5 and t2 = 13. Linear Regression Analysis 6E Montgomery, Peck & Vining 24 Example 7.2 (continued) • The cubic spline model is y =  00 +  01 x +  02 x 2 +  03 x 3 + 1 ( x − 6.5) 3+ +  2 ( x − 13) 3+ +  • Putting the original data in Minitab and then adding 4 new columns (for each term beyond x) we obtain the following results from the regression analysis: Linear Regression Analysis 6E Montgomery, Peck & Vining 25 Example 7.2 (continued) y = 8.47 - 1.45 x + 0.490 x2 - 0.0295 x3 + 0.0247 x65 + 0.0271 x13 Predictor Constant x x2 x3 x65 x13 Coef 8.4657 -1.4531 0.48989 -0.029467 0.024706 0.027112 SE Coef 0.2005 0.1816 0.04302 0.002848 0.004039 0.003578 T 42.22 -8.00 11.39 -10.35 6.12 7.58 Linear Regression Analysis 6E Montgomery, Peck & Vining P 0.000 0.000 0.000 0.000 0.000 0.000 26 Example 7.2 (continued) Linear Regression Analysis 6E Montgomery, Peck & Vining 27 Chapter 9 Multicollinearity Linear Regression Analysis 6E Montgomery, Peck & Vining 28 9.1 Introduction • Multicollinearity is a problem that plagues many regression models. It impacts the estimates of the individual regression coefficients. • Uses of regression: 1. Identifying the relative effects of the regressor variables 2. Prediction and/or estimation, and 3. Selection of an appropriate set of variables for the model. Linear Regression Analysis 6E Montgomery, Peck & Vining 29 9.1 Introduction • If all regressors are orthogonal, then multicollinearity is not a problem. This is a rare situation in regression analysis. • More often than not, there are near-linear dependencies among the regressors such that p t jX j = 0 j =1 is approximately true. If this sum holds exactly for a subset of regressors, then (X’X)-1 does not exist. Linear Regression Analysis 6E Montgomery, Peck & Vining 30 9.2 Sources of Multicollinearity Four primary sources: 1. 2. 3. 4. The data collection method employed Constraints on the model or in the population Model specification An overdefined model Linear Regression Analysis 6E Montgomery, Peck & Vining 31 9.2 Sources of Multicollinearity Data collection method employed - Occurs when only a subsample of the entire sample space has been selected. (Soft drink delivery: number of cases and distance tend to be correlated. That is, we may have data where only a small number of cases are paired with short distances, large number of cases paired with longer distances). We may be able to reduce this multicollinearity through the sampling technique used. There is no physical reason why you can’t sample in that area. Linear Regression Analysis 6E Montgomery, Peck & Vining 32 9.2 Sources of Multicollinearity Linear Regression Analysis 6E Montgomery, Peck & Vining 33 9.2 Sources of Multicollinearity Constraints on the model or in the population. (Electricity consumption: two variables x1 – family income and x2 – house size). Physical constraints are present, multicollinearity will exist regardless of collection method. Linear Regression Analysis 6E Montgomery, Peck & Vining 34 9.2 Sources of Multicollinearity Model Specification Polynomial terms can cause ill-conditioning in the X’X matrix. This is especially true if range on a regressor variable, x, is small. Linear Regression Analysis 6E Montgomery, Peck & Vining 35 9.2 Sources of Multicollinearity Overdefined model More regressor variables than observations. The best way to counter this is to remove regressor variables. - Recommendations: 1) Redefine the model using smaller set of regressors; 2) Do preliminary studies using subsets of regressors; or 3) Use principal components type regression methods to remove regressors. Linear Regression Analysis 6E Montgomery, Peck & Vining 36 9.3 Effects of Multicollinearity Strong multicollinearity can result in large variances and covariances for the least squares estimates of the coefficients. Recall from chapter 3, C = (X’X)-1 and 1 C jj = 1 − R 2j Strong multicollinearity between xj and any other regressor variable will cause Rj2 to be large, and thus Cjj to be large. In other words, the variance of the least squares estimate of the coefficient will be very large. Linear Regression Analysis 6E Montgomery, Peck & Vining 37 9.3 Effects of Multicollinearity Strong multicollinearity can also produce least-squares estimates of the coefficients that are too large in absolute value. The squared distance between the least squares estimate and the true parameter is denoted L12 = (ˆ − )' (ˆ − ) E L2 = E (ˆ − )' (ˆ − ) ( ) 1 =  2Tr ( X' X) −1 Linear Regression Analysis 6E Montgomery, Peck & Vining 38 Linear Regression Analysis 6E Montgomery, Peck & Vining 39 Linear Regression Analysis 6E Montgomery, Peck & Vining 40 Linear Regression Analysis 6E Montgomery, Peck & Vining 41 9.4 Multicollinearity Diagnostics • Ideal characteristics of a multicollinearity diagnostic: 1. We want the procedure to correctly indicate if multicollinearity is present; and, 2. We want the procedure to provide some insight as to which regressors are causing the problem. Linear Regression Analysis 6E Montgomery, Peck & Vining 42 9.4.1 Examination of the Correlation Matrix • • If we scale and center the regressors in the X’X matrix, we have the correlation matrix. The pairwise correlation between two variables xi and xj is denoted rij. The off diagonal elements of the centered and scaled X’X matrix (X’X matrix in correlation form) are the pairwise correlations. If |rij| is close to unity, then there may be an indication of multicollinearity. But, the opposite does not always hold. That is, there may be instances when multicollinearity is present, but the pairwise correlations do not indicate a problem. This can happen when using pairwise correlations in a problem with more than two variables involved. Linear Regression Analysis 6E Montgomery, Peck & Vining 43 Linear Regression Analysis 6E Montgomery, Peck & Vining 44 The correlation matrix fails to identify the multicollinearity problem in the Mason, Gunst & Webster data in Table 9.4, page 304. Linear Regression Analysis 6E Montgomery, Peck & Vining 45 9.4.2 Variance Inflation Factors • As discussed in Chapter 3, variance inflation factors are very useful in determining if multicollinearity is present. VIF j = C jj = (1 − R 2j ) −1 • VIFs > 5 to 10 are considered significant. The regressors that
have high VIFs probably have poorly estimated regression
coefficients.
Linear Regression Analysis 6E Montgomery, Peck
& Vining
46
Linear Regression Analysis 6E Montgomery, Peck
& Vining
47
9.4.2 Variance Inflation Factors
VIFs: A Second Look and Interpretation

The length of the normal-theory confidence interval on
the jth regression coefficient can be written as
L j = 2(C jj ˆ ) t  / 2,n− p −1
2 1/ 2
Linear Regression Analysis 6E Montgomery, Peck
& Vining
48
9.4.2 Variance Inflation Factors
VIFs: A Second Look and Interpretation

The length of the corresponding normal-theory confidence
interval based on a design with orthogonal regressors (with
same sample size, same root-mean square (rms) values) is
L = 2ˆ t / 2,n− p −1
*
Linear Regression Analysis 6E Montgomery, Peck
& Vining
49
9.4.2 Variance Inflation Factors
VIFs: A Second Look and Interpretation

Take the ratio of these two:
That is, the square
root of the jth VIF gives us a measure of how much longer the
confidence interval for the jth regression coefficient is
because of multicollinearity.

For example, say VIF3 = 10. Then VIF3  3.3 . This tells us that
that the confidence interval is 3.3 times longer than if the
regressors had been orthogonal (the best case scenario).
1/ 2
*
C
Lj/L = jj .
Linear Regression Analysis 6E Montgomery, Peck
& Vining
50
9.4.3 Eigensystem Analysis of X’X
• The eigenvalues of X’X (denoted 1, 2, …, p) can be used to
measure multicollinearity. Small eigenvalues are indications of
multicollinearity.
 max
The condition number of X’X is  =
 min
• This number measures the spread in the eigenvalues.
 < 100, no serious problem 100 <  < 1000, moderate to strong multicollinearity  > 1000, strong multicollinearity.
Linear Regression Analysis 6E Montgomery, Peck
& Vining
51
9.4.3 Eigensystem Analysis of X’X
• A large condition number indicates multicollinearity exists. It does
not tell us how many regressors are involved.
The condition indices of X’X are
 max
j =
j
• The number of condition indices that are large (greater than 1000)
provide a measure of the number of near linear dependencies in X’X.
• In SAS, PROC REG, in the model statement of your program, you can
use the option COLLIN; this will produce out eigenvalues, condition
indices, etc.
Linear Regression Analysis 6E Montgomery, Peck
& Vining
52
Linear Regression Analysis 6E Montgomery, Peck
& Vining
53
Linear Regression Analysis 6E Montgomery, Peck
& Vining
54
Linear Regression Analysis 6E Montgomery, Peck
& Vining
55
Linear Regression Analysis 6E Montgomery, Peck
& Vining
56
Linear Regression Analysis 6E Montgomery, Peck
& Vining
57
Linear Regression Analysis 6E Montgomery, Peck
& Vining
58
9.5 Methods for Dealing with Multicollinearity

Collect more data
Respecify the model
Ridge Regression and related techniques (PC regression,
LASSO, etc)
Linear Regression Analysis 6E Montgomery, Peck
& Vining
59
9.5 Methods for Dealing with Multicollinearity
• Least squares estimation gives an unbiased estimate,
E (ˆ ) = 
with minimum variance – but this variance may still be very
large, resulting in unstable estimates of the coefficients.
– Alternative: Find an estimate that is biased but with smaller variance than
the unbiased estimator.
Linear Regression Analysis 6E Montgomery, Peck
& Vining
60
9.5 Methods for Dealing with Multicollinearity
Ridge Estimator ̂ R
ˆ = ( X’ X + kI ) −1 X’ y
R
= ( X’ X + kI ) X’ Xβˆ
= Z βˆ
−1
k
k is a “biasing parameter” usually between 0 and 1.
Linear Regression Analysis 6E Montgomery, Peck
& Vining
61
9.5 Methods for Dealing with Multicollinearity
The effect of k on the MSE
Recall: MSE (ˆ * ) = Var (ˆ * ) + (bias ) 2
Now, MSE (ˆ *R ) = Var (ˆ *R ) + (bias) 2
= 
2
j
2
−2
+
k
β

(
X’
X
+
k
I
)
β
2
( j + k )
As k , Var , and bias 
Choose k such that the reduction in variance > increase in bias.
SS Re s = ( y − xˆ R )’ ( y − xˆ R )
Linear Regression Analysis 6E Montgomery, Peck
& Vining
62
9.5 Methods for Dealing with Multicollinearity
• Ridge Trace
– Plots k against the coefficient estimates. If multicollinearity is
severe, the ridge trace will show it. Choose k such that ̂ R is
stable and hope the MSE is acceptable
– Ridge regression is a good alternative if the model user wants to
have all regressors in the model.
Linear Regression Analysis 6E Montgomery, Peck
& Vining
63
9.5 Methods for Dealing with Multicollinearity
Linear Regression Analysis 6E Montgomery, Peck
& Vining
64
Linear Regression Analysis 6E Montgomery, Peck
& Vining
65
• Methods for choosing k
• Relationship to other estimators
• Ridge regression and variable selection
• Generalized ridge regression (a procedure with a biasing
parameter k for each regressor
Linear Regression Analysis 6E Montgomery, Peck
& Vining
66
Generalized Regression Techniques
Linear Regression Analysis 6E Montgomery, Peck
& Vining
67
Linear Regression Analysis 6E Montgomery, Peck
& Vining
68
Linear Regression Analysis 6E Montgomery, Peck
& Vining
69
Linear Regression Analysis 6E Montgomery, Peck
& Vining
70
Linear Regression Analysis 6E Montgomery, Peck
& Vining
71
Linear Regression Analysis 6E Montgomery, Peck
& Vining
72
Linear Regression Analysis 6E Montgomery, Peck
& Vining
73
Linear Regression Analysis 6E Montgomery, Peck
& Vining
74
Linear Regression Analysis 6E Montgomery, Peck
& Vining
75
Linear Regression Analysis 6E Montgomery, Peck
& Vining
76
Linear Regression Analysis 6E Montgomery, Peck
& Vining
77
Linear Regression Analysis 6E Montgomery, Peck
& Vining
78
Linear Regression Analysis 6E Montgomery, Peck
& Vining
79
Linear Regression Analysis 6E Montgomery, Peck
& Vining
80
Linear Regression Analysis 6E Montgomery, Peck
& Vining
81
Linear Regression Analysis 6E Montgomery, Peck
& Vining
82
9.5.4 Principal-Component Regression
Linear Regression Analysis 6E Montgomery, Peck
& Vining
83
Linear Regression Analysis 6E Montgomery, Peck
& Vining
84
Linear Regression Analysis 6E Montgomery, Peck
& Vining
85
The eigenvalues suggest that a model based on 4 or 5 of the PCs
Linear Regression Analysis 6E Montgomery, Peck
& Vining
86
Linear Regression Analysis 6E Montgomery, Peck
& Vining
87
Models D and E are
pretty similar
Linear Regression Analysis 6E Montgomery, Peck
& Vining
88
Chapter 10
Variable Selection and
Model Building
Linear Regression Analysis 6E Montgomery, Peck
& Vining
89
10.1 Introduction
In this chapter, we will cover some variable selection
techniques. Keep in mind the following:
1. None of the variable selection techniques can guarantee the best
regression equation for the dataset of interest.
2. The techniques may very well give different results.
3. Complete reliance on the algorithm for results is to be avoided.
Other valuable information such as experience with and
knowledge of the data and problem.
Linear Regression Analysis 6E Montgomery, Peck
& Vining
90
10.1.1 Model-Building Problem
Two “conflicting” goals in regression model building:
1. Want as many regressors as possible so that the “information
content” in the variables will influence ŷ
2. Want as few regressors as necessary because the variance of ŷ
will increase as the number of regressors increases. (Also,
more regressors can cost more money in data
collection/model maintenance)
A compromise between the two hopefully leads to the best
regression equation.
Linear Regression Analysis 6E Montgomery, Peck
& Vining
91
10.1.2 Consequences of Model Misspecification
Say there are K regressor variables under investigation in a
problem. Then
y = X + 
where X can be partitioned into two submatrices:
1) a matrix containing the intercept and the p – 1 regressors that
are significant (to be retained in the model) – denoted Xp ;
and,
2) a matrix containing the remaining r regressors that are not
significant and should be deleted from the model – denoted
Xr.
Linear Regression Analysis 6E Montgomery, Peck
& Vining
92
Note that K +1 = p + r. Our model is rewritten as
y = X p p + X r  r + 
For the full model:
*
1) ˆ  = ( XX) −1 X’ y , ̂  consists of two parts ˆ *p and ˆ r

−1
ˆ

y’
y

*’
X’
y
y
I

X(X’
X)
X’ y
2) ˆ *2 =
=
n − Κ −1
n − Κ −1
3) Fitted values are yˆ
*
i
Linear Regression Analysis 6E Montgomery, Peck
& Vining
93
For the subset model:
1) ˆ p = ( X ‘p X p ) −1 X ‘p y
2) ˆ =
2
y’ y −  p ‘ X ‘p y
n− p

=

y  I − X p (X ‘p X p ) −1 X ‘p y
n− p
3) Fitted values are ŷ i
What is the difference between ˆ *p and ̂ p?
Linear Regression Analysis 6E Montgomery, Peck
& Vining
94
10.1.2 Consequences of Model Misspecification
Some properties of the estimates ̂ pand ̂ 2 are:
1. E( ̂ p ) = p+Ar . ̂ p is a biased estimate of p unless the
regression coefficients of the insignificant (or deleted) variables
are zero or are orthogonal to the retained variables. (Xp’Xr = 0).
2. Var( ̂ p ) = 2(Xp’Xp)-1, Var( ̂ *) = 2(X’X)-1. Var(ˆ *p) – Var( ̂ p)
is a matrix such that all variances of regression coefficients in the
full model are greater than or equal to variances of
corresponding coefficients in the reduced model. In other words,
deleting unnecessary variables will not increase the variance on
remaining coefficients.
Linear Regression Analysis 6E Montgomery, Peck
& Vining
95
10.1.2 Consequences of Model Misspecification
Some properties of the estimates ̂ p and ̂ 2 are:
3.
MSE(̂ p) < MSE( ˆ *p ) when each coefficient in ˆ *r< the standard error of ˆ *r . In a nutshell, the MSE for the subset model is better (smaller) than the MSE for the same coefficients when the full model is employed – if the deleted variables are really insignificant. Linear Regression Analysis 6E Montgomery, Peck & Vining 96 10.1.2 Consequences of Model Misspecification Some properties of the estimates ̂ p and ̂ 2 are: 4. For the subset model, E (ˆ ) = 2    2 +  'r X 'r I − X p (X 'p X p ) −1 X 'p X r  r n− p that is for this model, ̂ 2 is a biased upward estimate of 2. 5. Prediction: From full model: yˆ * = x' ˆ *, from the ' ˆ ˆ y = x subset model: p  p , then Var( ŷ *)  MSE( ŷ ) Linear Regression Analysis 6E Montgomery, Peck & Vining 97 10.1.2 Consequences of Model Misspecification The summary of the five statements is: • • • • Deleting variables improves the precision of the parameter estimates of retained variables. Deleting variables improves the precision of the variance of the predicted response. Deleting variables can induce bias into the estimates of coefficients and variance of predicted response. (But, if the deleted variables are “insignificant” the MSE of the biased estimates will be less than the variance of the unbiased estimates). Retaining insignificant variables can increase the variance of the parameter estimates and variance of the predicted response. Linear Regression Analysis 6E Montgomery, Peck & Vining 98 10.1.3 Criteria for Evaluating Subset Regression Models Coefficient of Multiple Determination • Say we are investigating a model with p terms, SS Re s ( p) SS R ( p) R = = 1− SS T SS T 2 p • Models with large values of Rp2 are preferred, but adding terms will increase this value. Linear Regression Analysis 6E Montgomery, Peck & Vining 99 Linear Regression Analysis 6E Montgomery, Peck & Vining 100 10.1.3 Criteria for Evaluating Subset Regression Models Adjusted R2 • Say we are investigating a model with p terms, R • 2 adj , p  n −1  (1 − R p2 ) = 1 −  n− p This value will not necessarily increase as additional terms are introduced into the model. We want a model with the maximum adjusted R2. Linear Regression Analysis 6E Montgomery, Peck & Vining 101 10.1.3 Criteria for Evaluating Subset Regression Models Residual Mean Square • The MSres for a subset regression model is SS Re s ( p) MS Re s ( p) = n− p • MSRes(p) increases as p increases, in general. The increase in MSRes(p) occurs when the reduction in SSRes(p) from adding a regressor to the model is not sufficient to compensate for the loss of one degree of freedom. We want a model with a minimum MSRes(p). Linear Regression Analysis 6E Montgomery, Peck & Vining 102 Linear Regression Analysis 6E Montgomery, Peck & Vining 103 10.1.3 Criteria for Evaluating Subset Regression Models Mallow’s Cp Statistic • This criterion is related to the MSE of the fitted value, that is 2 2 ˆ ˆ E  y i − E ( y i ) = E ( y i ) − E ( y i ) + Var ( yˆ i ) 2 ˆ   E ( y ) − E ( y ) where is the squared bias. The total i i squared bias for a p-term model is SS B ( p) =  E ( y i ) − E ( yˆ i ) n 2 i =1 Linear Regression Analysis 6E Montgomery, Peck & Vining 104 10.1.3 Criteria for Evaluating Subset Regression Models Mallow’s Cp Statistic • The standardized total squared error is n 1 n 2  p = 2   E ( y i ) − E ( yˆ i ) +  Var ( yˆ i )  i =1    i =1 SS B ( p ) 1 n = + 2  Var ( yˆ i ) 2   i =1 • Making some appropriate substitutions, we can find the estimate of p, denoted Cp: SS Re s ( p) Cp = − n+ 2p 2 ˆ Linear Regression Analysis 6E Montgomery, Peck & Vining 105 10.1.3 Criteria for Evaluating Subset Regression Models Mallow’s Cp Statistic • It can be shown that if Bias = 0, the expected value of Cp is   (n − p)ˆ 2 E C p | Bias = 0 = − n+ 2p = p 2 ˆ Linear Regression Analysis 6E Montgomery, Peck & Vining 106 10.1.3 Criteria for Evaluating Subset Regression Models Mallow’s Cp Statistic Notes: 1. Cp is a measure of variance in the fitted values and (bias)2. (Large bias can be a result of important variables being left out of the model). 2. Cp >> p, then significant bias.
3. Small Cp values are desirable.
4. Beware of negative values of Cp. These could result because the
MSE for the full model overestimates the true 2.
Linear Regression Analysis 6E Montgomery, Peck
& Vining
107
Linear Regression Analysis 6E Montgomery, Peck
& Vining
108
Linear Regression Analysis 6E Montgomery, Peck
& Vining
109
Linear Regression Analysis 6E Montgomery, Peck
& Vining
110
10.1.3 Criteria for Evaluating Subset Regression Models
Uses of Regression and Model Evaluation Criteria

Regression equations may be used to make predictions. So,
minimizing the MSE for prediction may be an important
criterion. The PRESS statistic can be used for comparisons of
candidate models.
PRESS p =  ( yi − yˆ (i ) )
2
n
i =1
 ei 

=  
i =1 1 − hii

2
n

We want models with small values of PRESS.
Linear Regression Analysis 6E Montgomery, Peck
& Vining
111
10.2 Computational Techniques for Variable Selection
10.2.1 All Possible Regressions

Assume the intercept term is in all equations considered. Then, if
there are K regressors, we would investigate 2K possible
regression equations. Use the criteria above to determine some
candidate models and complete regression analysis on them.
Linear Regression Analysis 6E Montgomery, Peck
& Vining
112
Example 10.1 Hald Cement Data (Appendix Table B21)
Linear Regression Analysis 6E Montgomery, Peck
& Vining
113
Example 10.1 Hald Cement Data
Linear Regression Analysis 6E Montgomery, Peck
& Vining
114
Example 10.1 Hald Cement Data
Linear Regression Analysis 6E Montgomery, Peck
& Vining
115
Example 10.1 Hald Cement Data
Linear Regression Analysis 6E Montgomery, Peck
& Vining
116
Example 10.1 Hald Cement Data
Linear Regression Analysis 6E Montgomery, Peck
& Vining
117
Example 10.1 Hald Cement Data
Linear Regression Analysis 6E Montgomery, Peck
& Vining
118
10.2 Computational Techniques for Variable Selection
10.2.1 All Possible Regressions
Notes:
• Once some candidate models have been identified, run regression
analysis on each one individually and make comparisons (include
the PRESS statistic).
• A caution about the regression coefficients. If the estimates of a
particular coefficient tends to “jump around,” this could be an
indication of multicollinearity. Jumping around is a technical term
– example: if some estimates are positive and then negative.
Linear Regression Analysis 6E Montgomery, Peck
& Vining
119
Example 10.1 Hald
Cement Data
Linear Regression Analysis 6E Montgomery, Peck
& Vining
120
10.2 Computational Techniques for Variable Selection
10.2.2 Stepwise Regression Methods
Three types of stepwise regression methods:
1. Forward selection
2. Backward elimination
3. Stepwise regression — combination of forward and backward
Linear Regression Analysis 6E Montgomery, Peck
& Vining
121
10.2 Computational Techniques for Variable Selection
10.2.2 Stepwise Regression Methods
Forward Selection

Procedure is based on the idea that no variables are in the model originally, but
are added one at a time. The selection procedure is:
1. The first regressor selected to be entered into the model is the one with the
highest correlation with the response. If the F statistic corresponding to the
model containing this variable is significant (larger than some predetermined
value, Fin), then that regressor is left in the model.
2. The second regressor examined is the one with the largest partial
correlation with the response. If the F-statistic corresponding to the addition of
this variable is significant, the regressor is retained.
3. This process continues until all regressors are examined.
Linear Regression Analysis 6E Montgomery, Peck
& Vining
122
Linear Regression Analysis 6E Montgomery, Peck
& Vining
123
10.2 Computational Techniques for Variable Selection
10.2.2 Stepwise Regression Methods
Backward Elimination
Procedure is based on the idea that all variables are in the model originally, examined
one at a time and removed if not significant.
1. The partial F statistic is calculated for each variable as if it were the last one
added to the model. The regressor with the smallest F statistic is examined
first and will be removed if this value is less than some predetermined value
Fout.
2. If this regressor is removed, then the model is refit with the remaining
regressor variables and the partial F statistics calculated again. The regressor
with the smallest partial F statistic will be removed if that value is less than
Fout.
3. The process continues until all regressors are examined.
Linear Regression Analysis 6E Montgomery, Peck
& Vining
124
Linear Regression Analysis 6E Montgomery, Peck
& Vining
125
10.2 Computational Techniques for Variable Selection
10.2.2 Stepwise Regression Methods
Stepwise Regression
This procedure is a modification of forward selection.
1. The contribution of each regressor variable that is put into the model is
reassessed by way of its partial F statistic.
2. A regressor that makes it into the model, may also be removed it if is found to
be insignificant with the addition of other variables to the model. If the partial
F-statistic is less than Fout, the variable will be removed.
3. Stepwise requires both an Fin value and Fout value.
Linear Regression Analysis 6E Montgomery, Peck
& Vining
126
Linear Regression Analysis 6E Montgomery, Peck
& Vining
127
10.2 Computational Techniques for Variable Selection
10.2.2 Stepwise Regression Methods
Cautions:

No one model may be the “best”
The three stepwise techniques could result in different models
Inexperienced analysts may use the final model simply because the
procedure spit it out.
Please look over the discussion on “Stopping Rules for Stepwise Procedures” on
page 283.
Linear Regression Analysis 6E Montgomery, Peck
& Vining
128
Strategy for
Regression Model
Building
Linear Regression Analysis 6E Montgomery, Peck
& Vining
129
Chapter 13
Generalized Linear Models
Linear Regression Analysis 6E Montgomery, Peck
& Vining
1
Generalized Linear Models
• Traditional applications of linear models, such as DOX and
multiple linear regression, assume that the response variable
is
– Normally distributed
– Constant variance
– Independent
• There are many situations where these assumptions are
inappropriate
– The response is either binary (0,1), or a count
– The response is continuous, but nonnormal
Linear Regression Analysis 6E Montgomery, Peck
& Vining
2
Some Approaches to These Problems
• Data transformation
– Induce approximate normality
– Stabilize variance
– Simplify model form
• Weighted least squares
– Often used to stabilize variance
• Generalized linear models (GLM)
– Approach is about 30 years old, unifies linear and nonlinear regression
models
– Response distribution is a member of the exponential family (normal,
exponential, gamma, binomial, Poisson)
Linear Regression Analysis 6E Montgomery, Peck
& Vining
3
Generalized Linear Models
• Original applications were in biopharmaceutical sciences
• Lots of recent interest in GLMs in industrial statistics
• GLMs are simple models; include linear regression and OLS as a
special case
• Parameter estimation is by maximum likelihood (assume that the
response distribution is known)
• Inference on parameters is based on large-sample or asymptotic
theory
• We will consider logistic regression, Poisson regression, then the
GLM
Linear Regression Analysis 6E Montgomery, Peck
& Vining
4
References
• Montgomery, D. C., Peck, E. A., and Vining, G. G. (2021), Introduction to
Linear Regression Analysis, 6th Edition, Wiley, New York (see Chapter 13)
• Myers, R. H., Montgomery, D. C., Vining, G. G. and Robinson, T.J. (2010),
Generalized Linear Models with Applications in Engineering and the Sciences,
2nd edition, Wiley, New York
• Hosmer, D. W. and Lemeshow, S. (2000), Applied Logistic Regression, 2nd
Edition, Wiley, New York
• Lewis, S. L., Montgomery, D. C., and Myers, R. H. (2001), “Confidence Interval
Coverage for Designed Experiments Analyzed with GLMs”, Journal of Quality
Technology 33, pp. 279-292
• Lewis, S. L., Montgomery, D. C., and Myers, R. H. (2001), “Examples of
Designed Experiments with Nonnormal Responses”, Journal of Quality
Technology 33, pp. 265-278
• Myers, R. H. and Montgomery, D. C. (1997), “A Tutorial on Generalized Linear
Models”, Journal of Quality Technology 29, pp. 274-291
Linear Regression Analysis 6E Montgomery, Peck
& Vining
5
Binary Response Variables
• The outcome ( or response, or endpoint) values 0, 1 can
represent “success” and “failure”
• Occurs often in the biopharmaceutical field; dose-response
studies, bioassays, clinical trials
• Industrial applications include failure analysis, fatigue testing,
reliability testing
• For example, functional electrical testing on a semiconductor
can yield:
– “success” in which case the device works
– “failure” due to a short, an open, or some other failure mode
Linear Regression Analysis 6E Montgomery, Peck
& Vining
6
Binary Response Variables
• Possible model:
i = 1, 2,…, n
yi =  0 +   j xij +  i = xi +  i 
j =1
 yi = 0 or 1
k
• The response yi is a Bernoulli random variable
P( yi = 1) =  i with 0   i  1
P( yi = 0) = 1 −  i
E ( yi ) = i = xi =  i
Var ( yi ) =  y2i =  i (1 −  i )
Linear Regression Analysis 6E Montgomery, Peck
& Vining
7
Problems With This Model
• The error terms take on only two values, so they can’t
possibly be normally distributed
• The variance of the observations is a function of the mean
(see previous slide)
• A linear response function could result in predicted values
that fall outside the 0, 1 range, and this is impossible
because
0  E ( yi ) = i = xi =  i  1
Linear Regression Analysis 6E Montgomery, Peck
& Vining
8
At Least
One Oring
Failure
Temperature
at Launch
At Least
One O-ring
Failure
53
1
70
1
56
1
70
1
57
1
72
0
63
0
73
0
66
0
75
0
67
0
75
1
67
0
76
0
67
0
76
0
68
0
78
0
69
0
79
0
70
0
80
0
70
1
81
0
1.0
O-Ring Fail
Temperatur
e at Launch
Binary Response Variables – The
Challenger Data
0.5
0.0
Data for space shuttle
launches and static
tests prior to the
launch of Challenger
50
Linear Regression Analysis 6E Montgomery, Peck
& Vining
60
70
80
Temperature
9
Binary Response Variables
• There is a lot of empirical evidence that the response
function should be nonlinear; an “S” shape is quite logical
• See the scatter plot of the Challenger data
• The logistic response function is a common choice
exp(x)
1
E ( y) =
=
1 + exp(x) 1 + exp(−x)
Linear Regression Analysis 6E Montgomery, Peck
& Vining
10
Linear Regression Analysis 6E Montgomery, Peck
& Vining
11
The Logistic Response Function
• The logistic response function can be easily linearized. Let:
 = x and E ( y ) = 
• Define
 = ln

1−
• This is called the logit transformation
Linear Regression Analysis 6E Montgomery, Peck
& Vining
12
Logistic Regression Model
• Model:
yi = E ( yi ) +  i
where
E ( yi ) =  i
exp(xi)
=
1 + exp(xi)
• The model parameters are estimated by the method of
maximum likelihood (MLE)
Linear Regression Analysis 6E Montgomery, Peck
& Vining
13
A Logistic Regression Model for the Challenger Data
(Using Minitab)
Binary Logistic Regression: O-Ring Fail versus Temperature
Logit
Response Information
Variable
Value
Count
O-Ring F
1
7
0
17
Total
24
(Event)
Logistic Regression Table
Odds
Predictor
Coef
SE Coef
Z
P
Constant
10.875
5.703
1.91 0.057
Temperat
-0.17132
0.08344
-2.05 0.040
95% CI
Ratio
Lower
Upper
0.84
0.72
0.99
Log-Likelihood = -11.515
Linear Regression Analysis 6E Montgomery, Peck
& Vining
14
A Logistic Regression Model for the Challenger Data
Test that all slopes are zero: G = 5.944, DF = 1,
P-Value = 0.015
Goodness-of-Fit Tests
Method
Chi-Square
DF
P
Pearson
14.049
15
0.522
Deviance
15.759
15
0.398
Hosmer-Lemeshow
11.834
8
0.159
exp(10.875 − 0.17132 x)
yˆ =
1 + exp(10.875 − 0.17132 x)
Linear Regression Analysis 6E Montgomery, Peck
& Vining
15
Logistic Regression Model for Challenger Data
Note that the fitted function has been
extended down to 31 deg F, the
temperature at which Challenger
was launched
Linear Regression Analysis 6E Montgomery, Peck
& Vining
16
Maximum Likelihood Estimation in Logistic Regression
• The distribution of each observation yi is
fi ( yi ) =  iyi (1 −  i )1− yi , i = 1, 2,…, n
• The likelihood function is
n
n
i =
i =1
L(y, ) =  fi ( yi ) =  iyi (1 −  i )1− yi
• We usually work with the log-likelihood:

  i  n
ln L(y, ) = ln  fi ( yi ) =   yi ln 
  +  ln(1 −  i )
i =1 
i =1
 1 −  i   i =1
n
n
Linear Regression Analysis 6E Montgomery, Peck
& Vining
17
Maximum Likelihood Estimation in Logistic Regression
• The maximum likelihood estimators (MLEs) of the model parameters
are those values that maximize the likelihood (or log-likelihood)
function
• ML has been around since the first part of the previous century
• Often gives estimators that are intuitively pleasing
• MLEs have nice properties; unbiased (for large samples), minimum
variance (or nearly so), and they have an approximate normal
distribution when n is large
Linear Regression Analysis 6E Montgomery, Peck
& Vining
18
Maximum Likelihood Estimation in Logistic Regression
• If we have ni trials at each observation, we can write the loglikelihood as
n
ln L(y, ) = Xy −  ni ln[1 + exp(xi)
i =1
• The derivative of the log-likelihood is
n

ni
 ln L(y, )

= X y −
exp(xi)xi


i =1 1 + exp( xi ) 
n
= Xy −   ni i xi
i =1
= Xy − X ( because i = ni i )
Linear Regression Analysis 6E Montgomery, Peck
& Vining
19
Maximum Likelihood Estimation in Logistic Regression
• Setting this last result to zero gives the maximum likelihood score
equations
X(y − ) = 0
• These equations look easy to solve…we’ve actually seen them
before in linear regression:
y = X +  =  + 
X( y − ) = 0 results from OLS or ML with normal errors
Since  = X X( y − ) = X( y − X) = 0,
XXˆ = Xy , and ˆ = ( XX) −1 Xy (OLS or the normal-theory MLE)
Linear Regression Analysis 6E Montgomery, Peck
& Vining
20
Maximum Likelihood Estimation in Logistic Regression
• Solving the ML score equations in logistic regression isn’t quite as easy,
because
i =
ni
, i = 1, 2,…, n
1 + exp(−xi)
• Logistic regression is a nonlinear model
• It turns out that the solution is actually fairly easy, and is based on iteratively
reweighted least squares or IRLS (see Appendix for details)
• An iterative procedure is necessary because parameter estimates must be
updated from an initial “guess” through several steps
• Weights are necessary because the variance of the observations is not constant
• The weights are functions of the unknown parameters
Linear Regression Analysis 6E Montgomery, Peck
& Vining
21
Interpretation of the Parameters in Logistic Regression
• The log-odds at x is
ˆ ( x)
ˆ ( x) = ln
= ˆ0 + ˆ1 x
1 − ˆ ( x)
• The log-odds at x + 1 is
ˆ ( x + 1)
ˆ ( x + 1) = ln
= ˆ0 + ˆ1 ( x + 1)
1 − ˆ ( x + 1)
• The difference in the log-odds is
ˆ ( x + 1) − ˆ ( x) = ˆ1
Linear Regression Analysis 6E Montgomery, Peck
& Vining
22
Interpretation of the Parameters in Logistic Regression
• The odds ratio is found by taking antilogs:
Oddsx +1
ˆ1
ˆ
OR =
=e
Oddsx
• The odds ratio is interpreted as the estimated increase in the
probability of “success” associated with a one-unit increase in the
value of the predictor variable
Linear Regression Analysis 6E Montgomery, Peck
& Vining
23
Odds Ratio for the Challenger Data
Oˆ R = e −0.17132 = 0.84
This implies that every decrease of one degree in temperature increases the odds of Oring failure by about 1/0.84 = 1.19 or 19 percent
The temperature at Challenger launch was 22 degrees below the lowest observed
launch temperature, so now
Oˆ R = e 22( −0.17132) = 0.0231
This results in an increase in the odds of failure of 1/0.0231 = 43.34, or about 4200
percent!!
There’s a big extrapolation here, but if you knew this prior to launch, what decision
Linear Regression Analysis 6E Montgomery, Peck
& Vining
24
Inference on the Model Parameters
Linear Regression Analysis 6E Montgomery, Peck
& Vining
25
Inference on the Model Parameters
See slide 15;
Minitab calls
this “G”.
Linear Regression Analysis 6E Montgomery, Peck
& Vining
26
Testing Goodness of Fit
Linear Regression Analysis 6E Montgomery, Peck
& Vining
27
Pearson chi-square goodness-of-fit statistic:
Linear Regression Analysis 6E Montgomery, Peck
& Vining
28
The Hosmer-Lemeshow goodness-of-fit statistic:
Linear Regression Analysis 6E Montgomery, Peck
& Vining
29
Refer to slide 15 for the Minitab output showing all three goodness-of-fit
statistics for the Challenger data
Linear Regression Analysis 6E Montgomery, Peck
& Vining
30
Likelihood Inference on the Model Parameters
• Deviance can also be used to test hypotheses about subsets of the model
parameters (analogous to the extra SS method)
• Procedure:
 = X1 + X 2 2 , with p parameters,  2 has r parameters
This full model has deviance  ()
H 0 : 2 = 0
H1 :  2  0
The reduced model is  = X1 , with deviance  (1 )
The difference in deviance between the full and reduced models is
 (  | 1 ) =  (1 ) −  () with r degrees of freedom
 (  | 1 ) has a chi-square distribution under H 0 :   = 0
Large values of  (  | 1 ) imply that H 0 :   = 0 should be rejected
Linear Regression Analysis 6E Montgomery, Peck
& Vining
31
Inference on the Model Parameters
• Tests on individual model coefficients can also be done using Wald inference.
• Uses the result that the MLEs have an approximate normal distribution, so the
distribution of
ˆ
Z0 =
se( ˆ )
is standard normal if the true value of the parameter is zero. Some computer
programs report the square of Z (which is chi-square), and others calculate the
P-value using the t distribution.
See slide 14 for the Wald test on the temperature parameter for the Challenger
data
Linear Regression Analysis 6E Montgomery, Peck
& Vining
32
Another Logistic Regression Example: The Pneumoconiosis Data
• A 1959 article in Biometrics reported the data:
Linear Regression Analysis 6E Montgomery, Peck
& Vining
33
Linear Regression Analysis 6E Montgomery, Peck
& Vining
34
Linear Regression Analysis 6E Montgomery, Peck
& Vining
35
The fitted model:
Linear Regression Analysis 6E Montgomery, Peck
& Vining
36
Linear Regression Analysis 6E Montgomery, Peck
& Vining
37
Linear Regression Analysis 6E Montgomery, Peck
& Vining
38
Linear Regression Analysis 6E Montgomery, Peck
& Vining
39
Diagnostic Checking
Linear Regression Analysis 6E Montgomery, Peck
& Vining
40
Linear Regression Analysis 6E Montgomery, Peck
& Vining
41
Linear Regression Analysis 6E Montgomery, Peck
& Vining
42
Linear Regression Analysis 6E Montgomery, Peck
& Vining
43
Consider Fitting a More Complex Model
Linear Regression Analysis 6E Montgomery, Peck
& Vining
44
A More Complex Model
Is the expanded model useful? The Wald test on the term (Years)2 indicates that
the term is probably unnecessary.
Consider the difference in deviance:
 () =   ( ) = 
 (  |  ) =  ( ) −  () =  − 
=  with 1 df (chi-square P-value = 0.0961)
Compare the P-values for the Wald and deviance tests
Linear Regression Analysis 6E Montgomery, Peck
& Vining
45
Linear Regression Analysis 6E Montgomery, Peck
& Vining
46
Linear Regression Analysis 6E Montgomery, Peck
& Vining
47
Linear Regression Analysis 6E Montgomery, Peck
& Vining
48
Other Models for Binary Response Data
Logit model
Probit model
Complimentary
log-log model
Linear Regression Analysis 6E Montgomery, Peck
& Vining
49
Linear Regression Analysis 6E Montgomery, Peck
& Vining
50
More than two categorical outcomes
Linear Regression Analysis 6E Montgomery, Peck
& Vining
51
Linear Regression Analysis 6E Montgomery, Peck
& Vining
52
Linear Regression Analysis 6E Montgomery, Peck
& Vining
53
Poisson Regression
• Consider now the case where the response is a count of some relatively
rare event:
– Defects in a unit of product
– Software bugs
– Particulate matter or some pollutant in the environment
– Number of Atlantic hurricanes
• We wish to model the relationship between the count response and one
or more regressor or predictor variables
• A logical model for the count response is the Poisson distribution
−
e 
f ( y) =
, y = 0,1,…, and   0
y!
y
Linear Regression Analysis 6E Montgomery, Peck
& Vining
54
Poisson Regression
• Poisson regression is another case where the response variance is related
to the mean; in fact, in the Poisson distribution
E ( y ) =  and Var ( y ) = 
• The Poisson regression model is
yi = E ( yi ) +  i = i +  i , i = 1, 2,…, n
• We assume that there is a function g that relates the mean of the response
to a linear predictor
g ( i ) = i
=  0 +  i xi1 + … +  k xik
= xi
Linear Regression Analysis 6E Montgomery, Peck
& Vining
55
Poisson Regression
• The function g is called a link function
• The relationship between the mean of the response distribution
and the linear predictor is
i = g (i ) = g (xi)
−1
−1
• Choice of the link function:
– Log link (very logical for the Poisson-no negative predicted values)
g ( i ) = ln( i ) = xi
i = g −1 (xi) = ex
i
Linear Regression Analysis 6E Montgomery, Peck
& Vining
56
Poisson Regression
• The usual form of the Poisson regression model is
yi = e
xi 
+  i , i = 1, 2,…, n
• This is a special case of the GLM; Poisson response and a log link
• Parameter estimation in Poisson regression is essentially equivalent
to logistic regression; maximum likelihood, implemented by IRLS
• Wald (large sample) and Deviance (likelihood-based) based
inference is carried out the same way as in the logistic regression
model
Linear Regression Analysis 6E Montgomery, Peck
& Vining
57
An Example of Poisson Regression
• The aircraft damage data
• Response y = the number of locations where damage was inflicted
on the aircraft
• Regressors:
0 = A-4
x1 = type of aircraft 
1 = A-6
x3 = total months of crew experience
Linear Regression Analysis 6E Montgomery, Peck
& Vining
58
The table contains data from 30
strike missions
There is a lot of multicollinearity
in this data; the A-6 has a two-man
crew and is capable of carrying a
All three regressors tend to increase
monotonically
Linear Regression Analysis 6E Montgomery, Peck
& Vining
59
Based on the full model, we can remove x3
However, when x3 is removed, x1 (type of
aircraft) is no longer significant – this is not
shown, but easily verified
This is probably multicollinearity at work
Note the Type 1 and Type 3 analyses for
each variable
Note also that the P-values for the Wald
tests and the Type 3 analysis (based on
deviance) don’t agree
Linear Regression Analysis 6E Montgomery, Peck
& Vining
60
Let’s consider all of the subset regression models:
Deleting either x1 or x2 results in a two-variable model that is worse than the full model
Removing x3 gives a model equivalent to the full model, but as noted before, x1 is
insignificant
One of the single-variable models (x2) is equivalent to the full model
Linear Regression Analysis 6E Montgomery, Peck
& Vining
61
The one-variable model
with x2 displays no
lack of fit (Deviance/df
= 1.1791)
The prediction equation
is
ˆy = e −1.6491+ 0.2282 x2
Linear Regression Analysis 6E Montgomery, Peck
& Vining
62
Another Example Involving
Poisson Regression
•The mine fracture data
•The response is a count of the number of
fractures in the mine
•The regressors are:
x1 = inner burden thickness (feet)
x2 = Percent extraction of the lower
previously mined seam
x3 = Lower seam height (feet)
x4 = Time in years that mine has been open
Linear Regression Analysis 6E Montgomery, Peck
& Vining
63
The * indicates the best model of a specific subset size
Note that the addition of a term cannot increase the
deviance (promoting the analog between deviance and
the “usual” residual sum of squares)
To compare the model with only x1, x2, and x4 to the
full model, evaluate the difference in deviance:
38.03 – 37.86 = 0.17
with 1 df. This is not significant.
Linear Regression Analysis 6E Montgomery, Peck
& Vining
64
There is no indication of lack of fit: deviance/df = 0.9508
The final model is:
yˆ = e
−3.721− 0.0015 x1 + 0.0627 x2 − 0.0317 x4
Linear Regression Analysis 6E Montgomery, Peck
& Vining
65
The Generalized Linear Model
• Poisson and logistic regression are two special cases of the GLM:
– Binomial response with a logistic link
– Poisson response with a log link
• In the GLM, the response distribution must be a member of the exponential family:
f ( yi , i ,  ) = exp{[ yi i − b( i )] / a( ) + h( yi ,  )}
 = scale parameter
 i = natural location parameter(s)
• This includes the binomial, Poisson, normal, inverse normal, exponential, and gamma
distributions
Linear Regression Analysis 6E Montgomery, Peck
& Vining
66
The Generalized Linear Model
• The relationship between the mean of the response
distribution and the linear predictor is determined by the
i = g −1 (i ) = g −1 (xi)
• The canonical link is specified when
i =  i
• The canonical link depends on the choice of the response
distribution
Linear Regression Analysis 6E Montgomery, Peck
& Vining
67
Linear Regression Analysis 6E Montgomery, Peck
& Vining
68
• You do not have to use the canonical link, it just simplifies
some of the mathematics.
• In fact, the log (non-canonical) link is very often used with
the exponential and gamma distributions, especially when
the response variable is nonnegative.
• Other links can be based on the power family (as in power
family transformations), or the complimentary log-log
function.
Linear Regression Analysis 6E Montgomery, Peck
& Vining
69
Parameter Estimation and Inference in the GLM
• Estimation is by maximum likelihood (and IRLS); for the
canonical link the score function is
X(y − ) = 0
• For the case of a non-canonical link,
X (y − ) = 0
 = diag (d i / di )
• Wald inference and deviance-based inference is conducted
just as in logistic and Poisson regression
Linear Regression Analysis 6E Montgomery, Peck
& Vining
70
This is “classical data”; analyzed by
many.
y = cycles to failure, x1 = cycle length,
x2 = amplitude, x3 = load
The experimental design is a 33
factorial
Most analysts begin by fitting a full
squares
Linear Regression Analysis 6E Montgomery, Peck
& Vining
71
DESIGN-EXPERT Plot
Cycles
Box-Cox Plot for Power Transforms
20.56
L n (R e s id u a lS S )
Lambda
Current = 1
Best = -0.19
Low C.I. = -0.54
High C.I. = 0.22
18.46
Recommend transform:
Log
(Lambda = 0)
Design-Expert V6 was used
to analyze the data
A log transform is suggested
16.37
14.27
12.18
-3
-2
-1
0
1
2
3
Lam bda
Linear Regression Analysis 6E Montgomery, Peck
& Vining
72
The Final Model is First-Order:
ˆ = e6.34 + 0.83 x1 −0.63 x2 −0.39 x3
y
Response: Cycles
Transform: Natural log Constant:
ANOVA for Response Surface Linear Model
Analysis of variance table [Partial sum of squares]
Sum of
Mean
F
Source
Squares
DF
Square
Value
Model
22.32
3
7.44
213.50
A
12.47
1
12.47
357.87
B
7.11
1
7.11
204.04
C
2.74
1
2.74
78.57
Residual
0.80
23
0.035
Cor Total
23.12
26
0.000
Prob > F

Don't use plagiarized sources. Get Your Custom Essay on
Rutgers University Newark Regression Methods Worksheet
Just from \$13/Page
Calculator

Total price:\$26
Our features

## Need a better grade? We've got you covered.

Order your essay today and save 20% with the discount code GOLDEN