Chapter 7 Exercises
7.1 Linear models
7.1.1 Speed and stopping distances of cars
Note, the example is taken from Wood (2006).
This exercise is about modelling the relationship between stopping distance of a car and its speed at the moment that the driver is signalled to stop. Data on this are provided in R data frame cars. The data can be loaded with
data("cars", package = "datasets")
print(head(cars))
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
It takes a more or less fixed ‘reaction time’ for a driver to apply the car’s brakes, so that the car will travel a distance directly proportional to its speed before beginning to slow. A car’s kinetic energy is proportional to the square of its speed, but the brakes can only dissipate that energy, and slow the car, at a roughly constant rate per unit distance travelled: so we expect that once braking starts, the car will travel a distance proportional to the square of its initial speed, before stopping.
- Given the information provided above, fit a model of the form
\[
\texttt{dist}_i = \beta_0 + \beta_1 \texttt{speed}_i + \beta_2 \texttt{speed}^2_i + \varepsilon_i
\]
to the data in
cars
, and from this starting model, select the most appropriate model for the data using both AIC, and hypothesis testing methods. - From your selected model, estimate the average time that it takes a driver to apply the brakes (there are 5280 feet in a mile).
- When selecting between different polynomial models for a set of data, it is often claimed that one should not leave in higher powers of some continuous predictor, while removing lower powers of that predictor. Is this sensible?
7.1.2 Climate reconstruction data
The data set moberg2005
contains information on reconstructions of the
average yearly temperature on the Northern hemisphere for the last 2000 years which allows us to
study trends in the temperature evolution. For the following analyses, we assume that the average
yearly temperature is normally distributed with a constant variance but time-dependent
expectation, i.e.
\[
\texttt{temperature} \sim \mathrm{N}(f(\texttt{time}), \sigma^2).
\]
- Estimate a polynomial model for the temporal trend function.
- Search for the best fitting polynomial using the
BIC()
function. - Visualize the best fitting model including the raw observations, the fitted line as well as estimated 95% confidence intervals.
7.1.3 Heteroskedasticity
Note, the data is taken from Wooldridge (2016).
The data set hprice.rds
contains \(321\) observations of
house prices in the US. Consider the following linear predictor
\[
\texttt{price} = \beta_0 + f_1(\texttt{area}) + f_2(\texttt{age}) + f_3(\texttt{intst}),
\]
where each function \(f_j( \cdot )\) should be modeled using orthogonal polynomials
of degree \(3\). Is there a heteroskedasticity problem? If yes, estimate a
heteroskedasticity corrected model. What happens if you use log(price)
as the
dependent variable?
7.2 Semiparametric regression
7.2.1 Motorcycle accident data
For this exercise load the mcycle
data set from the MASS package.
- According the
AIC()
, find the best fitting polynomial model. - Compute the normalized residuals and check if they are in line with the assumptions of the linear regression model. Are there any outliers?
- Check for autocorrelation in the residuals using function
acf()
. - Now, set up a function
tp()
with argumentstp(z, degree = 3, knots = seq(min(z), max(z), length = 10))
that returns a design matrix of basis functions of a polynomial spline with truncated powers representation. Argument is the covariate used to set up the design matrix,degree
is the maximum degree of the polynomial andknots
are the knots that split up the range ofz
in intervals, default is 9 intervals with 10 equidistantknots
. - It is computationally inefficient to compute the (CV) score
the way it is done in the slides. A more efficient way is given by
\[
V_o = \frac{n \sum_{i = 1}^n (y_i - \hat{f}_i)^2}{[trace(\mathbf{I} - \mathbf{A})]^2}.
\]
which is also called the (GCV) score. Set up a new
function
GCV()
that computes the GCV criterion and has the argumentsGCV(y, Z)
wherey
is the response andZ
a design matrix that is used for estimating a linear model by ordinary least squares. - From the code of 6., find a model using function
tp()
that best fits themcycle
data according the returned GCV criterion of functionGCV()
(try out different knot locations and degrees by hand). - Finally, check if the best fitting spline model is better than the best fitting polynomial model of 1. according the GCV score.
7.3 Bayesian models.
7.3.1 Slice sampling
Take a close look at the implementation of the slice sampler in section Slice MCMC. If we want to perform slice MCMC for models with more parameters, how does the implementation need to be changed? Try to adapt it and estimate a regression model using the following simulated data: \[ y = 1.2 + sin(x) + \varepsilon \] and \(\varepsilon \sim N(0, 0.3^2)\)
set.seed(123)
<- 300
n
<- runif(n, -2 * pi, 2 * pi)
x <- 1.2 + sin(x) + rnorm(n, sd = 0.3)
y
par(mar = c(4.1, 4.1, 0.1, 0.1))
plot(x, y)
See also the Linear model example how the target posterior distribution is set up for the slice sampler.
7.4 Advanced modeling
7.4.1 Boston housing
A quite popular data set that has been used extensively throughout the literature to benchmark algorithms is the Boston Housing data set.
Harrison, D. and Rubinfeld, D.L. Hedonic Pices and the Demand for Clean Air, J. Environ. Economics & Management, 5, 81-102, 1978.
The data set is available in the mlbench package and can be load with
if(!require("mlbench")) {
install.packages("mlbench")
}
## Loading required package: mlbench
data("BostonHousing2", package = "mlbench")
The data set consists of 506 observations and 19 variables.
print(head(BostonHousing2))
## town tract lon lat medv cmedv crim zn indus chas nox
## 1 Nahant 2011 -70.9550 42.2550 24.0 24.0 0.00632 18 2.31 0 0.538
## 2 Swampscott 2021 -70.9500 42.2875 21.6 21.6 0.02731 0 7.07 0 0.469
## 3 Swampscott 2022 -70.9360 42.2830 34.7 34.7 0.02729 0 7.07 0 0.469
## 4 Marblehead 2031 -70.9280 42.2930 33.4 33.4 0.03237 0 2.18 0 0.458
## 5 Marblehead 2032 -70.9220 42.2980 36.2 36.2 0.06905 0 2.18 0 0.458
## 6 Marblehead 2033 -70.9165 42.3040 28.7 28.7 0.02985 0 2.18 0 0.458
## rm age dis rad tax ptratio b lstat
## 1 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
## 2 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
## 3 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
## 4 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
## 5 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
## 6 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
In this example, the aim is to model the corrected median value of owner-occupied homes using a smooth additive model. Therefore, split the data into a 80% training and 20% testing set with
set.seed(123)
<- sample(1:2, size = nrow(BostonHousing2), replace = TRUE, prob = c(0.8, 0.2))
id <- subset(BostonHousing2, id == 1)
dtrain <- subset(BostonHousing2, id == 2)
dtest print(nrow(dtrain))
## [1] 412
print(nrow(dtest))
## [1] 94
Using the methods from section linear models to smooth additive models, find a model that
minimizes the out-of-sample MSE (data frame dtest
)
\[
\text{MSE} = \frac{1}{n} \sum_{i = 1}^n (\text{cmedv} - \widehat{\text{cmedv}})^2.
\]
Visualize the estimated effects and interpret the results. Therefore, build a statistical report using Rmarkdown which also shows all the model building steps.