-
Notifications
You must be signed in to change notification settings - Fork 1
/
4B-Exercise.qmd
376 lines (262 loc) · 15.3 KB
/
4B-Exercise.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
---
output: html_document
editor_options:
chunk_output_type: console
---
```{r, include=FALSE}
set.seed(42)
```
# Exercise - Multiple Linear Regression {-}
| Formula | Meaning | Details |
|----------------|--------------------------|------------------------------|
| `y~x_1` | $y=a_0 +a_1*x_1$ | Slope+Intercept |
| `y~x_1 - 1` | $y=a_1*x_1$ | Slope, no intercept |
| `y~I(x_1^2)` | $y=a_0 + a_1*(x_1^2)$ | Quadratic effect |
| `y~x_1+x_2` | $y=a_0+a_1*x_1+a_2*x_2$ | Multiple linear regression (two variables) |
| `y~x_1:x_2` | $y=a_0+a_1*(x_1*x_2)$ | Interaction between x~1~ and x~2~ |
| `y~x_1*x_2` | $y=a_0+a_1*(x_1*x_2)+a_2*x_1+a_3*x_2$ | Interaction and main effects |
: Formula syntax
In this exercise you will:
- perform multiple linear regressions
- interpret regression output and check the residuals
- plot model predictions including interactions
Before you start, remember to clean your global environment (if you haven't already) using `rm(list=ls())`.
To conduct the exercise, please load the following packages:
```{r, eval=TRUE}
library(effects)
library(MuMIn)
```
You will work with the following datasets:
- mtcars
- Cement{MuMIn}
The second dataset is from the MuMIn package (as shown by the curly brackets).
## Useful functions
for multiple linear regression
`lm()` - fit linear model\
`summary(fit)` - apply to fitted model object to display regression table\
`plot(fit)` - plot residuals for model validation\
`anova(fit)` - apply type I ANOVA (variables included sequentially) to model to test for effects all levels of a factor\
`Anova(fit)` - `car` package; use type II ANOVA (effects for predictors when all other predictors are already included) for overall effects\
`scale()` - scale variable\
`sqrt()` - square-root\
`log()` - calculates natural logarithm\
`plot(allEffects(fit))` - apply to fitted model object to plot marginal effect; `effects` package\
`par()` - change graphical parameters\
use `oldpar \<- par(mfrow = c(number_rows, number_cols))` to change figure layout including more than 1 plot per figure\
use `par(oldpar)` to reset graphic parameter
for model selection
`stepAIC(fullModel)` - perform stepwise AIC model selection; apply to full model object, `MASS` package\
`dredge(fullModel)` - perform global model selection, `MuMIn` package\
`model.avg()` - perform model averaging\
`AIC(fit)` - get AIC for a fitted model\
`anova(fit1, fit2)` - compare two fitted models via Likelihood Ratio Test (LRT)
## Analyzing the `mtcars` dataset
Imagine a start up company wants to rebuild a car with a nice retro look from the 70ies. The car should be modern though, meaning the fuel consumption should be as low as possible. They've discovered the `mtcars` dataset with all the necessary measurements and they've somehow heard about you and your R skills and asked you for help. And of course you promised to help, kind as you are.
The company wants you to find out which of the following characteristics affects the fuel consumption measured in miles per gallon (`mpg`). Lower values for `mpg` thus reflect a higher fuel consumption. The company wants you to include the following variables into your analysis:
- number of cylinders (`cyl`)
- weight (`wt`)
- horsepower (`hp`)
- whether the car is driven manually or with automatic (`am`)
In addition, Pawl, one of the founders of the company suggested that the effect of weight (`wt`) might be irrelevant for powerful cars (high `hp` values). You are thus asked to test for this interaction in your analysis as well.
::: {.callout-warning}
### Question
Carry out the following tasks:
- Perform a multiple linear regression (change class for `cyl` and `am` to factor)
- Check the model residuals
- Interpret and plot all effects
You may need the following functions:
- `as.factor()`
- `lm()`
- `summary()`
- `anova()`
- `plot()`
- `allEffects()`
Use your results to answer the questions:
**Which of the following statements are correct? (Several are correct).**
```{r}
#| include: false
opts_p <- c(
answer = "Cars with 6 cylinders do not differ significantly in their fuel consumption as compared to cars with 4 cylinders.",
"Stronger cars (hp) use less fuel (mpg).",
answer = "Overall, heavier cars (wt) use significantly more fuel (their range is smaller; mpg)"
)
```
`r longmcq(opts_p)`
**Concerning the interaction between weight (wt) and horsepower (hp), which of the following statements is correct?**
```{r}
#| include: false
opts_p <- c(
"Pawl was wrong. There is a significant interaction between weight and horsepower, but the direction is opposite to what Pawl thought: The effect of weight is stronger for stronger cars.",
"Pawl was wrong, the effect of weight is independent of horsepower.",
answer = "Pawl was right. The effect of weight is strong for weaker cars, but becomes indeed irrelevant for stronger cars."
)
```
`r longmcq(opts_p)`
:::
`r hide("Click here to see the solution")`
This is the code that you need to interpret the results.
```{r mtcars}
# change am and cyl from numeric to factor
mtcars$am <- as.factor(mtcars$am)
mtcars$cyl <- as.factor(mtcars$cyl)
# multiple linear regression and results:
# (we need to scale (standardize) the predictors wt and hp, since we include their interaction)
carsfit <- lm(mpg ~ am + cyl + scale(wt) * scale(hp), dat = mtcars)
# weight is included as the first predictor in order to have
# it as the grouping factor in the allEffects plot
summary(carsfit)
# The first level of each factor is used as a reference, i.e. in this case a manual gear shift with 4 gears.
# From the coefficient cyl6 we see that there is no significant difference in fuel consumption (= our response) between 4 gears (the reference) and 6 gears.
# In contrast, the predictors weight (wt) and horsepower (hp) have a significant negative effect on the range (mpg), so that they both increase fuel consumption.
# check residuals
old.par = par(mfrow = c(2, 2))
plot(carsfit)
par(old.par)
# plot effects
plot(allEffects(carsfit))
# We can see in the wt*hp plot, that for high values of hp wt has no effect on the response mpg. We conclude that Pawl was right.
```
`r unhide()`
<!-- Take from https://theoreticalecology.github.io/DataScienceInR/4A-Exercise.html -->
::: {.callout-warning}
#### Question
1. What is the meaning of "An effect is not significant"?
2. Is an effect with three \*\*\* more significant / certain than an effect with one \*?
:::
`r hide("Click here to see the solution")`
1. You should NOT say that the effect is zero, or that the null hypothesis has been accepted. Official language is "there is no significant evidence for an effect(p = XXX)". If we would like to assess what that means, some people do a post-hoc power analysis (which effect size could have been estimated), but better is typically just to discuss the confidence interval, i.e. look at the confidence interval and say: if there is an effect, we are relatively certain that it is smaller than X, given the confidence interval of XYZ.
2. Many people view it that way, and some even write "highly significant" for \*\*\* . It is probably true that we should have a slightly higher confidence in a very small p-value, but strictly speaking, however, there is only *significant*, or *not significant*. Interpreting the p-value as a measure of certainty is a slight misinterpretation. Again, if we want to say how certain we are about the effect, it is better to look again at the confidence interval, i.e. the standard error and use this to discuss the precision of the estimate (small confidence interval / standard error = high precision / certainty).
`r unhide()`
## Interactions with the `plantHeight` dataset
::: {.callout-warning}
#### Plant Height revisited
Revisit exercise our previous analysis of EcoData::plantHeight
```{r}
library(EcoData)
model = lm(loght ~ temp, data = plantHeight)
```
Use (separate) multiple regressions to test if:
1. If temp or NPP (net primary productivity) is a more important predictor (importance == absolute effect size).
2. If growth forms (variable growthform) differ in their temperature effects. (use an interaction)
3. If the effect of temp remains significant if we include latitude and an interaction of latitude with temp. If not, why? Tip: plot temp \~ lat.
:::
`r hide("Click here to see the solution")`
```{r}
plantHeight$sTemp = scale(plantHeight$temp)
plantHeight$sLat = scale(plantHeight$lat)
plantHeight$sNPP = scale(plantHeight$NPP)
# relevel
plantHeight$growthform2 = relevel(as.factor(plantHeight$growthform), "Herb")
```
1) **NPP or Temp?**
```{r}
fit = lm(loght ~ sTemp + sNPP, data = plantHeight)
summary(fit)
```
NPP is slightly more important
2) **Interaction with growth form**
```{r}
fit = lm(loght ~ growthform2 * sTemp , data = plantHeight)
summary(fit)
```
Yes, because (some) interactions are significant.
Note that the n.s. effect of sTemp is the first growth form (Ferns), for which we had only one observation. In a standard multiple regression, you don't have p-values for the significance of the temperature effect against 0 for the other growth forms, because you test against the reference. What one usually does is to run an ANOVA (see chapter on ANOVA) to see if temp is overall significant.
```{r}
anova(lm(loght ~ growthform * sTemp , data = plantHeight))
```
Alternatively, if you want to test if a specific growth form has a significant temperature effect, you could either extract the p-value from a multiple regression (a bit complicated) or just run a univariate regression for this growth form
```{r}
fit = lm(loght ~ sTemp + 0, data = plantHeight[plantHeight$growthform == "Tree",])
summary(fit)
```
Or you could fit the interaction but turn-off the intercept (by saying +0 or -1) and remove the plantHeight intercepts:
```{r}
fit = lm(loght ~ sTemp:growthform + 0, data = plantHeight[,])
summary(fit)
```
3) **Interaction with lat**
```{r}
fit = lm(loght ~ sTemp * sLat, data = plantHeight)
summary(fit)
```
All is n.s. ... how did this happen? If we check the correlation between temp and lat, we see that the two predictors are highly collinear.
```{r}
cor(plantHeight$temp, plantHeight$lat)
```
In principle, the regression model should be able to still separate them, but the higher the collinearity, the more difficult it becomes for the regression to infer if the effect is caused by one or the other predictor.
`r unhide()`
## Model-selection with the `Cement` dataset
The process of cement hardening involves exogenous chemical reactions and thus produces heat. The amount of heat produced by the cement depends on the mixture of its constituents. The `Cement` dataset includes heat measurements for different types of cement that consist of different relative amounts of calcium aluminate (`X1`), tricalcium silicate (`X2`), tetracalcium alumino ferrite (`X3`) and dicalcium silicate (`X4`). A cement producing company wants to optimize the composition of its product and wants to know, which of these compounds are mainly responsible for heat production.
::: callout-note
We only do a model selection here for educational reasons. For your analysis, and if your goal is not a predictive model, think about the model structure before you do the analysis and then stick to it! See here the section about [p-hacking](https://theoreticalecology.github.io/AdvancedRegressionModels/3C-ModelSelection.html#pHacking) (and also consider that AIC selection will/can remove confounders which will violate causality and can lead to spurious correlations!
:::
::: {.callout-warning}
### Questions
Carry out the following tasks:
- Perform a multiple linear regression including all predictor variables and all two-way interactions (remember the notation `(var1 + var2 + var3)^2`.
- Perform forward, backward, and global model selection and compare the results
- Fit the model considered optimal by global model selection and compare it with the full model based on AIC (or AICc) and LRT.
You may need the following functions:
- `lm()`
- `summary()`
- `stepAIC()` from the `MuMIn` package (`library(MuMIn)`)
- `options()`
- `dredge()`
- `AIC()` or `AICc()` (for small datasets)
- `anova()`
Use your results to answer the following questions:
**1. You tested 3 different model selection methods: forward stepwise AIC selection, backward stepwise AIC selection and global model selection. How many terms ( = intercept + predictor effects + interactions) did each of the reduced models include?**
- Forward selection `r fitb(3)`
- Backward selection `r fitb(10)`
- global model selection `r fitb(3)`
**2. You compared the full model with the reduced model from global model selection based on AIC and LRT (using the `anova()` function).
Which of the two models would you choose based on their AIC? And which would you choose based on the LRT?**
- AIC `r mcq(c(answer = "The full model", "I don't know. Both models fit equally well.", "Also the full model", "The reduced model"))`
- LRT `r mcq(c("The full model", answer = "I don't know. Both models fit equally well.", "Also the full model", "The reduced model"))`
**3. Here's a quote from Wikipedia on the AIC: ["When the sample size is small, there is a substantial probability that AIC will select models that have too many parameters, i.e. that AIC will overfit."](https://en.wikipedia.org/wiki/Akaike_information_criterion)
Check the sample size of the Cement dataset.
How do you now interpret the AIC values for the full model as compared to the reduced model from global model selection? (Several are correct)**
```{r}
#| include: false
opts_p <- c(
"The AIC for the full model is smaller. The full model thus fits better.",
answer = "I would not trust AIC model selection in this case, because the sample size is too small to fit the number of parameters necessary for the full model.",
answer = "Instead of using the AIC for model comparison, I would now prefer the AICc, which corrects for small sample sizes."
)
```
`r longmcq(opts_p)`
:::
`r hide("Click here to see the solution")`
This is the code that you need to obtain the results.
```{r Cement}
#| eval: false
library(MuMIn)
library(MASS)
# full model -> has 11 coefficients
full = lm(y ~ (X1 + X2 + X3 + X4)^2, data = Cement)
summary(full)
# forward model selection
ms_forw = stepAIC(full, direction = "forward")
summary(ms_forw)
# lists 11 coefficients (i.e. selects full model)
# backward model selection
ms_back = stepAIC(full, direction = "backward")
summary(ms_back)
# lists 10 coefficients
# global model selection
options(na.action = "na.fail")
dd = dredge(full)
head(dd)
# The first row lists the best performing model: it includes only the intercept and effects for X1 and X2 (= 3 coefficients).
# Fit the model considered optimal by global model selection and compare it with the full model based on AIC (or AICc) and LRT:
opt = lm(y ~ X1 + X2, data = Cement)
summary(opt)
AIC(opt,full) # full model is better according to AIC (lower AIC)
anova(opt, full) # -> LRT: no significant difference between the models
# sample size in the Cement dataset:
str(Cement) # or
nrow(Cement)
# If the sample size is low, a corrected version of the AIC is recommended to avoid overfitting:
AICc(opt,full) # This is inf! -> optimal model is better according to AICc
```
`r unhide()`