4B-Exercise.qmd

---
output: html_document
editor_options: 
  chunk_output_type: console
---

```{r, include=FALSE}
set.seed(42)
```

# Exercise - Multiple Linear Regression {-}

| Formula      | Meaning                               | Details                                    |
|----------------|--------------------------|------------------------------|
| `y~x_1`      | $y=a_0 +a_1*x_1$                      | Slope+Intercept                            |
| `y~x_1 - 1`  | $y=a_1*x_1$                           | Slope, no intercept                        |
| `y~I(x_1^2)` | $y=a_0 + a_1*(x_1^2)$                 | Quadratic effect                           |
| `y~x_1+x_2`  | $y=a_0+a_1*x_1+a_2*x_2$               | Multiple linear regression (two variables) |
| `y~x_1:x_2`  | $y=a_0+a_1*(x_1*x_2)$                 | Interaction between x~1~ and x~2~          |
| `y~x_1*x_2`  | $y=a_0+a_1*(x_1*x_2)+a_2*x_1+a_3*x_2$ | Interaction and main effects               |

: Formula syntax

In this exercise you will:

-   perform multiple linear regressions
-   interpret regression output and check the residuals
-   plot model predictions including interactions

Before you start, remember to clean your global environment (if you haven't already) using `rm(list=ls())`.

To conduct the exercise, please load the following packages:

```{r, eval=TRUE}
library(effects)
library(MuMIn)
```

You will work with the following datasets:

-   mtcars
-   Cement{MuMIn}

The second dataset is from the MuMIn package (as shown by the curly brackets).

## Useful functions

for multiple linear regression

`lm()` - fit linear model\
`summary(fit)` - apply to fitted model object to display regression table\
`plot(fit)` - plot residuals for model validation\
`anova(fit)` - apply type I ANOVA (variables included sequentially) to model to test for effects all levels of a factor\
`Anova(fit)` - `car` package; use type II ANOVA (effects for predictors when all other predictors are already included) for overall effects\
`scale()` - scale variable\
`sqrt()` - square-root\
`log()` - calculates natural logarithm\
`plot(allEffects(fit))` - apply to fitted model object to plot marginal effect; `effects` package\
`par()` - change graphical parameters\
use `oldpar \<- par(mfrow = c(number_rows, number_cols))` to change figure layout including more than 1 plot per figure\
use `par(oldpar)` to reset graphic parameter

for model selection

`stepAIC(fullModel)` - perform stepwise AIC model selection; apply to full model object, `MASS` package\
`dredge(fullModel)` - perform global model selection, `MuMIn` package\
`model.avg()` - perform model averaging\
`AIC(fit)` - get AIC for a fitted model\
`anova(fit1, fit2)` - compare two fitted models via Likelihood Ratio Test (LRT)

## Analyzing the `mtcars` dataset

Imagine a start up company wants to rebuild a car with a nice retro look from the 70ies. The car should be modern though, meaning the fuel consumption should be as low as possible. They've discovered the `mtcars` dataset with all the necessary measurements and they've somehow heard about you and your R skills and asked you for help. And of course you promised to help, kind as you are.

The company wants you to find out which of the following characteristics affects the fuel consumption measured in miles per gallon (`mpg`). Lower values for `mpg` thus reflect a higher fuel consumption. The company wants you to include the following variables into your analysis:

-   number of cylinders (`cyl`)
-   weight (`wt`)
-   horsepower (`hp`)
-   whether the car is driven manually or with automatic (`am`)

In addition, Pawl, one of the founders of the company suggested that the effect of weight (`wt`) might be irrelevant for powerful cars (high `hp` values). You are thus asked to test for this interaction in your analysis as well.

::: {.callout-warning}
### Question

Carry out the following tasks:

-   Perform a multiple linear regression (change class for `cyl` and `am` to factor)
-   Check the model residuals
-   Interpret and plot all effects

You may need the following functions:

-   `as.factor()`
-   `lm()`
-   `summary()`
-   `anova()`
-   `plot()`
-   `allEffects()`

Use your results to answer the questions:

**Which of the following statements are correct? (Several are correct).**

```{r}
#| include: false
opts_p <- c(
  answer = "Cars with 6 cylinders do not differ significantly in their fuel consumption as compared to cars with 4 cylinders.",
  "Stronger cars (hp) use less fuel (mpg).",
  answer = "Overall, heavier cars (wt) use significantly more fuel (their range is smaller; mpg)"
)
```
`r longmcq(opts_p)`

**Concerning the interaction between weight (wt) and horsepower (hp), which of the following statements is correct?**

```{r}
#| include: false
opts_p <- c(
  "Pawl was wrong. There is a significant interaction between weight and horsepower, but the direction is opposite to what Pawl thought: The effect of weight is stronger for stronger cars.",
  "Pawl was wrong, the effect of weight is independent of horsepower.",
  answer = "Pawl was right. The effect of weight is strong for weaker cars, but becomes indeed irrelevant for stronger cars."
)
```
`r longmcq(opts_p)`


:::

`r hide("Click here to see the solution")`

This is the code that you need to interpret the results.

```{r mtcars}
# change am and cyl from numeric to factor
mtcars$am <- as.factor(mtcars$am)
mtcars$cyl <- as.factor(mtcars$cyl)

# multiple linear regression and results:
# (we need to scale (standardize) the predictors wt and hp, since we include their interaction)
carsfit <- lm(mpg ~ am + cyl + scale(wt) * scale(hp), dat = mtcars)
# weight is included as the first predictor in order to have
# it as the grouping factor in the allEffects plot

summary(carsfit)
# The first level of each factor is used as a reference, i.e. in this case a manual gear shift with 4 gears.
# From the coefficient cyl6 we see that there is no significant difference in fuel consumption (= our response) between 4 gears (the reference) and 6 gears.
# In contrast, the predictors weight (wt) and horsepower (hp) have a significant negative effect on the range (mpg), so that they both increase fuel consumption.

# check residuals
old.par = par(mfrow = c(2, 2))
plot(carsfit)
par(old.par)

# plot effects
plot(allEffects(carsfit))
# We can see in the wt*hp plot, that for high values of hp wt has no effect on the response mpg. We conclude that Pawl was right.
```

`r unhide()`

<!-- Take from https://theoreticalecology.github.io/DataScienceInR/4A-Exercise.html -->

::: {.callout-warning}
#### Question

1.  What is the meaning of "An effect is not significant"?
2.  Is an effect with three \*\*\* more significant / certain than an effect with one \*?
:::

`r hide("Click here to see the solution")`

1.  You should NOT say that the effect is zero, or that the null hypothesis has been accepted. Official language is "there is no significant evidence for an effect(p = XXX)". If we would like to assess what that means, some people do a post-hoc power analysis (which effect size could have been estimated), but better is typically just to discuss the confidence interval, i.e. look at the confidence interval and say: if there is an effect, we are relatively certain that it is smaller than X, given the confidence interval of XYZ.

2.  Many people view it that way, and some even write "highly significant" for \*\*\* . It is probably true that we should have a slightly higher confidence in a very small p-value, but strictly speaking, however, there is only *significant*, or *not significant*. Interpreting the p-value as a measure of certainty is a slight misinterpretation. Again, if we want to say how certain we are about the effect, it is better to look again at the confidence interval, i.e. the standard error and use this to discuss the precision of the estimate (small confidence interval / standard error = high precision / certainty).
`r unhide()`


## Interactions with the `plantHeight` dataset

::: {.callout-warning}
#### Plant Height revisited

Revisit exercise our previous analysis of EcoData::plantHeight

```{r}
library(EcoData)
model = lm(loght ~ temp, data = plantHeight)

```

Use (separate) multiple regressions to test if:

1.  If temp or NPP (net primary productivity) is a more important predictor (importance == absolute effect size).
2.  If growth forms (variable growthform) differ in their temperature effects. (use an interaction)
3.  If the effect of temp remains significant if we include latitude and an interaction of latitude with temp. If not, why? Tip: plot temp \~ lat.


:::

`r hide("Click here to see the solution")`

```{r}
plantHeight$sTemp = scale(plantHeight$temp)
plantHeight$sLat = scale(plantHeight$lat)
plantHeight$sNPP = scale(plantHeight$NPP)

# relevel 
plantHeight$growthform2 = relevel(as.factor(plantHeight$growthform), "Herb")
```

1)  **NPP or Temp?**

```{r}
fit = lm(loght ~ sTemp + sNPP, data = plantHeight)
summary(fit)
```

NPP is slightly more important

2)  **Interaction with growth form**

```{r}
fit = lm(loght ~ growthform2 *  sTemp , data = plantHeight)
summary(fit)

```

Yes, because (some) interactions are significant.

Note that the n.s. effect of sTemp is the first growth form (Ferns), for which we had only one observation. In a standard multiple regression, you don't have p-values for the significance of the temperature effect against 0 for the other growth forms, because you test against the reference. What one usually does is to run an ANOVA (see chapter on ANOVA) to see if temp is overall significant.

```{r}
anova(lm(loght ~ growthform *  sTemp , data = plantHeight))
```

Alternatively, if you want to test if a specific growth form has a significant temperature effect, you could either extract the p-value from a multiple regression (a bit complicated) or just run a univariate regression for this growth form

```{r}
fit = lm(loght ~ sTemp + 0, data = plantHeight[plantHeight$growthform == "Tree",])
summary(fit)
```

Or you could fit the interaction but turn-off the intercept (by saying +0 or -1) and remove the plantHeight intercepts:

```{r}
fit = lm(loght ~ sTemp:growthform + 0, data = plantHeight[,])
summary(fit)
```


3)  **Interaction with lat**

```{r}
fit = lm(loght ~ sTemp * sLat, data = plantHeight)
summary(fit)
```

All is n.s. ... how did this happen? If we check the correlation between temp and lat, we see that the two predictors are highly collinear.

```{r}
cor(plantHeight$temp, plantHeight$lat)
```

In principle, the regression model should be able to still separate them, but the higher the collinearity, the more difficult it becomes for the regression to infer if the effect is caused by one or the other predictor.
`r unhide()`


## Model-selection with the `Cement` dataset

The process of cement hardening involves exogenous chemical reactions and thus produces heat. The amount of heat produced by the cement depends on the mixture of its constituents. The `Cement` dataset includes heat measurements for different types of cement that consist of different relative amounts of calcium aluminate (`X1`), tricalcium silicate (`X2`), tetracalcium alumino ferrite (`X3`) and dicalcium silicate (`X4`). A cement producing company wants to optimize the composition of its product and wants to know, which of these compounds are mainly responsible for heat production.

::: callout-note
We only do a model selection here for educational reasons. For your analysis, and if your goal is not a predictive model, think about the model structure before you do the analysis and then stick to it! See here the section about [p-hacking](https://theoreticalecology.github.io/AdvancedRegressionModels/3C-ModelSelection.html#pHacking) (and also consider that AIC selection will/can remove confounders which will violate causality and can lead to spurious correlations!
:::

::: {.callout-warning}
### Questions

Carry out the following tasks:

-   Perform a multiple linear regression including all predictor variables and all two-way interactions (remember the notation `(var1 + var2 + var3)^2`.
-   Perform forward, backward, and global model selection and compare the results
-   Fit the model considered optimal by global model selection and compare it with the full model based on AIC (or AICc) and LRT.

You may need the following functions:

-   `lm()`
-   `summary()`
-   `stepAIC()` from the `MuMIn` package (`library(MuMIn)`)
-   `options()`
-   `dredge()`
-   `AIC()` or `AICc()` (for small datasets)
-   `anova()`

Use your results to answer the following questions:

**1. You tested 3 different model selection methods: forward stepwise AIC selection, backward stepwise AIC selection and global model selection. How many terms ( = intercept + predictor effects + interactions) did each of the reduced models include?**

- Forward selection `r fitb(3)`
- Backward selection `r fitb(10)`
- global model selection `r fitb(3)`


**2. You compared the full model with the reduced model from global model selection based on AIC and LRT (using the `anova()` function).
Which of the two models would you choose based on their AIC? And which would you choose based on the LRT?**

- AIC `r mcq(c(answer = "The full model", "I don't know. Both models fit equally well.", "Also the full model", "The reduced model"))`
- LRT `r mcq(c("The full model", answer = "I don't know. Both models fit equally well.", "Also the full model", "The reduced model"))`


**3. Here's a quote from Wikipedia on the AIC: ["When the sample size is small, there is a substantial probability that AIC will select models that have too many parameters, i.e. that AIC will overfit."](https://en.wikipedia.org/wiki/Akaike_information_criterion)
Check the sample size of the Cement dataset.
How do you now interpret the AIC values for the full model as compared to the reduced model from global model selection? (Several are correct)** 

```{r}
#| include: false
opts_p <- c(
  "The AIC for the full model is smaller. The full model thus fits better.",
  answer = "I would not trust AIC model selection in this case, because the sample size is too small to fit the number of parameters necessary for the full model.",
  answer = "Instead of using the AIC for model comparison, I would now prefer the AICc, which corrects for small sample sizes."
)
```
`r longmcq(opts_p)`
 
:::

`r hide("Click here to see the solution")`

This is the code that you need to obtain the results.

```{r  Cement}
#| eval: false

library(MuMIn)
library(MASS)

# full model ->  has 11 coefficients
full = lm(y ~ (X1 + X2 + X3 + X4)^2, data = Cement)
summary(full)

# forward model selection
ms_forw = stepAIC(full, direction = "forward")
summary(ms_forw)
# lists 11 coefficients (i.e. selects full model)

# backward model selection
ms_back = stepAIC(full, direction = "backward")
summary(ms_back)
# lists 10 coefficients

# global model selection
options(na.action = "na.fail")
dd = dredge(full)
head(dd)
# The first row lists the best performing model: it includes only the intercept and effects for X1 and X2 (= 3 coefficients).

# Fit the model considered optimal by global model selection and compare it with the full model based on AIC (or AICc) and LRT:
opt = lm(y ~ X1 + X2, data = Cement)
summary(opt)

AIC(opt,full) # full model is better according to AIC (lower AIC)
anova(opt, full) # -> LRT: no significant difference between the models

# sample size in the Cement dataset:
str(Cement)  # or
nrow(Cement)

# If the sample size is low, a corrected version of the AIC is recommended to avoid overfitting:
AICc(opt,full) # This is inf! -> optimal model is better according to AICc

```

`r unhide()`