07-grouping-structures.Rmd

# R formula and random effect grouping syntax

# Goals

- Review R formula syntax including some lesser known shortcuts
- Be able to identify possible random effect grouping structures
- Be able to write out random effect formulas for `lme4::lmer()`
- Be able to convert `lme4::lmer()` random effect syntax to `nlme::lme()` syntax

# Formula syntax

A formula in R uses the `~` character. E.g.

```{r}
f <- y ~ x
class(f)
```

We covered R's formula syntax in the slides. Below are some exercises to solidify your understanding. 

## Challenge 1

For this course, I'm assuming you're already somewhat familiar with R formulas. Let's review some more edge-case R formula syntax quickly. (Hint: read `?formula` as needed.)

Re-write the following in shortform:

```{r}
y ~ x1 + x2 + x1:x2
```

Answer:

```{r}
y ~ x1 * x2 # exercise
```

What is the longhand way of writing the following?

```{r}
y ~ (x1 + x2 + x3)^2
```

Answer:

```{r}
y ~ x1 + x2 + x3 + x1:x2 + x2:x3 + x1:x3 # exercise
```

Are the following the same? Why or why not?

```{r}
y ~ x^2
y ~ I(x^2)
```

What does `y ~ x^2` actually mean in simplified form?

```{r}
y ~ x # exercise
```

What are 2 ways to remove the intercept in an R formula? Hint: read `?formula`. When might you want to do this? 

```{r}
y ~ -1 + x # exercise
y ~ 0 + x # exercise
```

# Random effect syntax

We'll cover random effect syntax in the slides. 

Here's a summary for reference:

```{r, eval=FALSE}
# nlme::lme()
y ~ x, random = ~1 | group # varying intercepts
y ~ x, random = ~1 + x | group # varying intercepts and slopes
y ~ x, random = list(group = pdDiag(~ x)) # uncorrelated varying intercepts and slopes
y ~ x, random = ~1 | group/subgroup # nested
# crossed... some structures possible but not easy,
# e.g. Pinheiro and Bates 2000 p163
# e.g. http://stackoverflow.com/questions/36643713/how-to-specify-different-random-effects-in-nlme-vs-lme4

# lme4::lmer()
y ~ x + (1 | group) # varying intercepts
y ~ x + (1 + x | group) # varying intercepts and slopes
y ~ x + (1 | group) + (x - 1 | group) # uncorrelated varying intercepts and slopes
y ~ x + (1 | group/subgroup) # nested
y ~ x + (1 | group1) + (1 | group2) # varying intercepts, crossed
y ~ x + (1 + x | group1) + (1 + x | group2) # varying intercepts and slopes, crossed
```

Below are some exercises.

## Challenge 2

A group of grad students are trying to model log(frog density) as a function of vegetation characteristics. What might their model look like in the following cases (using `lme4::lmer`)? Give cases with random intercepts only and cases with random intercepts and slopes.

Jerry measures frog density and vegetation characteristics from 1 transect (`transect`) within 6 separate ponds (`pond`).

```{r}
# random intercepts
log(frog_dens) ~ vegetation + 
  (1 | pond) # exercise
# random slopes and intercepts:
log(frog_dens) ~ vegetation + 
  (1 + vegetation | pond) # exercise
```

Emily measures frog density and vegetation characteristics from 5 transects within 1 pond.

```{r}
log(frog_dens) ~ vegetation + 
  (1 | transect) # exercise
log(frog_dens) ~ vegetation + 
  (1 + vegetation | transect) # exercise
```

Bonus: what are 2 ways you can write the above models that have a random slope? Hint: do you need to specify the random intercept explicitly? Are there advantages/disadvantages to each syntax?

```{r}
log(frog_dens) ~ vegetation + (1 + vegetation | transect) # exercise
log(frog_dens) ~ vegetation + (vegetation | transect) # exercise
```

Bonus 2: How can you specify one of the above models with a random slope but no random intercept? Hint, see the examples in `?lmer`. When might you use this?

```{r}
log(frog_dens) ~ vegetation + (0 + vegetation | transect) # exercise
```

## Challenge 3

Jane measures frog density and vegetation characteristics from 2 transects (`transect`) within 4 ponds (`pond`). Assume her data are structured like this:

```{r}
library(dplyr)
set.seed(99)
d <- tibble::tibble(
  frog_dens = rlnorm(100), 
  vegetation = rnorm(100), 
  pond = gl(25, 4), # generates factor levels
  transect = rep(gl(2, 2), 25))
d
```

*Ignoring* that Jane probably hasn't sampled enough transects to get much benefit from a mixed effect model (we're trying to keep this example data set small):

What is one way she could write the model with a random intercept for transects nested within ponds?

```{r}
log(frog_dens) ~ vegetation + 
  (1 | pond/transect) # exercise
```

What's another way to write the same model?

```{r}
log(frog_dens) ~ vegetation + 
  (1 | pond) + (1 | pond:transect) # exercise
```

What if she had coded her transects as in the `transect2` column in the following code chunk?

```{r}
d <- d %>% mutate(transect2 = gl(50, 2))
```

She could write her model the same way as above with `transect`, but she has one new option. What is that?

```{r}
log(frog_dens) ~ vegetation + 
  (1 | pond) + (1 | transect2) # exercise
```

Let's paste those into `lmer()` for proof that they are all the same. First, fill in the last line with the answer to the last exercise. Note that the surrounding parentheses just cause R to print the output while running the line.

```{r}
library(lme4)
(m1 <- lmer(log(frog_dens) ~ vegetation + (1 | pond/transect), data = d))
(m2 <- lmer(log(frog_dens) ~ vegetation + (1 | pond) + (1 | pond:transect), data = d))
(m3 <- lmer(log(frog_dens) ~ vegetation +  
    (1 | pond) + (1 | transect2), data = d)) # exercise 
```

Bonus: but the following *is not* the same, and is probably *not* what you want to do. Why is that? What does this one mean?

```{r}
(m_wrong <- lmer(log(frog_dens) ~ vegetation + (1 | pond) + (1 | transect), data = d))
```

Bonus 2: can you show what `m_wrong` is estimating as random effects for `transect`? Compare that to `m3`. Why is this happening?

```{r}
ranef(m_wrong) # exercise
ranef(m3) # exercise
```

## Challenge 4

For practice, how can we write the following model using `nlme::lme()` syntax?

lme4:

```{r}
(m_lmer <- lmer(log(frog_dens) ~ vegetation + (1 | pond/transect), data = d))
```

nlme:

```{r}
(m_nlme <- nlme::lme(log(frog_dens) ~ vegetation, 
  random =  ~ 1 | pond/transect, data = d)) # exercise
```

Compare the output. Are they the same?