-
Notifications
You must be signed in to change notification settings - Fork 3
/
25-This-That.Rmd
541 lines (411 loc) · 17.9 KB
/
25-This-That.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
# This and That
```{r ch25-setup, include = FALSE}
library(CalvinBayes)
library(rstanarm)
library(brms)
library(loo)
set.seed(2512345)
```
## Wells in Bangledesh
Some things to learn from this example:
* We can use `update()` to speed up fitting multiple models.
* We can combine ideas to build up models with multiple predictors.
* `marginal_effects()` can simplify making certain plots that show how
the model thingks the response depends on one of the predictors. It is
a little bit clunky to use, but it saves a lot of work.
* Transforming predictors adds to our palette of models.
* Evaluating the model prediction at specific values can help us understand
what the model says.
* For logistic regression, there is a handy short-cut to help understand
the coefficients.
### The data
The rstanarm package contains a data set called `wells` that includes data from
a survey of 3200 residents in a small area of Bangladesh suffering from arsenic
contamination of groundwater. Respondents with elevated arsenic levels in their
wells had been encouraged to switch their water source to a safe public or
private well in the nearby area and the survey was conducted several years later
to learn which of the affected residents had switched wells.
The data include several variables.
* `switch` Indicator for well-switching (1 = switched, 0 = did not switch)
* `arsenic` Arsenic level in respondent’s well
* `dist` Distance (meters) from the respondent’s house to the nearest well with safe drinking water.
* `association` Indicator for member(s) of household participating in community organizations
* `educ` Years of education (of the head of household)
```{r ch25-wells}
library(rstanarm)
glimpse(wells)
```
### The Question
Our goal is to use this data to determine which factors impact the decision to switch.
### Distance as a predictor
It seems reasonable that people might be more likely to switch to a well
if it isn't too far away. Let's see.
```{r ch25-wells1-brm, cache = TRUE, results = "hide"}
wells1_brm <-
brm(switch ~ dist, data = wells, family = bernoulli(link = logit))
```
```{r ch25-wells1-brm-look}
wells1_brm
```
What do we make of `dist` as a predictor? Ideally, we'd like some more digits.
```{r ch25-wells1-post}
hdi(posterior(wells1_brm), regex_pars = "b_")
mcmc_combo(as.mcmc(wells1_brm), regex_pars = "b_")
```
`b_dist` is "small", but well separated from 0. But keep in mind that our
unit of distance is meters, so this is telling us about the change in log
odds of switching *per meter* that the clean well is farther away. One meter
probably doesn't matter much. Perhaps 100 or 1000 meters would matter more,
however. This model predicts a change in log odds of roughly 0.6 for every
100 meters. **Don't ignore small coefficients if they get multiplied by
large variables!**
So what does our model "look like". It would be nice to see how the probability
of switching depends on distance from a clean well. The `marginal_effects()`
function can help us make such a plot.
```{r ch25-wells1-marginal-1}
marginal_effects(wells1_brm)
```
If we would like to add to the plot we have some work to do.
* `marginal_effects()` doesn't return the plot, it prints it, so we have to
tell it not to do that.
* The result will be a **list of plots** (because in a more complicated model
there would be multiple predictors), so to get the plot we want, we have
to select it from the list.
```{r ch25-wells1-marginal-2}
p <- marginal_effects(wells1_brm) %>% plot(plot = FALSE)
p[[1]] %>%
gf_jitter(switch ~ dist, data = wells,
height = 0.2, width = 0, alpha = 0.4,
inherit = FALSE) %>%
gf_labs(x = "distance (m)")
```
### Updating a model without recompiling
Seems a shame to recompile our Stan model just to use the new distance variable.
Fortunately, brms includes and `update()` function for updating models and it will
avoid recompiling when it can. For example, there is no need to recompile if
* we use a different data set (that still has the needed variables for the model)
* we want change the number of iterations or chains.
Let's give it a try.
```{r ch25-wells-update, chache = TRUE, results = "hide"}
wells <- wells %>% mutate(dist100 = dist / 100)
wells2_brm <- update(wells1_brm, switch ~ dist100, newdata = wells)
```
```{r ch25-wells2-brm-look}
wells2_brm
p <- marginal_effects(wells2_brm) %>% plot(plot = FALSE)
p[[1]] %>%
gf_jitter(switch ~ dist100, data = wells,
height = 0.2, width = 0, alpha = 0.4,
inherit = FALSE) %>%
gf_labs(x = "distance (100 m)")
```
```{r ch25-wells3-update, chache = TRUE, results = "hide"}
# Two ways use a log transformation on distance
wells <- wells %>% mutate(log10dist = log10(dist))
wells3_brm <- update(wells1_brm, newdata = wells, formula. = switch ~ log10(dist))
wells4_brm <- update(wells1_brm, newdata = wells, formula. = switch ~ log10dist)
```
```{r ch25-wells3-brm-look}
wells3_brm
p <- marginal_effects(wells3_brm) %>% plot(plot = FALSE)
p[[1]] %>%
gf_jitter(switch ~ dist, data = wells, height = 0.2, width = 0, alpha = 0.4,
inherit = FALSE) %>%
gf_labs(x = "distance")
```
```{r ch25-wells4-brm-look}
wells4_brm
p <- marginal_effects(wells4_brm) %>% plot(plot = FALSE)
p[[1]] %>%
gf_jitter(switch ~ log10(dist), data = wells,
height = 0.2, width = 0, alpha = 0.4,
inherit = FALSE) %>%
gf_labs(x = "log distance")
```
```{r ch25-wells-loo, cache = TRUE}
compare(loo(wells1_brm), loo(wells2_brm), loo(wells3_brm), loo(wells4_brm))
```
### Interpreting coefficients -- discrete change
### Interpreting coefficients -- the divide by 4 trick
Rather than consider a discrete change in our predictor $x$,
we can compute the derivative of the logistic curve at the central value.
<!-- in this case x ̄ = 3.1. -->
Differentiating the function $\mathrm{ilogit}(\alpha + \beta x)$
with respect to $x$ yields
$$
\begin{align*}
\frac{d}{dx} \mathrm{ilogit}(\alpha + \beta x)
&=
\beta e^{\alpha + \beta x}/ (1 + e^{\alpha + \beta x})^2
\end{align*}
$$
Now consider the $x$ value for which model predicts a probability of 50%.
That is
$$
\begin{align*}
0.5 &= \mathrm{ilogit}(\alpha + \beta x) \\
\mathrm{logit}(0.5) &= \alpha + \beta x \\
0 &= \alpha + \beta x \\
x &= \frac{-\alpha}{\beta} \\
\end{align*}
$$
Plugging this into the derivative we get.
$$
\begin{align*}
\beta e^{\alpha + \beta x} / (1 + e^{\alpha + \beta x})^2
&=
\beta e^{\alpha + \beta \frac{-\alpha}{\beta}} /
(1 + e^{\alpha + \frac{-\alpha}{\beta}})^2
\\
&=
\beta e^{0} / (1 + e^{0})^2
\\
&=
\beta / 4
\end{align*}
$$
This is the steepest slope on the logistic curve. So $\beta/4$ gives us an
upper bound on how much the probability changes as $x$ changes. This upper
bound is a good approximation for values near this central point.
Applying this to our example,
* The central value of $x$ is approximately (plugging the posterior means
for the paramters) $- \frac{0.61}{- 0.62} = `r round(0.61/.62, 2)` \approx 1$.
* So an increase of 100 meters decreases the probabity of switching by
about 15%.
* Let's see how this compares to the direct calculation of the change.
Again, using the posterior mean parameter values we get.
* difference in probability of switching at 100m:
$\mathrm{ilogit}(\hat \alpha + \hat \beta \cdot 1) = - \frac{0.61}{- 0.62} =$
`r round(0.61/.62, 2)`.
* probability of switching at 200m:
$\mathrm{ilogit}(\hat \alpha + \hat \beta \cdot 2) = $
`r round(0.61 - 0.62 * 2, 2)`.
<!-- 0.6524895 -->
That's pretty close to our $\beta/4$ estimate.
### Other predictors: arsenic
If a person's well is heavily contaminated with arsenic, perhaps they
are more likely to switch.
```{r ch25-wells5-update, chache = TRUE, results = "hide"}
wells5_brm <-
update(wells1_brm, newdata = wells, switch ~ dist100 + arsenic)
```
```{r ch25-wells6-update, chache = TRUE, results = "hide"}
wells6_brm <-
update(wells1_brm, switch ~ dist100 + log(arsenic), newdata = wells)
```
```{r ch25-wells5-compare}
compare(waic(wells2_brm), waic(wells5_brm), waic(wells6_brm))
```
Looks like a log transformation on arsenic is useful.
```{r ch25-wells6-look}
wells6_brm
```
```{r ch25-dist-vs-arsenic, chache = TRUE, results = "hide"}
gf_point(log(arsenic) ~ dist100, data = wells) %>%
gf_smooth()
gf_point(arsenic ~ dist100, data = wells) %>%
gf_smooth()
```
```{r ch25-wells7-update, chache = TRUE, results = "hide"}
wells7_brm <-
update(wells1_brm, switch ~ dist100 * log10(arsenic), newdata = wells)
```
```{r ch25-wells7-compare}
compare(waic(wells2_brm), waic(wells6_brm), waic(wells7_brm))
```
Looks like a log transformation on arsenic is useful.
```{r ch25-wells7-look}
wells7_brm
mcmc_areas(as.mcmc.list(stanfit(wells7_brm)), regex_pars = "b_")
```
### Still more predictors
```{r ch25-wells8-update, chache = TRUE, results = "hide"}
wells8_brm <-
update(wells1_brm, switch ~ dist100 * log10(arsenic) + educ + assoc,
newdata = wells)
```
```{r ch25-wells8-brm-look}
wells8_brm
```
Marginal effects are getting more interesting now.
```{r ch25-wells8-marginal}
conditions <-
make_conditions(
expand.grid(
educ = c(5, 12)
),
vars = c("educ")
)
marginal_effects(wells8_brm, effects = "dist100:arsenic", conditions = conditions)
```
## Gelman/Hill Principles for Buiding Models
[@Gelman:2006] offers a nice set of
"general principles for building regression models for prediction".
1. Include all input variables that, for substantive reasons, might be expected to be important in predicting the outcome.
2. It is not always necessary to include these inputs as separate predictors -- for example, sometimes several inputs can be averaged or summed to create a "total score" that can be used as a single predictor in the model.
3. For inputs that have large effects, consider including their interactions as well.[^25-1]
4. We suggest the following strategy for decisions regarding whether to exclude
a variable from a prediction model based on expected sign and statistical
significance (typically measured at the 5% level; that is, a coefficient is
"statistically significant" if its estimate is more than 2 standard errors from
zero):
a. If a predictor is **not statistically significant** and has the
**expected sign**,
it is generally fine to keep it in. It may not help predictions
dramatically but is also probably not hurting them.
b. If a predictor is **not statistically significant** and
**does not have the expected sign**
(for example, incumbency having a negative effect on vote
share), consider removing it from the model (that is, setting its
coefficient to zero).
c. If a predictor is **statistically significant** and
**does not have the expected sign**,
then think hard if it makes sense. (For example, perhaps
this is a country such as India in which incumbents are generally
unpopular; see Linden, 2006.) Try to gather data on potential lurking
variables and include them in the analysis.
d. If a predictor is **statistically significant** and has the
**expected sign**, then by all means keep it in the model.
They conlcude by saying
> These strategies do not completely solve our problems but they help keep us
from making mistakes such as discarding important information. They are
predicated on having thought hard about these relationships before fitting the
model. It’s always easier to justify a coefficient’s sign after the fact than to
think hard ahead of time about what we expect. On the other hand, an explanation
that is determined after running the model can still be valid. We should be able
to adjust our theories in light of new information.
Since we are doing things in a Bayesian context, we should replace
"statistically significant" with an equivalent notion based on the posterior
distribution (using a posterior probability or and HDI, for example).
[^25-1]: Be careful how you interpret the word "large". Without context,
no number is large or small. Put parameter estimates in context by keeping
three things in mind: the units involved, "statistical significance" (ie, the shape of the posterior distribution, not just a 1-number summary),
and impact on the predictions (which for a term of the form $\beta x$ includes
understanding what typical values of $x$ might be).
In addition I'll add:
5. Use interactions if it makes sense that the effect of one predictor might
depend on the value of another predictor.
6. Linear models are inherentally monotonic. If you suspect instead a
**maximum or miniumum effect**
consider including both $x$ and $x^2$ (or something equivalent) as predictors.
(Unlike lines with either always rise or always
fall, parabolas have a maximum or a minumum.)
If a parabola isn't the "right shape", additional transformations of
$x$ or $y$ may be able to improve the fit. For example, we might use
$\log(x)$ and $(\log(x)^2)$.
7. Consider **transformation** of either a predictor or the response variable if
a. There is a natural reason to prefer the transformed variable, perhaps
because it makes the model more interpretable or corresponds to intuition
about the situation at hand.
b. Transforming the variable improves the fit either by improving the
"average" or by making the shape of the "noise" match the model's family
better.
8. **Don't fret the intercept.** The intercept should nearly always be
included. The rules about statistical significance and sign do not apply
to the intercept since often it has no meaninful interpretation.
* If you want to make the intercept somewhat more meaninful,
**centering** the predictors (subtracting the mean) may help.
(See if you can figure out why.)
### Example: Bicycling
Suppose we want to predict the maximum speed a bicyclist can ride based on
two predictors: the gear ratio they are using and the steepness of the road
they are riding on.
* We expect both variables to impact speed, so we include both in our model.
* We expect the effect of steepness to be monitonic. So no quadratic term required.
* We expect gear ratio to have a maximum effect -- there is a gear ratio with
which we can go fastest. Choosing a gear ratio that is lower than this will
make our legs need to move to fast. Choosing a gear ratio that is higher will
make it too hard to pedal. So a quadratic term for gear ratio seems reasonable.
* The best gear ratio to use depends on steepness (that's why bikes have multiple
gears), so it makes sense to include an interaction.
That sort of reasoning gives us a good starting point for exploring model options.
## Electric Company
### Interpreting a model with interaction
Here is a model for the *Electric Company* data that includes an interaction.
```{r ch25-ec-mod3, results = "hide", cache = TRUE }
Grade2 <- ElecComp %>% filter(grade == 2)
mod3 <- brm(post ~ pre * group, data = Grade2)
```
```{r ch25-ec-mod3-summary}
mod3
```
Let $x$ be the pre score, then this model is
$$
\begin{align*}
\mu_{\mathrm{post}}
&= \beta_0
+ \beta_1 x
+ \left(\beta_2 [\![\mathrm{treatement\ group}]\!]
+ \beta_3 [\![\mathrm{treatement\ group}]\!] x \right)
\\
&= \left(\beta_0
+ \beta_1 x \right)
+ (\beta_2 + \beta_3 x) [\![\mathrm{treatement\ group}]\!]
\\
\end{align*}
$$
The treatment effect is $\beta_2 + \beta_3 x$, which depends on $x$. So to compute
it, we must consider a particular value of $x$. Given the range of scores, we might
look at pre scores of 50 and 90 (near the low end and near the high end). That will
give us sense for the range of treatment effect over the bulk of the sample data.
```{r ch25-ec-pre-dist}
gf_histogram(~ pre, data = Grade2)
```
```{r ch25-ec-mod3-treatement-effect}
post3 <-
mod3 %>% posterior() %>%
mutate(
teffect_lo = b_grouptreatment + `b_pre:grouptreatment` * 50,
teffect_hi = b_grouptreatment + `b_pre:grouptreatment` * 90
)
gf_dens(~teffect_lo, data = post3, color = ~"lo") %>%
gf_dens(~teffect_hi, data = post3, color = ~"hi")
plot_post(post3$teffect_lo, quietly = TRUE)
plot_post(post3$teffect_hi, quietly = TRUE)
```
### Comparing models
Here are two comparison models that do not include the interaction.
```{r ch25-ec-models1-2, results = "hide", cache = TRUE}
mod2 <- update(mod3, post ~ pre + group)
mod1 <- update(mod3, post ~ group)
```
```{r}
mod1
mod2
```
Treatment effects for these models can be read directly off of the summaries above
by looking at the `grouptreatment` parameter.
(The posterior distributions of) the treatment effects for models 2 and 3 have similar peaks, but model 2 gives a more precise estimate (and is more sure that the effect is
positive). One reason for this is that there is strong correlation between the
parameters in the posterior distribution for model 3.
This makes it difficult to estimate individual parameters since one parameter can
be (nearly equivalently) exchanged for another.
```{r ch25-ec-mod3-pairs}
mcmc_pairs(as.mcmc(mod3), regex_pars = "b_", grid_args = list(top="Model 3"))
mcmc_pairs(as.mcmc(mod2), regex_pars = "b_", grid_args = list(top="Model 2"))
```
The treatment effect
for model 1 has a higher posterior mean, but also a much wider distribution.
Model 1 seems to be missing a key piece of the puzzle (as we see in model 2), and
Model 3 seems more challenging to interpret. Let's see what our elpd estimates
say about the models.
```{r}
loo_compare(loo(mod1), loo(mod2), loo(mod3))
```
### Comparing Treatement Effect by City
The contrast of interest here is
$$
(\mu_{YT} - \mu_{YC}) - (\mu_{FT} - \mu_{FC})
$$
We can only estimate this in a model that doesn't force this to be 0.
In the model `post ~ pre + city*group`, this will be.
$$
(\beta_0 + \beta_1 x + \beta_Y + \beta_T + \beta_{Y:T} -
\beta_0 + \beta_1 x + \beta_Y) -
(\beta_0 + \beta_1 x + \beta_T -
\beta_0 + \beta_1 x )
= \beta_{Y:T}
$$
So we can assess this by looking at the posterior for the interaction term in the model.
If we drop the interaction term, we will be unable to estimate this because the model
will force the difference to be 0.