-
Notifications
You must be signed in to change notification settings - Fork 3
/
20-Two-Nominal-Predictors.Rmd
1370 lines (1049 loc) · 52.6 KB
/
20-Two-Nominal-Predictors.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
output:
pdf_document: default
html_document: default
---
```{r ch20-setup, include = FALSE}
library(CalvinBayes)
library(brms)
```
# Multiple Nominal Predictors
## Crop Yield by Till Method and Fertilizer
The data in `CalvinBayes::SplitPlotAgri` are from an agricultural study in which
different tilling methods and different fertilizers were used and the crop yield (in
bushels per acre) was subsequently measured.
```{r ch20-splitplot-plot}
gf_point(Yield ~ Fert | ~ Till, data = SplitPlotAgri, alpha = 0.4, size = 4)
```
Here are two models. See if you can figure out what they are.
(How can you use R to check if you are correct?)
* What parameters does each model have?
* Write a formula that describes the model. Be sure to clarify what the variables
mean.
* How would you use each model to estimate the mean yield when using ridge tilling
and deep fertilizer? (Imagine that you already have the posterior distribution in
hand.)
```{r ch20-fert1, results = "hide", cache = TRUE}
fert1_brm <-
brm(Yield ~ Till + Fert, data = SplitPlotAgri)
```
\vfill
```{r ch20-fert2, results = "hide", cache = TRUE}
fert2_brm <-
brm(Yield ~ Till * Fert, data = SplitPlotAgri)
```
\vfill
In each of these models, the response (yield) is normally distributed
around a mean value that depends on the type of fertilizer and tilling method
used:
\begin{align*}
Y_i &\sim \mu_i + {\sf Norm}(0, \sigma) \\
Y_i &\sim {\ sf Norm}(\mu_i, \sigma)
\end{align*}
In model 1, the two nominal predictors are converted into indicator variables:
\begin{align*}
x_1 &= [\![ \mathrm{Till} = \mathrm{Moldbrd} ]\!] \\
x_2 &= [\![ \mathrm{Till} = \mathrm{Ridge} ]\!] \\
x_3 &= [\![ \mathrm{Fert} = \mathrm{Deep} ]\!] \\
x_4 &= [\![ \mathrm{Fert} = \mathrm{Surface} ]\!] \\
\end{align*}
So the model becomes (omitting the subscripted $i$):
\begin{align*}
\mu &= \beta0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_4
\\
&=
\beta_0 +
\beta_1 [\![ \mathrm{Till} = \mathrm{Moldbrd} ]\!] +
\beta_2 [\![ \mathrm{Till} = \mathrm{Ridge} ]\!]
\beta_3 [\![ \mathrm{Fert} = \mathrm{Deep} ]\!] +
\beta_4 [\![ \mathrm{Fert} = \mathrm{Surface} ]\!] +
\end{align*}
We can visualize this in a tabular form as
| Chisel | Moldbrd | Ridge
------- | ----- | ---- | --------
Broad | $\beta_0$ | $\beta_0 + \beta_1$ | $\beta_0 + \beta_2$
Deep | $\beta_0 + \beta_3$ | $\beta_0 + \beta_1 + \beta_3$ | $\beta_0 + \beta_2 + \beta_3$
Surface | $\beta_0 + \beta_4$ | $\beta_0 + \beta_1 + \beta_4$ | $\beta_0 + \beta_2 + \beta_4$
```{r ch20-fert1-summary}
fert1_brm
```
Note that this model implies that the difference in yield between using
two fertilizers is the same for each of the three tilling methods and the
difference due to tilling methods is the same for each of the three fertilizers.
This may not be a reasonable assumption. Perhaps some fertilizers work better
with certain tilling methods than with others. Model 2 allows for this.
The interaction (`Till * Fert`) creates additional new variables of the form
$x_i x_j$ where $i = 1$ or $2$ and $j = 3$ or $4$.
For example,
\begin{align*}
x_1 x_3 &=
[\![ \mathrm{Till} = \mathrm{Moldbrd} ]\!] \cdot
[\![ \mathrm{Fert} = \mathrm{Deep} ]\!] \\
& =
[\![ \mathrm{Till} = \mathrm{Moldbrd} \mathrm{\ and \ }
\mathrm{Fert} = \mathrm{Deep} ]\!]
\end{align*}
If we let $\beta_{i:j}$ be the coefficient on $x_i x_j$, then our table for
$\mu$ becomes
| Chisel | Moldbrd | Ridge
------- | ----- | ---- | --------
Broad | $\beta_0$ | $\beta_0 + \beta_1$ | $\beta_0 + \beta_2$
Deep | $\beta_0 + \beta_3$ | $\beta_0 + \beta_1 + \beta_3 + \beta_{1:3}$ | $\beta_0 + \beta_2 + \beta_3 + \beta_{2:3}$
Surface | $\beta_0 + \beta_4$ | $\beta_0 + \beta_1 + \beta_4 + \beta_{1:4}$| $\beta_0 + \beta_2 + \beta_4 + \beta_{2:4}$
```{r ch20-fert2-summary}
fert2_brm
```
In this model, there is are no dependencies among the various group means and the
interaction parameters ($\beta_{i:j}$) are a measure of how much this mattered.
(If they are close to 0, then this will be very much like the additive model.)
As before, we can opt to fit the model without an intercept. This produces a
different parameterization of the same model.
```{r ch20-fert2a, results = "hide", cache = TRUE}
fert2a_brm <-
brm(Yield ~ 0 + Till * Fert, data = SplitPlotAgri)
```
```{r ch20-fert2a-summary}
fert2a_brm
```
### What does $\sigma$ represent?
In each of these models, $\sigma$ is the standard deviation of yield for
all plots (not just those in our data) with a given combination of fertilizer
and tilling method. These models specify that the standard deviation is the same
in each of these groups (but we could modify that assumption and estimate
separate standard deviations in each group if we wanted). The estimate of $\sigma$
is a bit smaller for the model with interaction because the added flexibility
of the model allows us to estimate the means more flexibly.
\newpage
## Split Plot Design
There is a bit more to our story. The study used 33 different fields. Each field
was divided into 3 sections and a different fertilizer was applied to each of the
three sections. (Which fertilizer was used on which section was determined at random.)
This is called a "split-plot design" (even if it is applied to things that are not
fields of crops).
It would have been possible to divide each field into 9 sub-plots and use all
combinations of tilling and fertilizer, but that's not how this study was done.
The tilling method was the same for the entire field -- likely because it was
much more efficient to plow the fields this way.
The plot below indicates that different fields appear to have different baseline
yields since the dots associated with one field tend to be near the top or bottom
of each of the fertilizer clusters. We can add an additional variable to our
model to handle this situation.
```{r ch20-splitplot-plot-with-fields}
gf_point(Yield ~ Fert | ~ Till, data = SplitPlotAgri, alpha = 0.4, size = 4) %>%
gf_line(group = ~Field)
```
```{r ch20-fert3, results = "hide", cache = TRUE}
fert3_brm <-
# the use of factor() is important here because the field ids are numbers
# factor converts this into a factor (ie, a nominal variable)
brm(Yield ~ Till * Fert + factor(Field), data = SplitPlotAgri)
```
```{r ch19-fert3-summary}
fert3_brm
```
That's a lot of output. And the model is performing badly. Fortunately,
we don't really want this model anyway.
We now have a an adjustment for each field, and there
were 33 fields. But we are not really interested in predicting
the yield *for a given field*. Our primary interest is in which fertilizers
and tilling methods work well. We hope our results apply generally to all fields.
So field field plays a different role in this study.
We are only comparing 3 fertilizers and 3 tilling methods,
but there are many more fields than the 33 in our study. They are intended to
be representative of all fields (and their variable quality for producing large
yields).
If we think that field quality might be described by a normal distribution (or
some other distribution), we might be more interested in the parameters of that
distribution than in the specific estimates for the particular fields in this
study. The kind of model we want for this is called a **hierarchical** or **multi-level** model, and `brm()` makes it easy to describe such a model.
Here's a way to think about such a model
* Each field has a baseline productivity.
* The baseline productivities are normal with some mean and standard
deviation that tell us about the distribution of productivity among
fields. Our 33 fields should helps us estimate this distribution.
* That baseline productivity can be adjusted up or down depending on the
tilling method and fertilizer used.
In `brm()` lingo, the effect of field is to adjust the intercept, so we can write
it like this:
```{r ch20-fert4, results = "hide", cache = TRUE}
fert4_brm <-
brm(Yield ~ Till * Fert + (1 | Field), data = SplitPlotAgri)
```
We can see in the output below that the variability from plot to plot is
estimated by a standard deviation of roughly 8 to 15. Individual field
estimates are hidden in this report, but you can see them if you type
`stanfit(fert_brm)`.
```{r ch19-fert4-summary}
fert4_brm
```
The three groupings of the parameters shows are
* group-level effects
This is where we find the standard deviation associated with
`Field`. We are interested in the fields as a group, not as individual
fields.
* population-level effects
The parameters for `Till` and `Fert` go here.
* family specific
This is where we find parameters associated with the "noise" of the model,
in this case the standard deviation of the normal distribution.
If we used a t-distribution, we would find `nu` and `sigma` here.
## Which model should we use?
### Modeling Choices
Now that we are able to create more and more kinds of models, model selection
is going to become a bigger issue.
Here are just some of the choices we now have when constructing a model.
1. What variables?
If our primary goal is to study the association between the
response and certain predictor variables, we need to include
those variables in the model.
But additional variables can be helpful if they explain some of the
variation in the response variable in a way that makes it easier
to see the association with the variables of interest.
Since we have information about fields, and it seems plausible that
productivity varies by field, we prefer models include `Field`.
Similarly, even if we were only intersted in one of fertilizer or tilling
method, it may be useful to include both.
We might wish for some additional variables for our study of crop yields.
Perhaps knowing additional information about the fields (soil type,
hilly or flat? water shed or water basin? previous years crop, etc.).
Any of these things might help explain the variation from field
to field.
But adding too many variables can actually make this worse!
* If variables are correlated in our data (colinearity of predictors),
including both will usually make our posterior distributions
for the associated parameters much wider.
* Additional variables can lead to over-fitting. Additional variables
will always make our model fit the current data set better, but eventually
we will begin fitting the idiosyncracies of our particular data set
rather than patterns that are likely to extend to new observations.
2. Interaction terms?
When we use more than one predictor, we need to decide whether to include
interaction terms. Interaction is important to include if we are open to
to the possibility that the "effect" of one variable on the response
may depend on the value of some other variable.
3. Noise distribution?
Normal distributions are traditional and relatively easy to interpret.
T distributions are more robust against unusual observations or
heavy tailed distributions. Both of these are symmetric distributions.
If we expect or see evidience of a lack of symmetry, we may need to
use transformations of the data or other families of distributions.
4. What priors?
Bayesian inference adds another layer: prior selection. For most of our
models we have used "weakly informed priors". These priors avoid parameter
values that are impossible (negative values for standard deviations, for
example) and provide crude order-of-magnitude guidance. They can also
serve a mild regularizing effect (shrinking parameter estimates toward 0,
for example, to counter against over fitting). If more information is
available, it is possible to use more informed priors.
Choice of prior matters more for small data sets than for large data sets.
This makes sense both mathematically and intuitively. Intiuitively,
if we don't have much new data, it won't change our beliefs much from what
they were before. But if we have a lot of data, we will come to roughly
the same conclusion no matter what we believed before.
5. Multi-level?
Are the values of nominal variables in our data exhaustive of the
possibilities (or of our interests)? Or are they just representative
of a larger set of possible values?
In the latter case, a multi-level model may be called for.
To clarify this, it can be good to imagine expanding the data set. If you were
to collect additional data, would the variables take on new values?
In our crop yield study, adding more data would require using new fields,
but we could still use the same three fertilizers and same three tilling
methods. Furthermore, the three tilling methods selected are not likely
representative of some distribution of tilling methods the way the fields
studied might be representative of many fields in a given region.
These are only some of the questions we need to answer when constructing a model.
But how do we decide? Part of the decision is based on things we know or believe
in advance. Our model may be designed to reflect a theory about how data
are generated or may be informed by other studies that have been done in similar
situations. But there are also ways to investigate and compare models.
### Measuring a Model -- Prediction Error
#### Prediction vs. Observation
One way to measure how well a model is working is to compare the predictions
the model makes for the response variable $\hat y_i$ to the observed
response values in the data $y_i$. To simplify things, we would like to convert
these $n$ predictions and $n$ observations into a single number.
If you have taken a statistics course before, you may have done this using
**Sum of Squared Errors** (SSE) or **Mean Squared Error** (MSE).
\begin{align*}
SSE & = \sum_{i = 1}^n (y_i - \hat y_i)^2 \\
MSE & = \frac{1}{n} SSE = \frac{1}{n} \sum_{i = 1}^n (y_i - \hat y_i)^2
\end{align*}
If you are familiar with $r^2$, it is related to MSE:
\begin{align*}
SSE &= \sum_{i = 1}^n (y_i - \hat y_i)^2 \\
SST &= \sum_{i = 1}^n (y_i - \overline{y})^2 \\
r^2 &= 1 - \frac{SSE}{SST}
\end{align*}
We are working with Bayesian models, so $SSE$, $MSE$ and $r^2$ have posterior
distributions, since they depend on (the posterior distribution of) $\theta$.
Each posterior value of $\theta$ leads to a value of $\hat y_i = E(y_i \mid \theta)$
and that in turn leads to a values of $SSE$, $MSE$, and $r^2$.
Putting that all together to highlight the dependence on $\theta$, we get
$$MSE = \frac{1}{n} \sum_{i = 1}^n (y_i - E(y_i \mid \theta))^2$$
The intuition behind all three quantities is that model fit can be measured
by how close the model prediction $\hat y_i$ is to the obsered resopnse $y_i$.
$MSE$ adjusts for sample size to make it easier to compare values
across data sets of different sizes. $r^2$ makes a further normalization to
put things on a 0-1 scale. (1 is a perfect fit. 0 means the model always gives
the same prediction, so it isn't doing anything useful.)
#### (Log) predictive density
Another option is to compute **log predictive density** (lpd):
$$\mathrm{lpd}(\theta; y) = \log p(y \mid \theta)$$
Once again, $y$ is fixed, so this is a function of $\theta$.
In fact, it is just the log likelihood function.
For a given value of $\theta$,
lpd measures (on a log scale) the probability of observing the data.
A larger value indicates a better fit.
Once again, because lpd is a function of $\theta$,
it also has a posterior distribution.
Assuming that the values of $y$ are independent given the parameters
(and the predictor values $x$), this can be written as
$$
\mathrm{lpd}(\theta; y)
= \log p(y \mid \theta)
= \log \prod_{i = 1}^n p(y_i \mid \theta)
= \sum_{i = 1}^n \log p(y_i \mid \theta)
$$
In this case, we can compute the log posterior density pointwise and add.
In practice, this is often done even when independence does not hold. So
technically we are working with **log pointwise posterior density**:
$$
\mathrm{lppd}(\theta; y)
= \sum_{i = 1}^n \log p(y_i \mid \theta)
$$
As with $SSE$, $MSE$, and $r^2$ this assigns a score to each $i$ and then sums over
those scores.
For linear models with normal noise and uniform priors,
lpd is proportional to $MSE$ (and to $SSE$).
[^20-1]
[^20-1]: In this case, the posterior and the likelihood are the same, and
the noise distribution is normal.
The proportionality can be confirmed by observing that the "interesting part"
of $\log p(y_i \mid \theta)$ is $(y_i - \hat y_i)^2$.
#### Predictors
In the notation above, we have been hiding the role of predictors $x$ (and we will
continue to do so below).
A model with predictors makes different predictions depending on the
vaules of the predictors. In all our examples, $x$ will be fixed, but we could
include it in the notation if we wanted. For example,
$$
\mathrm{lpd}(\theta; y, x)
= \log p(y \mid \theta, x)
$$
#### Numbers from distributions
We can convert a measure $\mathrm{lpd}(\theta; y)$, which depends on $\theta$,
into a single number in several ways. We will illustrate below
1. We could replace $\theta$ with a particular number $\hat \theta$.
($\hat \theta$ might be the mean, median, or mode of the posterior distribution
or the mode of the likelihood function, for example).
If we do this we get the number
$$
\mathrm{lpd}(\hat \theta; y)
= \log p(y \mid \hat\theta)
= \sum_{i = 1}^n \log p(y_i \mid \theta)
$$
This is sometimes called a "plug-in" estimate since we are plugging in a single
number for $\theta$.
2. Instead of summarizing $\theta$ with a single number, we could summarize
$p(y_i \mid \theta)$ with a single number by averaging over the posterior
sample values $p(y_i \mid \theta^s)$. ($\theta^s$ denotes the value of
$\theta$ in row $s$ of our $S$ posterior samples.)
If sum over $i$, we get the **log pointwise posterior density** (lppd):
\begin{align}
\mathrm{lppd}
&\approx
\sum_{i = 1}^n \log \left( \frac{1}{S} \sum_{s = 1}^S p(y_i \mid \theta^s)\right)
\end{align}
This is an approximation because our poseterior samples are only an approximation
to the true posterior distribution. But if the effective sample size of the
posterior is large, this approximation should be very good.
<!-- 1. Posterior predictive checks. -->
<!-- If our model is reflective of the way the data were generated, then we should be -->
<!-- able to use it to generate new data that is similar to the actual data. Posterior -->
<!-- predictive checks of various sorts can be use to investigate. -->
Unfortunately, both of these measures ($MSE$ and log predictive density)
have a problem.
They measure how well the model fits the data used to fit the model, but
we are more interested in how well the model might fit new data
(generated by the same random process that generated the current data).
This leads to **overfitting** and **prefers larger, more complex models**,
since the extra flexibility of these models makes it easier for them to
"fit the data".
<!-- * We aren't really interested in how well our model fits the current data, but how -->
<!-- well it would fit other data (new data to be collected in the future or hypothetical -->
<!-- data that could have been collected). -->
<!-- * It may be better to incorporate elements of multiple models into a final -->
<!-- model than to simply choose one of them. -->
### Out-of-sample prediction error
More interesting would be to measure how well the models would fit **new data**.
This is referred to as **out-of-sample prediction**, in contrast to
**in-sample prediction**.
So let's consider how well our model predicts new data $\tilde y$ rather than
the observed data $y$:
<!-- \mathrm{E}_{\mathrm{post}}(\mathrm{lpd}) -->
$$
\mathrm{lpd}(\theta; \tilde y)
= \log p(\tilde y \mid \theta)
= \log \prod_{i = 1}^n p(\tilde y_i \mid \theta)
= \sum_{i = 1}^n \log p(\tilde y_i \mid \theta)
$$
which we can convert into a single number by plugging by posterior averaging:
<!-- $$ -->
<!-- \mathrm{lpd}(\hat \theta; \tilde y) -->
<!-- = \log p(\tilde y \mid \hat \theta) -->
<!-- = \log \prod_{i = 1}^n p(\tilde y_i \mid \hat \theta) -->
<!-- = \sum_{i = 1}^n \log p(\tilde y_i \mid \hat \theta) -->
<!-- $$ -->
<!-- or by averaging over the posterior distribution -->
<!-- $$ -->
<!-- \log p_{\mathrm{post}}(\tilde y) -->
<!-- = -->
<!-- \log \mathrm{E}_{\mathrm{post}}(p(\tilde y \mid \theta)) -->
<!-- = -->
<!-- \sum_{i = 1}^n -->
<!-- \log \left(\frac{1}{S} \sum_{s = 1}^S p(\tilde y_i \mid \theta_i)\right) -->
<!-- $$ -->
And since $\tilde y$ is not fixed (like $y$ was), we take an additional
step and compute the expected value (average) of this quantity over the
distribution of $\tilde y_i$ to get the **expected log (pointwise) predictive density** for a new response $\tilde y_i$:[^20-2]
[^20-2]: Technically, we should be computing the average using an integral
instead of averaging over our posterior samples. But since this is a quantity
we can't compute anyway, I've expressed this in terms of an average over
our posterior samples.
$$
\mathrm{elppd}
=
\mathrm{E}\left(\sum_{i = 1}^n \log p_{\mathrm{post}}(\tilde y_i)\right)
\approx
\sum_{i = 1}^n \mathrm{E}\left(\log
\frac{1}{S} \sum_{s = 1}^S p(\tilde y_i \mid \theta^s))\right)
$$
This expected value is taken over the true distribution of $\tilde y_i$ (which is a
problem, stay tuned.)
### Approximating out-of-sample prediction error
What we would ideally want (elppd), we cannot compute
since it requires us to know the distribution of out-of-sample data
($\tilde y_i$).
This leads us to the following impossible set of goals for our ideal
measure of model (predictive) performance (borrowed from [@Gelman:2014]):
* an **unbaised** and **accurate** measure
* of **out-of-sample prediction** error (elppd)
* that will be valid over a **general** class of models,
* and that **requires minimal computation** beyond that need to fit the model
in the first place.
Here are three approaches to solving this problem
1. Use within-sample predictive accuracy.
But this isn't ideal since it overestimates performace of the model
(and more so for more complicated models).
2. Adjust within-sample predictive accuracy.
Within-sample predictive accuracy will over-estimate out-of-sample predictive
accuracy. If we knew (or could estimate) by how much, we could adjust
by that amount to eliminate (or reduce) the bias. Quantities like
AIC (Aikeke's information criterion),
DIC (deviance information criterion),
and WAIC (widely applicable information criterion)
take the approach of substracting something from lppd that
depends on the complexity of the model.
3. Use cross-validation
The main idea here is to use some of the data to fit the model and the rest
of the data to evaluate prediction error. This is a poor person's version of
"out-of-sample". We will focus on **leave one out** (LOO) cross validation
where we fit the model $n$ times, each time leaving out one row of the data
and using the resulting model to predict the removed row.
If we really needed to recompute the model $n$ times,
this would be too computationally expensive for large data sets and complex
models. But there are (more) efficient
approximations to LOO-cv that make it doable. They are based on the idea
that the posterior distribution using $y(-i)$ (all but row $i$ of the data)
should usually be similar to the posterior distribution using $y$ (all of the data).
So we can recycle the work done to compute our original posterior.
The result is only an approximation, and it doesn't always work well,
so sometimes we have to recreate the posterior from scratch, at least for
some of the rows.
The formulas for **estimated out-of-sample predictive density**
\begin{align*}
\widehat{\mathrm{elppd}}_{\mathrm{AIC}}
&= \mathrm{lpd}(\hat\theta_{\mathrm{mle}}, y) - p_{\mathrm{AIC}} \\
\widehat{\mathrm{elppd}}_{\mathrm{DIC}}
&= \mathrm{lpd}(\hat\theta_{\mathrm{Bayes}}, y) - p_{\mathrm{DIC}} \\
\widehat{\mathrm{elppd}}_{\mathrm{WAIC}}
&= \mathrm{lppd} - p_{\mathrm{WAIC}} \\
\widehat{\mathrm{elppd}}_{\mathrm{LOO}}
&= \sum_{i=1}^n \log p_{\mathrm{post}(-i)}(y_i)
\approx \sum_{i=1}^n \log \left( \frac{1}{S} \sum_{s = 1}^S p(y_i \mid \theta^{is})\right)
\end{align*}
and the associated **effictive number of parameters**:
\begin{align*}
p_{\mathrm{AIC}} &= \mbox{number of parameters in the model}\\
p_{\mathrm{DIC}} &= 2 \mathrm{var}_{\mathrm{post}}(\log p(y \mid \theta)) \\
p_{\mathrm{WAIC}} &= 2 \mathrm{var}_{\mathrm{post}}(\sum_{i = 1}^n \log p(y_i \mid \theta)) \\
p_{\mathrm{LOO}} &= \hat{\mathrm{llpd}} - \hat{\mathrm{llpd}}_{\mathrm{LOO}} \\
\end{align*}
Notes
1. $\theta^{is}$ is the value of $\theta$ in row $s$ of the posterior distribution
*when row $i$ has been removed from the data*. What makes LOO practical is
that this can be approximated without refitting the model $n$ times.
1. AIC and DIC differ from WAIC and LOO in that they use a point estimate
for $\theta$ (the maximum likelihood estimate for AIC and the
mode of the posterior distribution for DIC) rather than using the
full posterior distribution.
1. AIC penalizes a model 1 for each parameter. This is correct for linear
models with normal noise and uniform priors, but is not correct in general.
You can think of DIC and WAIC as estimating the effective number of
parameters by looking at how much variation there is in $\log(p(y_i \mid \theta))$.
The more this quantity changes with changes in $\theta$, the more flexible
the model is (and the more it should be penalized).
1. LOO doesn't work by adusting for an estimated number of parameters;
it attempts to estimate elppd directly.
But we can reverse engineer things to get an estimated number of parameters
by taking the difference between
the (estimated) within-sample and out-of-sample predictive density.
1. LOO and WAIC are assymptotically equivalent (that is they give more
and more similar values as the sample size increases), but LOO typically
performs a bit better on small data sets, so the authors of the loo package
recommend LOO over WAIC as the go-to measure for comparing models.
1. Historically, information criteria have been expressed on the "devaiance scale".
To convert from log predictive density scale to deviance scale, we multiply by -2.
On the deviance scale, smaller is better.
On the log predictive density scale, larger is better
(but the values are usually negative.) The `waic()` and `loo()` functions
compute both values.
1. The output from `loo()` and `waic()` labels things elpd rather than elppd.
<!-- The problem is that we have used all of our data to create our -->
<!-- model. What to do? -->
<!-- a. Training and Testing -->
<!-- We could split our data into two portions. The first part (training data) -->
<!-- would be used to fit the model and the second (test data) would be used -->
<!-- to measure how well the model performs. Models that are overfitting may do -->
<!-- well on the training data but do poorly with test data. Other models may -->
<!-- not do as well with the training data, but might do better with the test data. -->
<!-- We should prefer the latter sort of model. -->
<!-- The downside is that we have lost a portion of our data for fitting purposes. -->
<!-- b. Cross validation. -->
<!-- Cross validation is a bit like Train/Test repeated many times. Each time we fit -->
<!-- the model to a portion of the data and see how it performs on the rest -->
<!-- of the data. LOO (leave one out) cross validation does this by leaving out -->
<!-- one row of the data and seeing how well a model fit to the remaining rows -->
<!-- predicts it. If we have $n$ rows of data, we can create $n$ leave-one-out -->
<!-- models. -->
<!-- One downside of cross validation is that it can be computationally expensive. -->
<!-- (That's a lot of bayesian models to fit.) To speed things up, approximations -->
<!-- are used. The loo R package provides tools for computing approximate LOO -->
<!-- cross-validation on Bayesian models. The method employed goes by the name -->
<!-- Pareto smoothed importance-sampling leave-one-out cross-validation -->
<!-- (PSIS-LOO). -->
<!-- c. Information criteria. -->
<!-- AIC (Aikeke's Information Criterion), -->
<!-- DIC (Deviance Information Criterion), -->
<!-- and WAIC (Widely Applicable Information Criterion) -->
<!-- are successively more complicated (and better) ways of estimating -->
<!-- out-of-sample prediction. As the names suggest, they are all based -->
<!-- on a concept called information. The loo R package also provides -->
<!-- methods for computing WAIC (but it recommends using LOO). -->
<!-- ## Measuring fit -->
<!-- ### $R^2$ -->
<!-- Using $R^2$ alone as a measure of fit has the problem that $R^2$ increases -->
<!-- as we add complexity to the model, which pushes us toward overfitting. -->
<!-- ```{r} -->
<!-- Brains.R2 <- -->
<!-- data_frame( -->
<!-- degree = 0:6, -->
<!-- R2 = sapply( list(m7, m1, m2, m3, m4, m5, m6), mosaic::rsquared) -->
<!-- ) -->
<!-- Brains.R2 -->
<!-- gf_point(R2 ~ degree, data = Brains.R2) %>% -->
<!-- gf_line(R2 ~ degree, data = Brains.R2, alpha = 0.5) %>% -->
<!-- gf_labs(x = "degree of polynomial", y = expression(R^2)) -->
<!-- ``` -->
<!-- There are ways to adjust $R^2$ to reduce this problem, but we are going to introduce -->
<!-- other methods of measuring fit. -->
<!-- ```{r} -->
<!-- gf_point(R2 ~ degree, color = ~ factor(degree), data = Brains.R2) %>% -->
<!-- gf_line(R2 ~ degree, data = Brains.R2, alpha = 0.5) %>% -->
<!-- gf_segment(R2 + 1 ~ degree + 7, color = ~ factor(degree), -->
<!-- data = Brains.R2, alpha = 0.3) %>% -->
<!-- gf_labs(x = "degree of polynomial", y = expression(R^2)) -->
<!-- ``` -->
<!-- ### Weather Prediction Accuracy -->
<!-- Consider the predictions of two weather people over the same set of 10 days. -->
<!-- Which one did a better job of predicting? How should we measure this? -->
<!-- * **First Weather Person:** -->
<!-- ![](images/weather1.png) -->
<!-- * **Second Weather Person:** -->
<!-- ![](images/weather2.png) -->
<!-- Last time we discussed some ways to compare which weather person makes the best predictions. -->
<!-- Here is one more: Given each weather person's "model" as a means of generating data, which -->
<!-- one makes the observed weather most likely? Now weather person 1 wins handily: -->
<!-- ```{r} -->
<!-- # WP #1 -->
<!-- 1^3 * 0.4^7 -->
<!-- # WP #2 -- no chance! -->
<!-- 0^3 * 1^7 -->
<!-- ``` -->
<!-- This has two advantages for us: -->
<!-- 1. This is just the likelihood, an important part of our Bayesian modeling system. -->
<!-- 2. It is based on joint probability rather than average probability. Weather person 2 is taking unfair advantage of average probability by making predictions we know are "impossible". -->
<!-- ## Shannon Entropy and related notions -->
<!-- Now let's take a bit of a detour on the road to another method of assessing the predictive -->
<!-- accuracy of a model. The route will look something like this: -->
<!-- $$ -->
<!-- \mbox{Information} \to \mbox{(Shannon) Entropy} \to \mbox{Divergence} \to \mbox{Deviance} \to \mbox{Inforation Criteria (DIC and WAIC)} -->
<!-- $$ -->
<!-- DIC (Deviance Information Criterion) and WAIC (Widely Applicable Information Criterion) are -->
<!-- where we are heading. For now, you can think of them as improvements to (adjusted) $R^2$ -->
<!-- that will work better for Bayesian models. -->
<!-- ### Information -->
<!-- Let's begin by considering the amount of -->
<!-- information we gain when we observe some random process. -->
<!-- Suppose that the event we observed has probability $p$. -->
<!-- Let $I(p)$ be the amount of information we gain from observing -->
<!-- this outcome. $I(p)$ depends on $p$ but not on the outcome itself, -->
<!-- and should satisfy the following properties. -->
<!-- 1. $I(1)$ = 0. -->
<!-- Since the outcome was certain, we didn't learn anything by observing. -->
<!-- 2. $I(0)$ is undefined. -->
<!-- We won't observe an outcome with probability 0. -->
<!-- 3. $I()$ is a decreasing function of $p$. -->
<!-- The more unusual the event, the more information we obtain when it occurs. -->
<!-- In particular, $I(p) \ge 0$ for all $p \in (0, 1]$. -->
<!-- 4. $I(p_1 p_2) = I(p_1) + I(p_2)$. -->
<!-- This is motivated by independent events. If we observe two independent -->
<!-- events $A_1$ and $A_2$ with probabilities $p_1$ and $p_2$, we can consider this as a single event with -->
<!-- probability $p_1 p_2$. -->
<!-- The function $I()$ should remind you of a function you have seen before. -->
<!-- Logarithms satisfy these properties 1, 2, and 4, but logarithms are increaseing -->
<!-- functions. We get the function we want if we define -->
<!-- $$ -->
<!-- I(p) = - \log(p) = \log(1/p) -->
<!-- $$ -->
<!-- We can choose any base we like: 2, $e$, and $10$ are common choices. -->
<!-- Our text chooses natural logarithms. -->
<!-- In can be shown that negative logarithms are the only -->
<!-- functions that have our desired properties. -->
<!-- ### Entropy -->
<!-- Now consider a random process $X$ with $n$ outcomes having probabilities -->
<!-- $\mathbf{p} = p_1, p_2, \dots, p_n$. That is, -->
<!-- $$ -->
<!-- P(X = x_i) = p_i, -->
<!-- $$ -->
<!-- The amount of information for each outcome depends on $p_i$. -->
<!-- The **Shannon entropy** (denoted $H$) is the average (usually called "expected") -->
<!-- amount of information gained from each observation of the random process: -->
<!-- $$ -->
<!-- H(X) = H(\mathrm{p}) = \mathrm{expected \ information} = \sum p_i \cdot I(p_i) = - \sum p_i \log(p_i) -->
<!-- $$ -->
<!-- Note that -->
<!-- * $H(X) \ge 0$ since $p_i \ge 0$ and $I(p_i) = - \log(p_i) \ge 0$. -->
<!-- * Outcomes with probability 0 must be removed from the list -->
<!-- (alternatively, we can treat $0 \log(0)$ as $0$ for the purposes of entropy. -->
<!-- Note: $\lim_{q \to \infty} q \log(q) = 0$, so this is a continuous extension.) -->
<!-- * $H(X)$, like $I(p_i)$ depends only on the probabilities, not on the outcomes themselves. -->
<!-- * $H$ is a **continuous** function. -->
<!-- * Among all distributions with a fixed number of outcomes, $H$ is **maximized** -->
<!-- when all outcomes are equally likely (for a fixed number of outcomes) -->
<!-- * among equiprobable distributions $H$ **increases as the number of outcomes increases**. -->
<!-- * $H$ is **additive** in the following sense: if $X$ and $Y$ are independent, then -->
<!-- $H(\langle X, Y\rangle) = H(X) + H(Y)$. -->
<!-- $H$ can be thought of as a measure of **uncertainty**. -->
<!-- Uncertainty decreases as we make observations. -->
<!-- * Consider a random variable that takes on only one value (all the time). -->
<!-- There is nothing uncertain, and $H(X) = 1 \cdot \log(1) = 0$. -->
<!-- ```{r, chunk6.9a} -->
<!-- p <- 1 -->
<!-- - sum(p * log(p)) -->
<!-- ``` -->
<!-- * A random coin toss has entropy 1 if we use base 2 logarithms. (In this case the unit is called a **shannon** -->
<!-- or a bit of uncertainty.) -->
<!-- Applied to a 50-50 coin we get: -->
<!-- ```{r, chunk6.9b} -->
<!-- p <- c(0.5, 0.5) -->
<!-- # one shannon of uncertainty -->
<!-- - sum(p * log2(p)) -->
<!-- ``` -->
<!-- * It is more common in statistics to use natural logarithms. -->
<!-- In that case, the unit for entropy is called a **nat**, -->
<!-- and the entropy of a fair coin toss is -->
<!-- ```{r, chunk6.9c} -->
<!-- p <- c(0.5, 0.5) -->
<!-- # uncertainty of a fair coin in nats -->
<!-- - sum(p * log(p)) -->
<!-- ``` -->
<!-- We can write a little function to compute entropy for cases where there are only -->
<!-- a finite number of outcomes. -->
<!-- ```{r} -->
<!-- H <- function(p, base = exp(1)) { -->
<!-- - sum(p * log(p, base = base)) -->
<!-- } -->
<!-- # in nats -->
<!-- H(c(0.5, 0.5)) -->
<!-- H(c(0.3, 0.7)) -->
<!-- # in shannons -->
<!-- H(c(0.5, 0.5), base = 2) -->
<!-- H(c(0.3, 0.7), base = 2) -->
<!-- ``` -->
<!-- ### Decrease in Entropy = Gained Information -->
<!-- Decreases in this uncertainty are gained information. -->
<!-- #### R code 6.9 -->
<!-- Applied to a 30-70 coin we get: -->
<!-- ```{r, chunk6.9} -->
<!-- p <- c(0.3, 0.7) -->
<!-- - sum(p * log(p)) -->
<!-- ``` -->
<!-- ### Divergence -->
<!-- Kullback-Leibler divergence compares two distributions and asks "if we are -->
<!-- anticipating \mathrm{q}, but get \mathrm{p}, how much more surprised will -->
<!-- we be than if we had been expecting \mathrm{p} in the first place?" -->
<!-- Here's the definition -->
<!-- $$ -->
<!-- D_{KL}(\mathrm{p}, \mathrm{q}) = -->
<!-- \mathrm{expected\ difference\ in\ ``surprise"} -->
<!-- = \sum p_i \left( I(q_i) - I(p_i) \right) -->
<!-- = \sum p_i I(q_i) - \sum p_i I(p_i) -->
<!-- $$ -->
<!-- This looks like the difference between two entropies. It alsmost is. -->
<!-- The first one is actually a **cross entropy** where we use probabilities -->
<!-- from one distribution and information from the other. We denote this -->
<!-- $$ -->
<!-- H(\mathbf{p}, \mathbf{q}) = \sum p_i I(q_i) = - \sum p_i \log(q_i) -->
<!-- $$ -->
<!-- Note that $H(\mathbf{p}) = H(\mathbf{p}, \mathbf{p})$, so -->
<!-- \begin{align*} -->
<!-- D_{KL} &= H(\mathrm{p}, \mathrm{q}) - H(\mathrm{p}) -->
<!-- \\ -->
<!-- &= -->
<!-- \sum p_i \log(p_i) - \sum p_i \log(q_i) -->
<!-- \\ -->
<!-- &= \sum p_i (\log(p_i) - \log(q_i)). -->
<!-- \end{align*} -->
<!-- #### R code 6.10 -->
<!-- ```{r, chunk6.10} -->
<!-- # fit model with lm -->
<!-- m1 <- lm(brain_size ~ body_mass, data = Brains) -->
<!-- # compute deviance by cheating -->
<!-- (-2) * logLik(m1) -->
<!-- ``` -->
<!-- #### R code 6.11 -->
<!-- ```{r, chunk6.11} -->
<!-- # standardize the body_mass before fitting -->
<!-- Brains <- -->
<!-- Brains %>% mutate(body_mass.s = zscore(body_mass)) -->
<!-- m8 <- map( -->
<!-- alist(brain_size ~ dnorm(mu, sigma), -->
<!-- mu <- a + b * body_mass.s), -->
<!-- data = Brains, -->
<!-- start = list( -->
<!-- a = mean(Brains$brain_size), -->
<!-- b = 0, -->
<!-- sigma = sd(Brains$brain_size) -->
<!-- ), -->
<!-- method = "Nelder-Mead" -->
<!-- ) -->
<!-- # extract MAP estimates -->
<!-- theta <- coef(m8); theta -->
<!-- # compute deviance -->
<!-- dev <- (-2) * sum(dnorm( -->
<!-- Brains$brain_size, -->
<!-- mean = theta[1] + theta[2] * Brains$body_mass.s, -->
<!-- sd = theta[3], -->
<!-- log = TRUE -->
<!-- )) -->
<!-- dev %>% setNames("dev") # setNames just labels the out put -->
<!-- -2 * logLik(m8) # for comparison -->
<!-- ``` -->
## Using loo
The loo package provides functions for computing WAIC and LOO estimates of
epld (and their information criterion counterparts).
While the definitions are a bit involved,
using WAIC or LOO to compare models is relatively easy.
WAIC can be faster, but LOO performs better (according to the authors of
the loo package).
```{r ch20-waic-loo, cache = TRUE}
library(loo)
waic(fert4_brm)