-
Notifications
You must be signed in to change notification settings - Fork 3
/
14-Stan.Rmd
352 lines (287 loc) · 13.1 KB
/
14-Stan.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
# Stan
## Why Stan might work better
Stan is sometimes (but not always) better or faster than JAGS. The reason is
that the HMC (Hamilton Markov Chain) algorithm that it uses avoids some of the
potential problems of the Metropolis algorithm and Gibbs sampler.
You can think of HMC as a generlization of the Metropolis algorithm.
Recall that in the Metropolis algorithm
* There is always a current vector of parameter values (like the
current island in our story)
* A new vector of parameter values is proposed
* The proposal is accepted or rejected by comparing the ratio
of the likelihoods of the current and proposal vectors.
* It is important that we only need the ratio because the scaling
constant would be prohibitively expensive to compute.
The main change is in how HMC chooses its proposals. Recall that in
the basic Metropolis algorithm
* the proposal distribution is symmetric
* the proposal distribution is the same no matter where
the current parameter vector is.
This has some potential negative consequences
* When the current position is a region of relatively low posterior density,
the algorithm is as likely to propose moves that go farther from the mode
as toward it. This can be inefficient.
* The behavior of the algorithm can be greatly affected by the "step size"
(how likely the proposal is to be close to or far from the current position).
HMC addresses these by using a proposal distribution that
* Changes depending on the current position
* Is more likely to make proposals in the direction of the mode
Unlike Gibbs samplers, HMC is not guided by the fixed directions corresponding
to letting only one parameter value change at a time. This makes it easier for
HMC to navigate posteriors that have narrow "ridges" that don't follow one of
these primary directions, so Stan is less disturbed by correlations in the
posterior distribution than JAGS is.
The basic idea of the HMC sampler in Stan is to turn the log posterior
upsided down so it is bowl-shaped with its mode at the "bottom" and to
imagine a small particle sliding along this surface after receiving a
"flick" to get it moving. If the flick is in the direction of the mode,
the particle will move farther.
If it is away from the mode, it may go
up hill for a while and then turn around.
(The same thing may happen if it travels in the direction of the mode and
overshoots.)
If it is in some other direction,
it will take a curved path that bends toward the mode.
A proposal is generated by specifying
* A direction and "force" for the flick (momentum)
* The amount of "time" to let the particle move.
At the end of the specified amount of time, the particle will be at
the proposal position.
* A level of discretization used to simulate the motion of the particle.
In principle (ie, physics), every proposal can be excepted
(as in the Gibbs sampler).
In practice, because the simulated movement is discretized into a
sequence of small segments, a rule is used that involves both the
ratio of the posterior values and the ratio of the momentums of the
current and proposal values.
If the motion is simulated with many small segments, the proposal will nearly
always be accepted, but it will take longer to do the simulation. On the other
hand, if a cruder approximation is used, the simulation is faster, but the
proposal is more likely to be rejected. "Time" is represented by the product of
the number of steps used and the size of the steps: `steps * eps` (eps is short
for espilon, but "steps time eps" has a aring to it). The step size (`eps`) is a
tuning parameter of the algorithm, and things seem to work most efficiently if
roughly 2/3 of proposals are accepted. The value of `eps` can be adjusted to
attain something close to this goal.
Stan adds an extra bit to this algorithm. To avoid the inefficiency
of overshooting and turning around, it tries to estimate when this will
happen. The result is its No U-Turn Sampler (NUTS). There are a number
of other features that lead to the complexity of Stan including
* Symbolic differentiation to determine the gradient of the posterior and
momentum.
* Simulation techniques for the physics to minimize the inaccuracy
created because of discretization.
* Techniques for dealing with parameters with bounded support.
* An inital phase that helps set the tuning parameters: step size (`eps`),
time (`steps * eps`), and the distribution from which to sample the
initial momentum.
Because of all these techinical details, it is easy to see why Stan may be
much slower than JAGS in situations where JAGS works well. The flip-side is
that Stan make work in situations where JAGS fails altogether or takes too
much time to be of practical use. Generally, as models become more complex,
Stan gains the advantage. For simple models, JAGS is often faster.
## Describing a model to Stan
Coding Stan models is also a bit more complicated, but what we have learned
about JAGS is helpful. RStudio also offers excellent support for Stan, so
we won't have to use tricks like writing a "function" in R that isn't really
R code to describe a JAGS model.
To use Stan in R we will load the `rstan` package and take advantage of
[RStudio's Stan chunks](https://github.com/stan-dev/rstan/wiki/RStan-Getting-Started).
```{r ch14-stan-setup}
library(rstan)
rstan_options(auto_write = TRUE) # saves some compiling
```
Stan descriptions have several sections (not all of which are required):
* data -- declarations of variables to hold the data
* transformed data -- transformations of data
* parameters -- declaration of parameters
* transformed parameters -- transformations of parameters
* model -- description of prior and likelihood
* generated quantities -- used to keep track of additional
values Stan can compute at each step.
Here is an example of a simple Stan model:
```{stan ch14-simple-stan, output.var = "simple_stan", cache = TRUE}
data {
int<lower=0> N; // N is a non-negative integer
int y[N]; // y is a length-N vector of integers
}
parameters {
real<lower=0,upper=1> theta; // theta is between 0 and 1
}
model {
theta ~ beta (1,1);
y ~ bernoulli(theta);
}
```
See if you can figure out what this model is doing.
You will also see that Stan requires some extra stuff compared to
JAGS. In particular, we need to tell Stan which quantities are integers
and which are reals, and also if there is an restriction to their domain.
**Note:** running the chunk above takes a little while.
This is when Stan compiles the C code for the model and also works out the
formulas for the gradient (derivatives). The result is a
**dynamic shared object** (**DSO**).
To use this model in RStudio, put the code in a Stan chunk (one of the options
from the insert menu) and set the `output.var` to the R variable that will store
the results. In this case, we have named it `simple_stan` using the
argument `output.var = "simple_stan"`. Behind the scenes, RStudio is
calling `stan_code()` to pass information between R and Stan and to
get Stan to do the compilation.
```{r ch14-simple-stan-look}
class(simple_stan) # what kind of thing is this?
simple_stan # let's take a look
```
We still need to provide Stan some data and ask Stan to provide us
with some posterior samples. We do this with the `sampling()`
function. By separating this into a separate step, we can use the same
compiled model with different data sets or different settings
(more iterations, for example) without having to recompile.
```{r ch14-simple-stanfit}
simple_stanfit <-
sampling(
simple_stan,
data = list(
N = 50,
y = c(rep(1, 15), rep(0, 35))
),
chains = 3, # default is 4
iter = 1000, # default is 2000
warmup = 200 # default is half of iter
)
```
The output below looks similar to what we have seen from JAGS.
```{r ch14-simple-stanfit-look}
simple_stanfit
```
There are a number of functions that can extract information from
stanfit objects.
```{r ch14-simple-stanfit-methods}
methods(class = "stanfit")
```
Unfortunately, some of these have the same names as functions elsewhere (in
the coda package, for example). We generally adopt an approach that
keeps things as similar to what we did with JAGS as possible.
* Use `CalvinBayes::posterior()` to create a dataframe with posterior
samples. These can be plotted or explored using `ggformula` or other
familiar tools.
* Use `as.matrix()` or `as.mcmc.list()` to create an object that can
be used with `bayesplot` just as we did when we used JAGS.
```{r ch14-simple-stan-plots}
gf_dens(~theta, data = posterior(simple_stanfit))
simple_mcmc <- as.matrix(simple_stanfit)
mcmc_areas(simple_mcmc, prob = 0.9, pars = "theta")
mcmc_areas(as.mcmc.list(simple_stanfit), prob = 0.9, pars = "theta")
```
```{r ch14-simple-stan-diagplot, fig.height = 5}
diag_mcmc(as.mcmc.list(simple_stanfit))
```
## Samping from the prior
In JAGS, to sample from the posterior, we just "removed the data".
For any parameter values, the likelihood of not having any data if
we don't collect any data is 1. So the posterior is the same as the
prior.
Unlike JAGS, Stan does not allow missing data, so we need a different
way to sample from the posterior. In Stan, we will remove the likelihood.
To understand why this works, let's think a little bit about how Stan
operates.
* All internal work is done on the log scale.
log prior, log likelihood, log posterior.
* Additive constants on the log scale
(multiplicative constants on the natural scale) don't matter...
... at least not for generating posterior samples,
so they can be ignored or chosen conveniently.
The distribution functions in Stan are really logs of kernels
with constants chosen to optimize efficiency of computation.
* log(posterior) = log(prior) + log(likelihood) + constant
So if we don't add in the likelihood part, we just get the prior
again as the posterior.
A line like
```
y ~ bernoulli(theta);
```
Is just telling stan to add the log of the bernoulli pmf for each value
of the data vector `y` using the current value for `theta`. If we
comment out that line, no log likelihood will be added.
```{stan ch14-simple0-stan, output.var = "simple0_stan", cache = TRUE}
data {
int<lower=0> N; // N is a non-negative integer
int y[N]; // y is a length-N vector of integers
}
parameters {
real<lower=0,upper=1> theta; // theta is between 0 and 1
}
model {
theta ~ beta (1,1);
// y ~ bernoulli(theta); // comment out to remove likelihood
}
```
```{r ch14-simple0-stanfit, results = "hide"}
simple0_stanfit <-
sampling(
simple0_stan,
data = list(
N = 50,
y = c(rep(1, 15), rep(0, 35))
),
chains = 3, # default is 4
iter = 1000, # default is 2000
warmup = 200 # default is half of iter
)
```
## Exercises {#ch14-exercises}
<!-- Exercise 14.1. [Purpose: Transformed parameters in Stan, and comparison with JAGS.] -->
1. Let's compare Stan and JAGS on the therapeutic touch example from
Chapter 9. (See Figure 9.7 on page 236.) Stan and JAGS code for this example are below. The data are in `TherapeuticTouch`.
```{stan ch14-prob-tt-stan, eval = FALSE, output.var = "tt_stan"}
data {
int<lower=1> Nsubj;
int<lower=1> Ntotal;
int<lower=0,upper=1> y[Ntotal];
int<lower=1> s[Ntotal]; // notice Ntotal not Nsubj
}
parameters {
real<lower=0,upper=1> theta[Nsubj]; // individual prob correct
real<lower=0,upper=1> omega; // group mode
real<lower=0> kappaMinusTwo; // group concentration minus two
}
transformed parameters {
real<lower=0> kappa;
kappa <- kappaMinusTwo + 2;
}
model {
omega ~ beta(1, 1);
kappaMinusTwo ~ gamma(1.105125, 0.1051249 ); // mode=1, sd=10
theta ~ beta(omega * (kappa-2) + 1, (1 - omega) * (kappa-2) + 1);
for ( i in 1:Ntotal ) {
y[i] ~ bernoulli(theta[s[i]]);
}
}
```
```{r ch14-prob-tt-jags}
jags_model <- function() {
for ( i in 1:Ntotal ) {
y[i] ~ dbern( theta[s[i]] )
}
for (s in 1:Nsubj ) {
theta[s] ~ dbeta(omega * (kappa-2) + 1, (1-omega) * (kappa-2) + 1)
}
omega ~ dbeta(1, 1)
kappa <- kappaMinusTwo + 2
kappaMinusTwo ~ dgamma(1.105125, 0.1051249) # mode=1, sd=10
}
```
Now answer the following questions.
a. What does the transformed parameters block of the Stan code do?
b. In the Stan code, there are two lines with `~` in them. One is
inside a for loop and the other not. Why?
c. Compile the Stan program, and note how long it takes.
d. Now generate posterior samples using both the JAGS and Stan versions.
Do the produce the same posterior distribution? How do the effective
sample sizes compare?
e. Tweak the settings until you get similar effective sample sizes
and rhat values from both Stan and JAGS.
(ESS is the best metric of how much work they have done, and we want to be
sure both algorithms think they are converging to get a fair comparison.)
Once you have done that, compare their speeds.
Which is faster in this example? By how much?
(If you want R to help automate the timing, you can use `system.time()`.)