You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When we use preconditioning with a mass matrix to improve performance of HMC
based on previous draws, we often ignore information that we already computed:
the gradients of the posterior density at those samples. But those gradients
contain a lot of information about the posterior geometry and as such also
about possible preconditioners. If for example we assume that the posterior is
an $n$-dimensional normal distribution, then knowing the gradient at $n + 1$
locations identifies the covariance matrix – and as such the optimal
preconditioner of the posterior – exactly.
We can evaluate a precondition matrix $\hat{\Sigma}$ by thinking of it and a
mean $\hat{\mu}$ as a normal distribution
$p(x) = N(x\mid \hat{\mu}, \hat{\Sigma})$ that approximates the posterior distribution with density $p$ such that
is small. (Where $\lVert x\rVert_{\hat{\Sigma}}$ is the norm defined by the
preconditioner). Equivalently as an affine transformation
$T(x) = \hat{\Sigma}^\tfrac{1}{2}x + \mu$
such that
Given an arbitrary but sufficiently nice posterior $p$, this is minimal if
$\hat{\Sigma}$ is the geodesic mean of the covariance of $p$ and the inverse
of the covariance of $\nabla p$. If $p$ is normal, then $Cov(\nabla p) = Cov(p)^{-1}$, so the minimum is reached at the covariance matrix.
If we only allow diagonal preconditioning matrices, we can find the minimum
analytically as
$$
C = \text{diag}\left(\sqrt{\frac{\text{Var}(p)}{\text{Var}(\nabla p)}}\right).
$$
This diagonal preconditioner is already implemented in PyMC and nuts-rs.
If we approximate the integral in $F$ with a finite number of samples using a Monte Carlo estimate, we find that $F$ is minimal if
If we have more dimensions than draws this does not have a unique solution,
so we introduce regularization. Some regularization methods based on the logdet or trace of $\Sigma$ or $\Sigma^{-1}$ still allow more or less explicit solutions as a algebraic Riccati equations that sometimes can be made to scale reasonably with
the dimension, but in my experiments the geodesic distance to $I$, $R(\hat\Sigma)=\sum\log(\sigma_i) ^ 2$ seems to work better.
To avoid quadratic memory and computational costs with the dimensionality,
we write $\hat{\Sigma} = D(I + Q\Sigma Q^T - QQ^T)D$ where $Q\in\mathbb{R}^{N\times k}$ orthogonal and $D, \Sigma$ diagonal, so that we can perform
all operations necessary for HMC or NUTS in $O(Nk)$.
We can now define a Riemannian metric on the space of all $(D, Q, \Sigma)$
as a pullback of the fisher information metric of $N(0, \hat\Sigma)$
and minimize $F$ using natural gradient descent. If we do this during tuning, we get similar behavior as in a stochastic natural descent, and
can avoid the saddle points during optimization.
Acknowledgment
A lot of the work that went into this package was during my time at Quantopian,
while trying to improve sampling of a (pretty awesome) model for portfolio
optimization. Thanks a lot for making that possible!