Sampling from the posterior with Markov-chain Monte Carlo

Posted on: 6 Aug. 2019

John K. Kruschke’s book, titled Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and Stan (2nd ed.) (Amazon, official site), gives a very quick and practical introduction to Bayesian analysis. Compared to BDA3, it contains less proofs, but also less jargon; more explanations that are informal, and more introductions to the basics. As such, I would recommend it to someone who hasn’t had much of an exposure to statistics yet, or is not a mathematician nor a programmer.

The book includes thorough and nicely visualized descriptions of multiple Markov-chain Monte Carlo methods for sampling from a posterior distribution, of which I’ll try to summarize the most basic one in this post.

Goal of sampling

Given the prior (p(θ)) and the likelihood (p(\D\given θ)), we want samples from the posterior (p(θ\given \D)). In the following sections I’ll use the fact that the unnormalized posterior is equal to the prior multiplied with the likelihood: (p(θ, \D) = p(θ)\,p(\D \given θ)). Here, I’ll talk only about continuous probability spaces; discrete spaces can be sampled similarly.

Metropolis algorithm

Just like the other MC methods, the Metropolis algorithm starts with a seed value for (θ) – let’s call it (θ_0). (I assume in practice (θ_0) is sampled from the prior.) Then, once you have a seed value (θ_i), repeat the following two steps for a prespecified number of iterations, or until an effective sample size is achieved.

Sample (θ’{i+1}) from a proposal distribution around (\theta_i), which could be a Gaussian: (\theta’ \sim \N (θ_i, Σ)).

If (p(θ_{i},\D) \le p(θ’{i+1},\D)) – i.e. if (p(θ \given \D) \le p(θ’{i+1} \given \D)) – then _accept the proposed parameter value: (θ_{i+1} := θ’_{i+1}).
Otherwise, the probability of accepting the proposed parameter is the ratio of the posterior at the proposed value and at the current value; otherwise, reject it:

[\begin{gathered} p = \frac{p(θ’{i+1}, \D)}{p(θ = \frac{p(θ’}, \D){i+1} \given \D)}{p(θ, \ b \sim Bernoulli(p), \ θ_{i+1} = \begin{cases} θ_{i+1}’ & \text{if } b=1,\ θ_i & \text{if } b=0. \end{cases} \end{gathered}]} \given \D)

It can be proven that after a so-called “burn-in” period, the probability of any (θ_{n}) value will be the posterior probability: (θ_n \sim p(\theta_n\given \D)) if (n \gg 1), therefore if you do the procedure long enough, you’ll end up with many samples from the posterior. Note that the effective sample size will be much lower than (N), because neighboring samples are strongly correlated, so we have to drop most of the (θ_i) values so obtained.

The beauty of this algorithm is that during this whole procedure, we only need to be able to compute the unnormalized posterior – so the algorithm can be easily used for sampling using the prior and the likelihood, even when the model is specified up to a multiplicative constant (as in an undirected graphical model).

This algorithm doesn’t easily escape a “probability island” – i.e. a region that is surrounded with a wide region of probability 0. (Although if the proposal distribution is wide enough, then the algorithm is theoretically able to make that jump eventually, which maybe in practice “approximately never”.)

One downside of this basic algorithm is that the proposal distribution needs to be fine-tuned for the individual application: differences in effective sample size can be orders of magnitudes, even for a simple (\text{Beta}(14,20)) distribution (i.e. a 1-dimensional unimodal distribution with finite support).

Another downside is that in multiple dimensions this random walk is quite inefficient, and even more dependent on a correct choice of the covariance matrix (Σ) – but apart from the obvious reason that “high-dimensional spaces are big”, I couldn’t tell why.

The well-known Metropolis–Hastings algorithm, Gibbs sampling and Hamiltonian Monte Carlo are different twists on this core idea, and they are also described in the book.

Allegedly, credit for this method is due more to Marshall and Arianna Rosenbluth – if there is agreement on that, we could rename it to Rosenbluthsian Monte Carlo.

For more information…

If you want to learn about sampling, or Bayesian data analysis, consider reading the book, it’s a great read from what I’ve read so far.

Stay tuned for more of Bayes, or Curry, or Euler, or McCarthy.