Back to Overview

The Metropolis-Hastings Algorithm - MCMC methods (1)

— 19 July 2021

This is a preliminary version. The algorithm was originally proposed in (Metropolis et al., 1953) and later extended in (Hastings, 1970). For an exhaustive description of the algorithm and other MC methods I refer the reader to (Robert & Casella, 2005).

Quick overview:

Prelims

We start with a couple of definitions to set the stage for the algorithm and why it works. Recall that a stationary distribution $\pi$ of the chain $K$ is characterized by

\[\pi(Y) = \int_X \pi(x) \ K(x,Y) \ dx,\]

for all events $Y$. Note that for the discrete finite case, if $K$ is just a transition matrix, this tranlates to

\[\pi = \pi K.\]

Moreover, for the Markov chain $(X_n)$ defined by $X_n \sim K(X_{n-1}, X_n)$, if $X_0 \sim \pi$ then $X_n \sim \pi$ for every $n$. What is a bit more involved is showing that the chain converges to $\pi$ (cf. (Robert & Casella, 2005)), i.e. showing that (for some appropriate norm) we have

\[\lim_{n\to\infty} \| K^n(x,.) - \pi \| \stackrel{!}{=} 0.\]

A Markov chain with transition kernel $K$ satisfies the detailed balance condition if there is a function $f$ such that for every $x,y$

\[f(x) \ K(x,y) = f(y) \ K(y,x).\]

(This is slightly “weaker” than reversibility.) If $f$ is a probability density – as it will be in the MH case below – it implies that the chain is reversible and $f$ is a stationary distribution of the chain. Simply compute

\[\begin{eqnarray*} \int_X f(x) \ K(x,Y) \ dx &=& \int_X \int_Y f(x) \ K(x,y) \ dy \ dx \\ &=& \int_X \int_Y f(y) \ K(y,x) \ dy \ dx \\ &=& \int_Y f(y) \ \Big[ \int_X K(y, x) \ dx \Big] \ dy \\ &=& \int_Y \ f(y) \ dy \\ &=& f(Y) . \end{eqnarray*}\]

Metropolis-Hastings transition kernel

Given a distribtution $p(x)$ we will now create a Markov chain $(X_1,X_2,\ldots)$ with stationary distribution $p(x)$. For a family of proposal distributions $q(y \mid x)$ we define the acceptance probability (of a proposed state transition $x \to y$) as

\[a(x, y) := \min \Big\{ 1, \frac{q(x \mid y)}{q(y \mid x)} \frac{p(y)}{p(x)} \Big\}.\]

As the name suggests $a(x,y)$ guides which proposals to accept and which to reject. The way it is set up ensures that $p$ is the stationary distribution of the chain. But we are getting ahead of our self. Let us define the Markov chain first.

Pick an initial state $x_0$ and sample $x_1, x_2, x_3, \ldots$ by iterating the following process for $t \geq 1$:

\[\left.\begin{aligned} y_t \ & \sim \ q(y \mid x_{t-1}) \\ a_t \ & \sim \ \text{Bernoulli}\big(a({x_{t-1}, y_t}) \big) \end{aligned}\right. \ \ \ \ \longrightarrow \ \ \ \ \ x_t := \begin{cases} y_t & \text{if $a_t$ is true,} \\ x_{t-1} & \text{else.} \end{cases}\]

We can also write down the MH kernel explicityly: Let $r(x) = 1 - \int a(x,\tilde{y}) \ q(\tilde{y} \mid x) d\tilde{y}$ denote the mass of ending up in the second branch in the above if-else statement, or equivalently the mass of the event of any proposal being rejected. Then we have

\[K(x,y) = a(x,y) \ q(y \mid x) + r(x) \ \delta_x(y).\]

Detailed balance for MH kernel

The above Markov chain satisfies the detailed balance condition: Note that the detailed balance condition is always satisfied for $x=y$ which is the only case for which the second term is non-vanishing. It remains to show detailed balance for the first term, which can be written as

\[\begin{eqnarray*} p(x) \ \big( a(x,y) \ q(y \mid x) \big) &=& \min \Big\{ 1, \frac{q(x \mid y) \ p(y)}{q(y \mid x) \ p(x)} \Big\} \ q(y \mid x) \ p(x) \\ &=& \min \Big\{ q(y \mid x) \ p(x), q(x \mid y) \ p(y) \Big\}. \end{eqnarray*}\]

The expression on the right side is clearly symmetric in $x$ and $y$ and thus implies detailed balance.

Intuition

Note that we consider the probability ratio of the transitions of the reversed transition $y \to x$ over the proposed transition $x \to y$ and we distinguish two cases:

\[\text{(i)} \ \ a=1 \ \ \ \text{and} \ \ \ \text{(ii)} \ \ a<1 .\]

The first case translates to $p(y \to x) \geq p(x \to y)$. In order to reestablish balance we have to increase the denominator $p(x \to y)$, which means we want to accept the proposed transition. Conversely, the second case translates to $p(y \to x) < p(x \to y)$ and we want to reject the proposed transition (proportionally to how bad we are out of balance) to avoid amplifying the imbalance.

References

  1. Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., & Teller, E. (1953). Equation of State Calculations by Fast Computing Machines. The Journal of Chemical Physics, 21(6), 1087–1092. https://doi.org/10.1063/1.1699114
  2. Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57(1), 97–109. https://doi.org/10.1093/biomet/57.1.97
  3. Robert, C. P., & Casella, G. (2005). Monte Carlo Statistical Methods (Springer Texts in Statistics). Springer-Verlag.