This is a preliminary version. The algorithm was originally proposed in (Metropolis et al., 1953) and later extended in (Hastings, 1970). For an exhaustive description of the algorithm and other MC methods I refer the reader to (Robert & Casella, 2005).
Quick overview:
- Defns of detailed balance and stationary distribution of a Markov chain.
- Detailed balance implies stationarity.
- Defn of MH Kernel.
- MH Kernel satisfies detailed balance: formal and intuitive proof.
Prelims
We start with a couple of definitions to set the stage for the algorithm and why it works. Recall that a stationary distribution $\pi$ of the chain $K$ is characterized by
\[\pi(Y) = \int_X \pi(x) \ K(x,Y) \ dx,\]for all events $Y$. Note that for the discrete finite case, if $K$ is just a transition matrix, this tranlates to
\[\pi = \pi K.\]Moreover, for the Markov chain $(X_n)$ defined by $X_n \sim K(X_{n-1}, X_n)$, if $X_0 \sim \pi$ then $X_n \sim \pi$ for every $n$. What is a bit more involved is showing that now matter the initial distribution the chain converges to $\pi$ (cf. (Robert & Casella, 2005)), i.e. showing that for any state $x$ (and for some appropriate norm) we have
\[\lim_{n\to\infty} \| K^n(x,.) - \pi \| \stackrel{!}{=} 0.\]A Markov chain with transition kernel $K$ satisfies the detailed balance condition if there is a function $f$ such that for every $x,y$
\[f(x) \ K(x,y) = f(y) \ K(y,x).\]If $f$ is a probability density – as it will be in the MH case below – it implies that the chain is reversible and $f$ is a stationary distribution of the chain. Simply compute
\[\begin{eqnarray*} \int_X f(x) \ K(x,Y) \ dx &=& \int_X \int_Y f(x) \ K(x,y) \ dy \ dx \\ &=& \int_X \int_Y f(y) \ K(y,x) \ dy \ dx \\ &=& \int_Y f(y) \ \Big[ \int_X K(y, x) \ dx \Big] \ dy \\ &=& \int_Y \ f(y) \ dy \\ &=& f(Y) . \end{eqnarray*}\]Metropolis-Hastings transition kernel
Given a distribtution $p(x)$ we will now create a Markov chain $(X_1,X_2,\ldots)$ with stationary distribution $p(x)$. For a family of proposal distributions $q(y \mid x)$ we define the acceptance probability (of a proposed state transition $x \to y$) as
\[a(x, y) := \min \Big\{ 1, \frac{q(x \mid y)}{q(y \mid x)} \frac{p(y)}{p(x)} \Big\}.\]As the name suggests $a(x,y)$ guides which proposals to accept and which to reject. The way it is set up ensures that $p$ is the stationary distribution of the chain. But we are getting ahead of our self. Let us define the Markov chain first.
Pick an initial state $x_0$ and sample $x_1, x_2, x_3, \ldots$ by iterating the following process for $t \geq 1$:
\[\left.\begin{aligned} y_t \ & \sim \ q(y \mid x_{t-1}) \\ a_t \ & \sim \ \text{Bernoulli}\big(a({x_{t-1}, y_t}) \big) \end{aligned}\right. \ \ \ \ \longrightarrow \ \ \ \ \ x_t := \begin{cases} y_t & \text{if $a_t$ is true,} \\ x_{t-1} & \text{else.} \end{cases}\]We can also write down the MH kernel explicityly: Let $r(x) = 1 - \int a(x,\tilde{y}) \ q(\tilde{y} \mid x) d\tilde{y}$ denote the mass of ending up in the second branch in the above if-else statement, or equivalently the mass of the event of any proposal being rejected. Then we have
\[K(x,y) = a(x,y) \ q(y \mid x) + r(x) \ \delta_x(y).\]Detailed balance for MH kernel
The above Markov chain satisfies the detailed balance condition: Note that the detailed balance condition is always satisfied for $x=y$ which is the only case for which the second term is non-vanishing. It remains to show detailed balance for the first term, which can be written as
\[\begin{eqnarray*} p(x) \ \big( a(x,y) \ q(y \mid x) \big) &=& \min \Big\{ 1, \frac{q(x \mid y) \ p(y)}{q(y \mid x) \ p(x)} \Big\} \ q(y \mid x) \ p(x) \\ &=& \min \Big\{ q(y \mid x) \ p(x), q(x \mid y) \ p(y) \Big\}. \end{eqnarray*}\]The expression on the right side is clearly symmetric in $x$ and $y$ and thus implies detailed balance.
Intuition
Note that we consider the probability ratio of the transitions of the reversed transition $y \to x$ over the proposed transition $x \to y$ and we distinguish two cases:
\[\text{(i)} \ \ a=1 \ \ \ \text{and} \ \ \ \text{(ii)} \ \ a<1 .\]The first case translates to $p(y \to x) \geq p(x \to y)$. In order to reestablish balance we have to increase the denominator $p(x \to y)$, which means we want to accept the proposed transition. Conversely, the second case translates to $p(y \to x) < p(x \to y)$ and we want to reject the proposed transition (proportionally to how bad we are out of balance) to avoid amplifying the imbalance.
References
- Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., & Teller, E. (1953). Equation of State Calculations by Fast Computing Machines. The Journal of Chemical Physics, 21(6), 1087–1092. https://doi.org/10.1063/1.1699114
- Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57(1), 97–109. https://doi.org/10.1093/biomet/57.1.97
- Robert, C. P., & Casella, G. (2005). Monte Carlo Statistical Methods (Springer Texts in Statistics). Springer-Verlag.