Markov Chain Monte Carlo and Variational Inference: Bridging the Gap

Tim Salimans, Diederik P. Kingma, Max Welling

ICML 2015 | arxiv |

Objectives: Benefit from: 1) the fast posterior approximation (variational inference: maximizing an explicit (approximate) objective) 2) the option of trading off additional computation for additional accuracy (MCMC: nonparametric/ flexible & asymptotically exact)

Variatioanl inference:

Fit to the posterior a parameterized \(q\_\phi(z|x)\) by maximizing the lower bound on the marginal likelihood:

\[ \log p(x) = D_{KL}(q_\phi(z|x) || p(z|x)) + E_{q_\phi(z|x)}[\log p(x,z) - \log q_\phi(z|x)] \]

Maximizing the second RHS of the previous equation (\(\mathcal L\)) w.r.t to \(\phi\) will leave the ML unchanged (independent from \(\phi\)) and thus bring down the non-negative DL to 0.

MCMC:

Rather than otimizing the distribution by which we pick our latent variable \(z\) we apply a stochastic transition operator to a random draw \(z\_0\) i.e.

\[ \forall t=1..T,\;\; z_t \sim q(z_t | z_{t-1}, x) \]

If the transition operator chosen judiciously, the last \(z\_T\) will converge in distribution to the exact posterior \(p(z|x)\) ==CAVEAT== How large \(T\) should be to get a good approximate?

Bridging the gap:

Interpret the stochastic Markov chain as a variational approximation:

\[ MC:\;\; q(z|x) = q(z_T|x) = q(z_0|x) \prod_t^T q(z_t| z_{t-1})|x) \]

in an expanded space where \(y = z\_0,..,z\_{t-1}\) is a set of auxiliary r.v.

We then integrate those auxiliary variables into the lower bound:

\[ \begin{align} \mathcal L_{aux} & = E_{q(y,z_T|x)}\left[\log \left[p(x,z_T)r(y|x,z_T)\right] - \log q(y,z_T|x)\right]\\ & = \mathcal L - E_{q(z_T|x)}[D_{KL}(q(y|z_T,x) || r(y|z_T,x))]\\ & \leq \mathcal L \leq \log p(x) \end{align} \]

Where \(r(y|x,z_T)\) is an aux. inference distrib. which we are free to choose. \(q(z_T|x)\) in this case is given by \(\int q(y,z_T|x) dy\) i.e. a mixture of distributions of the form \(q(z_T|x,y)\) In the paper \(r(y|x,z_T)\) has a Markov structure as well:

\[ r(y|x,z_T) = r(z_0,...z_{t-1}|x,z_T) = \prod_t^T r_t(z_{t-1}|x,z_t) \]

The variational lower bound can be written as:

\[ \mathcal L_{aux} = E_q\left[\log\left[p(x,z_T)/q(z_0|x)\right]\right] + \sum_t^T\log[r_t(z_{t-1} |x, z_t)/q_t(z_t|x,z_{t-1})] \]

At each step of the markov chain we can choose different transition operators \(r\_t\) and inverse models \(r\_t\)