Improving Variational Inference with Inverse Autoregressive Flow

Diederik P. Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, Max Welling

conf | arxiv | code* |

Objective:

VAE : using neural networks for the generative model and the inference network $q\_\phi(z|x)$.

In order to improve the flexibility of the inference network, the authors use autoregressive functions that take an input with some specified ordering (multidim Tensor or a sequence) and outputs a mean and std for each element of the input conditionned on the previous ones. e.g. RNNs, pixelCNN, Wavenet (Van der oord et al.) or MADE (Germain et. al). Such functions can be turned into invertible nonlinear transformations of the input with a simple ==Jacobian determinant==.

In the scenario of multidim input variable we will consider a cotext variable $c$ and our appromixate posterior will thus be noted $q(z|x,c)$.

The inference model $q(z|x)$ is required to be ==computationally cheap== to compute and differentiate as well as sample from (operations needed t each point of the minibatch). Further felexibility can be ensured if those operations are parallelizable across dimensions of $z$. These requirements limit the choice of families for the approximate posterior (generally diagonal gaussians). Yet we need $q$ to be flexible enough to match the true posterior!

Normalizing flow

==CHECKME: Rezende and mohamed 2015==. Start off with an initial r.v. with a simple distrib. and then apply a series of invertible parameterized transformations $f^t$ such that the last latent variable has a more flexible distrib.:

\[ z^0 \sim q(z^0|x,c),\:\: z^t = f^t(z^{t-1},x,c),\:\forall t=1,..T \]

As long as the jacobian of each of these transformations can be computed:

\[ \log q(z^T|x,c) = \log q(z^0|x,c) - \sum_t^T\log det\left|\frac{df^t(z^{t-1},x,c)}{dz^{t-1}}\right| \]

Rezende et al. exerimented only with: $ f^t(z{t-1}) = z^{t-1} +uh(w^Tz{t-1}+b)$ where $h$ is a non-linearity.

==CHECKME: Hamiltonian flow (Salimans et al. 2014)==

Objective: find a more powerful normalizing flow! (still computationally cheap): == Autoregressive gaussian model== Let $y$ be a random vector (or tensor) with ordering on its elements. On $y = \\{y\_i\\}\_{1\leq i\leq D}$ we define an autoregressive gaussian generative model as:

\[ \begin{align} & y_0 = \mu_0 + \sigma_0 z_0\\ & y_i = \mu_i(y_{1:i-1}) + \sigma_i(y_{1:i-1}) z_i\\ & z_i \sim \mathcal N(0,1)\:\:\forall i \end{align}\]

Where $\mu_i, \sigma_i$ are neural networks with parameters $\theta$. Such models include LSTM units (take the previous elements of y and map them to predict mean,std of the next element) and Gaussian MADE models.

==CAVEAT: == the $y\_i$ have to be generated sequentially!

The autoregressive model thus transofrms $z \sim\mathcal N(0,I)$ into a vector $y$ with a more complicated distrib. Provided $\sigma\_i > 0$ the transofrmation is invertible.

\[ z_i = \frac{y_i - \mu_i(y_{1:i-1})}{\sigma_i(y_{1:i-1})} \]

This inverse transformation whiten the data into an iid standard normal distrib. What's more is that $z\_i$ can be computed in parallel! And vectorized:

\[ z = \frac{y - \mu(y)}{\sigma(y)} \phantom{abcde}\text{(Elementwise operations)} \]

This inverse autoregressive transofrmation has a lower triangular jacobian matrix whose diagonal elements are those of $\sigma(y)$:

\[ \log det\left|\frac{dz}{dy}\right| = - \sum_1^D \log \sigma_i(y) \]

IAF (Inverse autoregressive flow)

Now we use the inverse whitening transformation above as a normalizing flow for variational inference.

When using IAF in the posterior approximation, we use factorized gaussian for $z^0 \sim q(z^0|x,c)$, and then perform $T$ steps of IAF (as can be seen in the figure above)

\[ (\forall t=1,..,T)\;\:z^t = f^t(z^{t-1},x,c)=\frac{z^{t-1} - \mu^t(z^{t-1},x,c)}{\sigma^t(z^{t-1},x,c)} \]

IAF through masked autoencoders (MADE)

To introduce non-linear dependencies between the elements of $z$ we use MADE for the autoregressive NN: masks are applied to the weight matrices in such a way that $\mu(y), \sigma(y)$ are == autoregressive i.e.$\partial\mu\_i(y)/\partial y\_j = 0,\:j\geq i$ ==.

IAF through RNNs:

We can use LSTMs as the autoregresive NN. They are more powerful than MADE ==but the computation of $\mu, \sigma$ cannot be parallelized==.

Check http://bjlkeng.github.io/posts/variational-autoencoders-with-inverse-autoregressive-flows/