Tutorial on Variational Autoencoders

Carl Doersch

arxiv | Courville's slides Kingma & Welling' sldie

The VAE has little to do with the classic autoencoders; they are called so because the final training setup has an encoder-decoder architecture.

Generative modeling learning a model distribution \(P(X)\) over \(\mathcal X\) e.g. natural images.

Objective: Produce more examples that are like those already in a database (but not exactly the same). This is formalized as getting X drawn from \(P_{gt}\). Existing methods have one of these 3 drawbacks:

Make strong assumptions about the data structure.
Make severe approximations
Rely on computationally expensive procedures (MCMC)

The VAE makes weak assumptions, it's training is fast via back-propagation and the approximations made do arguably induce smaller error.

Preliminaries:

The generative model will be conditioned on a latent variable z that somehow direct the model toward a specific category/class of the data points \(\mathcal X\). A model is representative of the data if for every X, there is at least one setting of the latent variable z that causes the model to generate X. The latent variable is easily sampled with a pdf P(z) over the latent space \(\mathcal Z\). From there we can define a family of (deterministic) functions \(f(z;\theta)\) that are r.v in \(\mathcal X\). The parameterized problem would be to find \(\theta\) so that we can sample \(z\) from \(P(z)\) and find \(f(z,\theta)\) that matches the X's in our data. ### The objective function: Mathematically speaking: \(f(z,\theta)\) will be replaced by the conditional probability \(P(X|z;\theta)\). Now, we aim at maximizing \(P(X), \forall X\) with

\[P(X) = \int P(X|z;\theta)P(z)dz\phantom{abcdefg}(a)\] In VAE, the choice of the output distribution is often Gaussian: \[P(X|z;\theta) = \mathcal N(X|f(z;\theta), \sigma²I)\phantom{abcdefg}(b)\]

In (a) two problems arise, the definition of \(z\) and the integral. \(z\) intuitively summarize all the parameters required to generate a sample. How \[P(X) = \frac{1}{N}\sum_i P(X|z_i;\theta)\]

But the sampled values z are likely to yield negligeable probabilities \(P(X|z)\), thus we need to sample the likeliest \(z\) to produce \(X\) i.e. a new estimator \(Q(z|X)\) and a link between the new \(E_{z\sim Q}P(X|z)\) and \(P(X)\). For an arbitrary \(Q\), the KL divergence from \(P(z|X)\) is:

\[D[Q(z)||P(z|X)] = E_{z\sim Q}[\log Q(z) - \log P(z|X)]\]

With bayes rule we find:

\[\log P(X) - D[Q(z)||P(z|X)] = E_{z\sim Q}[\log P(X|z)] - D[Q(z) ||P(z)]\]

It makes sense to choose \(Q\) that depends on \(X\) i.e \(Q(z|X)\) i.e:

\[\log P(X) - D[Q(z|X)||P(z|X)] = E_{z\sim Q}[\log P(X|z)] - D[Q(z|X) ||P(z)]\phantom{abcdefg}(c)\]

The LHS is our objective function + an error depending on how well we can find the right \(z\) to produce \(X\), while the RHS is something we can optimize via SGD, besides it mimics the encoder-decoder paradigm with \(Q\) encoding \(X\) into \(z\) and \(P\) decoding it.

In this equation we choose a high capacity model for \(Q(z|X)\) usually:

\[Q(z|X) = \mathcal N(z|\mu(X;\theta), \Sigma(X;\theta))\] Where \(\mu\) and \(\Sigma\)(diagonal) are deterministic functions (e.g.implemented via neural networks). The KL divergence of the RHS is now between two multivariate gaussians and can be computed in closed form. In our case: \[ D[\mathcal N(\mu(X), \Sigma(X)) || \mathcal N(0,I)]= \frac{1}{2}\left(tr(\Sigma(X)) + \mu(X)^T\mu(X) -dim(z)-\log det(\Sigma(X))\right) \] This term can be viewed as a regularization term, besides he \(\sigma\) parameter can be treated as a weighting factor between the terms of the RHS.

As for the first term \(E_{z\sim Q}[\log P(X|z)]\) we can evaluate it for a single \(z\) instead of sampling. Yet taking simply for a given \(z\) \(P(X|z)\) as computed from (b) doesn't show any dependence to \(Q\) and this way the VAE would not learn an decoder \(P\) adapted to the encoder \(Q\). In other terms we need to propagate the error back to the sampler \(Q(z|X)\) SGD can handle stochastic inputs but not stochastic units within the network. That's why we introduce a reparameterization trick which is to move the sampling to an input layer: to sample from \(Q(z|X)\) we can first sample \(\epsilon \sim \mathcal N(0,I)\) then compute \(z = \mu(X) + \Sigma^{1/2}(X)\).

Now given a fixed \(X\) and \(\epsilon\) our objective function is deterministic and continuous in the parameters of \(P\) and \(Q\):

On our training set \(\mathcal D\), the objective function is: \[ E_{X\sim \mathcal D}[E_{\epsilon \sim \mathcal N(0,I)}[\log P(X|z=\mu(X) + \Sigma(X)^{1/2}\epsilon)] - D[Q(z|X)||P(z)]] \]

At test time we can use the RHS as lower bound of \(P(X)\) to estimate the probability of a testing sample under the model.

Miscellaneous:

The error

How much error is introduced by \(D[Q(z|X)||P(z|X)]\)?

The tractability of the VAE relies on the assumption that \(Q(z|X)\) is a Gaussian. For KL divergence to converge to 0, \(P(z|X)\) should be a Gaussian too which is not necessarily the case for an arbitraty \(f\) defining the distribution \(P\). Thus we need to choose \(f\) so that we can at the same time maximize \(\log P(X)\) and ensure that \(P(z|X)\) is a Gaussian for all \(X\). This is possible provided \(\sigma\) is smalle enough w.r.t the ground truth distribution's CDF #### Information theory perspective: Check minimum description length Check bits back (keeping the neural networks simple by minimizing the description length of the weights, Hinton 1993 and AE, minimum description length and Helmlotz free energy, Hinton 1994). #### Conditional variational AE: This time the generative process of the VAE will be conditioned on an input. CVAE tackles the problems where the input-to-output mapping is one to many.

Give input-output pair \((X,Y)\) we want to learn a model \(P(Y|X)\) defined via a latent variable \(z\): \[ P(Y|X) = \mathcal N(f(z,x), \sigma^2I)\]