Semi-Supervised Learning with Deep Generative Models

Diederik P. Kingma, Danilo J. Rezende, Shakir Mohamed, Max Welling

NIPS 2014 | arxiv |

Given paired data \(\\{(x\_i, y_i)\\}\_{1\leq i \leq N}\) with the labels \(y\_i\) in \(\\{1,..,L\\}\). The observations have corresponding latent variables \(z\_i\). The empirical distribution over the labelled and unlabelled subsets are denoted as \(\tilde p\_l(x,y)\) and \(\tilde p\_u(x)\). #### Latent feature discriminative model (M1): 1) Train an encoder/ generative model that provides a feature representation. 2) Train a separate classifier in the latent space. The generative model used is: \(p(z) =\mathcal N(z|0,I)\) with \(p\_\theta(x|z) = f(x;z,\theta)\). \(f\) is a suitable likelihood function whose probabilities are formed by a non-linear transformation \(\equiv\) deep neural network.

The features are set to the approximate samples from the posterior \(p(z|x)\) and a classifier such as transductive SVM (few unlabeled data near the margins) or multinomial regression is trained on this features. #### Generative semi-supervised model (M2): A probabilistic model that describes the data as being generated by a latent class variable \(y\) as well as the continuous latent variable \(z\): \(p(y) = Cat(y|\pi)\); \(p(z) = \mathcal N(z|0,I)\) and \(p\_\theta(x|y,z) = f(x;y,z,\theta)\)

\(Cat(y|\pi)\) is the multinomial distrib. and the class labels are treated as latent variable if the data point is unlabelled. ==The two latent variables are marginally independent==.

Stacked generative semi-supervised model (M1+M2):

learn a latent variable \(z\_1\) using M1.
use M2 on \(z\_1\) instead of the raw data.

In all models a VAE approach is used where we approximate the true posterior with \(q\_\phi(z|x)\) of parameters \(\phi\). For M1 we use a gaussian inference network and for M2 we assume a factorized form:

\[ q_\phi(z, y|x) = q_\phi(z|x)q_\phi(y|x)\]

specified as gaussian and multinomial distributions.

The functions parameterized by \(\phi\) are represented as MLPs. For M1, the variational bound is:

\[ -\mathcal J(x) = E_{q_\phi(z|x)}\left[\log p_\theta(x|z)\right] - KL[q_\phi(z|x)|| p_\theta(z)] \]

While for M2, we distinguish between the labelled and unlabelled bound:

\[ \begin{align} -\mathcal L(x, y) & = E_{q_\phi(y,z|x)}\left[\log p_\theta(x|y,z) + \log p_\theta(y) + \log p(z) - \log q_\phi(z|x,y)\right] - KL[q_\phi(z|x)|| p_\theta(z)]\\ -\mathcal U(x) & = \sum_y q_\phi(y|x)(-\mathcal L(x,y)) + \mathcal H(q_\phi(y|x))\\ \mathcal J & = \sum_{(x,y)\sim \tilde p_l} \mathcal L(x,y) + \sum_{x \sim \tilde p_u} \mathcal U(x) \end{align} \]

Unfortunately the label predictive distribution \(q\_\phi(y|x)\) intervene only on the unlabelled term of the bound. To remedy this, an additional classification loss on labelled data is added.

\[ \mathcal J^\alpha = \mathcal J + \alpha E_{\tilde p_l(x,y)}\left[- \log q_\phi(y|x)\right] \]

In experiments \(\alpha = 0.1 N\) The optimization of the variational lower bound is perfrmed via AEVB.