Diederik P. Kingma, Danilo J. Rezende, Shakir Mohamed, Max Welling
NIPS 2014 | arxiv |
Given paired data \(\\{(x\_i, y_i)\\}\_{1\leq i \leq N}\) with the labels \(y\_i\) in \(\\{1,..,L\\}\). The observations have corresponding latent variables \(z\_i\). The empirical distribution over the labelled and unlabelled subsets are denoted as \(\tilde p\_l(x,y)\) and \(\tilde p\_u(x)\). #### Latent feature discriminative model (M1): 1) Train an encoder/ generative model that provides a feature representation. 2) Train a separate classifier in the latent space. The generative model used is: \(p(z) =\mathcal N(z|0,I)\) with \(p\_\theta(x|z) = f(x;z,\theta)\). \(f\) is a suitable likelihood function whose probabilities are formed by a non-linear transformation \(\equiv\) deep neural network.
The features are set to the approximate samples from the posterior \(p(z|x)\) and a classifier such as transductive SVM (few unlabeled data near the margins) or multinomial regression is trained on this features. #### Generative semi-supervised model (M2): A probabilistic model that describes the data as being generated by a latent class variable \(y\) as well as the continuous latent variable \(z\): \(p(y) = Cat(y|\pi)\); \(p(z) = \mathcal N(z|0,I)\) and \(p\_\theta(x|y,z) = f(x;y,z,\theta)\)
\(Cat(y|\pi)\) is the multinomial distrib. and the class labels are treated as latent variable if the data point is unlabelled. ==The two latent variables are marginally independent==.
\[ q_\phi(z, y|x) = q_\phi(z|x)q_\phi(y|x)\]
specified as gaussian and multinomial distributions.
\[ M1: q_\phi(z|x) = \mathcal N(z | \mu_\phi(x), diag(\sigma^2_\phi(x))\\ M2: q_\phi(z|y, x) = \mathcal N(z | \mu_\phi(y,x), diag(\sigma^2_\phi(x));\:q_\phi(y|x) = Cat(y|\pi_\phi(x)) \]
The functions parameterized by \(\phi\) are represented as MLPs. For M1, the variational bound is:\[ -\mathcal J(x) = E_{q_\phi(z|x)}\left[\log p_\theta(x|z)\right] - KL[q_\phi(z|x)|| p_\theta(z)] \]
While for M2, we distinguish between the labelled and unlabelled bound:
\[ \begin{align} -\mathcal L(x, y) & = E_{q_\phi(y,z|x)}\left[\log p_\theta(x|y,z) + \log p_\theta(y) + \log p(z) - \log q_\phi(z|x,y)\right] - KL[q_\phi(z|x)|| p_\theta(z)]\\ -\mathcal U(x) & = \sum_y q_\phi(y|x)(-\mathcal L(x,y)) + \mathcal H(q_\phi(y|x))\\ \mathcal J & = \sum_{(x,y)\sim \tilde p_l} \mathcal L(x,y) + \sum_{x \sim \tilde p_u} \mathcal U(x) \end{align} \]
Unfortunately the label predictive distribution \(q\_\phi(y|x)\) intervene only on the unlabelled term of the bound. To remedy this, an additional classification loss on labelled data is added.\[ \mathcal J^\alpha = \mathcal J + \alpha E_{\tilde p_l(x,y)}\left[- \log q_\phi(y|x)\right] \]
In experiments \(\alpha = 0.1 N\) The optimization of the variational lower bound is perfrmed via AEVB.