Understanding Variational Autoencoder
Let us consider some dataset X = {x (i)} N i=1 consisting of N i.i.d. samples of some continuous or discrete variable x. We assume that the data are generated by some random process, involving an unobserved continuous random variable z. The process consists of two steps: (1) a value z (i) is generated from some prior distribution p(z); (2) a value x (i) is generated from some conditional distribution p(x|z). Given N data points X = {x1, . . . , xN } we typically aim at maximizing the marginal log-likelihood:
with respect to parameters. This task could be troubleR some because the integral of the marginal likelihood p(x) = p(z)p(x|z) dz is intractable (so we cannot evaluate or differentiate the marginal likelihood). These intractabilities are quite common and appear in cases of moderately complicated likelihood functions p(x|z), e.g. a neural network with a nonlinear hidden layer. To overcome this issue one can introduce an inference model (an encoder) q(z|x) and optimize the variational lower bound:
where p(x|z) is called a decoder and p(z) = N (z|0, I) is the prior. There are various ways of optimizing this lower bound but for continuous z this could be done efficiently through a re-parameterization of q(z|x). Then the architecture is called a variational auto-encoder (VAE).
The variational approach exploits sampling from an auxiliary “inference” distribution Q(z|X), hopefully producing values for z more likely to effectively contribute to the (re)generation of X. The relation between P(X) and Ez∼Q(z|X) P(X|z) is given by the following equation, where KL denotes the Kulback-Leibler divergence:
KL-divergence is always positive, so the term on the right provides a lower bound to the loglikelihood P(X), known as Evidence Lower Bound (ELBO).
In traditional implementations, we additionally assume that Q(z|X) is normally distributed around an encoding function µθ(X), with variance σ 2 θ (X); similarly, P(X|z) is normally distributed around a decoder function dθ(z). The functions µθ, σ 2 θ, and dθ are approximated by deep neural networks. Knowing the variance of latent variables allows sampling during training.
Provided the model for the decoder function dθ(z) is sufficiently expressive, the shape of the prior distribution P(z) for latent variables can be arbitrary, and for simplicity, we may assume it is a normal distribution P(z) = G(0, 1). The term KL(Q(z|X)||P(z) is hence the KL-divergence between two Gaussian distributions G(µθ(X), σ2 θ (X)) and G(1, 0) which can be computed in closed form:
Regularization effect of KL Loss
The loss function is composed of two parts: one is just the log-likelihood of the reconstruction, while the second one is a term aimed to enforce a known prior distribution P(z) of the latent space — typically a spherical normal distribution. Technically, this is achieved by minimizing the Kullbach-Leibler distance between Q(z|X) and the prior distribution P(z); as a side effect, this will also improve the similarity of the aggregate inference distribution Q(z) = EX Q(z|Z) with the desired prior, that is our final objective.
Loglikelihood and KL-divergence are typically balanced by a suitable λ-parameter (called β in the terminology of β-VAE ), since they have somewhat contrasting effects: the former will try to improve the quality of the reconstruction, neglecting the shape of the latent space; on the other side, KL-divergence is normalizing and smoothing the latent space, possibly at the cost of some additional “overlapping” between latent variables, eventually resulting in a more noisy encoding. If not properly tuned, KL-divergence can also easily induce a sub-optimal use of network capacity, where only a limited number of latent variables are exploited for a generation: this is the so-called over-pruning/variable-collapse/sparsity phenomenon.
Tuning down λ typically reduces the number of collapsed variables and improves the quality of reconstructed images. However, this may not result in a better quality of generated samples, since 2 we lose control of the shape of the latent space, which becomes harder to be exploited by a random generator.
In the variational autoencoder, p is specified as a standard Normal distribution with mean zero and variance one, or P(z) = G(0, 1). If the encoder outputs representations z that are different than those from a standard normal distribution, it will receive a penalty in the loss. This regularizer term means to keep the representations z of each digit sufficiently diverse. If we didn't include the regularizer, the encoder could learn to cheat and give each datapoint a representation in a different region of the Euclidean space. This is bad because then two images of the same number (say a 2 written by different people, (2alice and 2bob) could end up with very different representations z_alice, z_bob. We want the representation space of z to be meaningful, so we penalize this behavior. This has the effect of keeping similar number representations close together.
Defining Variational Autoencoder in constraint setting
In this setting,we introduce an adjustable hyperparameter β to the original objective:
The above equation is derived from optimization problem given by: maxθEqφ(z|x) [log pθ(x|z)]subjecttoKLqφ(z|x)p(z)
Re-writing the above equation as a Lagrangian under the KKT conditions:
where the KKT multiplier Κ is the regularisation coefficient that constrains the capacity of the latent information channel z and puts implicit independence pressure on the learnt posterior due to the isotropic nature of the Gaussian prior p(z). Since β, 0 according to the complementary slackness KKT condition: this can be written as :
Well-chosen values of β (usually β > 1) result in more disentangled latent representations. When β = 1, the β vae becomes equivalent to the original VAE framework. It was suggested that the stronger pressure for the posterior qφ(|) to match the factorized unit Gaussian prior p() introduced by the β vae objective puts extra constraints on the implicit capacity of the latent bottleneck and extra pressures for it to be factorized while still being sufficient to reconstruct the data. Higher values of β necessary to encourage disentangling often lead to a trade-off between the fidelity of β vae reconstructions and the disentangled nature of its latent code. This is due to the loss of information as it passes through the restricted capacity latent bottleneck.
References:
- Auto-Encoding Variational Bayes https://arxiv.org/pdf/1312.6114.pdf
- β-VAE: LEARNING BASIC VISUAL CONCEPTS WITH A CONSTRAINED VARIATIONAL FRAMEWORK — https://openreview.net/pdf?id=Sy2fzU9gl