GENERATIVE LEARNING TRILEMMA
At present, generative learning frameworks are unable to meet three essential criteria simultaneously, which are often necessary for their extensive use in real-world applications. These criteria include (i) producing high-quality samples, (ii) covering all modes and generating diverse samples, and (iii) generating samples quickly and with low computational costs.
It has been noticed that diffusion models typically make the assumption that the denoising distribution can be estimated by Gaussian distributions. Nonetheless, it is a known fact that the Gaussian assumption only holds when the denoising steps are infinitesimally small. As a result, a significant number of steps are required in the reverse process.
TRADTIONAL DIFFUSION MODEL:
Forward process :
Reverse process:
Loss function :
DENOISING DIFFUSION GANS:
The literature on diffusion models often relies on the assumption that the denoising distribution, q(xt-1|xt), can be approximated with a Gaussian distribution. However, it is essential to examine when this approximation is accurate and when it is not.
According to Bayes’ rule, the true denoising distribution, q(xt-1|xt), can be expressed as proportional to the product of the forward Gaussian diffusion, q(xt|xt-1), and the marginal data distribution at step t, q(xt-1). When the step size βt is infinitesimally small, the product in Bayes’ rule is dominated by q(xt|xt-1), and it has been shown that the true denoising distribution takes a Gaussian form. This means that in such cases, the approximation used by current diffusion models can be accurate. Additionally, when q(xt|xt-1) is a Gaussian and βt is sufficiently small, the denoising distribution, q(xt-1|xt), is also Gaussian. Therefore, in such situations, the approximation used by current diffusion models can also be accurate.
Another situation where the denoising distribution, q(xt−1|xt), takes a Gaussian form is when the data marginal, q(xt), is also a Gaussian distribution. The concept of using a VAE encoder to bring the data distribution closer to Gaussian was recently explored in LSGM (Vahdat et al., 2021). However, transforming the data to a Gaussian distribution itself is a difficult problem that VAE encoders cannot solve perfectly. As a result, even with VAE-based approaches, LSGM still requires a considerable number of steps (tens to hundreds) to be effective on complex datasets.
If the denoising step is large and the data distribution is non-Gaussian, there is no guarantee that the Gaussian assumption made on the denoising distribution is accurate or holds true.
If the reverse process uses larger step sizes, meaning that there are fewer denoising steps, a non-Gaussian multimodal distribution is needed to model the denoising distribution accurately. This is because, in situations such as image synthesis, multiple plausible clean images can correspond to the same noisy image, resulting in a multimodal distribution. The denoising diffusion GAN, in which the denoising distributions are modeled with conditional GANs are mathematically been shown as follows.
Our forward diffusion model is structured in a manner similar to the diffusion models described in Equation 1, with the key difference being that we assume that T is small (T ≤ 8) and each diffusion step has a larger βt value. To train our model, we use an adversarial loss to match the conditional GAN generator pθ(xt−1|xt) and the denoising distribution q(xt−1|xt) by minimizing a divergence measure, Dadv, for each denoising step.
where D_adv can be Wasserstein distance, Jenson-Shannon divergence, or f-divergence depending on the adversarial training setup. The generator is trained with following loss function:
The discriminator is trained with loss function:
It is widely known that GANs are prone to training instability and mode collapse. However, our model overcomes these issues by breaking down the generation process into several conditional denoising diffusion steps, each of which is relatively simple to model due to strong conditioning on xt. Additionally, the diffusion process smoothens the data distribution, which reduces the likelihood of the discriminator overfitting.
Although diffusion models offer high-quality and diverse samples, their expensive sampling can limit their applicability in many real-world problems. The denoising diffusion GAN, on the other hand, significantly reduces the computational complexity of diffusion models, making them more practical and cost-effective for real-world applications.
Related works:
- TACKLING THE GENERATIVE LEARNING TRILEMMA WITH DENOISING DIFFUSION GANS. https://openreview.net/pdf?id=JprM0p-q0Co