Latent Diffusion (Stable Diffusion)

High-Resolution Image Synthesis with Latent Diffusion Models



Image Synthesis

Problems with DMs

Goal: Reducing the computational demands of DMs without impairing their performance.

Semantic Compression and Perceptual Compression

Illustrating perceptual and semantic compression.

Learning a likelihood-based model (e.g., DMs) can be roughly divided into two stages:

  1. Perceptual Compression: The model removes high-frequency details but still learns little semantic variation.
  2. Semantic Compression: The actual generative model learns the semantic and conceptual composition of the data

The distortion decreases steeply in the low-rate region of the rate-distortion plot, indicating that the majority of the bits are indeed allocated to imperceptible distortion. While DMs allow suppressing this semantically meaningless information by minimizing the responsible loss term, gradients (during training) and the neural network backbone (training and inference) still need to be evaluated on all pixels, leading to superfluous computations and unnecessarily expensive optimization and inference.





The architecture of LDM.

Perceptual Image Compression (Autoencoder)

The architecture of the perceptual compression model is based on VQGAN.

Encoder Decoder
\(x\in\mathbb{R}^{H\times W\times C}\) \(z\in\mathbb{R}^{h\times w\times c}\)
\(\text{Conv2D} \rightarrow\mathbb{R}^{H\times W\times C'}\) \(\text{Conv2D} \rightarrow\mathbb{R}^{H\times W\times C''}\)
\(m\times\{\ \text{ResBlock, Downsample}\}\rightarrow\mathbb{R}^{h\times w\times C''}\) \(\text{ResBlock}\rightarrow\mathbb{R}^{h\times w\times C''}\)
\(\text{ResBlock}\rightarrow\mathbb{R}^{h\times w\times C''}\) \(\text{Non-Local Block}\rightarrow\mathbb{R}^{h\times w\times C''}\)
\(\text{Non-Local Block}\rightarrow\mathbb{R}^{h\times w\times C''}\) \(\text{ResBlock}\rightarrow\mathbb{R}^{h\times w\times C''}\)
\(\text{ResBlock}\rightarrow\mathbb{R}^{h\times w\times C''}\) \(m\times\{\ \text{ResBlock, Upsample}\}\rightarrow\mathbb{R}^{H\times W\times C'}\)
\(\text{GroupNorm, Swish, Conv2D} \rightarrow\mathbb{R}^{h\times w\times c}\) \(\text{GroupNorm, Swish, Conv2D} \rightarrow\mathbb{R}^{H\times W\times C}\)
  1. Given an RGB image \(x\in\mathbb{R}^{H\times W\times 3}\).
  2. The encoder \(\mathcal{E}\) encodes \(x\) into a latent representation \(z=\mathcal{E}(x)\in\mathbb{R}^{h\times w\times c}\). The encoder downsamples the image by a factor \(f = H/h = W/w\),
  3. The decoder \(\mathcal{D}\) reconstructs the image from the latent, giving \(\tilde{x}=\mathcal{D}(z)=\mathcal{D}(\mathcal{E}(x))\).

To avoid arbitrarily high-variance latent spaces, the authors experiment with two kinds of regularizations.

Because the subsequent DM is designed to work with the 2D structure of the learned latent space \(z = \mathcal{E}(x)\), the authors use mild compression rates and achieve very good reconstructions. This is in contrast to previous works (e.g. VQGAN), which relied on an arbitrary 1D ordering of the learned space \(z\) to model its distribution autoregressively and thereby ignored much of the inherent structure of \(z\).

Latent Diffusion Models

Diffusion Models (on Image Domain)

Generative Modeling of Latent Representations

Conditioning Mechanisms

DMs are capable of modeling conditional distributions of the form $$p(z y)$$.

Augmenting the UNet backbone with the cross-attention mechanism

  1. Use a domain-specific encoder \(\tau_\theta(y)\) to project \(y\) to an intermediate representation \(\tau_\theta(y)\in\mathbb{R}^{M\times d_\tau}\).
  2. The intermediate (flattened )representation of the UNet \(\varphi_i(z_t)\in\mathbb{R}^{N\times d_\epsilon^i}\) is mapped with a cross-attention layer

    \[\begin{aligned} \text{Attention}(Q, K, V)&=\text{softmax}(QK^T/\sqrt{d})V\\ Q=W_Q^{(i)}\varphi_i(z_t),\quad K&=W_K^{(i)}\tau_\theta(y),\quad V=W_V^{(i)}\tau_\theta(y) \end{aligned}\]

Loss for conditional LDM:

\[L_{LDM} = \mathbb{E}_{\mathcal{E}(x),y,\epsilon\sim\mathcal{N}(0,1),t}\left[ || \epsilon - \epsilon_\theta(z_t,t,\tau_\theta(y)) ||_2^2 \right].\]


On Perceptual Compression Tradeoffs

Train class-conditional LDMs on the ImageNet with different downsampling factors \(f\).

Train LDMs on CelebAHQ and ImageNet with different \(f\) and plot sample speed against FID scores.

Image Generation with Latent Diffusion

Train unconditional LDMs and evaluate:


Conditional Latent Diffusion

Text-to-Image Synthesis

Layouts-to-image Synthesis

Semantic Synthesis

Super-Resolution with Latent Diffusion

LDMs can be efficiently trained for super-resolution by directly conditioning on low-resolution images via concatenation.

LDM-SR (trained with bicubic degradation)

LDM-BSR (a generic model trained with more diverse degradation)

Inpainting with Latent Diffusion

Inpainting is the task of filling masked regions of an image with new content either because parts of the image are corrupted or to replace existing but undesired content within the image.

Use the general approach for conditional image generation (concatenation)

The best model:

Convolutional Sampling

The SNR induced by the variance of the latent space (i.e., \(\text{Var}(z)/\sigma_t^2\)) significantly affects the results for convolutional sampling.