Sampling

Before diving into the math, here's a blurb about how to generate datapoints with the VAE framework.

In an ideal world, we'd like to sample from the true posterior $p(\bold{z}|\bold{x})$, but we don't have access to it. The good news is that optimizing the ELBO encourages the learned posterior to be similar in form to the prior, giving us some assurance that, using latents drawn from the prior, the decoder will generate novel datapoints which are in-distribution to the training data. The reasoning goes something like:

The learned posterior is regularized by the prior via a KL divergence term in the ELBO; the prior is similar in form to the learned posterior.
The decoder generates data points which are in-distribution to the training data when fed latents from the learned posterior/encoder (perfect reconstructions, in fact, under perfect optimization).
The decoder also generates data points which are in-distribution to the training data when fed latents from the prior, due to 1.

Therefore the procedure is as simple as sampling a latent vector $\bold{z}^{(i)}$ from the prior $p(\bold{z})$ used during training, passing it to the decoder and sampling $\bold{x}^{(i)} \sim p_{\pmb{\theta}}(\bold{x}|\bold{z}^{(i)})$. Or you can sample a batch of $\bold{x}^{(i)}$ and take the mean to get a MAP estimate.

Setup

Given a dataset $\bold{X}=\{\bold{x}^{(i)}\}_{i=1}^N$ consisting of $N$ i.i.d. samples, we want some model $p_{\hat{\bold{\theta}}}(\bold{x})$ that maximizes the marginal likelihood of our observations. \begin{align*} \hat{\bold{\theta}} &= \argmax_\bold{\theta} \log p_{\pmb{\theta}} (\bold{x}^{i}, ..., \bold{x}^{N}) \\ &= \argmax_{\bold{\theta}} \log \prod_{i=1}^{N} p_{\pmb{\theta}}(\bold{x}^{(i)}) \\ &= \argmax_{\bold{\theta}} \sum_{i=1}^N \log p_{\pmb{\theta}}(\bold{x}^{(i)}) \end{align*} However there are only a handful of closed-form parametric distributions and most are too simple for complex datasets. Therefore, more expressive functions such as neural nets are needed. Further, assume the data is generated by some latent continuous random vector $\bold{z}$ through the following random process:

a value $\bold{z}^{(i)}$ is generated from a continuous prior $p(\bold{z})$
a value $\bold{x}^{(i)}$ is generated from a continuous or discrete conditional distribution $p(\bold{x} | \bold{z}^{(i)})$

Under this assumption we seek to model the joint distribution of the data and latents through optimizing a lower bound on the evidence $\log p(\bold{x})$, the ELBO. The fact that variational autoencoders assume data generation proceeds from the tractable procedure above is what characterizes it as a latent model. See the latent variables assumption of the world.

Importantly, this latent variables assumption is not always accurate, and other models, such as energy-based models, eschew such a notion.

On the other hand, since energy-based models ignore the normalizing constant altogether, the exact likelihood of a sample can't be computed and the procedure for generating data points is more involved (Song & Kingma, 2021).

The ELBO Estimator

We have $\log p_{\pmb{\theta}}(\bold{x}^{(i)}) \geq \mathcal{L}(\pmb{\theta}, \pmb{\phi}; \bold{x}^{(i)})$, where $\mathcal{L}(\pmb{\theta}, \pmb{\phi}; \bold{x}^{(i)})$ is the ELBO for sample $\bold{x}^{(i)}$. Since $\mathcal{L}$ is a lower bound, maximizing $\mathcal{L}$ w.r.t. $\pmb{\theta}$ and $\pmb{\phi}$ serves as a proxy for maximizing the marginal likelihood of $\bold{x}^{(i)}$. Here we derive Kingma and Welling's estimator for the ELBO.

ELBO for a Single Datapoint $\bold{x}^{(i)}$

\begin{align*} \mathcal{L}(\pmb{\theta}, \pmb{\phi}; \bold{x}^{(i)}) = \underbrace{ E_{\bold{z} \sim q_{\pmb{\phi}}}[ \log p_{\pmb{\theta}}(\bold{x}^{(i)}|\bold{z}) ] }_{\text{reconstruction term}} - \underbrace{ D_{KL}(q_{\bold{\phi}}(\bold{z}|\bold{x}^{(i)}) || p(\bold{z})) }_{\text{KL divergence term}} \end{align*} The optimization is formulated as a joint optimization over parameters $\bold{\theta}$ and $\pmb{\phi}$ of a probabilistic encoder $q_{\pmb{\phi}}$ and probabilistic decoder $p_{\pmb{\theta}}$. By joint optimization we mean that after every backward pass, we perform a gradient update on both $\pmb{\theta}$ and $\pmb{\phi}$.

Calculating the KL Divergence Term

Suppose the probabilistic encoder $q_{\pmb{\phi}}$ consists of a two-headed neural net parameterizing an isotropic, multivariate Gaussian distribution. Recall $q_{\pmb{\phi}}$ is an approximate of the true posterior $p(\bold{z} | \bold{x}^{(i)})$. \begin{align*} q_{\pmb{\phi}}(\bold{z} | \bold{x}^{(i)}) = N( \pmb{\mu}_{\pmb{\phi}}(\bold{x}^{(i)}), \pmb{\sigma}^2_{\pmb{\phi}}(\bold{x}^{(i)})\bold{I} ) \end{align*} Here $\pmb{\mu}_{\pmb{\phi}}(\bold{x}^{(i)}) \in \mathbb{R}^J$ is a vector coming from one head, while the other outputs $\log \pmb{\sigma}^2_{\pmb{\phi}}(\bold{x}^{(i)}) \in \mathbb{R}^J$. $J$ is the dimensionality of the latent space. In other words, we assume the latents are independent from one another and follow Gaussian form. The covariance matrix is diagonal with entries \begin{align*} \pmb{\sigma}^2_{\pmb{\phi}}(\bold{x}^{(i)}) = \{\sigma_j^2 | j=1, ..., J\}. \end{align*} Similarly \begin{align*} \pmb{\mu}_{\pmb{\phi}}(\bold{x}^{(i)}) = \{ \mu_j | j=1, ..., J\}. \end{align*} Since the KL divergence between two Gaussians has a closed form expression, assuming a standard normal prior $p(\bold{z})$, we have \begin{align*} D_{KL}(q_{\pmb{\phi}}(\bold{z} | \bold{x}^{(i)}) ||p(\bold{z})) = \frac{1}{2} \sum_{j=1}^J \left( - \ln\sigma_j^2 - 1 + \sigma_j^2 +\mu_j^2 \right) \end{align*}

Estimating the Reconstruction Term

The sample mean is an unbiased estimator of the expectation. We can simply estimate the reconstruction term with $L$ samples drawn from $q_{\pmb{\phi}}(\bold{z} | \bold{x}^{(i)})$. $L$ is a hyperparameter. Similar to the encoder, the decoder $p_{\pmb{\theta}}(\bold{x}^{(i)}|\bold{z})$ consists of a two-headed neural net parameterizing an isotropic multivariate Gaussian.

\begin{align*} E_{\bold{z} \sim q_{\pmb{\phi}}}[ \log p_{\pmb{\theta}}(\bold{x}^{(i)}|\bold{z}) ] \approx \frac{1}{L} \sum_{\ell=1}^L \log p_{\pmb{\theta}}(\bold{x}^{(i)} | \bold{z}^{(i, \ell)}) \end{align*} However, the act of drawing samples from a distribution is not a differentiable operation.

Reparameterization Trick: Instead, draw $L$ samples from $\bold{\epsilon}^{(i, \ell)} \sim N(\bold{0}, \bold{I})$, where $\bold{\epsilon}^{(i, \ell)} \in \mathbb{R}^{J}$. Then for each sample, scale it to $\pmb{\sigma}^2_{\pmb{\phi}}(\bold{x}^{(i)})$ variance and shift its mean to $\pmb{\mu}_{\pmb{\phi}}(\bold{x}^{(i)})$. The $\odot$ denotes an element-wise product. \begin{align*} \bold{z}^{(i, \ell)} = \pmb{\mu}_{\pmb{\phi}} + \pmb{\sigma}_{\pmb{\phi}} \odot \bold{\epsilon}^{(i, \ell)} \end{align*} Since $\bold{\epsilon}$ is not a function of $\pmb{\phi}$, we have \begin{align*} \frac{\partial \bold{z}^{(i, \ell)}}{\partial \pmb{\phi}} = \frac{\partial \bold{\mu}_{\pmb{\phi}}}{\partial \pmb{\phi}} + \frac{\partial \bold{\sigma}_{\pmb{\phi}}}{\partial \pmb{\phi}} \odot \bold{\epsilon}^{(i, \ell)} \end{align*} thus making $\bold{z}^{(i, \ell)}$ differentiable.

Remark: While $\bold{\mu}_{\pmb{\phi}}$ and $\bold{\sigma}_{\pmb{\phi}}$ are deterministic functions, $\bold{z}^{(i, \ell)}$ is a stochastic equation through the random variable $\bold{\epsilon}$.

The estimator for the ELBO of a single datapoint $\bold{x}^{(i)}$: \begin{align*} \hat{\mathcal{L}}(\pmb{\theta}, \pmb{\phi}; \bold{x}^{(i)}) = \frac{1}{L} \sum_{\ell=1}^L \log p_{\pmb{\theta}}(\bold{x}^{(i)} | \bold{z}^{(i, \ell)}) + \frac{1}{2} \sum_{j=1}^J \left( - \ln\sigma_j^2 - 1 + \sigma_j^2 +\mu_j^2 \right). \end{align*}

Estimator for ELBO of the Whole Training Set

We can estimate the evidence lower bound of the whole dataset $\bold{X} = \{ \bold{x}^{(i)}\}_{i=1}^N$ by randomly selecting a minibatch of size $M$ (also a hyperparameter). \begin{align*} \hat{\mathcal{L}}(\pmb{\theta}, \pmb{\phi}; \bold{X}^M) &\approx \mathcal{L}(\pmb{\theta}, \pmb{\phi}; \bold{X}) \\ &= \frac{N}{M} \sum_{j=1}^M \hat{\mathcal{L}}(\pmb{\theta}, \pmb{\phi}; \bold{x}^{(j)}) \end{align*} Here $\frac{1}{M} \sum_{j=1}^M \hat{\mathcal{L}}(\pmb{\theta}, \pmb{\phi}; \bold{x}^{(j)})$ is the minibatch estimate of the single point ELBO. Without $N$, we'd have an estimator for the mean single point ELBO, but as it stands, in the paper they include $N$ in their loss. Either probably works.

Proof:
\begin{align*} \text{Evidence of }\bold{X} &= \log \left( p_{\bold{\bold{\theta}}}(\bold{x}^{(1)}, \bold{x}^{(2)}, ..., \bold{x}^{(N)}) \right) \\ &= \log \left( p_{\bold{\bold{\theta}}}(\bold{x}^{(1)}) \cdot p_{\bold{\bold{\theta}}}(\bold{x}^{(2)}) \cdot ... \cdot p_{\bold{\theta}}(\bold{x}^{(N)}) \right) \\ &= \sum_{i=1}^N \log p_{\bold{\theta}}(\bold{x}^{(i)}) \\ &\geq \sum_{i=1}^N \mathcal{L}(\pmb{\theta}, \pmb{\phi}; \bold{x}^{(i)}) = \mathcal{L}(\pmb{\theta}, \pmb{\phi}; \bold{X}) \end{align*} In the second line we assumed the dataset is i.i.d. and in the last line we invoked $\log p_{\bold{\theta}}(\bold{x}^{(i)}) \geq \mathcal{L}(\pmb{\theta}, \pmb{\phi}; \bold{x}^{(i)})$ for all $i$. Replace single point ELBOs $\mathcal{L}(\pmb{\theta}, \pmb{\phi}; \bold{x}^{(i)})$ in the last line with the minibatch version $\frac{1}{M} \sum_{j=1}^M \mathcal{L}(\pmb{\theta}, \pmb{\phi}; \bold{x}^{(j)})$ and we have \begin{align*} \hat{\mathcal{L}}(\pmb{\theta}, \pmb{\phi}; \bold{X}^M) &\approx \mathcal{L}(\pmb{\theta}, \pmb{\phi}; \bold{X}) \\ &= \sum_{i=1}^N \frac{1}{M} \sum_{j=1}^M \hat{\mathcal{L}}(\pmb{\theta}, \pmb{\phi}; \bold{x}^{(j)}) \\ &= \frac{N}{M} \sum_{j=1}^M \hat{\mathcal{L}}(\pmb{\theta}, \pmb{\phi}; \bold{x}^{(j)}) \end{align*}

Generative Sampling with Variational Autoencoders