Previously we derived the ELBO. The variational autoencoder paper (Kingma & Welling, 2013) introduces a differentiable and unbiased estimator for the ELBO, and extends the idea of simple parametric forms to complex ones via feed-forward neural nets.
In an ideal world, we'd like to sample from the true posterior $p(\bold{z}|\bold{x})$, but we don't have access to it. The good news is that optimizing the ELBO encourages the learned posterior to be similar in form to the prior, giving us some assurance that, using latents drawn from the prior, the decoder will generate novel datapoints which are in-distribution to the training data. The reasoning goes something like:
Therefore the procedure is as simple as sampling a latent vector $\bold{z}^{(i)}$ from the prior $p(\bold{z})$ used during training, passing it to the decoder and sampling $\bold{x}^{(i)} \sim p_{\pmb{\theta}}(\bold{x}|\bold{z}^{(i)})$. Or you can sample a batch of $\bold{x}^{(i)}$ and take the mean to get a MAP estimate.
Given a dataset $\bold{X}=\{\bold{x}^{(i)}\}_{i=1}^N$ consisting of $N$ i.i.d. samples, we want some model $p_{\hat{\bold{\theta}}}(\bold{x})$ that maximizes the marginal likelihood of our observations. \begin{align*} \hat{\bold{\theta}} &= \argmax_\bold{\theta} \log p_{\pmb{\theta}} (\bold{x}^{i}, ..., \bold{x}^{N}) \\ &= \argmax_{\bold{\theta}} \log \prod_{i=1}^{N} p_{\pmb{\theta}}(\bold{x}^{(i)}) \\ &= \argmax_{\bold{\theta}} \sum_{i=1}^N \log p_{\pmb{\theta}}(\bold{x}^{(i)}) \end{align*} However there are only a handful of closed-form parametric distributions and most are too simple for complex datasets. Therefore, more expressive functions such as neural nets are needed. Further, assume the data is generated by some latent continuous random vector $\bold{z}$ through the following random process:
Importantly, this latent variables assumption is not always accurate, and other models, such as energy-based models, eschew such a notion.
On the other hand, since energy-based models ignore the normalizing constant altogether, the exact likelihood of a sample can't be computed and the procedure for generating data points is more involved (Song & Kingma, 2021).