Latent Spaces and the ELBO


\gdef\expect{ E_{\bold{z} \sim q_{ \pmb{\phi} }}}

Here we'll make some remarks about latent spaces and derive the evidence lower bound (ELBO).

Used in variational Bayesian inference, the ELBO serves as a proxy for learning a model of the true posterior over latent variables. The framework assumes that in the real world, an observation x\bold{x} is generated by some underlying set of latent or "hidden" variables z=[z1,z2,...]\bold{z} = [z_1, z_2, ...], which are the true representation of the object whereas x=[x1,x2,...]\bold{x} = [x_1, x_2, ...] are the measurable features. Latent variable models make this assumption about the world and one can maximize the ELBO as part of an algorithm for generating novel data points.


Latent Spaces

Plato's Allegory

Luo gave an analogy between Plato's Allegory of the Cave and latent spaces. In the allegory, cavemen see each other as shadows cast by campfires. These shadows have measurable features in the cavemen's observable world (the shadow's height and width, the shape of the ears, and so forth), while a set of latents represent a complete description (temperament, language, caloric intake, etc.). The latents are a rich representation, while the observations are a compressed but measurable manifestation of cavemen. From the observer's point of view, many aspects of the real objects are "lost in translation".

On the other hand, the latent space can be of lower dimension than the data, in which the latents represent a compressed version of the measured data. Whether such a compression is good or bad depends on how accurately one can reconstruct the inputs from their latents.

Empricism

Since learning a higher dimensional latent space requires priors with strong opinions about the structure of the latent variable distribution (posterior regularization), most generative algorithms specify a latent space of equal or lower dimension.

Conversely, if the data is highly complex or contains intricate patterns, the optimal latent dimension may be higher and meaningfully capture underlying data complexity. In a previous project, we found cases of overcomplete autoencoders performing better on an anomaly detection task vis-à-vis their undercomplete counterparts. In practice, the optimal dimensionality probably depends on your particular task and dataset.

Long-haired Dogs

It's more difficult to interpret the meaning of individual latents under higher dimensional spaces than lower ones, because additional characteristics about the objects are hard to guess. In both cases, interpretation takes additional work, but the benefit is some level of control at generation time. Suppose we're generating images of dogs, and we knew that a particular latent affects hair length. We could generate images of long-haired dogs by manually fixing the value for this latent.


ELBO

As with any likelihood-based model, the objective is to maximize marginal log likelihood of the observed data. There exists some true joint distribution p(x,z)p(\bold{x},\bold{z}) - a naïve way to proceed is to marginalize out the latents. p(x)=zp(x,z)dz\begin{equation*} p(\bold{x}) = \int_{z}p(\bold{x}, \bold{z}) d\bold{z} \end{equation*} But this integral is intractable for all but simple parametric forms, further the latents z\bold{z} are unobserved. Instead, start with the evidence logp(x)\log p(\bold{x}) and use the chain rule of probability. All distributions below are true, except the approximate posterior qϕq_{\pmb{\phi}}. logp(x)=logp(x)qϕ(zx)dz=logp(x)qϕ(zx)dz=logp(x,z)p(zx)qϕ(zx)dz=Ezqϕ[logp(x,z)p(zx)]=Ezqϕ[logp(x,z)p(zx)qϕ(zx)qϕ(zx)]=Ezqϕ[logp(x,z)qϕ(zx)]+Ezqϕ[logqϕ(zx)p(zx)]=ELBO+DKL(qϕ(zx)p(zx))\begin{align*} \log p(\bold{x}) &= \log p(\bold{x}) \int q_{\pmb{\phi}}(\bold{z}|\bold{x})d\bold{z} \\ &= \int \log p(\bold{x}) q_{\pmb{\phi}}(\bold{z}|\bold{x})d\bold{z} \\ &= \int \log \frac{p(\bold{x}, \bold{z})}{p(\bold{z}|\bold{x})} q_{\pmb{\phi}}(\bold{z}|\bold{x})d\bold{z} \\ &= \expect \left[\log\frac{p(\bold{x}, \bold{z})}{p(\bold{z}|\bold{x})}\right] \\ &= \expect \left[ \log\frac{p(\bold{x}, \bold{z})}{p(\bold{z}|\bold{x})}\cdot\frac{q_{\pmb{\phi}}(\bold{z}|\bold{x})}{q_{\pmb{\phi}}(\bold{z}|\bold{x})} \right] \\ &= \expect \left[ \log \frac{p(\bold{x}, \bold{z})}{q_{\pmb{\phi}}(\bold{z}|\bold{x})} \right] + \expect \left[ \log \frac{q_{\pmb{\phi}}(\bold{z}|\bold{x})}{p(\bold{z}|\bold{x})}\right] \\ &= ELBO + D_{KL}(q_{\pmb{\phi}}(\bold{z} | \bold{x}) || p(\bold{z} | \bold{x})) \\ \end{align*} Note that the way qϕq_{\pmb{\phi}} was introduced does not violate equality. The integral in the first line equals 1 by the definition of a probability density function. In the second line we brought the evidence into the integral, and in the third line we applied chain rule.

First Form

ELBO=Ezqϕ[logp(x,z)qϕ(zx)]=logp(x)DKL(qϕ(zx)p(zx))logp(x)\begin{align*} ELBO &= \expect \left[ \log \frac{p(\bold{x}, \bold{z})}{q_{\pmb{\phi}}(\bold{z}|\bold{x})} \right] \\ &= \log p(\bold{x}) - D_{KL}(q_{\pmb{\phi}}(\bold{z}|\bold{x} ) || p(\bold{z} | \bold{x})) \\ &\leq \log p(\bold{x}) \end{align*} This first form introduces the definition of the ELBO, and shows that it's the lower bound on the evidence due to KL divergence non-negativity. Maximizing the ELBO w.r.t. ϕ\pmb{\phi} invokes equal minimization of DKL(qϕ(zx)p(zx))D_{KL}(q_{\pmb{\phi}}(\bold{z}|\bold{x} ) || p(\bold{z} | \bold{x})), learning an approximate posterior qϕ^q_{\hat{\pmb{\phi}}}.

Second Form

However, when training a network you must estimate the loss on every forward pass via some (hopefully unbiased) estimator, but you don't know the true posterior p(zx)p(\bold{z} | \bold{x}). Instead we tease out the prior.

ELBO=Ezqϕ[logp(x,z)qϕ(zx)]=Ezqϕ[logp(z)p(xz)qϕ(zx)]=Ezqϕ[logp(xz)]DKL(qϕ(zx)p(z))\begin{align*} ELBO &= \expect \left[ \log \frac{p(\bold{x}, \bold{z})}{q_{\pmb{\phi}}(\bold{z} | \bold{x})} \right] \\ &= \expect \left[ \log \frac{p(\bold{z})p(\bold{x} | \bold{z})}{q_{\pmb{\phi}}(\bold{z} | \bold{x})} \right] \\ &= \expect \left[ \log p(\bold{x} | \bold{z}) \right] - D_{KL}(q_{\pmb{\phi}}(\bold{z} | \bold{x}) || p(\bold{z})) \\ \end{align*} Kingma & Welling, 2013 assumes that the true likelihood p(xz)p(\bold{x} | \bold{z}) comes from some parametric family of distributions pθp_{\pmb{\theta}} (Gaussian for continuous data, Bernoulli for binary). In other words, assume the parametric form is known, but the parameters must be estimated. Under this assumption we have, with equality ELBO=Ezqϕ[logpθ(xz)]DKL(qϕ(zx)p(z))\begin{align*} ELBO = \expect \left[ \log p_{\pmb{\theta}}(\bold{x} | \bold{z}) \right] - D_{KL}(q_{\pmb{\phi}}(\bold{z} | \bold{x}) || p(\bold{z})) \end{align*} This form shows that maximizing the ELBO w.r.t. θ\pmb{\theta} and ϕ\pmb{\phi} simultaneously learns a posterior qϕ^q_{\hat{\pmb{\phi}}} that moves toward maximizing the data likelihood pθ^(xz)p_{\pmb{\hat{\theta}}}(\bold{x} | \bold{z}), while regularizing the posterior by keeping it close to p(z)p(\pmb{z}). I.e. it learns a posterior that produces effective latents for reconstruction, while (for smooth priors such as the standard normal) discouraging qϕ^q_{\hat{\pmb{\phi}}} from overfitting and collapsing into Dirac deltas.

Third Form

The Shannon entropy for any function f(x)f(x) of a random variable xx is H(f(x))=Exp(x)[logf(x)]\begin{equation*} H(f(x)) = E_{{x \sim p(x)}}[-\log f(x)] \end{equation*} Starting from the definition of the ELBO, we invoke chain rule the other way. ELBO=Ezqϕ[logp(x,z)qϕ(zx)]=Ezqϕ[logp(x)p(zx)qϕ(zx)]=logp(x)+Ezqϕ[logp(zx)]+Ezqϕ[logqϕ(zx)]=logp(x)+Ezqϕ[logp(zx)]+H(qϕ)\begin{align*} ELBO &= \expect \left[ \log \frac{p(\bold{x}, \bold{z})} {q_{\pmb{\phi}}(\bold{z} | \bold{x})} \right] \\ &= \expect \left[ \log \frac{p(\bold{x})p(\bold{z} | \bold{x})} {q_{\pmb{\phi}}(\bold{z} | \bold{x})} \right] \\ &= \log p({\bold{x}}) + \expect[\log p(\bold{z} | \bold{x})] + \expect[- \log q_{\pmb{\phi}}(\bold{z} | \bold{x})] \\ &= \log p({\bold{x}}) + \expect[\log p(\bold{z} | \bold{x})] + H(q_{\pmb{\phi}}) \end{align*} This last form shows that maximizing the ELBO w.r.t. ϕ\pmb{\phi} maximizes the entropy of qϕq_{\pmb{\phi}} H(qϕ)=zqϕlogqϕdz\begin{equation*} H(q_{\pmb{\phi}}) = - \int_{\bold{z}} q_{\pmb{\phi}} \log q_{\pmb{\phi}} d\bold{z} \end{equation*} Because of the negative sign, high entropy corresponds to low integrand over the support of qϕq_{\pmb{\phi}}, encouraging exploration of different parts of the latent space and implying a high degree of uncertainty or variability in the inferred latents, which can be beneficial in cases where the true posterior is complex and multimodal.