Latent Spaces

Plato's Allegory

Luo gave an analogy between Plato's Allegory of the Cave and latent spaces. In the allegory, cavemen see each other as shadows cast by campfires. These shadows have measurable features in the cavemen's observable world (the shadow's height and width, the shape of the ears, and so forth), while a set of latents represent a complete description (temperament, language, caloric intake, etc.). The latents are a rich representation, while the observations are a compressed but measurable manifestation of cavemen. From the observer's point of view, many aspects of the real objects are "lost in translation".

On the other hand, the latent space can be of lower dimension than the data, in which the latents represent a compressed version of the measured data. Whether such a compression is good or bad depends on how accurately one can reconstruct the inputs from their latents.

Empricism

Since learning a higher dimensional latent space requires priors with strong opinions about the structure of the latent variable distribution (posterior regularization), most generative algorithms specify a latent space of equal or lower dimension.

Conversely, if the data is highly complex or contains intricate patterns, the optimal latent dimension may be higher and meaningfully capture underlying data complexity. In a previous project, we found cases of overcomplete autoencoders performing better on an anomaly detection task vis-à-vis their undercomplete counterparts. In practice, the optimal dimensionality probably depends on your particular task and dataset.

Long-haired Dogs

It's more difficult to interpret the meaning of individual latents under higher dimensional spaces than lower ones, because additional characteristics about the objects are hard to guess. In both cases, interpretation takes additional work, but the benefit is some level of control at generation time. Suppose we're generating images of dogs, and we knew that a particular latent affects hair length. We could generate images of long-haired dogs by manually fixing the value for this latent.

ELBO

As with any likelihood-based model, the objective is to maximize marginal log likelihood of the observed data. There exists some true joint distribution $p(\bold{x},\bold{z}) -$ a naïve way to proceed is to marginalize out the latents. \begin{equation*} p(\bold{x}) = \int_{z}p(\bold{x}, \bold{z}) d\bold{z} \end{equation*} But this integral is intractable for all but simple parametric forms, further the latents $\bold{z}$ are unobserved. Instead, start with the evidence $\log p(\bold{x})$ and use the chain rule of probability. All distributions below are true, except the approximate posterior $q_{\pmb{\phi}}$. \begin{align*} \log p(\bold{x}) &= \log p(\bold{x}) \int q_{\pmb{\phi}}(\bold{z}|\bold{x})d\bold{z} \\ &= \int \log p(\bold{x}) q_{\pmb{\phi}}(\bold{z}|\bold{x})d\bold{z} \\ &= \int \log \frac{p(\bold{x}, \bold{z})}{p(\bold{z}|\bold{x})} q_{\pmb{\phi}}(\bold{z}|\bold{x})d\bold{z} \\ &= \expect \left[\log\frac{p(\bold{x}, \bold{z})}{p(\bold{z}|\bold{x})}\right] \\ &= \expect \left[ \log\frac{p(\bold{x}, \bold{z})}{p(\bold{z}|\bold{x})}\cdot\frac{q_{\pmb{\phi}}(\bold{z}|\bold{x})}{q_{\pmb{\phi}}(\bold{z}|\bold{x})} \right] \\ &= \expect \left[ \log \frac{p(\bold{x}, \bold{z})}{q_{\pmb{\phi}}(\bold{z}|\bold{x})} \right] + \expect \left[ \log \frac{q_{\pmb{\phi}}(\bold{z}|\bold{x})}{p(\bold{z}|\bold{x})}\right] \\ &= ELBO + D_{KL}(q_{\pmb{\phi}}(\bold{z} | \bold{x}) || p(\bold{z} | \bold{x})) \\ \end{align*} Note that the way $q_{\pmb{\phi}}$ was introduced does not violate equality. The integral in the first line equals 1 by the definition of a probability density function. In the second line we brought the evidence into the integral, and in the third line we applied chain rule.

First Form

\begin{align*} ELBO &= \expect \left[ \log \frac{p(\bold{x}, \bold{z})}{q_{\pmb{\phi}}(\bold{z}|\bold{x})} \right] \\ &= \log p(\bold{x}) - D_{KL}(q_{\pmb{\phi}}(\bold{z}|\bold{x} ) || p(\bold{z} | \bold{x})) \\ &\leq \log p(\bold{x}) \end{align*} This first form introduces the definition of the ELBO, and shows that it's the lower bound on the evidence due to KL divergence non-negativity. Maximizing the ELBO w.r.t. $\pmb{\phi}$ invokes equal minimization of $D_{KL}(q_{\pmb{\phi}}(\bold{z}|\bold{x} ) || p(\bold{z} | \bold{x}))$, learning an approximate posterior $q_{\hat{\pmb{\phi}}}$.

Second Form

However, when training a network you must estimate the loss on every forward pass via some (hopefully unbiased) estimator, but you don't know the true posterior $p(\bold{z} | \bold{x})$. Instead we tease out the prior.

\begin{align*} ELBO &= \expect \left[ \log \frac{p(\bold{x}, \bold{z})}{q_{\pmb{\phi}}(\bold{z} | \bold{x})} \right] \\ &= \expect \left[ \log \frac{p(\bold{z})p(\bold{x} | \bold{z})}{q_{\pmb{\phi}}(\bold{z} | \bold{x})} \right] \\ &= \expect \left[ \log p(\bold{x} | \bold{z}) \right] - D_{KL}(q_{\pmb{\phi}}(\bold{z} | \bold{x}) || p(\bold{z})) \\ \end{align*} Kingma & Welling, 2013 assumes that the true likelihood $p(\bold{x} | \bold{z})$ comes from some parametric family of distributions $p_{\pmb{\theta}}$ (Gaussian for continuous data, Bernoulli for binary). In other words, assume the parametric form is known, but the parameters must be estimated. Under this assumption we have, with equality \begin{align*} ELBO = \expect \left[ \log p_{\pmb{\theta}}(\bold{x} | \bold{z}) \right] - D_{KL}(q_{\pmb{\phi}}(\bold{z} | \bold{x}) || p(\bold{z})) \end{align*} This form shows that maximizing the ELBO w.r.t. $\pmb{\theta}$ and $\pmb{\phi}$ simultaneously learns a posterior $q_{\hat{\pmb{\phi}}}$ that moves toward maximizing the data likelihood $p_{\pmb{\hat{\theta}}}(\bold{x} | \bold{z})$, while regularizing the posterior by keeping it close to $p(\pmb{z})$. I.e. it learns a posterior that produces effective latents for reconstruction, while (for smooth priors such as the standard normal) discouraging $q_{\hat{\pmb{\phi}}}$ from overfitting and collapsing into Dirac deltas.

Third Form

The Shannon entropy for any function $f(x)$ of a random variable $x$ is \begin{equation*} H(f(x)) = E_{{x \sim p(x)}}[-\log f(x)] \end{equation*} Starting from the definition of the ELBO, we invoke chain rule the other way. \begin{align*} ELBO &= \expect \left[ \log \frac{p(\bold{x}, \bold{z})} {q_{\pmb{\phi}}(\bold{z} | \bold{x})} \right] \\ &= \expect \left[ \log \frac{p(\bold{x})p(\bold{z} | \bold{x})} {q_{\pmb{\phi}}(\bold{z} | \bold{x})} \right] \\ &= \log p({\bold{x}}) + \expect[\log p(\bold{z} | \bold{x})] + \expect[- \log q_{\pmb{\phi}}(\bold{z} | \bold{x})] \\ &= \log p({\bold{x}}) + \expect[\log p(\bold{z} | \bold{x})] + H(q_{\pmb{\phi}}) \end{align*} This last form shows that maximizing the ELBO w.r.t. $\pmb{\phi}$ maximizes the entropy of $q_{\pmb{\phi}}$ \begin{equation*} H(q_{\pmb{\phi}}) = - \int_{\bold{z}} q_{\pmb{\phi}} \log q_{\pmb{\phi}} d\bold{z} \end{equation*} Because of the negative sign, high entropy corresponds to low integrand over the support of $q_{\pmb{\phi}}$, encouraging exploration of different parts of the latent space and implying a high degree of uncertainty or variability in the inferred latents, which can be beneficial in cases where the true posterior is complex and multimodal.

Latent Spaces and the ELBO