Useful KL Divergence Results


\gdef\expect{E_{x \sim p(x)}} \gdef\expectb{E_{\bold{x} \sim p(\bold{x})}}

The KL divergence of distribution pp from qq is a measure of dissimilarity, taken as the expectation of the log difference over the support of pp. It's used a lot in Bayesian inference (specifically, the divergence of a posterior pp from some prior qq), so we're listing some properties and closed-form equations here.

For discrete distributions over sample space X\mathcal{X}, we have: DKL(pq)=Exp(x)[log(p(x)q(x))]=xXp(x)log(p(x)q(x))\begin{equation} D_{KL}(p||q) = \expect\left[\log\left(\frac{p(x)}{q(x)}\right)\right]= \sum_{x \in \mathcal{X}} p(x)\log\left(\frac{p(x)}{q(x)}\right) \end{equation} whereas for continuous distributions we have the integral: DKL(pq)=p(x)log(p(x)q(x))dx\begin{equation} D_{KL}(p||q) = \int_{-\infty}^{\infty} p(x)\log\left(\frac{p(x)}{q(x)}\right)dx \end{equation} Remark: KL divergence has units of "bits" when using log base 2; "nats" when using natural log.


Properties

  1. DKL(pq)=0D_{KL}(p||q) = 0 if and only if p=qp=q.

    If p=qp=q then logp(x)/q(x)=log(1)=0\log p(x)/q(x) = \log(1) = 0 for all xx and DKL(pq)=0D_{KL}(p||q)=0. The proof of the forward implication is a little bit trickier.

  2. Non-negativity (DKL0D_{KL} \geq 0).

    Since log(a)a1\log(a) \leq a-1 for all a>0a > 0, we have:
    DKL(pq)=p(x)log(q(x)p(x))p(x)(q(x)p(x)1)\begin{align} D_{KL}(p||q)&=-\sum p(x) \log \left(\frac{q(x)}{p(x)}\right) \tag{Invert the log} \\ &\geq -\sum p(x)\left(\frac{q(x)}{p(x)} - 1\right) \tag{letting $a=q/p$} \end{align} The right hand side is equal to zero, which completes the proof: p(x)(q(x)p(x)1)=q(x)+p(x)=1+1=0DKL(pq)0\begin{align} -\sum p(x)\left(\frac{q(x)}{p(x)} - 1\right) &=-\sum q(x) + \sum p(x) \tag*{}\\ &=-1 + 1 \tag*{}\\ &=0 \tag*{}\\ D_{KL}(p||q) &\geq 0 \tag*{} \end{align}

  3. However the KL divergence is asymmetric. DKL(pq)DKL(qp)\begin{equation} \tag*{} D_{KL}(p||q) \neq D_{KL}(q||p) \end{equation}

  4. And it doesn't satisfy the triangle inequality. DKL(pr)DKL(pq)+DKL(qr)\begin{equation} \tag*{} D_{KL}(p||r) \nleq D_{KL}(p||q) + D_{KL}(q||r) \end{equation}

Remark: Due to the last two properties, the KL divergence doesn't qualify as a distance metric, which is why it's called the KL divergence. Because of asymmetry, we usually say that DKL(pq)D_{KL}(p||q) is the "divergence of pp from qq" rather than "between pp and qq".


Useful Results

We often encounter the KL divergence of a Gaussian from another Gaussian. As you'd expect, the algebra works out nicely to closed-formed equations. We list them here for the univariate (Eq. 3) and multivariate (Eq. 4) cases, as well as for the special case of an isotropic posterior from a standard normal prior (Eq. 5).

p(x)=N(μ1,σ12),q(x)=N(μ2,σ22)\underline{ p(x)=N(\mu_1, \sigma_1^2), q(x)=N(\mu_2, \sigma_2^2) } \\
The KL divergence of an univariate Gaussian posterior pp from a Gaussian prior qq is given by Eq. (3).

DKL(pq)=p(x)log(p(x)q(x))dx=p(x)log(σ2σ1exp[(xμ1)22σ12+(xμ2)22σ22])dx=ln(σ2σ1)12σ12p(x)(xμ1)2dxσ12+12σ22p(x)(xμ2)2dx=ln(σ2σ1)12+12σ22p(x)(x22μ2x+μ22)dx=ln(σ2σ1)12+12σ22(Exp(x)[x2]2μ1μ2+μ22)=ln(σ2σ1)12+12σ22(σ12+μ222μ1μ2+μ22)DKL(pq)=ln(σ2σ1)12+12σ22(σ12+(μ1μ2)2)\begin{align*} D_{KL}(p||q) &= \int_{-\infty}^{\infty} p(x)\log\left(\frac{p(x)}{q(x)}\right)dx \\ &= \int_{-\infty}^{\infty} p(x) \log \left( \frac{\sigma_2}{\sigma_1} \exp \left[ -\frac{(x-\mu_1)^2}{2\sigma_1^2} + \frac{(x-\mu_2)^2}{2\sigma_2^2} \right] \right) dx \\ &=\ln \left( \frac{\sigma_2}{\sigma_1} \right) - \frac{1}{2\sigma_1^2} \underbrace{\int_{-\infty}^{\infty}p(x)(x-\mu_1)^2dx}_{\sigma_1^2} \\ \tag{Use ln} & \quad + \frac{1}{2\sigma_2^2} \int_{-\infty}^{\infty}p(x)(x-\mu_2)^2dx \\ &=\ln \left( \frac{\sigma_2}{\sigma_1} \right) - \frac{1}{2} + \frac{1}{2\sigma_2^2} \int_{-\infty}^{\infty}p(x)(x^2-2\mu_2x + \mu_2^2)dx \\ &=\ln \left( \frac{\sigma_2}{\sigma_1} \right) - \frac{1}{2} + \frac{1}{2\sigma_2^2} \left( \expect[x^2] - 2\mu_1\mu_2 + \mu_2^2 \right) \\ \tag{$E[x^2] = \sigma^2 + \mu^2$} &=\ln \left( \frac{\sigma_2}{\sigma_1} \right) - \frac{1}{2} + \frac{1}{2\sigma_2^2} \left( \sigma_1^2 + \mu_2^2 - 2\mu_1\mu_2 + \mu_2^2 \right) \\\\ \tag{3} D_{KL}(p||q) &=\ln \left( \frac{\sigma_2}{\sigma_1} \right) - \frac{1}{2} + \frac{1}{2\sigma_2^2} \left( \sigma_1^2 + (\mu_1 - \mu_2)^2 \right) \\ \end{align*}


p(x)=N(μ,σ2),q(x)=N(0,1)\underline{ p(x)=N(\mu, \sigma^2), q(x)=N(0, 1) }

For a standard normal prior, DKLD_{KL} only depends on the mean and variance of the posterior, μ\mu and σ2\sigma^2. DKL(pq)=ln(σ)12+12(σ2+μ2)\begin{align*} &D_{KL}(p||q) = -\ln(\sigma) - \frac{1}{2} + \frac{1}{2}(\sigma^2 + \mu^2) \end{align*}


\gdef\quadxmu#1#2#3{ (\bold{x}-\pmb{\mu_#1})^T\bold{\Sigma_#2^{-1}}(\bold{x}-\pmb{\mu_#3}) } \gdef\quadmumu#1#2#3{ (\pmb{\mu_#1}-\pmb{\mu_#2})^T\bold{\Sigma_#3^{-1}}(\pmb{\mu_#1}-\pmb{\mu_#2}) }

p(x)=N(μ1,Σ1),q(x)=N(μ2,Σ2)\underline{ p(\bold{x})=N(\pmb{\mu_1}, \bold{\Sigma_1}), q(\bold{x})=N(\pmb{\mu_2}, \bold{\Sigma_2}) }

For multivariate Gaussian distributions we have: DKL(pq)=p(x)log(p(x)q(x))dx=p(x)(12ln(Σ2Σ1)12(xμ1)TΣ11(xμ1)+12(xμ2)TΣ21(xμ2))dx=12ln(Σ2Σ1)12Exp(x)[(xμ1)TΣ11(xμ1)](a)+12Exp(x)[(xμ2)TΣ21(xμ2)](b)\begin{align*} D_{KL}(p||q) &= \int_{-\infty}^{\infty} p(\bold{x})\log\left(\frac{p(\bold{x})}{q(\bold{x})}\right)d\bold{x} \\ &= \int_{-\infty}^{\infty} p(\bold{x}) \left( \frac{1}{2} \ln \left( \frac{\begin{vmatrix}\bold{\Sigma_2}\end{vmatrix}}{\begin{vmatrix}\bold{\Sigma_1}\end{vmatrix}} \right) - \frac{1}{2} \quadxmu{1}{1}{1} + \frac{1}{2} \quadxmu{2}{2}{2} \right) d\bold{x} \\ &=\frac{1}{2}\ln \left( \frac{\begin{vmatrix}\bold{\Sigma_2}\end{vmatrix}}{\begin{vmatrix}\bold{\Sigma_1}\end{vmatrix}} \right) -\frac{1}{2} \underbrace{ \expectb\left[ \quadxmu{1}{1}{1} \right] }_{\text{(a)}} +\frac{1}{2} \underbrace{ \expectb\left[ \quadxmu{2}{2}{2} \right] }_{\text{(b)}} \end{align*}

To evaluate (a) and (b), we use the trace trick for the expectation of quadratic forms. Recall that the trace of a square matrix is the sum of the main diagonal. The quadratic forms inside (a) and (b) evaluate to 1 by 1 matrices (each containing a single quadratic polynomial in x\bold{x}), the trace of which is the polynomial itself. This allows us to write (a) as:
Exp(x)[tr((xμ1)TΣ11(xμ1))]\begin{align*} \expectb \left[ tr \left( \quadxmu{1}{1}{1} \right) \right] \end{align*} Now, traces are invariant under cyclic permutation, i.e. tr(ABC)=tr(BCA)=tr(CAB)tr(ABC)=tr(BCA)=tr(CAB), and the expectation of the trace of a matrix is equal to the trace of the expectation, E[tr(A)]=tr(E[A])E[tr(A)]=tr(E[A]). Using both properties we get: tr(Exp(x)[(xμ1)(xμ1)TΣ11])\begin{align*} tr\left( \expectb \left[ (\bold{x} - \pmb{\mu_1})(\bold{x} - \pmb{\mu_1})^T \bold{\Sigma_1^{-1}} \right] \right) \end{align*} Since the inverse covariance matrix is constant w.r.t. x\bold{x}, we move it out of the expectation and find that (a) equals nn. tr(Σ11Exp(x)[(xμ1)(xμ1)T])=tr(Σ11Σ1)=tr(I)=n\begin{align*} tr\left( \bold{\Sigma_1^{-1}} \expectb \left[ (\bold{x}-\pmb{\mu_1})(\bold{x}-\pmb{\mu_1})^T \right] \right) = tr\left( \bold{\Sigma_1^{-1}}\bold{\Sigma_1} \right) = tr\left(\bold{I}\right) = n \end{align*} Using the same trick we evaluate (b). Start by introducing μ1\bold{\mu_1}: Exp(x)[((xμ1)+(μ1μ2))TΣ21((xμ1)+(μ1μ2))]=Exp(x)[((xμ1)TΣ21+(μ1μ2)TΣ21)((xμ1)+(μ1μ2))]=Exp(x)[(xμ1)TΣ21(xμ1)+(xμ1)TΣ21(μ1μ2)+(μ1μ2)TΣ21(xμ1)+(μ1μ2)TΣ21(μ1μ2)]=Exp(x)[(xμ1)TΣ21(xμ1)]+(μ1μ2)Σ21Exp(x)[(xμ1)T]+(μ1μ2)TΣ21Exp(x)[(xμ1)]+(μ1μ2)TΣ21(μ1μ2)\begin{align*} \tag{Introduce $\pmb{\mu_1}$} &\expectb [ ((\bold{x} - \pmb{\mu_1}) + (\pmb{\mu_1} - \pmb{\mu_2}))^T \bold{\Sigma_2^{-1}} ((\bold{x} - \pmb{\mu_1}) + (\pmb{\mu_1} - \pmb{\mu_2})) ] \\ \tag{Distribute $\bold{\Sigma_2^{-1}}$} &= \expectb [ \left( (\bold{x}-\pmb{\mu_1})^T \bold{\Sigma_2^{-1}} + (\pmb{\mu_1}-\pmb{\mu_2})^T \bold{\Sigma_2^{-1}} \right) ((\bold{x} - \pmb{\mu_1}) + (\pmb{\mu_1} - \pmb{\mu_2})) ] \\ &= \expectb [ \quadxmu{1}{2}{1} + (\bold{x} - \pmb{\mu_1})^T \bold{\Sigma_2^{-1}} (\pmb{\mu_1} - \pmb{\mu_2}) \\ & \quad + (\pmb{\mu_1} - \pmb{\mu_2})^T \bold{\Sigma_2^{-1}} (\bold{x} - \pmb{\mu_1}) + \quadmumu{1}{2}{2} ] \\ \tag{Move constants} &= \expectb [\quadxmu{1}{2}{1}] + (\pmb{\mu_1} - \pmb{\mu_2}) \bold{\Sigma_2^{-1}} \expectb [(\bold{x} - \pmb{\mu_1})^T] \\ & \quad + (\pmb{\mu_1} - \pmb{\mu_2})^T \bold{\Sigma_2^{-1}} \expectb[(\bold{x} - \pmb{\mu_1})] + \quadmumu{1}{2}{2} \\ \end{align*} Since Exp(x)[(xμ1)]=Exp(x)[x]μ1=0\expectb [(\bold{x} - \pmb{\mu_1})] = \expectb[\bold{x}] - \pmb{\mu_1} = \bold{0}, we are left with: Exp(x)[(xμ1)TΣ21(xμ1)]+(μ1μ2)TΣ21(μ1μ2)\begin{align*} \expectb [\quadxmu{1}{2}{1}] + \quadmumu{1}{2}{2} \end{align*} =Exp(x)[tr((xμ1)(xμ1)TΣ21)]+(μ1μ2)TΣ21(μ1μ2)=tr(Exp(x)[(xμ1)(xμ1)TΣ21])+(μ1μ2)TΣ21(μ1μ2)\begin{align*} \tag{Trace trick} \begin{split} &= \expectb[tr((\bold{x} - \pmb{\mu_1})(\bold{x} - \pmb{\mu_1})^T \bold{\Sigma_2^{-1}})] + \quadmumu{1}{2}{2} \\ &= tr(\expectb[(\bold{x} - \pmb{\mu_1})(\bold{x} - \pmb{\mu_1})^T \bold{\Sigma_2^{-1}}]) + \quadmumu{1}{2}{2} \end{split} \\ \end{align*} =tr(Exp(x)[(xμ1)(xμ1)T]Σ21)+(μ1μ2)TΣ21(μ1μ2)=tr(Σ1Σ21)+(μ1μ2)TΣ21(μ1μ2)\begin{align*} \tag{Move $\bold{\Sigma_2^{-1}}$} &= tr(\expectb[(\bold{x} - \pmb{\mu_1})(\bold{x} - \pmb{\mu_1})^T] \bold{\Sigma_2^{-1}}) + \quadmumu{1}{2}{2} \\ \tag{Definition of $\bold{\Sigma_1}$} &= tr(\bold{\Sigma_1} \bold{\Sigma_2^{-1}}) + \quadmumu{1}{2}{2} \end{align*} Plugging back the simplified expressions for (a) and (b) yields: DKL(pq)=12(ln(Σ2Σ1)n+tr(Σ1Σ21)+(μ1μ2)TΣ21(μ1μ2))\begin{equation} \tag{4} D_{KL}(p||q) = \frac{1}{2} \left( \ln \left( \frac{\begin{vmatrix}\bold{\Sigma_2}\end{vmatrix}}{\begin{vmatrix}\bold{\Sigma_1}\end{vmatrix}} \right) - n + tr(\bold{\Sigma_1} \bold{\Sigma_2^{-1}}) + \quadmumu{1}{2}{2} \right) \end{equation}


p(x)=N(μ,diag(σ12,...,σn2)),q(x)=N(0,I)\underline{ p(\bold{x})=N(\pmb{\mu}, diag(\sigma_1^2, ..., \sigma_n^2)), q(\bold{x})=N(\bold{0}, I) }

For an isotropic Gaussian posterior and the standard normal prior, we have μ1=μ,Σ1=diag(σ12,...,σn2),μ2=0\pmb{\mu_1} = \pmb{\mu}, \bold{\Sigma_1}=diag(\sigma_1^2, ..., \sigma_n^2), \pmb{\mu_2}=\bold{0} and Σ2=I\bold{\Sigma_2}=\bold{I}. DKL(pq)=12(ln(Σ2Σ1)n+tr(Σ1Σ21)+(μ1μ2)TΣ21(μ1μ2))=12(ln(IΣ1)n+tr(Σ1I)+(μ0)TI(μ0))=12(ln(1diag(σ12,...,σn2))n+tr(diag(σ12,...,σn2))+μTμ)=12(ln(i=1nσi2)i=1n1+i=1nσi2+i=1nμi2)=12(i=1nlnσi2+i=1n1+σi2+μi2)=12i=1n(lnσi21+σi2+μi2)\begin{align*} D_{KL}(p||q) &= \frac{1}{2} \left( \ln \left( \frac{\begin{vmatrix}\bold{\Sigma_2}\end{vmatrix}}{\begin{vmatrix}\bold{\Sigma_1}\end{vmatrix}} \right) - n + tr(\bold{\Sigma_1} \bold{\Sigma_2^{-1}}) + \quadmumu{1}{2}{2} \right) \\ &= \frac{1}{2} \left( \ln \left( \frac{\begin{vmatrix}\bold{I}\end{vmatrix}}{\begin{vmatrix}\bold{\Sigma_1}\end{vmatrix}} \right) - n + tr(\bold{\Sigma_1} \bold{I}) + (\pmb{\mu} - \bold{0})^T\bold{I}(\pmb{\mu} - \bold{0}) \right) \\ &= \frac{1}{2} \left(\ln \left( \frac{1}{\begin{vmatrix}diag(\sigma_1^2, ..., \sigma_n^2)\end{vmatrix}} \right) - n + tr(diag(\sigma_1^2, ..., \sigma_n^2)) + \pmb{\mu}^T\pmb{\mu} \right) \\ &= \frac{1}{2} \left( -\ln \left( \prod_{i=1}^n \sigma_i^2 \right) - \sum_{i=1}^n 1 + \sum_{i=1}^n \sigma_i^2 + \sum_{i=1}^n \mu_i^2 \right) \\ &= \frac{1}{2} \left( - \sum_{i=1}^n \ln\sigma_i^2 + \sum_{i=1}^n -1 + \sigma_i^2 + \mu_i^2 \right) \\ \tag{5} &= \frac{1}{2} \sum_{i=1}^n \left( - \ln\sigma_i^2 - 1 + \sigma_i^2 +\mu_i^2 \right) \end{align*}