The KL divergence of distribution p from q is a measure of dissimilarity,
taken as the expectation of the log difference over the support of p.
It's used a lot in Bayesian inference (specifically, the divergence of a posterior p
from some prior q), so we're listing some properties and closed-form equations here.
For discrete distributions over sample space X, we have:
DKL(p∣∣q)=Ex∼p(x)[log(q(x)p(x))]=x∈X∑p(x)log(q(x)p(x))
whereas for continuous distributions we have the integral:
DKL(p∣∣q)=∫−∞∞p(x)log(q(x)p(x))dxRemark: KL divergence has units of "bits" when using log base 2; "nats" when using natural log.
Properties
DKL(p∣∣q)=0 if and only if p=q.
If p=q then logp(x)/q(x)=log(1)=0
for all x and DKL(p∣∣q)=0. The proof of the forward implication
is a little bit trickier.
Non-negativity (DKL≥0).
Since log(a)≤a−1 for all a>0, we have: DKL(p∣∣q)=−∑p(x)log(p(x)q(x))≥−∑p(x)(p(x)q(x)−1)(Invert the log)(letting a=q/p)
The right hand side is equal to zero, which completes the proof:
−∑p(x)(p(x)q(x)−1)DKL(p∣∣q)=−∑q(x)+∑p(x)=−1+1=0≥0
However the KL divergence is asymmetric.
DKL(p∣∣q)=DKL(q∣∣p)
And it doesn't satisfy the triangle inequality.
DKL(p∣∣r)≰DKL(p∣∣q)+DKL(q∣∣r)
Remark: Due to the last two properties, the KL divergence doesn't qualify as a distance metric,
which is why it's called the KL divergence. Because of asymmetry, we usually say that
DKL(p∣∣q) is the "divergence of p from q" rather than "between p and q".
Useful Results
We often encounter the KL divergence of a Gaussian
from another Gaussian. As you'd expect, the algebra works out nicely to closed-formed
equations. We list them here for the univariate (Eq. 3) and multivariate (Eq. 4) cases,
as well as for the special case of an isotropic posterior from a standard normal prior (Eq. 5).
p(x)=N(μ1,σ12),q(x)=N(μ2,σ22)
The KL divergence of an univariate Gaussian posterior p from a Gaussian prior q is given by Eq. (3).
For a standard normal prior, DKL only depends on the mean and variance of the posterior,
μ and σ2.
DKL(p∣∣q)=−ln(σ)−21+21(σ2+μ2)
p(x)=N(μ1,Σ1),q(x)=N(μ2,Σ2)
For multivariate Gaussian distributions we have:
DKL(p∣∣q)=∫−∞∞p(x)log(q(x)p(x))dx=∫−∞∞p(x)(21ln(Σ1Σ2)−21(x−μ1)TΣ1−1(x−μ1)+21(x−μ2)TΣ2−1(x−μ2))dx=21ln(Σ1Σ2)−21(a)Ex∼p(x)[(x−μ1)TΣ1−1(x−μ1)]+21(b)Ex∼p(x)[(x−μ2)TΣ2−1(x−μ2)]
To evaluate (a) and (b), we use the trace trick for the expectation of quadratic forms.
Recall that the trace of a
square matrix is the sum of the main diagonal. The quadratic forms inside (a) and (b)
evaluate to 1 by 1 matrices (each containing a single quadratic polynomial in x),
the trace of which is the polynomial itself. This allows us to write (a) as: Ex∼p(x)[tr((x−μ1)TΣ1−1(x−μ1))]
Now, traces are invariant under cyclic permutation, i.e. tr(ABC)=tr(BCA)=tr(CAB), and
the expectation of the trace of a matrix is equal to the trace of the expectation,
E[tr(A)]=tr(E[A]). Using both properties we get:
tr(Ex∼p(x)[(x−μ1)(x−μ1)TΣ1−1])
Since the inverse covariance matrix is constant w.r.t. x, we move it out of the
expectation and find that (a) equals n.
tr(Σ1−1Ex∼p(x)[(x−μ1)(x−μ1)T])=tr(Σ1−1Σ1)=tr(I)=n
Using the same trick we evaluate (b). Start by introducing μ1:
Ex∼p(x)[((x−μ1)+(μ1−μ2))TΣ2−1((x−μ1)+(μ1−μ2))]=Ex∼p(x)[((x−μ1)TΣ2−1+(μ1−μ2)TΣ2−1)((x−μ1)+(μ1−μ2))]=Ex∼p(x)[(x−μ1)TΣ2−1(x−μ1)+(x−μ1)TΣ2−1(μ1−μ2)+(μ1−μ2)TΣ2−1(x−μ1)+(μ1−μ2)TΣ2−1(μ1−μ2)]=Ex∼p(x)[(x−μ1)TΣ2−1(x−μ1)]+(μ1−μ2)Σ2−1Ex∼p(x)[(x−μ1)T]+(μ1−μ2)TΣ2−1Ex∼p(x)[(x−μ1)]+(μ1−μ2)TΣ2−1(μ1−μ2)(Introduce μ1)(Distribute Σ2−1)(Move constants)
Since Ex∼p(x)[(x−μ1)]=Ex∼p(x)[x]−μ1=0,
we are left with:
Ex∼p(x)[(x−μ1)TΣ2−1(x−μ1)]+(μ1−μ2)TΣ2−1(μ1−μ2)=Ex∼p(x)[tr((x−μ1)(x−μ1)TΣ2−1)]+(μ1−μ2)TΣ2−1(μ1−μ2)=tr(Ex∼p(x)[(x−μ1)(x−μ1)TΣ2−1])+(μ1−μ2)TΣ2−1(μ1−μ2)(Trace trick)=tr(Ex∼p(x)[(x−μ1)(x−μ1)T]Σ2−1)+(μ1−μ2)TΣ2−1(μ1−μ2)=tr(Σ1Σ2−1)+(μ1−μ2)TΣ2−1(μ1−μ2)(Move Σ2−1)(Definition of Σ1)
Plugging back the simplified expressions for (a) and (b) yields:
DKL(p∣∣q)=21(ln(Σ1Σ2)−n+tr(Σ1Σ2−1)+(μ1−μ2)TΣ2−1(μ1−μ2))(4)
p(x)=N(μ,diag(σ12,...,σn2)),q(x)=N(0,I)
For an isotropic Gaussian posterior and the standard normal prior, we have
μ1=μ,Σ1=diag(σ12,...,σn2),μ2=0
and Σ2=I.
DKL(p∣∣q)=21(ln(Σ1Σ2)−n+tr(Σ1Σ2−1)+(μ1−μ2)TΣ2−1(μ1−μ2))=21(ln(Σ1I)−n+tr(Σ1I)+(μ−0)TI(μ−0))=21(ln(diag(σ12,...,σn2)1)−n+tr(diag(σ12,...,σn2))+μTμ)=21(−ln(i=1∏nσi2)−i=1∑n1+i=1∑nσi2+i=1∑nμi2)=21(−i=1∑nlnσi2+i=1∑n−1+σi2+μi2)=21i=1∑n(−lnσi2−1+σi2+μi2)(5)