KL-Divergence

Definition

\[ \begin{align} D_{\mathrm{KL}}\left( p\|q \right) &= \int p(x) \log\frac{p(x)}{q(x)} \,dx \\ &= -\int p(x) \log\frac{q(x)}{p(x)} \,dx \end{align} \]

Measures the divergence between two distributions: \(p(x)\) and \(q(x)\). It is always non-negative, since: \[ D_{\mathrm{KL}}\left( p\|q \right) = -\int p(x) \log\frac{q(x)}{p(x)} \,dx \geq -\log \int p(x) \frac{q(x)}{p(x)} = 0 \]

KL-Divergence is Not Symmetric

There is a fact that \[ D_{\mathrm{KL}}\left( p\|q \right) \neq D_{\mathrm{KL}}\left( q\|p \right) \] To train model distribution \(q(x)\) according to data distribution \(p(x)\), \(D_{\mathrm{KL}}\left( p\|q \right)\) will deduce a spread-out \(q(x)\) as follows:

Figure 1: \(D_{\mathrm{KL}}\left( p\|q \right)\) will deduce a spread-out \(q(x)\)1

Conversely, the reverse KL \(D_{\mathrm{KL}}\left( q\|p \right)\) will deduce a \(q(x)\) concentrated at some modes of \(p(x)\):

Figure 2: \(D_{\mathrm{KL}}\left( p\|q \right)\) will deduce a \(q(x)\) concentrated at some modes of \(p(x)\)

Estimate KL-Divergence with Classifier

Theory

Define a classifier: \[ p(x\in p|x) = \frac{1}{1+\exp(-r(x))} \] then for the mixed training data distribution: \[ \tilde{p}(x) = 0.5 (p(x) + q(x)) \] the \(r^{\star}(x)\) for the optimal classifier \(p^\star(x\in p|x)\) is: \[ \begin{align} p^\star(x\in p|x) &= \frac{p(x)}{p(x)+q(x)} \\ r^\star(x) &= \log \frac{p(x)}{q(x)} \end{align} \] Thus we can use the learned \(r^{\star}(x)\) to estimate the KL-divergence: \[ D_{\mathrm{KL}}\left( p\|q \right) = \mathbb{E}_{p(x)}\left[ r^{\star}(x) \right] \]

Training objective

Maximizing \(\mathcal{L}\) can obtain \(r(x) = D_{\mathrm{KL}}\left( p\|q \right)\), where \(\mathcal{L}\) is formulated as: \[ \begin{align} \mathcal{L} &= \mathbb{E}_{\tilde{p}(x)}\left[ I(x\in p) \log p(x\in p|x) + (1 - I(x \in p)) \log (1 - p(x\in p|x)) \right] \\ &= \mathbb{E}_{p(x)}\left[ \log \frac{1}{1+\exp(-r(x))} \right] + \mathbb{E}_{q(x)}\left[ \log \frac{\exp(-r(x))}{1+\exp(-r(x))} \right] \\ &= -\mathbb{E}_{p(x)}\left[ \log \left( 1+\exp(-r(x)) \right) \right] - \mathbb{E}_{q(x)}\left[ \log \left( 1+\exp(r(x)) \right) \right] \end{align} \]

The Donsker-Varadhan representation

Donsker and Varadhan (1983) proposed a dual representation of the KL-Divergence: \[ \mathrm{D}_{KL}(p\|q) = \sup_{T:\Omega \to \mathbb{R}} \left\{ \mathbb{E}_{p}[T(\mathbf{x})] - \log\left( \mathbb{E}_{q}[e^{T(x,y)}] \right) \right\} \]

And also, the following inequilty holds for \(T \in \mathcal{F}\), a specific family of functions: \[ \mathrm{D}_{KL}(p\|q) \geq \sup_{T \in \mathcal{F}} \left\{ \mathbb{E}_{p}[T(\mathbf{x})] - \log\left( \mathbb{E}_{q}[e^{T(x,y)}] \right) \right\} \] where the inequilty is tight for \(T^*\) satisfying: \[ p(x) \,dx = \frac{1}{Z} e^{T^*(x,y)}\,q(x)\,dx, \quad \text{where } Z = \mathbb{E}_{q}\left[e^{T^*(x,y)}\right] \]

References

Donsker, Monroe D., and SR Srinivasa Varadhan. 1983. “Asymptotic Evaluation of Certain Markov Process Expectations for Large Time. IV.” Communications on Pure and Applied Mathematics 36 (2): 183–212.


  1. Figures from https://wiseodd.github.io/techblog/2016/12/21/forward-reverse-kl/.↩︎