# KL-Divergence

## Definition

\begin{align} D_{\mathrm{KL}}\left( p\|q \right) &= \int p(x) \log\frac{p(x)}{q(x)} \,dx \\ &= -\int p(x) \log\frac{q(x)}{p(x)} \,dx \end{align}

Measures the divergence between two distributions: $$p(x)$$ and $$q(x)$$. It is always non-negative, since: $D_{\mathrm{KL}}\left( p\|q \right) = -\int p(x) \log\frac{q(x)}{p(x)} \,dx \geq -\log \int p(x) \frac{q(x)}{p(x)} = 0$

### KL-Divergence is Not Symmetric

There is a fact that $D_{\mathrm{KL}}\left( p\|q \right) \neq D_{\mathrm{KL}}\left( q\|p \right)$ To train model distribution $$q(x)$$ according to data distribution $$p(x)$$, $$D_{\mathrm{KL}}\left( p\|q \right)$$ will deduce a spread-out $$q(x)$$ as follows:

Conversely, the reverse KL $$D_{\mathrm{KL}}\left( q\|p \right)$$ will deduce a $$q(x)$$ concentrated at some modes of $$p(x)$$:

## Estimate KL-Divergence with Classifier

### Theory

Define a classifier: $p(x\in p|x) = \frac{1}{1+\exp(-r(x))}$ then for the mixed training data distribution: $\tilde{p}(x) = 0.5 (p(x) + q(x))$ the $$r^{\star}(x)$$ for the optimal classifier $$p^\star(x\in p|x)$$ is: \begin{align} p^\star(x\in p|x) &= \frac{p(x)}{p(x)+q(x)} \\ r^\star(x) &= \log \frac{p(x)}{q(x)} \end{align} Thus we can use the learned $$r^{\star}(x)$$ to estimate the KL-divergence: $D_{\mathrm{KL}}\left( p\|q \right) = \mathbb{E}_{p(x)}\left[ r^{\star}(x) \right]$

### Training objective

Maximizing $$\mathcal{L}$$ can obtain $$r(x) = D_{\mathrm{KL}}\left( p\|q \right)$$, where $$\mathcal{L}$$ is formulated as: \begin{align} \mathcal{L} &= \mathbb{E}_{\tilde{p}(x)}\left[ I(x\in p) \log p(x\in p|x) + (1 - I(x \in p)) \log (1 - p(x\in p|x)) \right] \\ &= \mathbb{E}_{p(x)}\left[ \log \frac{1}{1+\exp(-r(x))} \right] + \mathbb{E}_{q(x)}\left[ \log \frac{\exp(-r(x))}{1+\exp(-r(x))} \right] \\ &= -\mathbb{E}_{p(x)}\left[ \log \left( 1+\exp(-r(x)) \right) \right] - \mathbb{E}_{q(x)}\left[ \log \left( 1+\exp(r(x)) \right) \right] \end{align}

Donsker and Varadhan (1983) proposed a dual representation of the KL-Divergence: $\mathrm{D}_{KL}(p\|q) = \sup_{T:\Omega \to \mathbb{R}} \left\{ \mathbb{E}_{p}[T(\mathbf{x})] - \log\left( \mathbb{E}_{q}[e^{T(x,y)}] \right) \right\}$
And also, the following inequilty holds for $$T \in \mathcal{F}$$, a specific family of functions: $\mathrm{D}_{KL}(p\|q) \geq \sup_{T \in \mathcal{F}} \left\{ \mathbb{E}_{p}[T(\mathbf{x})] - \log\left( \mathbb{E}_{q}[e^{T(x,y)}] \right) \right\}$ where the inequilty is tight for $$T^*$$ satisfying: $p(x) \,dx = \frac{1}{Z} e^{T^*(x,y)}\,q(x)\,dx, \quad \text{where } Z = \mathbb{E}_{q}\left[e^{T^*(x,y)}\right]$