\[ \begin{align} D_{\mathrm{KL}}\left( p\|q \right) &= \int p(x) \log\frac{p(x)}{q(x)} \,dx \\ &= -\int p(x) \log\frac{q(x)}{p(x)} \,dx \end{align} \]

Measures the divergence between two distributions: \(p(x)\) and \(q(x)\). It is always non-negative, since: \[ D_{\mathrm{KL}}\left( p\|q \right) = -\int p(x) \log\frac{q(x)}{p(x)} \,dx \geq -\log \int p(x) \frac{q(x)}{p(x)} = 0 \]

KL-Divergence is Not Symmetric

There is a fact that \[ D_{\mathrm{KL}}\left( p\|q \right) \neq D_{\mathrm{KL}}\left( q\|p \right) \] To train model distribution \(q(x)\) according to data distribution \(p(x)\), \(D_{\mathrm{KL}}\left( p\|q \right)\) will deduce a spread-out \(q(x)\) as follows:

Figure 1: \(D_{\mathrm{KL}}\left( p\|q \right)\) will deduce a spread-out \(q(x)\)1

Conversely, the reverse KL \(D_{\mathrm{KL}}\left( q\|p \right)\) will deduce a \(q(x)\) concentrated at some modes of \(p(x)\):

Figure 2: \(D_{\mathrm{KL}}\left( p\|q \right)\) will deduce a \(q(x)\) concentrated at some modes of \(p(x)\)

Estimate KL-Divergence with Classifier


Define a classifier: \[ p(x\in p|x) = \frac{1}{1+\exp(-r(x))} \] then for the mixed training data distribution: \[ \tilde{p}(x) = 0.5 (p(x) + q(x)) \] the \(r^{\star}(x)\) for the optimal classifier \(p^\star(x\in p|x)\) is: \[ \begin{align} p^\star(x\in p|x) &= \frac{p(x)}{p(x)+q(x)} \\ r^\star(x) &= \log \frac{p(x)}{q(x)} \end{align} \] Thus we can use the learned \(r^{\star}(x)\) to estimate the KL-divergence: \[ D_{\mathrm{KL}}\left( p\|q \right) = \mathbb{E}_{p(x)}\left[ r^{\star}(x) \right] \]

Training objective

Maximizing \(\mathcal{L}\) can obtain \(r(x) = D_{\mathrm{KL}}\left( p\|q \right)\), where \(\mathcal{L}\) is formulated as: \[ \begin{align} \mathcal{L} &= \mathbb{E}_{\tilde{p}(x)}\left[ I(x\in p) \log p(x\in p|x) + (1 - I(x \in p)) \log (1 - p(x\in p|x)) \right] \\ &= \mathbb{E}_{p(x)}\left[ \log \frac{1}{1+\exp(-r(x))} \right] + \mathbb{E}_{q(x)}\left[ \log \frac{\exp(-r(x))}{1+\exp(-r(x))} \right] \\ &= -\mathbb{E}_{p(x)}\left[ \log \left( 1+\exp(-r(x)) \right) \right] - \mathbb{E}_{q(x)}\left[ \log \left( 1+\exp(r(x)) \right) \right] \end{align} \]

The Donsker-Varadhan representation

Donsker and Varadhan (1983) proposed a dual representation of the KL-Divergence: \[ \mathrm{D}_{KL}(p\|q) = \sup_{T:\Omega \to \mathbb{R}} \left\{ \mathbb{E}_{p}[T(\mathbf{x})] - \log\left( \mathbb{E}_{q}[e^{T(x,y)}] \right) \right\} \]

And also, the following inequilty holds for \(T \in \mathcal{F}\), a specific family of functions: \[ \mathrm{D}_{KL}(p\|q) \geq \sup_{T \in \mathcal{F}} \left\{ \mathbb{E}_{p}[T(\mathbf{x})] - \log\left( \mathbb{E}_{q}[e^{T(x,y)}] \right) \right\} \] where the inequilty is tight for \(T^*\) satisfying: \[ p(x) \,dx = \frac{1}{Z} e^{T^*(x,y)}\,q(x)\,dx, \quad \text{where } Z = \mathbb{E}_{q}\left[e^{T^*(x,y)}\right] \]


Donsker, Monroe D., and SR Srinivasa Varadhan. 1983. “Asymptotic Evaluation of Certain Markov Process Expectations for Large Time. IV.” Communications on Pure and Applied Mathematics 36 (2): 183–212.

  1. Figures from https://wiseodd.github.io/techblog/2016/12/21/forward-reverse-kl/.↩︎

Graph Auto-Encoder

Auto-Encoding the Feature Matrix Given Adjacency Matrix

The task is to learn the encoder \(f\) and the decoder \(g\) that: \[ \begin{align} \mathbf{Z} &= f(\mathbf{A},\mathbf{X}) \\ \tilde{\mathbf{X}} &= g(\mathbf{A},\mathbf{Z}) \end{align} \]

Graph U-Net

Figure 1: Graph U-Net (view pdf)
Figure 2: gUnpool (view pdf)

Gao and Ji (2019) proposed the gPool, and gUnpool as the building blocks of graph auto-encoder. It also mimics the hierarchical structure of U-Net (Ronneberger, Fischer, and Brox 2015), thus named as Graph U-Net.

  • gPool: is similar to that proposed by Cangea et al. (2018), see Graph Convolution Network.

  • gUnpool: the gUnpool layer that restores the input of a gUnpool layer can be formulated as; \[ \tilde{\mathbf{X}}^{(l)} = \mathop{\mathrm{distribute}}\left( \mathbf{X}^{(l+1)}, \mathbf{idx}^{(l)} \right) \] where \(\mathbf{X}^{(l+1)}\) is output of the \(l\)-th gPool layer, \(\mathbf{idx}^{(l)}\) is the top-\(k\) nodes of the \(l\)-th gPool layer. The function \(\text{distribute}(\cdot)\) assigns back each row of \(\mathbf{X}^ {(l+1)}\) to the \(\tilde{\mathbf{X}}^{(l)}\), according to the selection index \(\mathbf{idx}^{(l)}\). The shape of \(\tilde{\mathbf{X}}^{(l)}\) is the same as the input of the \(l\)-th gPool layer (i.e., \(\mathbf{X}^{(l)}\)). The un-assigned rows are zero vectors.


Cangea, Cătălina, Petar Veličković, Nikola Jovanović, Thomas Kipf, and Pietro Liò. 2018. “Towards Sparse Hierarchical Graph Classifiers.” arXiv Preprint arXiv:1811.01287.

Gao, Hongyang, and Shuiwang Ji. 2019. “Graph U-Nets.” arXiv:1905.05178 [Cs, Stat], May. http://arxiv.org/abs/1905.05178.

Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. 2015. “U-Net: Convolutional Networks for Biomedical Image Segmentation.” arXiv:1505.04597 [Cs], May. http://arxiv.org/abs/1505.04597.

Image Segmentation


For each pixel \(x_{ij}\) on an image, predict its segmentation class \(c_{ij}\).

Supervised Methods

Fully convolutional networks for semantic segmentation

Figure 1: The Architecture of "Fully convolutional networks for semantic segmentation" (view pdf)

Long, Shelhamer, and Darrell (2015) proposed to use deconvolutional layers to up-sample intermediate feature maps at different levels from a pre-trained convolutional neural network, in order to compose the pixel-wise classification output.


Figure 2: The Architecture of "U-Net" (view pdf)
  • Hierarchical deconvolution at different levels.
  • Weighted cross-entropy loss: to balance the loss of different classes, and to enforce the net to put emphasis on the cell boundaries. \[ \begin{align} \mathcal{L} &= \sum w(\mathbf{x}) \log p(\mathbf{x}) \\ w(\mathbf{x}) &= w_c(\mathbf{x}) + w_0 \cdot \exp \left( -\frac{d_1(\mathbf{x}) + d_2(\mathbf{x})^2}{2 \sigma^2} \right) \end{align} \] where \(w_c(\mathbf{x})\) is the normalizing weight for the class of the pixel, while \(w_0\) is a hyper-parameter. \(d_1\) and \(d_2\) is the distance from the background pixel \(\mathbf{x}\) to the closest and second closest cell.



Let \(n_{ij}\) be the number of pixels of class \(i\) being predicted to belong to class \(j\). Suppose there are \(k\) different classes, and \(t_i = \sum_{j} n_{ij}\) be the total number of pixels of class \(i\). Then we have the following metrics for image segmentation (Long, Shelhamer, and Darrell 2015):

  • Pixel accuracy \[ \text{Pixel Acc} = \frac{\sum_i n_{ii}}{\sum_i t_i} \]

  • Mean accuracy \[ \text{Mean Acc} = \frac{1}{k} \sum_i \frac{n_{ii}}{t_i} \]

  • Mean IoU (Intersection over Union): \[ \text{Mean IoU} = \frac{1}{k} \sum_{i} \frac{n_ii}{t_i + \sum_j n_{ji} - n_{ii}} \]

  • Weighted IoU \[ \text{Weighted IoU} = \frac{1}{\sum_j t_j} \cdot\frac{\sum_i t_i n_{ii}}{t_i + \sum_j n_{ji} -n_{ii}} \]


Long, Jonathan, Evan Shelhamer, and Trevor Darrell. 2015. “Fully Convolutional Networks for Semantic Segmentation.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3431–40.