KL-Divergence

Definition

\begin{align} D_{\mathrm{KL}}\left( p\|q \right) &= \int p(x) \log\frac{p(x)}{q(x)} \,dx \\ &= -\int p(x) \log\frac{q(x)}{p(x)} \,dx \end{align}

Measures the divergence between two distributions: $$p(x)$$ and $$q(x)$$. It is always non-negative, since: $D_{\mathrm{KL}}\left( p\|q \right) = -\int p(x) \log\frac{q(x)}{p(x)} \,dx \geq -\log \int p(x) \frac{q(x)}{p(x)} = 0$

KL-Divergence is Not Symmetric

There is a fact that $D_{\mathrm{KL}}\left( p\|q \right) \neq D_{\mathrm{KL}}\left( q\|p \right)$ To train model distribution $$q(x)$$ according to data distribution $$p(x)$$, $$D_{\mathrm{KL}}\left( p\|q \right)$$ will deduce a spread-out $$q(x)$$ as follows:

Conversely, the reverse KL $$D_{\mathrm{KL}}\left( q\|p \right)$$ will deduce a $$q(x)$$ concentrated at some modes of $$p(x)$$:

Estimate KL-Divergence with Classifier

Theory

Define a classifier: $p(x\in p|x) = \frac{1}{1+\exp(-r(x))}$ then for the mixed training data distribution: $\tilde{p}(x) = 0.5 (p(x) + q(x))$ the $$r^{\star}(x)$$ for the optimal classifier $$p^\star(x\in p|x)$$ is: \begin{align} p^\star(x\in p|x) &= \frac{p(x)}{p(x)+q(x)} \\ r^\star(x) &= \log \frac{p(x)}{q(x)} \end{align} Thus we can use the learned $$r^{\star}(x)$$ to estimate the KL-divergence: $D_{\mathrm{KL}}\left( p\|q \right) = \mathbb{E}_{p(x)}\left[ r^{\star}(x) \right]$

Training objective

Maximizing $$\mathcal{L}$$ can obtain $$r(x) = D_{\mathrm{KL}}\left( p\|q \right)$$, where $$\mathcal{L}$$ is formulated as: \begin{align} \mathcal{L} &= \mathbb{E}_{\tilde{p}(x)}\left[ I(x\in p) \log p(x\in p|x) + (1 - I(x \in p)) \log (1 - p(x\in p|x)) \right] \\ &= \mathbb{E}_{p(x)}\left[ \log \frac{1}{1+\exp(-r(x))} \right] + \mathbb{E}_{q(x)}\left[ \log \frac{\exp(-r(x))}{1+\exp(-r(x))} \right] \\ &= -\mathbb{E}_{p(x)}\left[ \log \left( 1+\exp(-r(x)) \right) \right] - \mathbb{E}_{q(x)}\left[ \log \left( 1+\exp(r(x)) \right) \right] \end{align}

Donsker and Varadhan (1983) proposed a dual representation of the KL-Divergence: $\mathrm{D}_{KL}(p\|q) = \sup_{T:\Omega \to \mathbb{R}} \left\{ \mathbb{E}_{p}[T(\mathbf{x})] - \log\left( \mathbb{E}_{q}[e^{T(x,y)}] \right) \right\}$

And also, the following inequilty holds for $$T \in \mathcal{F}$$, a specific family of functions: $\mathrm{D}_{KL}(p\|q) \geq \sup_{T \in \mathcal{F}} \left\{ \mathbb{E}_{p}[T(\mathbf{x})] - \log\left( \mathbb{E}_{q}[e^{T(x,y)}] \right) \right\}$ where the inequilty is tight for $$T^*$$ satisfying: $p(x) \,dx = \frac{1}{Z} e^{T^*(x,y)}\,q(x)\,dx, \quad \text{where } Z = \mathbb{E}_{q}\left[e^{T^*(x,y)}\right]$

References

Donsker, Monroe D., and SR Srinivasa Varadhan. 1983. “Asymptotic Evaluation of Certain Markov Process Expectations for Large Time. IV.” Communications on Pure and Applied Mathematics 36 (2): 183–212.

Graph Auto-Encoder

Auto-Encoding the Feature Matrix Given Adjacency Matrix

The task is to learn the encoder $$f$$ and the decoder $$g$$ that: \begin{align} \mathbf{Z} &= f(\mathbf{A},\mathbf{X}) \\ \tilde{\mathbf{X}} &= g(\mathbf{A},\mathbf{Z}) \end{align}

Graph U-Net

Gao and Ji (2019) proposed the gPool, and gUnpool as the building blocks of graph auto-encoder. It also mimics the hierarchical structure of U-Net (Ronneberger, Fischer, and Brox 2015), thus named as Graph U-Net.

• gPool: is similar to that proposed by Cangea et al. (2018), see Graph Convolution Network.

• gUnpool: the gUnpool layer that restores the input of a gUnpool layer can be formulated as; $\tilde{\mathbf{X}}^{(l)} = \mathop{\mathrm{distribute}}\left( \mathbf{X}^{(l+1)}, \mathbf{idx}^{(l)} \right)$ where $$\mathbf{X}^{(l+1)}$$ is output of the $$l$$-th gPool layer, $$\mathbf{idx}^{(l)}$$ is the top-$$k$$ nodes of the $$l$$-th gPool layer. The function $$\text{distribute}(\cdot)$$ assigns back each row of $$\mathbf{X}^ {(l+1)}$$ to the $$\tilde{\mathbf{X}}^{(l)}$$, according to the selection index $$\mathbf{idx}^{(l)}$$. The shape of $$\tilde{\mathbf{X}}^{(l)}$$ is the same as the input of the $$l$$-th gPool layer (i.e., $$\mathbf{X}^{(l)}$$). The un-assigned rows are zero vectors.

References

Cangea, Cătălina, Petar Veličković, Nikola Jovanović, Thomas Kipf, and Pietro Liò. 2018. “Towards Sparse Hierarchical Graph Classifiers.” arXiv Preprint arXiv:1811.01287.

Gao, Hongyang, and Shuiwang Ji. 2019. “Graph U-Nets.” arXiv:1905.05178 [Cs, Stat], May. http://arxiv.org/abs/1905.05178.

Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. 2015. “U-Net: Convolutional Networks for Biomedical Image Segmentation.” arXiv:1505.04597 [Cs], May. http://arxiv.org/abs/1505.04597.

Image Segmentation

Overview

For each pixel $$x_{ij}$$ on an image, predict its segmentation class $$c_{ij}$$.

Supervised Methods

Fully convolutional networks for semantic segmentation

Long, Shelhamer, and Darrell (2015) proposed to use deconvolutional layers to up-sample intermediate feature maps at different levels from a pre-trained convolutional neural network, in order to compose the pixel-wise classification output.

U-Net

• Hierarchical deconvolution at different levels.
• Weighted cross-entropy loss: to balance the loss of different classes, and to enforce the net to put emphasis on the cell boundaries. \begin{align} \mathcal{L} &= \sum w(\mathbf{x}) \log p(\mathbf{x}) \\ w(\mathbf{x}) &= w_c(\mathbf{x}) + w_0 \cdot \exp \left( -\frac{d_1(\mathbf{x}) + d_2(\mathbf{x})^2}{2 \sigma^2} \right) \end{align} where $$w_c(\mathbf{x})$$ is the normalizing weight for the class of the pixel, while $$w_0$$ is a hyper-parameter. $$d_1$$ and $$d_2$$ is the distance from the background pixel $$\mathbf{x}$$ to the closest and second closest cell.

Evaluation

Metrics

Let $$n_{ij}$$ be the number of pixels of class $$i$$ being predicted to belong to class $$j$$. Suppose there are $$k$$ different classes, and $$t_i = \sum_{j} n_{ij}$$ be the total number of pixels of class $$i$$. Then we have the following metrics for image segmentation (Long, Shelhamer, and Darrell 2015):

• Pixel accuracy $\text{Pixel Acc} = \frac{\sum_i n_{ii}}{\sum_i t_i}$

• Mean accuracy $\text{Mean Acc} = \frac{1}{k} \sum_i \frac{n_{ii}}{t_i}$

• Mean IoU (Intersection over Union): $\text{Mean IoU} = \frac{1}{k} \sum_{i} \frac{n_ii}{t_i + \sum_j n_{ji} - n_{ii}}$

• Weighted IoU $\text{Weighted IoU} = \frac{1}{\sum_j t_j} \cdot\frac{\sum_i t_i n_{ii}}{t_i + \sum_j n_{ji} -n_{ii}}$

References

Long, Jonathan, Evan Shelhamer, and Trevor Darrell. 2015. “Fully Convolutional Networks for Semantic Segmentation.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3431–40.