Mutual Information

Definition

\[ \begin{align} I(X;Y) &= D_{\mathrm{KL}}\left( p(x,y) \,\|\, p(x)\,p(y) \right) \\ &= \int p(x,y)\, \log \frac{p(x,y)}{p(x)\,p(y)}\,dx \end{align} \]

Invariant Under Reparameterization

If \(f\) and \(g\) are homeormophism, then \[ I(X;Y) = I(f(X);f(Y)) \] that is, mutual information is invariant under reparameterization (Brakel and Bengio 2017).

Estimators

InfoNCE

(to be completed)

MINE

Belghazi et al. (2018) proposed a mutual information estimator based on the Donsker-Varadhan dual representation (Donsker and Varadhan 1983). \[ \begin{align} I(X;Z) &\geq \sup_{T_{\theta}} \left\{ \mathbb{E}_{p(x,z)}\left[T_{\psi}(x,z)\right] - \log \left( \mathbb{E}_{p(x)\,p(z)}\left[e^{T_{\psi}(x,z)}\right] \right) \right\} \\ &\approx \sup_{T_{\theta}} \left\{ \frac{1}{b} \sum_{i=1}^b\left[T_{\psi}(x^{(i)},z^{(i)})\right] - \log \left( \frac{1}{b} \sum_{i=1}^b\left[e^{T_{\psi}(x^{(i)},\tilde{z}^{(i)})}\right] \right) \right\} \end{align} \] where \(x^{(i)}, z^{(i)} \sim p(x,z)\), and \(\tilde{z}^{(i)} \sim p(z)\). \(b\) is the batch size.

The gradient of MINE estimator might have too large a scale, in which situation the Adaptive Clipping might be used. Also, Belghazi et al. (2018) proposed to correct the bias of the gradient estimator by an additional moving average term.

Deep InfoMax

Hjelm et al. (2018) proposed a mutual information estimator based on the JSD estimator, via f-GAN formulation: \[ \mathcal{I}^{(JSD)}(X;Z) = \mathbb{E}_{p(x,z)}\left[ -\text{sp}(-T_{\psi}(x,z)) \right] - \mathbb{E}_{p(x)\,p(z)}\left[ \text{sp}(T_{\psi}(x,z)) \right] \] where \(\text{sp}(a) = \log(1 + e^a)\).

For encoder architecture \(z = E_{\theta}(x)\) and \(p(x) = p_d(x)\), it can also be formulated as: \[ \begin{align} \mathcal{I}^{(JSD)}(X;Z) &= \mathbb{E}_{\mathbb{P}}\left[ -\text{sp}(-T_{\psi}(x,E_{\theta}(x))) \right] - \mathbb{E}_{\mathbb{P}\times\tilde{\mathbb{P}}}\left[ \text{sp}(T_{\psi}(\tilde{x},E_{\theta}(x))) \right] \\ &\approx \frac{1}{b} \sum_{i=1}^b \left[ -\text{sp}\left(-T_{\psi}(x^{(i)},E_{\theta}(x^{(i)}))\right) \right] - \frac{1}{b} \sum_{i=1}^b \left[ \text{sp}\left({T_{\psi}(\tilde{x}^{(i)},E_{\theta}(x^{(i)}))}\right) \right] \end{align} \] where \(\mathbb{P} = \tilde{\mathbb{P}} = p(x)\). \(x^{(i)}\) and \(\tilde{x}^{(i)}\) are samples from \(\mathbb{P}\) and \(\tilde{\mathbb{P}}\), respectively. \(b\) is the batch size.

Note: Deep InfoMax actually has much more content then summarized here. The architecture, the global/local deep information maximum and the prior matching techniques can be referred to the original paper.

Mutual Information as Regularizers

Avoid Mode Collapse in GAN

Maximizing the entropy \(H(X) = H(G(Z))\) of a GAN generator could help avoid mode collapse. (Chen et al. 2016; Belghazi et al. 2018; Kumar et al. 2019)

Since \(X=G(Z)\) is a deterministic mapping, we have \(H(G(Z)|Z) \equiv 0\), and therefore: \[ H(G(Z)) = I(G(Z);Z) - H(G(Z)|Z) = I(G(Z);Z) \]

Bound the Reconstruction Error

Given the encoder \(q(\mathbf{z}|\mathbf{x})\) and decoder \(p(\mathbf{x}|\mathbf{z})\), and the prior distribution \(p(\mathbf{z})\), assuming \(q(\mathbf{x}) = p(\mathbf{x}) = p_d(\mathbf{x})\), then the reconstruction error \(\mathcal{R}\) defined as: \[ \mathcal{R} = \mathbb{E}_{p_d(\mathbf{x})} \mathbb{E}_{q(\mathbf{z}|\mathbf{x})}\left[ -\log p(\mathbf{x}|\mathbf{z}) \right] \] can be decomposed as: \[ \begin{align} \mathcal{R} &= \mathbb{E}_{q(\mathbf{x},\mathbf{z})}\left[ -\log p(\mathbf{x}|\mathbf{z}) \right] \\ &= \mathbb{E}_{q(\mathbf{x},\mathbf{z})}\left[ -\log p(\mathbf{x},\mathbf{z}) + \log p(\mathbf{z}) \right] \\ &= \mathbb{E}_{q(\mathbf{x},\mathbf{z})}\left[ \log \frac{q(\mathbf{x},\mathbf{z})}{p(\mathbf{x},\mathbf{z})} - \log q(\mathbf{z},\mathbf{x}) + \log p(\mathbf{z}) \right] \\ &= D_{\mathrm{KL}}(q(\mathbf{x},\mathbf{z})\,\|\,p(\mathbf{x},\mathbf{z})) + H_{q}(Z,X) + \mathbb{E}_{q(\mathbf{x},\mathbf{z})}\left[ \log p(\mathbf{z}) \right] \end{align} \] where \[ \begin{align} \mathbb{E}_{q(\mathbf{x},\mathbf{z})}\left[ \log p(\mathbf{z}) \right] &= \iint q(\mathbf{x},\mathbf{z}) \log p(\mathbf{z}) \,\mathrm{d}\mathbf{z}\,\mathrm{d}\mathbf{x} \\ &= \int \left( \int q(\mathbf{x},\mathbf{z}) \,\mathrm{d}\mathbf{x} \right)\log p(\mathbf{z}) \,\mathrm{d}\mathbf{z} \\ &= \mathbb{E}_{q(\mathbf{z})}\left[ \log p(\mathbf{z}) \right] \\ &= \mathbb{E}_{q(\mathbf{z})}\left[ -\log \frac{q(\mathbf{z})}{p(\mathbf{z})} + \log q(\mathbf{z}) \right] \\ &= -D_{\mathrm{KL}}(q(\mathbf{z})\,\|\,p(\mathbf{z})) - H_{q}(Z) \end{align} \] and since \[ H_q(Z,X) - H_q(Z) = H_q(Z|X) = H_q(Z) - I_q(Z|X) \] we have (Belghazi et al. 2018): \[ \begin{align} \mathcal{R} &= D_{\mathrm{KL}}(q(\mathbf{x},\mathbf{z})\,\|\,p(\mathbf{x},\mathbf{z})) + H_{q}(Z,X) - D_{\mathrm{KL}}(q(\mathbf{z})\,\|\,p(\mathbf{z})) - H_{q}(Z) \\ &= D_{\mathrm{KL}}(q(\mathbf{x},\mathbf{z})\,\|\,p(\mathbf{x},\mathbf{z})) - D_{\mathrm{KL}}(q(\mathbf{z})\,\|\,p(\mathbf{z})) + H_q(Z) - I_q(Z|X) \end{align} \]

As long as \(q(\mathbf{z})\) is trained to match \(p(\mathbf{z})\), maximizing \(I_q(Z|X)\) can be an efficient way to bound the reconstruction loss.

Information Bottleneck

For a latent variable model \(X \to Z \to Y\), mutual information can be served as a regularizer to limit the amount of information passed through \(Z\), leaving only the most relevant part reaching the output variable \(Y\).

Minimizing the Information Bottleneck Lagrangian for encoder \(q(z|x)\) can serve the constraint: \[ \mathcal{L}(q(z|x)) = H(Y|Z) + \beta I(X,Z) \]

References

Belghazi, Mohamed Ishmael, Aristide Baratin, Sai Rajeswar, Sherjil Ozair, Yoshua Bengio, Aaron Courville, and R. Devon Hjelm. 2018. “Mine: Mutual Information Neural Estimation.” arXiv Preprint arXiv:1801.04062.

Brakel, Philemon, and Yoshua Bengio. 2017. “Learning Independent Features with Adversarial Nets for Non-Linear Ica.” arXiv Preprint arXiv:1710.05050.

Chen, Xi, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. 2016. “InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets.” In Advances in Neural Information Processing Systems 29, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, 2172–80. Curran Associates, Inc.

Donsker, Monroe D., and SR Srinivasa Varadhan. 1983. “Asymptotic Evaluation of Certain Markov Process Expectations for Large Time. IV.” Communications on Pure and Applied Mathematics 36 (2): 183–212.

Hjelm, R. Devon, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. 2018. “Learning Deep Representations by Mutual Information Estimation and Maximization.” arXiv Preprint arXiv:1808.06670.

Kumar, Rithesh, Anirudh Goyal, Aaron Courville, and Yoshua Bengio. 2019. “Maximum Entropy Generators for Energy-Based Models.” arXiv:1901.08508 [Cs, Stat], January. http://arxiv.org/abs/1901.08508.