Gradient Tricks

Clipping

Adaptive Clipping

To avoid the optimizer putting too much attention on just one of the loss components (or adversarial losses), adaptive clipping can be adopted (Belghazi et al. 2018), to match the gradient scales of different losses.

For example, Belghazi et al. (2018) adapts \(g_m\) to match the scale of \(g_u\), where \(g_u\) is the main loss gradient, and \(g_m\) is the gradient of the mutual information regularizer: \[ \tilde{g}_m = \min\left( \|g_u\|, \|g_m\| \right) \frac{g_m}{\|g_m\|} \]

References

Belghazi, Mohamed Ishmael, Aristide Baratin, Sai Rajeswar, Sherjil Ozair, Yoshua Bengio, Aaron Courville, and R. Devon Hjelm. 2018. “Mine: Mutual Information Neural Estimation.” arXiv Preprint arXiv:1801.04062.

f-GAN

Proposed by Nowozin, Cseke, and Tomioka (2016), f-GAN generalizes the KL-divergence of the original GAN framework (Goodfellow et al. 2014) to the f-divergence.

Theoretical Framework

f-Divergence

\[ D_{f}(P \| Q) = \int q(\mathbf{x}) \,f\left( \frac{p(\mathbf{x})}{q(\mathbf{x})} \right)\,\mathrm{d}\mathbf{x} \]

where \(f: \mathbb{R}^+ \to \mathbb{R}\) is a convex, lower-semicontinuous function satisfying \(f(\mathbf{1})=0\).

Fenchel conjugate

The Fenchel conjugate of \(f(u)\) is: \[ f^*(t) = \sup_{u \in \text{dom}_f}\left\{ ut - f(u) \right\} \] Since \(f\) is a convex and lower-semicontinuous function, \(f^{**} = f\), therefore: \[ f(u) = \sup_{t \in \text{dom}_{f^*}} \left\{ tu - f^*(t) \right\} \] And further we have: \[ D_{f}(P \| Q) = \int q(\mathbf{x}) \, \sup_{t \in \text{dom}_{f^*}} \left\{ t\,\frac{p(\mathbf{x})}{q(\mathbf{x})} - f^*(t) \right\} \,\mathrm{d}\mathbf{x} \]

Variational Estimation of f-Divergence

Using a function \(T: \mathcal{X} \to \mathbb{R}\) of family \(\mathcal{T}\), we can can obtain the following lower-bound (Nowozin, Cseke, and Tomioka 2016): \[ \begin{align} D_{f}(P \| Q) &= \int q(\mathbf{x}) \, \sup_{t \in \text{dom}_{f^*}} \left\{ t\,\frac{p(\mathbf{x})}{q(\mathbf{x})} - f^*(t) \right\} \,\mathrm{d}\mathbf{x} \\ &\geq \sup_{T \in \mathcal{T}} \left\{ \int q(\mathbf{x}) \left( T(\mathbf{x})\,\frac{p(\mathbf{x})}{q(\mathbf{x})} - f^*(T(\mathbf{x})) \right) \,\mathrm{d}\mathbf{x} \right\} \\ &= \sup_{T \in \mathcal{T}} \left\{ \int p(\mathbf{x})\,T(\mathbf{x}) \,\mathrm{d}\mathbf{x} - \int q(\mathbf{x}) \,f^*(T(\mathbf{x})) \,\mathrm{d}\mathbf{x} \right\} \\ &= \sup_{T \in \mathcal{T}} \left\{ \mathbb{E}_{p(\mathbf{x})}\left[ T(\mathbf{x}) \right] - \mathbb{E}_{q(\mathbf{x})} \left[ f^*(T(\mathbf{x})) \right] \right\} \end{align} \] It is a lower-bound, because the family \(\mathcal{T}\) might not cover all of the possible functions. In fact, the bound is tight for \[ T^*(\mathbf{x}) = f'\left(\frac{p(\mathbf{x})}{q(\mathbf{x})}\right) \] where \(f'\) is the first derivative of \(f\).

Variational Divergence Minimization

Using the variational lower-bound of the f-divergence \(D_{f}(P \| Q)\), we can obtain the objective \(\mathcal{F}(\theta,\omega)\), for the generator \(q_{\theta}(\mathbf{x})\), which should be minimized w.r.t. \(\theta\), and maximized w.r.t. \(\omega\):

\[ \mathcal{F}(\theta,\omega) = \mathbb{E}_{p_d(\mathbf{x})}\left[ T_{\omega}(\mathbf{x}) \right] - \mathbb{E}_{q_{\theta}(\mathbf{x})}\left[ f^*(T_{\omega}(\mathbf{x})) \right] \]

Note: \(\mathcal{F}(\theta,\omega)\) is just a lower-bound of \(D_{f}(P \| Q)\). Minimizing a lower-bound may not lead to optimal result (commented by me).

To ensure \(T_{\omega}\) match the domain of \(f^*\), it can be further reparameterized as: \[ T_{\omega}(\mathbf{x}) = g_f(V_{\omega}(\mathbf{x})) \] where \(V_{\omega}: \mathcal{X} \to \mathbb{R}\), and \(g_f: \mathbb{R} \to \text{dom}_{f^*}\) being the activation function chosen according to \(f\). Then the above objective can be rewritten as: \[ \mathcal{F}(\theta,\omega) = \mathbb{E}_{p_d(\mathbf{x})}\left[ g_f(V_{\omega}(\mathbf{x})) \right] + \mathbb{E}_{q_{\theta}(\mathbf{x})}\left[ -f^*(g_f(V_{\omega}(\mathbf{x}))) \right] \]

Single-Step Gradient Method

Instead of alternate between optimizing \(\omega\) and \(\theta\) in different steps, Nowozin, Cseke, and Tomioka (2016) proposed to optimize these two set of parameters in a single step by: \[ \begin{align} \omega^{t+1} &= \omega^{t} + \eta \,\nabla_{\omega} \mathcal{F}(\theta^t,\omega^t) \\ &= \omega^{t} + \eta \,\nabla_{\omega} \left( \mathbb{E}_{p_d(\mathbf{x})}\left[ g_f(V_{\omega}(\mathbf{x})) \right] + \mathbb{E}_{q_{\theta}(\mathbf{x})}\left[ -f^*(g_f(V_{\omega}(\mathbf{x}))) \right] \right) \\ \theta^{t+1} &= \theta^{t} - \eta \,\nabla_{\theta} \mathcal{F}(\theta^t,\omega^t) \\ &= \theta^{t} - \eta \,\nabla_{\theta} \left( \mathbb{E}_{q_{\theta}(\mathbf{x})}\left[ -f^*(g_f(V_{\omega}(\mathbf{x}))) \right] \right) \end{align} \]

And similar to Goodfellow et al. (2014), Nowozin, Cseke, and Tomioka (2016) also proposed an alternative method to update the paramter \(\theta\), so as to speed up training: \[ \theta^{t+1} = \theta^{t} + \eta \,\nabla_{\theta} \mathbb{E}_{q_{\theta}(\mathbf{x})}\left[ g_f(V_{\omega}(\mathbf{x})) \right] \] Also note that, in many architectures, the gradient estimator \(\nabla_{\theta} \mathbb{E}_{q_{\theta}(\mathbf{x})}\) can be computed with the re-paramterization trick, i.e., reparameterize \(\mathbf{x}\) by a differentiable function \(\mathbf{x} = \mathbf{x}_{\theta}(\mathbf{\epsilon})\), where \(\mathbf{\epsilon}\) is a random noise independent of \(\theta\).

Formulations for Various f-Divergence

More details can be found in Nowozin, Cseke, and Tomioka (2016), including the formulations for Pearson \(\chi^2\), and the Squared Hellinger.

Kullback-Leibler

\[ \begin{align} D_{f}(P \| Q) &= D_{\mathrm{KL}}(P\|Q) \\ &= \int p(\mathbf{x}) \log \frac{p(\mathbf{x})}{q(\mathbf{x})}\,\mathrm{d}\mathbf{x} \\ f(u) &= u \log u \\ f^*(t) &= \exp(t - 1) \\ g_f(v) &= v \end{align} \]

Reverse KL

\[ \begin{align} D_{f}(P \| Q) &= D_{\mathrm{KL}}(Q\|P) \\ &= \int q(\mathbf{x}) \log \frac{q(\mathbf{x})}{p(\mathbf{x})}\,\mathrm{d}\mathbf{x} \\ f(u) &= -\log u \\ f^*(t) &= -1 - \log(-t) \\ g_f(v) &= -\exp(-v) \end{align} \]

Jensen-Shannon

\[ \begin{align} D_{f}(P \| Q) &= \text{JSD}(P\|Q) = \frac{1}{2}\Big( D_{\mathrm{KL}}(P\|M) +D_{\mathrm{KL}}(Q\|M) \Big) \\ &= \frac{1}{2}\int \left( p(\mathbf{x}) \log \frac{2p(\mathbf{x})}{p(\mathbf{x}) + q(\mathbf{x})} + q(\mathbf{x}) \log \frac{2 q(\mathbf{x})}{p(\mathbf{x}) + q(\mathbf{x})} \right)\,\mathrm{d}\mathbf{x} \\ f(u) &= u\log u - (u+1)\log \frac{u+1}{2} \\ f^*(t) &= -\log(2-\exp(t)) \\ g_f(v) &= \log 2 - \log (1+\exp(-v)) \end{align} \]

where \(M = \frac{P+Q}{2}\).

Original GAN (Goodfellow et al. 2014)

\[ \begin{align} D_{f}(P \| Q) &= \int \left( p(\mathbf{x}) \log \frac{2p(\mathbf{x})}{p(\mathbf{x}) + q(\mathbf{x})} + q(\mathbf{x}) \log \frac{2 q(\mathbf{x})}{p(\mathbf{x}) + q(\mathbf{x})} \right)\mathrm{d}\mathbf{x} - \log 4 \\ f(u) &= u \log u - (u+1) \log (u+1) \\ f^*(t) &= -\log(1-\exp(t)) \\ g_f(v) &= -\log(1+\exp(-v)) \end{align} \]

References

Goodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. “Generative Adversarial Nets.” In Advances in Neural Information Processing Systems, 2672–80.

Nowozin, Sebastian, Botond Cseke, and Ryota Tomioka. 2016. “F-GAN: Training Generative Neural Samplers Using Variational Divergence Minimization.” In Advances in Neural Information Processing Systems 29, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, 271–79. Curran Associates, Inc.

Mutual Information

Definition

\[ \begin{align} I(X;Y) &= D_{\mathrm{KL}}\left( p(x,y) \,\|\, p(x)\,p(y) \right) \\ &= \int p(x,y)\, \log \frac{p(x,y)}{p(x)\,p(y)}\,dx \end{align} \]

Invariant Under Reparameterization

If \(f\) and \(g\) are homeormophism, then \[ I(X;Y) = I(f(X);f(Y)) \] that is, mutual information is invariant under reparameterization (Brakel and Bengio 2017).

Estimators

InfoNCE

(to be completed)

MINE

Belghazi et al. (2018) proposed a mutual information estimator based on the Donsker-Varadhan dual representation (Donsker and Varadhan 1983). \[ \begin{align} I(X;Z) &\geq \sup_{T_{\theta}} \left\{ \mathbb{E}_{p(x,z)}\left[T_{\psi}(x,z)\right] - \log \left( \mathbb{E}_{p(x)\,p(z)}\left[e^{T_{\psi}(x,z)}\right] \right) \right\} \\ &\approx \sup_{T_{\theta}} \left\{ \frac{1}{b} \sum_{i=1}^b\left[T_{\psi}(x^{(i)},z^{(i)})\right] - \log \left( \frac{1}{b} \sum_{i=1}^b\left[e^{T_{\psi}(x^{(i)},\tilde{z}^{(i)})}\right] \right) \right\} \end{align} \] where \(x^{(i)}, z^{(i)} \sim p(x,z)\), and \(\tilde{z}^{(i)} \sim p(z)\). \(b\) is the batch size.

The gradient of MINE estimator might have too large a scale, in which situation the Adaptive Clipping might be used. Also, Belghazi et al. (2018) proposed to correct the bias of the gradient estimator by an additional moving average term.

Deep InfoMax

Hjelm et al. (2018) proposed a mutual information estimator based on the JSD estimator, via f-GAN formulation: \[ \mathcal{I}^{(JSD)}(X;Z) = \mathbb{E}_{p(x,z)}\left[ -\text{sp}(-T_{\psi}(x,z)) \right] - \mathbb{E}_{p(x)\,p(z)}\left[ \text{sp}(T_{\psi}(x,z)) \right] \] where \(\text{sp}(a) = \log(1 + e^a)\).

For encoder architecture \(z = E_{\theta}(x)\) and \(p(x) = p_d(x)\), it can also be formulated as: \[ \begin{align} \mathcal{I}^{(JSD)}(X;Z) &= \mathbb{E}_{\mathbb{P}}\left[ -\text{sp}(-T_{\psi}(x,E_{\theta}(x))) \right] - \mathbb{E}_{\mathbb{P}\times\tilde{\mathbb{P}}}\left[ \text{sp}(T_{\psi}(\tilde{x},E_{\theta}(x))) \right] \\ &\approx \frac{1}{b} \sum_{i=1}^b \left[ -\text{sp}\left(-T_{\psi}(x^{(i)},E_{\theta}(x^{(i)}))\right) \right] - \frac{1}{b} \sum_{i=1}^b \left[ \text{sp}\left({T_{\psi}(\tilde{x}^{(i)},E_{\theta}(x^{(i)}))}\right) \right] \end{align} \] where \(\mathbb{P} = \tilde{\mathbb{P}} = p(x)\). \(x^{(i)}\) and \(\tilde{x}^{(i)}\) are samples from \(\mathbb{P}\) and \(\tilde{\mathbb{P}}\), respectively. \(b\) is the batch size.

Note: Deep InfoMax actually has much more content then summarized here. The architecture, the global/local deep information maximum and the prior matching techniques can be referred to the original paper.

Mutual Information as Regularizers

Avoid Mode Collapse in GAN

Maximizing the entropy \(H(X) = H(G(Z))\) of a GAN generator could help avoid mode collapse. (Chen et al. 2016; Belghazi et al. 2018; Kumar et al. 2019)

Since \(X=G(Z)\) is a deterministic mapping, we have \(H(G(Z)|Z) \equiv 0\), and therefore: \[ H(G(Z)) = I(G(Z);Z) - H(G(Z)|Z) = I(G(Z);Z) \]

Bound the Reconstruction Error

Given the encoder \(q(\mathbf{z}|\mathbf{x})\) and decoder \(p(\mathbf{x}|\mathbf{z})\), and the prior distribution \(p(\mathbf{z})\), assuming \(q(\mathbf{x}) = p(\mathbf{x}) = p_d(\mathbf{x})\), then the reconstruction error \(\mathcal{R}\) defined as: \[ \mathcal{R} = \mathbb{E}_{p_d(\mathbf{x})} \mathbb{E}_{q(\mathbf{z}|\mathbf{x})}\left[ -\log p(\mathbf{x}|\mathbf{z}) \right] \] can be decomposed as: \[ \begin{align} \mathcal{R} &= \mathbb{E}_{q(\mathbf{x},\mathbf{z})}\left[ -\log p(\mathbf{x}|\mathbf{z}) \right] \\ &= \mathbb{E}_{q(\mathbf{x},\mathbf{z})}\left[ -\log p(\mathbf{x},\mathbf{z}) + \log p(\mathbf{z}) \right] \\ &= \mathbb{E}_{q(\mathbf{x},\mathbf{z})}\left[ \log \frac{q(\mathbf{x},\mathbf{z})}{p(\mathbf{x},\mathbf{z})} - \log q(\mathbf{z},\mathbf{x}) + \log p(\mathbf{z}) \right] \\ &= D_{\mathrm{KL}}(q(\mathbf{x},\mathbf{z})\,\|\,p(\mathbf{x},\mathbf{z})) + H_{q}(Z,X) + \mathbb{E}_{q(\mathbf{x},\mathbf{z})}\left[ \log p(\mathbf{z}) \right] \end{align} \] where \[ \begin{align} \mathbb{E}_{q(\mathbf{x},\mathbf{z})}\left[ \log p(\mathbf{z}) \right] &= \iint q(\mathbf{x},\mathbf{z}) \log p(\mathbf{z}) \,\mathrm{d}\mathbf{z}\,\mathrm{d}\mathbf{x} \\ &= \int \left( \int q(\mathbf{x},\mathbf{z}) \,\mathrm{d}\mathbf{x} \right)\log p(\mathbf{z}) \,\mathrm{d}\mathbf{z} \\ &= \mathbb{E}_{q(\mathbf{z})}\left[ \log p(\mathbf{z}) \right] \\ &= \mathbb{E}_{q(\mathbf{z})}\left[ -\log \frac{q(\mathbf{z})}{p(\mathbf{z})} + \log q(\mathbf{z}) \right] \\ &= -D_{\mathrm{KL}}(q(\mathbf{z})\,\|\,p(\mathbf{z})) - H_{q}(Z) \end{align} \] and since \[ H_q(Z,X) - H_q(Z) = H_q(Z|X) = H_q(Z) - I_q(Z|X) \] we have (Belghazi et al. 2018): \[ \begin{align} \mathcal{R} &= D_{\mathrm{KL}}(q(\mathbf{x},\mathbf{z})\,\|\,p(\mathbf{x},\mathbf{z})) + H_{q}(Z,X) - D_{\mathrm{KL}}(q(\mathbf{z})\,\|\,p(\mathbf{z})) - H_{q}(Z) \\ &= D_{\mathrm{KL}}(q(\mathbf{x},\mathbf{z})\,\|\,p(\mathbf{x},\mathbf{z})) - D_{\mathrm{KL}}(q(\mathbf{z})\,\|\,p(\mathbf{z})) + H_q(Z) - I_q(Z|X) \end{align} \]

As long as \(q(\mathbf{z})\) is trained to match \(p(\mathbf{z})\), maximizing \(I_q(Z|X)\) can be an efficient way to bound the reconstruction loss.

Information Bottleneck

For a latent variable model \(X \to Z \to Y\), mutual information can be served as a regularizer to limit the amount of information passed through \(Z\), leaving only the most relevant part reaching the output variable \(Y\).

Minimizing the Information Bottleneck Lagrangian for encoder \(q(z|x)\) can serve the constraint: \[ \mathcal{L}(q(z|x)) = H(Y|Z) + \beta I(X,Z) \]

References

Belghazi, Mohamed Ishmael, Aristide Baratin, Sai Rajeswar, Sherjil Ozair, Yoshua Bengio, Aaron Courville, and R. Devon Hjelm. 2018. “Mine: Mutual Information Neural Estimation.” arXiv Preprint arXiv:1801.04062.

Brakel, Philemon, and Yoshua Bengio. 2017. “Learning Independent Features with Adversarial Nets for Non-Linear Ica.” arXiv Preprint arXiv:1710.05050.

Chen, Xi, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. 2016. “InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets.” In Advances in Neural Information Processing Systems 29, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, 2172–80. Curran Associates, Inc.

Donsker, Monroe D., and SR Srinivasa Varadhan. 1983. “Asymptotic Evaluation of Certain Markov Process Expectations for Large Time. IV.” Communications on Pure and Applied Mathematics 36 (2): 183–212.

Hjelm, R. Devon, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. 2018. “Learning Deep Representations by Mutual Information Estimation and Maximization.” arXiv Preprint arXiv:1808.06670.

Kumar, Rithesh, Anirudh Goyal, Aaron Courville, and Yoshua Bengio. 2019. “Maximum Entropy Generators for Energy-Based Models.” arXiv:1901.08508 [Cs, Stat], January. http://arxiv.org/abs/1901.08508.