f-GAN

Proposed by Nowozin, Cseke, and Tomioka (2016), f-GAN generalizes the KL-divergence of the original GAN framework (Goodfellow et al. 2014) to the f-divergence.

Theoretical Framework

f-Divergence

\[ D_{f}(P \| Q) = \int q(\mathbf{x}) \,f\left( \frac{p(\mathbf{x})}{q(\mathbf{x})} \right)\,\mathrm{d}\mathbf{x} \]

where \(f: \mathbb{R}^+ \to \mathbb{R}\) is a convex, lower-semicontinuous function satisfying \(f(\mathbf{1})=0\).

Fenchel conjugate

The Fenchel conjugate of \(f(u)\) is: \[ f^*(t) = \sup_{u \in \text{dom}_f}\left\{ ut - f(u) \right\} \] Since \(f\) is a convex and lower-semicontinuous function, \(f^{**} = f\), therefore: \[ f(u) = \sup_{t \in \text{dom}_{f^*}} \left\{ tu - f^*(t) \right\} \] And further we have: \[ D_{f}(P \| Q) = \int q(\mathbf{x}) \, \sup_{t \in \text{dom}_{f^*}} \left\{ t\,\frac{p(\mathbf{x})}{q(\mathbf{x})} - f^*(t) \right\} \,\mathrm{d}\mathbf{x} \]

Variational Estimation of f-Divergence

Using a function \(T: \mathcal{X} \to \mathbb{R}\) of family \(\mathcal{T}\), we can can obtain the following lower-bound (Nowozin, Cseke, and Tomioka 2016): \[ \begin{align} D_{f}(P \| Q) &= \int q(\mathbf{x}) \, \sup_{t \in \text{dom}_{f^*}} \left\{ t\,\frac{p(\mathbf{x})}{q(\mathbf{x})} - f^*(t) \right\} \,\mathrm{d}\mathbf{x} \\ &\geq \sup_{T \in \mathcal{T}} \left\{ \int q(\mathbf{x}) \left( T(\mathbf{x})\,\frac{p(\mathbf{x})}{q(\mathbf{x})} - f^*(T(\mathbf{x})) \right) \,\mathrm{d}\mathbf{x} \right\} \\ &= \sup_{T \in \mathcal{T}} \left\{ \int p(\mathbf{x})\,T(\mathbf{x}) \,\mathrm{d}\mathbf{x} - \int q(\mathbf{x}) \,f^*(T(\mathbf{x})) \,\mathrm{d}\mathbf{x} \right\} \\ &= \sup_{T \in \mathcal{T}} \left\{ \mathbb{E}_{p(\mathbf{x})}\left[ T(\mathbf{x}) \right] - \mathbb{E}_{q(\mathbf{x})} \left[ f^*(T(\mathbf{x})) \right] \right\} \end{align} \] It is a lower-bound, because the family \(\mathcal{T}\) might not cover all of the possible functions. In fact, the bound is tight for \[ T^*(\mathbf{x}) = f'\left(\frac{p(\mathbf{x})}{q(\mathbf{x})}\right) \] where \(f'\) is the first derivative of \(f\).

Variational Divergence Minimization

Using the variational lower-bound of the f-divergence \(D_{f}(P \| Q)\), we can obtain the objective \(\mathcal{F}(\theta,\omega)\), for the generator \(q_{\theta}(\mathbf{x})\), which should be minimized w.r.t. \(\theta\), and maximized w.r.t. \(\omega\):

\[ \mathcal{F}(\theta,\omega) = \mathbb{E}_{p_d(\mathbf{x})}\left[ T_{\omega}(\mathbf{x}) \right] - \mathbb{E}_{q_{\theta}(\mathbf{x})}\left[ f^*(T_{\omega}(\mathbf{x})) \right] \]

Note: \(\mathcal{F}(\theta,\omega)\) is just a lower-bound of \(D_{f}(P \| Q)\). Minimizing a lower-bound may not lead to optimal result (commented by me).

To ensure \(T_{\omega}\) match the domain of \(f^*\), it can be further reparameterized as: \[ T_{\omega}(\mathbf{x}) = g_f(V_{\omega}(\mathbf{x})) \] where \(V_{\omega}: \mathcal{X} \to \mathbb{R}\), and \(g_f: \mathbb{R} \to \text{dom}_{f^*}\) being the activation function chosen according to \(f\). Then the above objective can be rewritten as: \[ \mathcal{F}(\theta,\omega) = \mathbb{E}_{p_d(\mathbf{x})}\left[ g_f(V_{\omega}(\mathbf{x})) \right] + \mathbb{E}_{q_{\theta}(\mathbf{x})}\left[ -f^*(g_f(V_{\omega}(\mathbf{x}))) \right] \]

Single-Step Gradient Method

Instead of alternate between optimizing \(\omega\) and \(\theta\) in different steps, Nowozin, Cseke, and Tomioka (2016) proposed to optimize these two set of parameters in a single step by: \[ \begin{align} \omega^{t+1} &= \omega^{t} + \eta \,\nabla_{\omega} \mathcal{F}(\theta^t,\omega^t) \\ &= \omega^{t} + \eta \,\nabla_{\omega} \left( \mathbb{E}_{p_d(\mathbf{x})}\left[ g_f(V_{\omega}(\mathbf{x})) \right] + \mathbb{E}_{q_{\theta}(\mathbf{x})}\left[ -f^*(g_f(V_{\omega}(\mathbf{x}))) \right] \right) \\ \theta^{t+1} &= \theta^{t} - \eta \,\nabla_{\theta} \mathcal{F}(\theta^t,\omega^t) \\ &= \theta^{t} - \eta \,\nabla_{\theta} \left( \mathbb{E}_{q_{\theta}(\mathbf{x})}\left[ -f^*(g_f(V_{\omega}(\mathbf{x}))) \right] \right) \end{align} \]

And similar to Goodfellow et al. (2014), Nowozin, Cseke, and Tomioka (2016) also proposed an alternative method to update the paramter \(\theta\), so as to speed up training: \[ \theta^{t+1} = \theta^{t} + \eta \,\nabla_{\theta} \mathbb{E}_{q_{\theta}(\mathbf{x})}\left[ g_f(V_{\omega}(\mathbf{x})) \right] \] Also note that, in many architectures, the gradient estimator \(\nabla_{\theta} \mathbb{E}_{q_{\theta}(\mathbf{x})}\) can be computed with the re-paramterization trick, i.e., reparameterize \(\mathbf{x}\) by a differentiable function \(\mathbf{x} = \mathbf{x}_{\theta}(\mathbf{\epsilon})\), where \(\mathbf{\epsilon}\) is a random noise independent of \(\theta\).

Formulations for Various f-Divergence

More details can be found in Nowozin, Cseke, and Tomioka (2016), including the formulations for Pearson \(\chi^2\), and the Squared Hellinger.

Kullback-Leibler

\[ \begin{align} D_{f}(P \| Q) &= D_{\mathrm{KL}}(P\|Q) \\ &= \int p(\mathbf{x}) \log \frac{p(\mathbf{x})}{q(\mathbf{x})}\,\mathrm{d}\mathbf{x} \\ f(u) &= u \log u \\ f^*(t) &= \exp(t - 1) \\ g_f(v) &= v \end{align} \]

Reverse KL

\[ \begin{align} D_{f}(P \| Q) &= D_{\mathrm{KL}}(Q\|P) \\ &= \int q(\mathbf{x}) \log \frac{q(\mathbf{x})}{p(\mathbf{x})}\,\mathrm{d}\mathbf{x} \\ f(u) &= -\log u \\ f^*(t) &= -1 - \log(-t) \\ g_f(v) &= -\exp(-v) \end{align} \]

Jensen-Shannon

\[ \begin{align} D_{f}(P \| Q) &= \text{JSD}(P\|Q) = \frac{1}{2}\Big( D_{\mathrm{KL}}(P\|M) +D_{\mathrm{KL}}(Q\|M) \Big) \\ &= \frac{1}{2}\int \left( p(\mathbf{x}) \log \frac{2p(\mathbf{x})}{p(\mathbf{x}) + q(\mathbf{x})} + q(\mathbf{x}) \log \frac{2 q(\mathbf{x})}{p(\mathbf{x}) + q(\mathbf{x})} \right)\,\mathrm{d}\mathbf{x} \\ f(u) &= u\log u - (u+1)\log \frac{u+1}{2} \\ f^*(t) &= -\log(2-\exp(t)) \\ g_f(v) &= \log 2 - \log (1+\exp(-v)) \end{align} \]

where \(M = \frac{P+Q}{2}\).

Original GAN (Goodfellow et al. 2014)

\[ \begin{align} D_{f}(P \| Q) &= \int \left( p(\mathbf{x}) \log \frac{2p(\mathbf{x})}{p(\mathbf{x}) + q(\mathbf{x})} + q(\mathbf{x}) \log \frac{2 q(\mathbf{x})}{p(\mathbf{x}) + q(\mathbf{x})} \right)\mathrm{d}\mathbf{x} - \log 4 \\ f(u) &= u \log u - (u+1) \log (u+1) \\ f^*(t) &= -\log(1-\exp(t)) \\ g_f(v) &= -\log(1+\exp(-v)) \end{align} \]

References

Goodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. “Generative Adversarial Nets.” In Advances in Neural Information Processing Systems, 2672–80.

Nowozin, Sebastian, Botond Cseke, and Ryota Tomioka. 2016. “F-GAN: Training Generative Neural Samplers Using Variational Divergence Minimization.” In Advances in Neural Information Processing Systems 29, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, 271–79. Curran Associates, Inc.