Original GAN

Goodfellow et al. (2014) proposed the following first GAN architecture:

  • Prior: \(p_g(\mathbf{z}) = \mathcal{N}(\mathbf{0},\mathbf{I})\)
  • Generator: \(G(\mathbf{z})\)
  • Discriminator: \(D(\mathbf{x}) \in [0, 1]\)

The overall loss from their theory: \[ \mathcal{L} = \min_G \max_D \left\{ \mathbb{E}_{p_d(\mathbf{x})}\left[ \log D(\mathbf{x}) \right] + \mathbb{E}_{p_g(\mathbf{z})}\left[ \log(1 - D(G(\mathbf{z}))) \right] \right\} \] The discriminator loss (to maximize): \[ \mathcal{L}_D = \mathbb{E}_{p_d(\mathbf{x})} \left[ \log D(\mathbf{x}) \right] + \mathbb{E}_{p_g(\mathbf{z})}\left[ \log(1 - D(G(\mathbf{z}))) \right] \]

The theoretical generator loss (to minimize): \[ \mathcal{L}_G = \mathbb{E}_{p_g(\mathbf{z})}\left[ \log(1 - D(G(\mathbf{z}))) \right] \] The actual generator loss (to maximize) in experiments, to avoid saturation in the early stage of training: \[ \mathcal{L}_G = \mathbb{E}_{p_g(\mathbf{z})}\left[ \log(D(G(\mathbf{z}))) \right] \]

GAN Training Algorithm

The most widely adopted GAN training algorithm, which is proposed in this work, alternates between training the discriminator and the generator. That is, in each training iteration:

  • Repeat for n_critics iterations

    • Sample \(\mathbf{x}^{(1)}, \dots, \mathbf{x}^{(b)}\) from \(p_d(\mathbf{x})\), \(\mathbf{z}^{(1)}, \dots, \mathbf{z}^{(b)}\) from \(p_g(\mathbf{z})\).

    • Update the discriminator by: \[ \theta_D = \theta_D + \eta \, \nabla_{\theta_D}\left( \frac{1}{b}\sum_{i=1}^b \left[ \log D(\mathbf{x}^{(i)}) \right] + \frac{1}{b}\sum_{i=1}^b \left[ \log(1 - D(G(\mathbf{z}^{(i)}))) \right] \right) \]

  • Sample \(\mathbf{z}^{(1)}, \dots, \mathbf{z}^{(b)}\) from \(p_g(\mathbf{z})\).

  • Update the generator by: \[ \theta_G = \theta_G - \eta\,\nabla_{\theta_G}\left( \frac{1}{b} \sum_{i=1}^b \log(1 - D(G(\mathbf{z}^{(i)}))) \right) \] or alternatively, \[ \theta_G = \theta_G + \eta\,\nabla_{\theta_G}\left( \frac{1}{b} \sum_{i=1}^b \log(D(G(\mathbf{z}^{(i)}))) \right) \]


Goodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. “Generative Adversarial Nets.” In Advances in Neural Information Processing Systems, 2672–80.

Evaluation Metrics


Negative Log-Likelihood

The negative log-likelihood (NLL) for \(p_{\theta}(\mathbf{x})\) is defined as:

\[ \begin{align} \text{NLL} &= \mathbb{E}_{p_d(\mathbf{x})} \left[ -\log p_{\theta}(\mathbf{x}) \right] \end{align} \]

Find the Original NLL using Scaled Data

If the original data \(\mathbf{x}\) is \(k\)-dimensional ,and is scaled by \(\frac{1}{\sigma}\) at each of its dimensions, such that the data fed into the model is \(\tilde{\mathbf{x}} = \frac{1}{\sigma} \mathbf{x}\), then: \[ \begin{align} p_d(\mathbf{x}) &= \tilde{p}_d(\tilde{\mathbf{x}}) \left| \det\left( \frac{\mathrm{d}\tilde{\mathbf{x}}}{\mathrm{d}\mathbf{x}} \right) \right| = \tilde{p}_d(\tilde{\mathbf{x}})\left| \det\left( \frac{\mathrm{d}(\mathbf{x} / \sigma)}{\mathrm{d}\mathbf{x}} \right) \right| = \frac{1}{\sigma^k}\,\tilde{p}_d(\tilde{\mathbf{x}}) \\ p_{\theta}(\mathbf{x}) &= \frac{1}{\sigma^k}\, \tilde{p}_{\theta}(\tilde{\mathbf{x}}) \end{align} \] Thus the computed NLL for \(\mathbf{x}\) and \(\tilde{\mathbf{x}}\) has the following relationship: \[ \begin{align} \text{NLL} &= \mathbb{E}_{p_d(\mathbf{x})} \left[ -\log p_{\theta}(\mathbf{x}) \right] \\ &= -\int p_d(\mathbf{x}) \log p_{\theta}(\mathbf{x})\,\mathrm{d}\mathbf{x} \\ &= -\int \frac{1}{\sigma^k}\,\tilde{p}_{d}(\tilde{\mathbf{x}}) \log \left( \frac{1}{\sigma^k}\, \tilde{p}_{\theta}(\tilde{\mathbf{x}}) \right)\left| \det\left( \frac{\mathrm{d}\mathbf{x}}{\mathrm{d}\tilde{\mathbf{x}}} \right) \right|\mathrm{d}\tilde{\mathbf{x}} \\ &= -\int \tilde{p}_{d}(\tilde{\mathbf{x}}) \left[ \log \tilde{p}_{\theta}(\tilde{\mathbf{x}}) - k\log \sigma \right]\mathrm{d}\tilde{\mathbf{x}} \\ &= \mathbb{E}_{\tilde{p}_d(\mathbf{x})} \left[ -\log \tilde{p}_{\theta}(\mathbf{x}) \right] + k\log \sigma \\ &= \widetilde{NLL} + k\log\sigma \end{align} \]

Continuous NLL as an Upper-Bound of Discrete NLL

To train a continuous model upon discrete data (e.g., images), one may add a uniform noise to the data, and obtain an upper-bound of the discrete data NLL with the augmented data.

For pixel integer-valued \(\mathbf{x}\) ranging from 0 to 255, adding a uniform noise \(\mathbf{u} \sim \mathcal{U}[0, 1)\), such that \(\tilde{\mathbf{x}} = \mathbf{x} + \mathbf{u}\), we have (Theis, Oord, and Bethge 2015): \[ \begin{align} -\int \tilde{p}_d(\tilde{\mathbf{x}}) \log \tilde{p}_{\theta}(\tilde{\mathbf{x}}) \,\mathrm{d}\tilde{\mathbf{x}} &= -\sum_{\mathbf{x}} P_d(\mathbf{x}) \int \log \tilde{p}_{\theta}(\mathbf{x} + \mathbf{u}) \,\mathrm{d}\mathbf{u} \\ &\geq -\sum_{\mathbf{x}} P_d(\mathbf{x}) \log \int \tilde{p}_{\theta}(\mathbf{x} + \mathbf{u}) \,\mathrm{d}\mathbf{u} \\ \\ &= -\sum_{\mathbf{x}} P_d(\mathbf{x}) \log P_{\theta}(\mathbf{x}) \\ \end{align} \] where we define the probability of the true discrete data to be: \[ P_{\theta}(\mathbf{x}) = \int \tilde{p}_{\theta}(\mathbf{x} + \mathbf{u}) \,\mathrm{d}\mathbf{u} \] That is to say, the NLL of the augmented continuous random variable \(\tilde{\mathbf{x}}\) can serve as an upper-bound as the true discrete data NLL.

Image Quality


(Davis and Goadrich 2006)

  1. There is a one-to-one correspondence between the points on ROC and AUC curves.

  2. A curve dominates the ROC (fpr-tpr curve) \(\Leftrightarrow\) dominates the AUC (recall-precision curve).

  3. Interpolation between two points \(A\) and \(B\):

    1. On ROC: linear interpolation.

    2. On AUC: \[ \left( \frac{TP_A + x}{\text{Total Pos}}, \frac{TP_A + x}{TP_A + x + FP_A + \frac{FP_B - FP_A}{TP_B - TP_A} x} \right) \]

  4. Compute the area: include the interpolation and use composite trapezoidal method.

    1. Incorrect interpoluation for computing AUC-PR will cause over-estimate.
  5. Optimize Area under ROC and AUC curves: not exactly the same. (especially when not one algorithm dominates the curve?)


Davis, Jesse, and Mark Goadrich. 2006. “The Relationship Between Precision-Recall and ROC Curves.” In Proceedings of the 23rd International Conference on Machine Learning - ICML ’06, 233–40. Pittsburgh, Pennsylvania: ACM Press. https://doi.org/10.1145/1143844.1143874.

Theis, Lucas, Aäron van den Oord, and Matthias Bethge. 2015. “A Note on the Evaluation of Generative Models.” arXiv Preprint arXiv:1511.01844.

Energy GAN


Given the discriminator \(E_{\theta}(\mathbf{x})\), a density function can be derived as: \[ \begin{align} p_{\theta}(\mathbf{x}) &= \frac{1}{Z_{\theta}} e^{-E_{\theta}(\mathbf{x})} \\ Z_{\theta} &= \int e^{-E_{\theta}(\mathbf{x})}\,\mathrm{d}\mathbf{x} \end{align} \]

Maximum Entropy Generators for Energy-Based Models

Kumar et al. (2019) proposed the following architecture:

  • Prior: \(p_z(\mathbf{z})\)
  • Generator: \(G_{\omega}(\mathbf{z})\)
  • Discriminator: \(E_{\theta}(\mathbf{x}) \in (-\infty, \infty)\), the energy function
    • The density function: \(p_{\theta}(\mathbf{x}) = \frac{1}{Z_{\theta}} e^{-E_{\theta}(\mathbf{x})}\)
  • Discriminator for the mutual information estimator: \(T_{\phi}(\mathbf{x},\mathbf{z}) \in (-\infty,\infty)\)

The discriminator loss (to minimize): \[ \begin{align} \mathcal{L}_E &= \mathbb{E}_{p_d(\mathbf{x})}\left[ E_{\theta}(\mathbf{x}) \right] - \mathbb{E}_{p_G(\mathbf{x})}\left[ E_{\theta}(\mathbf{x}) \right] + \Omega \\ \Omega &= \lambda\,\mathbb{E}_{p_d(\mathbf{x})} \left[ \left\| \nabla_{\mathbf{x}} E_{\theta}(\mathbf{x}) \right\|^2 \right] \end{align} \] The generator loss (to minimize) \[ \begin{align} \mathcal{L}_G &= \mathbb{E}_{p_z(\mathbf{z})}\left[ E_{\theta}(G_{\omega}(\mathbf{z})) \right] - H_{p_G}[X] \\ &= \mathbb{E}_{p_z(\mathbf{z})}\left[ E_{\theta}(G_{\omega}(\mathbf{z})) \right] - I_{p_G}(X;Z) \end{align} \] where \(H_{p_G}[X] = I_{p_G}(X;Z) + H(G_{\omega}(Z)|Z)\), and \(H(G_{\omega}(Z)|Z) \equiv 0\) since \(G_{\omega}(\mathbf{z})\) is a deterministic mapper.

The mutual information \(I_{p_G}(X;Z)\) is estimated via the Deep InfoMax estimator by Kumar et al. (2019), formulated as: \[ \begin{align} I_{p_G}(X;Z) &= \mathbb{E}_{p_z(\mathbf{z})}\left[ -\text{sp}(-T_{\phi}(G_{\omega}(\mathbf{z}),\mathbf{z})) \right] - \mathbb{E}_{p_z(\mathbf{z})\times\tilde{p}_z(\tilde{\mathbf{z}})}\left[ \text{sp}(T_{\phi}(G_{\omega}(\mathbf{z}),\tilde{\mathbf{z}})) \right] \\ &= \mathbb{E}_{p_z(\mathbf{z})}\left[ \log \sigma(T_{\phi}(G_{\omega}(\mathbf{z}),\mathbf{z})) \right] - \mathbb{E}_{p_z(\mathbf{z})\times\tilde{p}_z(\tilde{\mathbf{z}})}\left[ \log\left(1 - \sigma(T_{\phi}(G_{\omega}(\mathbf{z}),\tilde{\mathbf{z}}))\right) \right] \end{align} \]

where \(\text{sp}(a) = \log(1 + e^a)\) is the SoftPlus function, and \(\sigma(a) = \frac{1}{1 + e^{-a}}\) is the sigmoid function.

Training Algorithm

Kumar et al. (2019) proposed the following training algorithm:

  • Repeat for n_critics Iterations

    • Sample \(\mathbf{x}^{(1)}, \dots, \mathbf{x}^{(b)}\) from \(p_d(\mathbf{x})\), \(\mathbf{z}^{(1)}, \dots, \mathbf{z}^{(b)}\) from \(p_z(\mathbf{z})\).

    • Obtain \(\tilde{\mathbf{x}}^{(i)} = G_{\omega}(\mathbf{z}^{(i)})\).

    • Calculate: \[ \mathcal{L}_E = \frac{1}{b} \left[ \sum_{i=1}^b E_{\theta}(\mathbf{x}^{(i)}) - \sum_{i=1}^b E_{\theta}(\tilde{\mathbf{x}}^{(i)}) + \lambda\sum_{i=1}^b \left\| \nabla_{\mathbf{x}} E_{\theta}(\mathbf{x}^{(i)}) \right\|^2 \right] \]

    • Gradient descent: \[ \theta^{t+1} = \theta^{t} - \eta \, \nabla_{\theta} \mathcal{L}_E \]

  • Sample \(\mathbf{z}^{(1)}, \dots, \mathbf{z}^{(b)}\) from \(p_z(\mathbf{z})\).

  • Per-dimensional shuffle of \(\mathbf{z}\), yielding \(\tilde{\mathbf{z}}^{(1)}, \dots, \tilde{\mathbf{z}}^{(b)}\).

    (Is this really better than re-sampling from the prior \(p_z(\mathbf{z})\)?)

  • Obtain \(\tilde{\mathbf{x}}^{(i)} = G_{\omega}(\mathbf{z}^{(i)})\).

  • Calculate: \[ \begin{align} \mathcal{L}_H &= \frac{1}{b} \sum_{i=1}^b \left[ \log \sigma(T_{\phi}(\tilde{\mathbf{x}}^{(i)},\mathbf{z}^{(i)})) - \log\left(1 - \sigma(T_{\phi}(\tilde{\mathbf{x}}^{(i)},\tilde{\mathbf{z}}^{(i)}))\right) \right] \\ \mathcal{L}_G &= \frac{1}{b} \sum_{i=1}^b \left[ E_{\theta}(\tilde{\mathbf{x}}^{(i)}) \right] - \mathcal{L}_H \end{align} \]

  • Gradient descent: \[ \begin{align} \omega^{t+1} &= \omega^{t} - \eta \, \nabla_{\omega} \mathcal{L}_G \\ \phi^{t+1} &= \phi^{t} - \eta \, \nabla_{\phi} \mathcal{L}_H \\ \end{align} \]

In summary, this algorithm:

  • Minimize \(\theta\) w.r.t. \(\mathcal{L}_E\) =>
    • Minimizes \(E_{\theta}\) in the discriminator loss \(\mathcal{L}_E\).
  • Minimize \(\phi\) w.r.t. \(L_H\) =>
    • Minimizes \(T_{\phi}\) in the mutual information regularizer \(I_{p_G}(X;Z)\).
  • Minimize \(\omega\) w.r.t. \(L_G\) =>
    • Maximizes \(G_{\omega}\) in the mutual information regularizer \(I_{p_G}(X;Z)\);
    • Minimizes \(G_{\omega}\) in the generator loss \(\mathbb{E}_{p_z(\mathbf{z})}\left[ E_{\theta}(G_{\omega}(\mathbf{z})) \right]\).

Latent Space MCMC

Kumar et al. (2019) also proposed an MCMC method to refine the \(\mathbf{z}\) samples obtained from the prior \(p_z(\mathbf{z})\), according to the energy function on \(\mathbf{z}\), derived as: \[ E(\mathbf{z}) = E_{\theta}(G_{\omega}(\mathbf{z})) \] Then Metropolis-adjusted Langevin algorithm is adopted to sample \(\mathbf{z}\):

  • Langevin dynamics: \[ \mathbf{z}^{(t+1)} = \mathbf{z}^{(t)} - \alpha \frac{\partial E_{\theta}(G_{\omega}(\mathbf{z}^{(t)}))}{\partial \mathbf{z}^{(t)}} + \epsilon \sqrt{2 * \alpha} \] where \(\epsilon \sim \mathcal{N}(\mathbf{0},\mathbf{I})\).

  • Metropolis-Hastings Algorithm: \[ \begin{align} r &= \min\left\{ 1, \frac{p(\mathbf{z}^{(t+1)})}{p(\mathbf{z}^{(t)})} \cdot \frac{q(\mathbf{z}^{(t)}|\mathbf{z}^{(t+1)})}{q(\mathbf{z}^{(t+1)}|\mathbf{z}^{(t)})} \right\} \\ p(\mathbf{z}^{(t)}) &\propto \exp\left\{ -E_{\theta}(G_{\omega}(\mathbf{z}^{(t)})) \right\} \\ q(\mathbf{z}^{(t+1)}|\mathbf{z}^{(t)}) &\propto \exp\left( -\frac{1}{4 \alpha}\left\| \mathbf{z}^{(t+1)} - \mathbf{z}^{(t)} + \alpha \frac{\partial E_{\theta}(G_{\omega}(\mathbf{z}^{(t)}))}{\partial \mathbf{z}^{(t)}} \right\|^2_2 \right) \end{align} \]


Kumar, Rithesh, Anirudh Goyal, Aaron Courville, and Yoshua Bengio. 2019. “Maximum Entropy Generators for Energy-Based Models.” arXiv:1901.08508 [Cs, Stat], January. http://arxiv.org/abs/1901.08508.