# Energy GAN

## Overview

Given the discriminator $$E_{\theta}(\mathbf{x})$$, a density function can be derived as: \begin{align} p_{\theta}(\mathbf{x}) &= \frac{1}{Z_{\theta}} e^{-E_{\theta}(\mathbf{x})} \\ Z_{\theta} &= \int e^{-E_{\theta}(\mathbf{x})}\,\mathrm{d}\mathbf{x} \end{align}

## Maximum Entropy Generators for Energy-Based Models

Kumar et al. (2019) proposed the following architecture:

• Prior: $$p_z(\mathbf{z})$$
• Generator: $$G_{\omega}(\mathbf{z})$$
• Discriminator: $$E_{\theta}(\mathbf{x}) \in (-\infty, \infty)$$, the energy function
• The density function: $$p_{\theta}(\mathbf{x}) = \frac{1}{Z_{\theta}} e^{-E_{\theta}(\mathbf{x})}$$
• Discriminator for the mutual information estimator: $$T_{\phi}(\mathbf{x},\mathbf{z}) \in (-\infty,\infty)$$

The discriminator loss (to minimize): \begin{align} \mathcal{L}_E &= \mathbb{E}_{p_d(\mathbf{x})}\left[ E_{\theta}(\mathbf{x}) \right] - \mathbb{E}_{p_G(\mathbf{x})}\left[ E_{\theta}(\mathbf{x}) \right] + \Omega \\ \Omega &= \lambda\,\mathbb{E}_{p_d(\mathbf{x})} \left[ \left\| \nabla_{\mathbf{x}} E_{\theta}(\mathbf{x}) \right\|^2 \right] \end{align} The generator loss (to minimize) \begin{align} \mathcal{L}_G &= \mathbb{E}_{p_z(\mathbf{z})}\left[ E_{\theta}(G_{\omega}(\mathbf{z})) \right] - H_{p_G}[X] \\ &= \mathbb{E}_{p_z(\mathbf{z})}\left[ E_{\theta}(G_{\omega}(\mathbf{z})) \right] - I_{p_G}(X;Z) \end{align} where $$H_{p_G}[X] = I_{p_G}(X;Z) + H(G_{\omega}(Z)|Z)$$, and $$H(G_{\omega}(Z)|Z) \equiv 0$$ since $$G_{\omega}(\mathbf{z})$$ is a deterministic mapper.

The mutual information $$I_{p_G}(X;Z)$$ is estimated via the Deep InfoMax estimator by Kumar et al. (2019), formulated as: \begin{align} I_{p_G}(X;Z) &= \mathbb{E}_{p_z(\mathbf{z})}\left[ -\text{sp}(-T_{\phi}(G_{\omega}(\mathbf{z}),\mathbf{z})) \right] - \mathbb{E}_{p_z(\mathbf{z})\times\tilde{p}_z(\tilde{\mathbf{z}})}\left[ \text{sp}(T_{\phi}(G_{\omega}(\mathbf{z}),\tilde{\mathbf{z}})) \right] \\ &= \mathbb{E}_{p_z(\mathbf{z})}\left[ \log \sigma(T_{\phi}(G_{\omega}(\mathbf{z}),\mathbf{z})) \right] - \mathbb{E}_{p_z(\mathbf{z})\times\tilde{p}_z(\tilde{\mathbf{z}})}\left[ \log\left(1 - \sigma(T_{\phi}(G_{\omega}(\mathbf{z}),\tilde{\mathbf{z}}))\right) \right] \end{align}

where $$\text{sp}(a) = \log(1 + e^a)$$ is the SoftPlus function, and $$\sigma(a) = \frac{1}{1 + e^{-a}}$$ is the sigmoid function.

### Training Algorithm

Kumar et al. (2019) proposed the following training algorithm:

• Repeat for n_critics Iterations

• Sample $$\mathbf{x}^{(1)}, \dots, \mathbf{x}^{(b)}$$ from $$p_d(\mathbf{x})$$, $$\mathbf{z}^{(1)}, \dots, \mathbf{z}^{(b)}$$ from $$p_z(\mathbf{z})$$.

• Obtain $$\tilde{\mathbf{x}}^{(i)} = G_{\omega}(\mathbf{z}^{(i)})$$.

• Calculate: $\mathcal{L}_E = \frac{1}{b} \left[ \sum_{i=1}^b E_{\theta}(\mathbf{x}^{(i)}) - \sum_{i=1}^b E_{\theta}(\tilde{\mathbf{x}}^{(i)}) + \lambda\sum_{i=1}^b \left\| \nabla_{\mathbf{x}} E_{\theta}(\mathbf{x}^{(i)}) \right\|^2 \right]$

• Gradient descent: $\theta^{t+1} = \theta^{t} - \eta \, \nabla_{\theta} \mathcal{L}_E$

• Sample $$\mathbf{z}^{(1)}, \dots, \mathbf{z}^{(b)}$$ from $$p_z(\mathbf{z})$$.

• Per-dimensional shuffle of $$\mathbf{z}$$, yielding $$\tilde{\mathbf{z}}^{(1)}, \dots, \tilde{\mathbf{z}}^{(b)}$$.

(Is this really better than re-sampling from the prior $$p_z(\mathbf{z})$$?)

• Obtain $$\tilde{\mathbf{x}}^{(i)} = G_{\omega}(\mathbf{z}^{(i)})$$.

• Calculate: \begin{align} \mathcal{L}_H &= \frac{1}{b} \sum_{i=1}^b \left[ \log \sigma(T_{\phi}(\tilde{\mathbf{x}}^{(i)},\mathbf{z}^{(i)})) - \log\left(1 - \sigma(T_{\phi}(\tilde{\mathbf{x}}^{(i)},\tilde{\mathbf{z}}^{(i)}))\right) \right] \\ \mathcal{L}_G &= \frac{1}{b} \sum_{i=1}^b \left[ E_{\theta}(\tilde{\mathbf{x}}^{(i)}) \right] - \mathcal{L}_H \end{align}

• Gradient descent: \begin{align} \omega^{t+1} &= \omega^{t} - \eta \, \nabla_{\omega} \mathcal{L}_G \\ \phi^{t+1} &= \phi^{t} - \eta \, \nabla_{\phi} \mathcal{L}_H \\ \end{align}

In summary, this algorithm:

• Minimize $$\theta$$ w.r.t. $$\mathcal{L}_E$$ =>
• Minimizes $$E_{\theta}$$ in the discriminator loss $$\mathcal{L}_E$$.
• Minimize $$\phi$$ w.r.t. $$L_H$$ =>
• Minimizes $$T_{\phi}$$ in the mutual information regularizer $$I_{p_G}(X;Z)$$.
• Minimize $$\omega$$ w.r.t. $$L_G$$ =>
• Maximizes $$G_{\omega}$$ in the mutual information regularizer $$I_{p_G}(X;Z)$$;
• Minimizes $$G_{\omega}$$ in the generator loss $$\mathbb{E}_{p_z(\mathbf{z})}\left[ E_{\theta}(G_{\omega}(\mathbf{z})) \right]$$.

### Latent Space MCMC

Kumar et al. (2019) also proposed an MCMC method to refine the $$\mathbf{z}$$ samples obtained from the prior $$p_z(\mathbf{z})$$, according to the energy function on $$\mathbf{z}$$, derived as: $E(\mathbf{z}) = E_{\theta}(G_{\omega}(\mathbf{z}))$ Then Metropolis-adjusted Langevin algorithm is adopted to sample $$\mathbf{z}$$:

• Langevin dynamics: $\mathbf{z}^{(t+1)} = \mathbf{z}^{(t)} - \alpha \frac{\partial E_{\theta}(G_{\omega}(\mathbf{z}^{(t)}))}{\partial \mathbf{z}^{(t)}} + \epsilon \sqrt{2 * \alpha}$ where $$\epsilon \sim \mathcal{N}(\mathbf{0},\mathbf{I})$$.

• Metropolis-Hastings Algorithm: \begin{align} r &= \min\left\{ 1, \frac{p(\mathbf{z}^{(t+1)})}{p(\mathbf{z}^{(t)})} \cdot \frac{q(\mathbf{z}^{(t)}|\mathbf{z}^{(t+1)})}{q(\mathbf{z}^{(t+1)}|\mathbf{z}^{(t)})} \right\} \\ p(\mathbf{z}^{(t)}) &\propto \exp\left\{ -E_{\theta}(G_{\omega}(\mathbf{z}^{(t)})) \right\} \\ q(\mathbf{z}^{(t+1)}|\mathbf{z}^{(t)}) &\propto \exp\left( -\frac{1}{4 \alpha}\left\| \mathbf{z}^{(t+1)} - \mathbf{z}^{(t)} + \alpha \frac{\partial E_{\theta}(G_{\omega}(\mathbf{z}^{(t)}))}{\partial \mathbf{z}^{(t)}} \right\|^2_2 \right) \end{align}

# References

Kumar, Rithesh, Anirudh Goyal, Aaron Courville, and Yoshua Bengio. 2019. “Maximum Entropy Generators for Energy-Based Models.” arXiv:1901.08508 [Cs, Stat], January. http://arxiv.org/abs/1901.08508.