Gradient Tricks


Adaptive Clipping

To avoid the optimizer putting too much attention on just one of the loss components (or adversarial losses), adaptive clipping can be adopted (Belghazi et al. 2018), to match the gradient scales of different losses.

For example, Belghazi et al. (2018) adapts \(g_m\) to match the scale of \(g_u\), where \(g_u\) is the main loss gradient, and \(g_m\) is the gradient of the mutual information regularizer: \[ \tilde{g}_m = \min\left( \|g_u\|, \|g_m\| \right) \frac{g_m}{\|g_m\|} \]


Belghazi, Mohamed Ishmael, Aristide Baratin, Sai Rajeswar, Sherjil Ozair, Yoshua Bengio, Aaron Courville, and R. Devon Hjelm. 2018. “Mine: Mutual Information Neural Estimation.” arXiv Preprint arXiv:1801.04062.