For example, Belghazi et al. (2018) adapts $$g_m$$ to match the scale of $$g_u$$, where $$g_u$$ is the main loss gradient, and $$g_m$$ is the gradient of the mutual information regularizer: $\tilde{g}_m = \min\left( \|g_u\|, \|g_m\| \right) \frac{g_m}{\|g_m\|}$