## First-Order Methods

For simplicity, we first introduce the following common notations:

• $$J(\theta)$$: The objective function, which should be minimized w.r.t. $$\theta$$.
• $$g(\theta)$$: The gradient of $$J(\theta)$$, i.e., $$g(\theta) = \nabla J(\theta)$$.
• $$g_t$$: The abbrevation for $$g(\theta_t)$$.

One key difference between this article and that of (“An Overview of Gradient Descent Optimization Algorithms” 2016) is that, $$\eta$$ is applied on the whole delta when updating the parameters $$\theta_t$$, including the momentum term. This is because, practically, one may expect $$\eta$$ to be served as a direct constraint of how much the parameters $$\theta_t$$ are updated within each training step.

$\theta_{t+1} = \theta_t - \eta \, g_t$

### Momentum SGD

\begin{align} v_t &= \gamma \, v_{t-1} + g_t \\ \theta_{t+1} &= \theta_t - \eta \,v_t \end{align}

It converges generally faster than the naive stochastic gradient descent, due to the accumulated $$v_t$$ can help eliminate the unrelated directions of the gradient, and amplify the most related directions of the gradient. The ratio of amplification is roughly $$\frac{1}{1 - \gamma}$$.

### Nesterov Momentum SGD

\begin{align} v_t &= \gamma \,v_{t-1} + g(\theta_t - \eta\, \gamma \,v_{t-1}) \\ \theta_{t+1} &= \theta_t - \eta\, v_t \end{align}

#### Alternative Form

Dozat (2016) proposed to combine the two momentum steps into one, resulting in: \begin{align} v_t &= \gamma \, v_{t-1} + g_t \\ \theta_{t+1} &= \theta_t - \eta \, (\gamma \,v_t + g_t) \end{align} which may be more preferable in practice.

Proof

Let $$\hat{\theta}_t = \theta_t - \eta\,\gamma\,v_{t-1}$$ and $$\hat{g}_t = g(\hat{\theta}_t) = g(\theta_t - \eta\, \gamma \,v_{t-1})$$, and substitute into the original form, we can obtain: \begin{align} v_t &= \gamma\,v_{t-1} + \hat{g}_t \\ \hat{\theta}_{t+1} &= \theta_{t+1} - \eta\,\gamma\,v_t \\ &= \theta_t - \eta\,v_t - \eta\,\gamma\,v_t \\ &= \theta_t - \eta(\gamma\,v_{t-1} + \hat{g}_t) - \eta\,\gamma\,v_t \\ &= (\theta_t - \eta\,\gamma\,v_{t-1}) - \eta\,(\hat{g}_t + \gamma\,v_t) \\ &= \hat{\theta}_t - \eta\,(\gamma\,v_t + \hat{g}_t) \end{align} Discarding all $$\hat{ }$$ marks, we then get to the conclusion.

#### Convergence Analysis

This is a modified version of the above momentum SGD, which converges generally faster. (“比Momentum更快：揭开Nesterov Accelerated Gradient的真面目,” n.d.) suggests that this difference may be caused by the (approximately) second-order property of nesterov momentum SGD, since if we let: \begin{align} \hat{\theta}_t &= \theta_t - \eta\,\gamma\,v_{t-1} \\ \hat{v}_t &= \gamma^2 \,v_{t-1} + (\gamma + 1) \, g(\hat{\theta}_t) \end{align} we can obtain the following iterative equations: \begin{align} \hat{v}_t &= \gamma\,\hat{v}_{t-1} + g(\hat{\theta}_t) + \gamma\left[ g(\hat{\theta}_t) - g(\hat{\theta}_{t-1}) \right] \\ \hat{\theta}_{t+1} &= \hat{\theta}_t - \eta \, \hat{v}_t \end{align} which suggests the nesterov momentum SGD uses the second order gradient (approximated by $$g(\hat{\theta}_{t-1}) - g(\hat{\theta}_{t-2})$$) to revise the trajectory produced by the first order gradient $$g(\hat{\theta}_{t-1})$$.

Proof

From the original equation of nesterov momentum SGD, we have: \begin{align} \theta_{t+1} - \eta\,\gamma\,v_t &= \theta_t - \eta\,(\gamma+1)\,v_t \\ &= \theta_t - \eta\,(\gamma+1)\,\big( \gamma\,v_{t-1} + g(\theta_t - \eta\,\gamma\,v_{t-1}) \big) \\ &= \theta_t - \eta\, \gamma\,v_{t-1} - \eta\,\gamma^2\,v_{t-1} - \eta\,(\gamma+1)\,g(\theta_t - \eta\,\gamma\,v_{t-1}) \end{align} Substitute $$\hat{\theta}_t$$ and $$\hat{v}_t$$ into the above equation, we have: $\hat{\theta}_{t+1} = \hat{\theta}_t - \eta \, \hat{v}_t$ We then next deal with the term $$\hat{v}_t$$. Substitute $$v_t$$, we have: \begin{align} \hat{v}_t &= (\gamma+1) \, g(\hat{\theta}_t) + \gamma^2\,v_{t-1} \\ &= (\gamma+1) \, g(\hat{\theta}_t) + \gamma^2 \left( \gamma \, v_{t-2} + g(\hat{\theta}_{t-1})\right) \\ &= (\gamma+1) \, g(\hat{\theta}_t) + \gamma^2 \, g(\hat{\theta}_{t-1}) + \gamma^3\left( \gamma\,v_{t-3} + g(\hat{\theta}_{t-2}) \right) \\ &= (\gamma+1) \, g(\hat{\theta}_t) + \gamma^2 \, g(\hat{\theta}_{t-1}) + \gamma^3 \, g(\hat{\theta}_{t-2}) + \dots \\ \hat{v}_{t-1} &= \qquad \qquad \quad \;\, (\gamma+1) \, g(\hat{\theta}_{t-1}) + \gamma^2 \, g(\hat{\theta}_{t-2}) + \dots \end{align}

Subtract $$\gamma \, \hat{v}_{t-1}$$ from $$\hat{v}_t$$, we have: $\hat{v}_t - \gamma\,\hat{v}_{t-1} = (\gamma+1) \, g(\hat{\theta}_t) - \gamma \, g(\hat{\theta}_{t-1})$ We thus obtain: $\hat{v}_t = \gamma \, \hat{v}_{t-1} + g(\hat{\theta}_t) + \gamma\left[ g(\hat{\theta}_t) - g(\hat{\theta}_{t-1}) \right]$

Duchi, Hazan, and Singer (2011) proposed a method, which adapts the learning rate according to the update rate of each parameter: for fast updating parameters, it uses smaller learning rate, and conversely, it uses larger learning rate. This mechanism is well suitable for sparse data (“An Overview of Gradient Descent Optimization Algorithms” 2016). \begin{align} G_t &= \sum_{\tau=1}^t g_{\tau}^2 \\ \theta_{t+1} &= \theta_t - \frac{\eta}{\sqrt{G_t + \epsilon}} \odot g_t \end{align} where $$\odot$$ denotes the element-wise multiplication between two vectors. Here $$g_t^2 = g_t \odot g_t$$, is the element-wise square of $$g_t$$. $$G_t$$ is the sum of squares of all gradients of $$\theta$$ since the beginning of training. The small infinitesimal $$\epsilon$$ is adopted to avoid dividing by zero.

To avoid having an infinitesimal learning rate as that of Adagrad, Zeiler (2012) proposed to use a exponential moving average to estimate the expectation of $$g_t^2$$ (denoted as $$E[g^2]$$), instead of summing up all squares of gradients.

Also, it maintains the moving average for square of the term $$\Delta \theta$$ (i.e., the update of parameters at $$t$$, denoted as $$E[\Delta\theta^2]$$) to match the units of $$E[g^2]_t$$. \begin{align} E[g^2]_t &= \gamma \,E[g^2]_{t-1} + (1-\gamma) \,g_t^2 \\ E[\Delta\theta^2]_t &= \gamma\,E[\Delta\theta^2]_{t-1} + (1-\gamma)\,(\Delta\theta_t)^2 \\ \mathop{RMS}[g]_t &= \sqrt{E[g^2]_t + \epsilon} \\ RMS[\Delta\theta]_t &= \sqrt{E[\Delta\theta^2]_t + \epsilon} \\ \Delta \theta_t &= -\frac{RMS[\Delta\theta]_{t-1}}{RMS[g]_{t-1}} \odot g_t \\ \theta_{t+1} &= \theta_t + \eta \,\Delta \theta_t \end{align} where the learning rate $$\eta$$ is chosen to be 1 in the original paper.

### RMSprop

Hinton proposed an unpublished SGD method based on Adagrad, also to avoid having an infinitesimal learning rate. (See Lecture 6e of his Coursera Class). \begin{align} E[g^2]_t &= \gamma \,E[g^2]_{t-1} + (1-\gamma)\,g_t^2 \\ \theta_{t+1} &= \theta_t - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} \odot g_t \end{align} with the denominator exactly the same as Adadelta. The moving average decay factor $$\gamma$$ is suggested to be 0.9, while the initial learning rate $$\eta$$ is suggested to be 0.001.

In addition to tracking the moving average of squares of gradients, Kingma and Ba (2017) proposed to also track the moving average of the gradients. \begin{align} m_t &= \beta_1 m_{t-1} + (1-\beta_1)\,g_t \\ v_t &= \beta_2 v_{t-1} + (1-\beta_2)\,g_t^2 \\ \end{align} The authors also proposed to apply a zero-debias term to the moving average estimates: \begin{align} \hat{m}_t &= \frac{m_t}{1-\beta_1^t} \\ \hat{v}_t &= \frac{v_t}{1-\beta_2^t} \end{align} Then update the parameters accordingly: $\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \odot \hat{m}_t$ By default, the hyper-parameters are suggested to be $$\beta_1 = 0.9$$, $$\beta_2 = 0.999$$, $$\epsilon = 10^{-8}$$.

As a variant of Adam, Adamax uses $$l_{\infty}$$ norm instead of $$l_2$$ norm for the denominator: \begin{align} m_t &= \beta_1 m_{t-1} + (1-\beta_1)\,g_t \\ \hat{m}_t &= \frac{m_t}{1-\beta_1^t} \\ u_t &= \left( \beta_2^{\infty} \, u_{t-1}^{\infty} + (1-\beta_2^{\infty}) \,\left| g_t \right|^{\infty} \right)^{1/\infty} \\ &= \max\left( \beta_2\,u_{t-1}, \left| g_t \right| \right) \\ \theta_{t+1} &= \theta_t - \frac{\eta}{u_t} \, \hat{m}_t \end{align} By default, the hyper-parameters are suggested to be $$\beta_1 = 0.9$$, $$\beta_2 = 0.999$$, $$\epsilon = 10^{-8}$$.

Dozat (2016) incorporates Nesterov momentum into Adam.

Comparing the momentum method: \begin{align} v_t &= \gamma \, v_{t-1} + g_t \\ \theta_{t+1} &= \theta_t - \eta \,v_t \\ &= \theta_t - \eta\,(\gamma\,v_{t-1} + g_t) \end{align}

with the “alternative form” of Nesterov Momentum method: \begin{align} v_t &= \gamma \, v_{t-1} + g_t \\ \theta_{t+1} &= \theta_t - \eta \, (\gamma \,v_t + g_t) \end{align}

It is clear that, we can obtain the Nesterov Momentum method by replacing $$v_{t-1}$$ with $$v_t$$ in the term $$\gamma\,v_{t-1} + g_t$$ in the Momentum method.

We then first seek to write Adam in such a form. Given that: \begin{align} m_t &= \beta_1 m_{t-1} + (1-\beta_1)\,g_t \\ \hat{m}_t &= \frac{m_t}{1-\beta_1^t} \end{align} We have: \begin{align} \theta_{t+1} &= \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \odot \hat{m}_t \\ &= \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \odot \frac{m_t}{1-\beta_1^t} \\ &= \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \odot \left( \frac{\beta_1}{1-\beta_1^t} \cdot m_{t-1} + \frac{1-\beta_1}{1-\beta_1^t} \cdot g_t \right) \end{align}

If we consider $$m_{t-1}/(1-\beta_1^t) \approx m_{t-1}/(1-\beta_1^{t-1})$$, which is true when $$t \to \infty$$, we can obtain: $\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t}+\epsilon}\odot \left( \beta_1\,\hat{m}_{t-1} + \frac{1-\beta_1}{1-\beta_1^t} \cdot g_t \right)$ By replacing $$\hat{m}_{t-1}$$ with $$\hat{m}_t$$ as analogue to the Nesterov Momentum method, we finally obtain: $\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t}+\epsilon}\odot \left( \beta_1\,\hat{m}_t + \frac{1-\beta_1}{1-\beta_1^t} \cdot g_t \right)$

Note the $$\hat{v}_t$$ is not changed in Nadam. In summary, we get the following update equations for Nadam: \begin{align} v_t &= \beta_2 v_{t-1} + (1-\beta_2)\,g_t^2 \\ \hat{v}_t &= \frac{v_t}{1-\beta_2^t} \\ m_t &= \beta_1 m_{t-1} + (1-\beta_1)\,g_t \\ \hat{m}_t &= \frac{m_t}{1-\beta_1^t} \\ \theta_{t+1} &= \theta_t - \frac{\eta}{\sqrt{\hat{v}_t}+\epsilon}\odot \left( \beta_1\,\hat{m}_t + \frac{1-\beta_1}{1-\beta_1^t} \cdot g_t \right) \end{align}

The aggressive moving average strategy used by Adam may cause it hard to converge in some problems (Reddi, Kale, and Kumar 2019). AMSGrad is thus proposed as a modified version of Adam: \begin{align} m_t &= \beta_1 m_{t-1} + (1-\beta_1)\,g_t \\ v_t &= \beta_2 v_{t-1} + (1-\beta_2)\,g_t^2 \\ \hat{v}_t &= \max(\hat{v}_{t-1}, v_t) \\ \theta_{t+1} &= \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \odot m_t \end{align} The authors discarded the bias correction term $$1 / (1 - \beta_1^t)$$ and $$1 / (1 - \beta_2^t)$$ for simplicity. But in practice, some implementations may still consider this correction term, resulting in: \begin{align} m_t &= \beta_1 m_{t-1} + (1-\beta_1)\,g_t \\ v_t &= \beta_2 v_{t-1} + (1-\beta_2)\,g_t^2 \\ u_t &= \max(u_{t-1}, v_t) \\ \hat{m}_t &= \frac{m_t}{1 - \beta_1^t} \\ \hat{u}_t &= \frac{u_t}{1 - \beta_2^t} \\ \theta_{t+1} &= \theta_t - \frac{\eta}{\sqrt{\hat{u}_t} + \epsilon} \odot \hat{m}_t \end{align}

# References

Dozat, Timothy. 2016. “Incorporating Nesterov Momentum into Adam,” 4.

Duchi, John, Elad Hazan, and Yoram Singer. 2011. “Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.” Journal of Machine Learning Research 12 (61): 2121–59.

Kingma, Diederik P., and Jimmy Ba. 2017. “Adam: A Method for Stochastic Optimization.” arXiv:1412.6980 [Cs], January. http://arxiv.org/abs/1412.6980.

Reddi, Sashank J., Satyen Kale, and Sanjiv Kumar. 2019. “On the Convergence of Adam and Beyond.” arXiv:1904.09237 [Cs, Math, Stat], April. http://arxiv.org/abs/1904.09237.

“比Momentum更快：揭开Nesterov Accelerated Gradient的真面目.” n.d. 知乎专栏. https://zhuanlan.zhihu.com/p/22810533.

# Visualizing High Dimensional Space

## Orthogonal Projection

Izmailov et al. (2018) proposed to project weight tensors $$w$$ into 2d plane (i.e., obtain the projector $$w(x,y)$$), by constructing the orthogonal basis $$(\hat{u},\hat{v})$$ from three given weight tensors $$(w_1, w_2, w_3)$$: \begin{align} u &= w_2 - w_1 \\ v &= (w_3 - w_1) - \frac{\left<w_3-w_1, w_2-w_1\right>}{\left\| w_2-w_1 \right\|^2} \cdot (w_2 - w_1) \\ \hat{u} &= \frac{u}{\left\| u \right\|} \\ \hat{v} &= \frac{v}{\left\| v \right\|} \\ w(x,y) &= w_1 + x\cdot \hat{u} + y \cdot \hat{v} \end{align}

# References

Izmailov, Pavel, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. 2018. “Averaging Weights Leads to Wider Optima and Better Generalization.” arXiv Preprint arXiv:1803.05407.

# Loss Surface and Generalization

## Stochastic Weight Averaging

Izmailov et al. (2018) suggested

The visualization method of the above figure can be found at Orthogonal Projection.

# References

Izmailov, Pavel, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. 2018. “Averaging Weights Leads to Wider Optima and Better Generalization.” arXiv Preprint arXiv:1803.05407.