Adaptive Learning Rate Method

Gradient descent

Before the adaptive learning rate methods were introduced, the gradient descent algorithms including Batch Gradient Descent (BGD), Stochastic Gradient Descent (SGD) and mini-Batch Gradient Descent (mini-BGD, the mixture of BGD ans SGD) were state-of-the-art. In essence, these methods try to update the weights $\begin{array}{l}\mathrm{\theta}\end{array}$ of the network, with the help of a learning rate $\begin{array}{l}\mathrm{\eta}\end{array}$ , the objective function $\begin{array}{l}\mathrm{J(\theta)}\end{array}$ and the gradient of it, $\begin{array}{l}\mathrm{\nabla\,J(\theta)}\end{array}$ . What all gradient descent algorithms and its improvements have in common, is the goal of minimizing $\begin{array}{l}\mathrm{J(\theta)}\end{array}$ in order to find the optimal weights $\begin{array}{l}\mathrm{\theta}\end{array}$ .

The simplest of the three is the BGD.

$\begin{array}{l}\theta = \theta - \eta \cdot \nabla_{\theta} J(\theta)\end{array}$

It tries to reach the minimum of $\begin{array}{l}\mathrm{J\theta)}\end{array}$ , by subtracting from $\begin{array}{l}\mathrm{\theta}\end{array}$ the gradient of $\begin{array}{l}\mathrm{J(\theta)}\end{array}$ (refere to Figure 3 for a visualization). The algorithm always computes over the whole set of data, for each update. This makes the BGD the slowest and causes it to be unable to update online. Additionally, it performs redundant operates for big sets of data, computing similar examples at each update and it converges to the closeset minimum depending on the given data data, resulting in potential suboptimal results.

An often used algorithm is the SGD.

$\begin{array}{l}\theta = \theta - \eta \cdot \nabla_{\theta}J(\theta; x^{(i)}; y^{(i)})\end{array}$

Contrary to BGD, SGD updates for each training example $\begin{array}{l}\mathrm{(x^{(i)};\,y^{(i)})}\end{array}$ , thus updating according to a single example step. Furthermore, this fluctuation enables the SGD to jump to minima farther away, potentially reaching a better minimum. But thanks to this fluctuation, SGD is also able to overshoot. This can be counteracted by slowly decreasing the learning rate. In the exemplary code shown in Figure 2, a shuffle function is additionally used in the SGD and mini-BGD algorithm, compared to the BGD. This is done, as it is often preferable to avoid meaningful order of the data and thereby avoid bias of optimization algorithm, although sometimes better results can be achieved with data in order. In this case the shuffle operation is to be removed.

Lastly, there is the mini-BGD.

$\begin{array}{l}\theta = \theta - \eta \cdot \nabla_{\theta} J(\theta;x^{(i:i+n)};y^{(i:i+n)})\end{array}$

The mini-BGD updates for every mini-batch of $\begin{array}{l}\mathrm{n}\end{array}$ training examples. This leads to a more stable convergence, by reducing the variance of the parameters. When people talk abput a SGD algorithm, they often refer to this version.

batch gradient descent (BGD)

stochastic gradient descent (SGD)

mini-batch gradient descent (mini-BGD)

for i in range (nb_epoches):

params_grad = evaluate_gradient(loss_function, data, params

params = params - learning_rate * params_grad

for i in range(np_epochs):

np.random.shuffle(data)

for example in data:

params_grad = evaluate_gradient(loss_function, example, params

params = params - learning_rate * params_grad

for i in range(np_epochs):

np.random.shuffle(data)

for batch in get_batches(data, batch_size=50):

params_grad = evaluate_gradient(loss_function, batch, params

params = params - learning_rate * params_grad

Figure 2: ⁽¹⁾ Pseudo code of the three gradient descent algorithms

Figure 1:⁽⁵⁾ Local minima may occure in $\begin{array}{l}\mathrm{J(\theta)}\end{array}$ (here $\begin{array}{l}\mathrm{J(w)}\end{array}$ ), which may result in suboptimal solution for some gradient descent methodes.

Adaptive Learning Rate Method

As an improvement to traditional gradient descent algorithms, the adaptive gradient descent optimization algorithms or adaptive learning rate methods can be utilized. Several versions of these algorithms are described below.

Momentum can be seen as an evolution of the SGD.

$\begin{array}{l}v_t = \gamma v_{t-1} + \eta \nabla_{\theta}J(\theta)\\\theta = \theta - v_t\end{array}$

While SGD has problems with data having steep curves in one direction of the gradient, Momentum circumvents that by adding the update vector of the time step before multiplying it with a $\begin{array}{l}\mathrm{\gamma}\end{array}$ , usually around 0.9 ⁽¹⁾. As an analogy, one can think of a ball rolling down the gradient, gathering momentum (hence the name), while still being affected by the wind resistance (0< $\begin{array}{l}\mathrm{\gamma}\end{array}$ < 1).

Nesterov accelerated gradient can be seen as a further enhancement to momentum.

$\begin{array}{l}v_t = \gamma v_{t-1} + \eta \nabla_{\theta}J(\theta - \gamma v_{t-1})\\\theta = \theta - v_t\end{array}$

This algorithm adds a guess of the next step, in the form of the term $\begin{array}{l}\mathrm{\gamma\,v_{t-1}}\end{array}$ . A comparison for the first two steps of Momentum and Nesterov accelerated gradient can be found in Figure 3. The additional term results in a consideration of the error of the previous step, accelerating the progress in comparison to momentum.

Contrary to the nesterov accelerated gradient, Adagrad adapts its learning rate $\begin{array}{l}\mathrm{\eta}\end{array}$ during its run-time and it updates its parameters $\begin{array}{l}\mathrm{\theta_i}\end{array}$ separately during each time step $\begin{array}{l}\mathrm{t}\end{array}$ . It has to do that, since $\begin{array}{l}\mathrm{\eta}\end{array}$ adapts for every $\begin{array}{l}\mathrm{\theta_i}\end{array}$ on its own.

$\begin{array}{l}\theta_{t+1,i} = \theta_{t,i} - \frac{\eta}{\sqrt{G_{t,ii}+ \epsilon}} \cdot\nabla_{\theta} J(\theta_i)\end{array}$

$\begin{array}{l}\mathrm{G_t}\end{array}$ is a matrix containing the squared sum of the past gradients with regards to all $\begin{array}{l}\mathrm{\theta}\end{array}$ along its diagonal.

$\begin{array}{l}\mathrm{\epsilon}\end{array}$ is correction term which is utilized to avoid dividing by 0 and is generally insignificantly small (~ $\begin{array}{l}\mathrm{10^{-8}}\end{array}$ ).

Due to the accumulation of the squared gradients in $\begin{array}{l}\mathrm{G_t}\end{array}$ the learning rate $\begin{array}{l}\mathrm{\frac{\eta}{\sqrt{G_{t,ii}+\,\epsilon}}}\end{array}$ gets smaller over time, finally leading to a significantly small rate, which causes the algorithm to obtain no new knowledge.

Figure 3:^⁽¹⁾ Visualization of the analogy for Momentum using a $\begin{array}{l}\mathrm{\gamma\,= 0.9}\end{array}$ .

Figure 4:⁽¹⁾ Comparison of Momentum (blue) and Nesterov accelerated gradient (green). The brown arrow, is the first prediction of the nesterov accelerated gradient, before the gradient is calculated. The green arrow is the final result of the nesterov accelerated gradient, now with the gradient taken into account.

Literature

⁽¹⁾Ruder, Sebastian. "An overview of gradient descent optimization algorithms." arXiv preprint arXiv:1609.04747 (2016).

⁽²⁾Duchi, John, Elad Hazan, and Yoram Singer. "Adaptive subgradient methods for online learning and stochastic optimization." Journal of Machine Learning Research 12.Jul (2011): 2121-2159.

Weblinks

⁽³⁾ http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf

⁽⁴⁾ https://www.quora.com/Why-do-we-need-adaptive-learning-rates-for-Deep-Learning http://sebastianraschka.com/Articles/2015_singlelayer_neurons.html

⁽⁵⁾ http://sebastianraschka.com/Articles/2015_singlelayer_neurons.html

Seitenhierarchie

Gradient descent

Adaptive Learning Rate Method

Literature

Weblinks

2 Kommentare

Unbekannter Benutzer (ga29mit) sagt:

Unbekannter Benutzer (ga73fuj) sagt: