Gradient descent
Before the adaptive learning rate methods were introduced, the gradient descent algorithms including Batch Gradient Descent (BGD), Stochastic Gradient Descent (SGD) and mini-Batch Gradient Descent (mini-BGD, the mixture of BGD ans SGD) were state-of-the-art. In essence, these methods try to update the weights
of the network, with the help of a learning rate
, the objective function
and the gradient of it,
. What all gradient descent algorithms and its improvements have in common, is the goal of minimizing
in order to find the optimal weights
.
The simplest of the three is the BGD.
LaTeX Math Inline |
---|
body | \theta = \theta - \eta \cdot \nabla_{\theta} J(\theta) |
---|
|
It tries to reach the minimum of
, by subtracting from
the gradient of
(refere to Figure 3 for a visualization). The algorithm always computes over the whole set of data, for each update. This makes the BGD the slowest and causes it to be unable to update online. Additionally, it performs redundant operates for big sets of data, computing similar examples at each update and it converges to the closeset minimum depending on the given data data, resulting in potential suboptimal results.
An often used algorithm is the SGD.
LaTeX Math Inline |
---|
body | \theta = \theta - \eta \cdot \nabla_{\theta}J(\theta; x^{(i)}; y^{(i)}) |
---|
|
Contrary to BGD, SGD updates for each training example
, thus updating according to a single example step. Furthermore, this fluctuation enables the SGD to jump to minima farther away, potentially reaching a better minimum. But thanks to this fluctuation, SGD is also able to overshoot. This can be counteracted by slowly decreasing the learning rate. In the exemplary code shown in Figure 2, a shuffle function is additionally used in the SGD and mini-BGD algorithm, compared to the BGD. This is done, as it is often preferable to avoid meaningful order of the data and thereby avoid bias of optimization algorithm, although sometimes better results can be achieved with data in order. In this case the shuffle operation is to be removed.
Lastly, there is the mini-BGD.
LaTeX Math Inline |
---|
body | \theta = \theta - \eta \cdot \nabla_{\theta} J(\theta;x^{(i:i+n)};y^{(i:i+n)}) |
---|
|
The mini-BGD updates for every mini-batch of
training examples. This leads to a more stable convergence, by reducing the variance of the parameters. When people talk abput a SGD algorithm, they often refer to this version.
batch gradient descent (BGD) | stochastic gradient descent (SGD) | mini-batch gradient descent (mini-BGD) |
---|
for i in range (nb_epoches): params_grad = evaluate_gradient(loss_function, data, params params = params - learning_rate * params_grad | for i in range(np_epochs): np.random.shuffle(data) for example in data: params_grad = evaluate_gradient(loss_function, example, params params = params - learning_rate * params_grad | for i in range(np_epochs): np.random.shuffle(data) for batch in get_batches(data, batch_size=50): params_grad = evaluate_gradient(loss_function, batch, params params = params - learning_rate * params_grad |
Figure 2: (1) Pseudo code of the three gradient descent algorithms