Gradient descent

Introduction
The gradient method, also called steepest descent method, is a method used in Numerics to solve general Optimization problems. Here one starts (at the example of a minimization problem) from an approximate value. From this one proceeds in the direction of the negative gradient (which indicates the direction of steepest descent from this approximate value) until one no longer achieves numerical improvement.

Wiki2Reveal
This page can be displayed as Wiki2Reveal slides. Single sections are regarded as slides and modifications on the slides will immediately affect the content of the slides.

Remark Convergence
The method often converges very slowly, since it approaches the optimuÏm with a strong zigzag course. For the solution of symmetric positive definite linear system of equations, the method of conjugate gradients offers an immense improvement here. Gradient descent is possible with the hill climbing algorithm (hill climbing).

The optimization problem
The gradient method is applicable when the minimization of a real-valued differentiable function $ f: \mathbb{R}^n \rightarrow \mathbb{R} $ is involved; i.e., the optimization problem



\underset{x \in \mathbb{R}^n}{\rm min} \ f(x). $$

This is a problem of Optimization without constraints, also called unconstrained optimization problem.

Essential steps
f $ decrease. f $ during an iteration step. Then one would decrease the step size accordingly to further minimize and more accurately approximate the function value of $ f $.
 * Gradient points in the direction of the steepest increase.
 * The negative gradient therefore points in the direction in which the function values of $
 * It can happen that one jumps over the local minimum of the function $

Termination condition
x^{(k)} \in \mathbb{R}^n $ with the iteration at which the gradient of $ f $ is 0
 * Termination condition for the gradient descent method would be when we have found a location $

\nabla(f)(x^{(k)})= 0\in \mathbb{R}^n $$ . In general, the gradient of a place $ x^{(k)} \in \mathbb{R}^n $ for the $ k $ -th iteration step is defined as follows via the partial derivatives:

\nabla(f)(x^{(k)}) := \left(\frac{\partial f}{\partial x_{1}}(x^{(k)}),\cdots ,\frac{\partial f}{\partial x_{n}}(x^{(k)})\right) $$.

Example of a gradient
Sei $ f(x_1,x_2):= x_1^3\cdot x_2 + x_2^2 $ :
 * $$\nabla(f)(x_1,x_2) = \left(\frac{\partial f}{\partial x_{1}}(x_1,x_2),\frac{\partial f}{\partial x_{2}}(x_1,x_2)\right) = \left( 3 x_1^2 x_2, x_1^3 +2 x_2 \right)$$.

This allows us to calculate the gradient at a given point in the domain of definition:

\nabla(f)(1,2)=(6,5) $$

Example - normalized gradient
Using a gradient different from the zero vector $ \nabla(f)(1,2)=(6,5) $, one can create a normalized "negative" gradient:
 * $$ d^{(j)} := -

\frac{ \nabla\left( f(1,2) \right) }{  \left\| \nabla\left( f(1,2) \right) \right| } = \frac{ (-6;-5) }{  \sqrt{36+25} } = \left(   -\frac{6}{\sqrt{61}};   -\frac{5}{\sqrt{61}} \right) $$

The procedure
A simplified step size calculation is used as an introduction to the gradient descent procedure. \alpha_j $ . \varepsilon > 0 $ (i.e. $  \alpha_j < \varepsilon $ ).
 * The procedure terminates if the gradient is the zero vector.
 * If the gradient is not the zero vector, the negative gradient is first normalized to length 1 and multiplied by the step size $
 * The step size is halved if after the iteration step the function value (e.g. cost) does not decrease.
 * Another termination condition for the iteration is if the step size falls below an accuracy limit $

Start of the optimization
As starting point a point $ x^{(0)} $ is chosen from the definition domain of the function $ f $, for which one wants to approximate local minima with the gradient descent method.

Direction of the steepest descent
Starting from a starting point $ x^{(0)} $ resp. from the current point $ x^{(j)} $ for the next iteration step, the direction of steepest descent is determined by $ -Grad\left( f(x^{(j)})\right) $, where $ Grad\left( f(x^{(j)})\right)\in \mathbb{R}^n $ is the Gradient of $ f $ at the location $ x^{(j)}\in \mathbb{R}^n $ . This points in the direction of the "steepest increase". The negative sign in front of the gradient ensures that one moves with the iteration steps in the direction of the strongest decrease (e.g. minimization of the cost function/error function $ f $ ).

Normalization of the direction vector
The simplified interation procedure terminates at the condition if $ \left\| \nabla \left( f(x^{(j)}) \right) \right\| < \varepsilon $ . Otherwise, the direction vector is normalized for the following iteration step (this is optional to normalize the step size in the learning algorithm) :
 * $$ d^{(j)} := -

\frac{ \nabla\left( f(x^{(j)}) \right) }{  \left\| \nabla\left( f(x^{(j)}) \right) \right\| } $$ with Euclidean norm $$ \| x \| := \| (x_1,\ldots ,x_n ) \| := \sqrt{ \sum_{k=1}^n x_k^2 } $$

Iteration step
Formally, one notates this iteration step as follows:


 * $$x^{(j+1)} =.

\begin{cases} x^{(j)} + \alpha^{(j)} d^{(j)}, & \text{if } f(x^{(j)} + \alpha^{(j)} d^{(j)}) < f(x^{(j)}) \text{ } \text{ (enhancement) }\text{ } x^{(j)}, & \text{other } \end{cases} $$ If there is no improvement, the step size is decreased (e.g., halved).

Setting the step size
The step size is used for the next iteration step until the cost function $ f $ increases with the subsequent step. In this introductory example, the step size $ \alpha^{(j)} $ is halved. Formal
 * $$\alpha^{(j+1)} =.

\begin{cases} \alpha^{(j)}, & \text{if } f(x^{(j)} + \alpha^{(j)} d^{(j)}) < f(x^{(j)}) \text{ (enhancement) } \\ \frac{ \alpha^{(j)} }{2}, & \text{other } \end{cases} $$

Alternative step size reduction
In general, the step size reduction can also be replaced by a factor $ \delta $ with $ 0 < \delta< 1 $ over

\alpha^{(j+1)} := \alpha^{(j)} \cdot \delta $$ can be replaced.

Step size definition per iteration step
Here $ \alpha^{(j)} > 0 $ is the step size in the jth iteration step. This step size must be determined in each step of the iteration procedure. In general, there are different ways to do this, such as regressing the step size determination on a one-dimensional optimization problem. The here chosen step size optimization is chosen as an introduction to the topic.

Gradient descent in spreadsheet
In the following ZIP file from GitHub is a LibreOffice file with an example gradient descent for the cost function:

f(x_1,x_2) := \sin(x_1) + \cos(x_2) + 3 $$. The cost function has infinitely many local minima on its domain of definition $ \mathbb{R}^2 $ . By definition, the minimum of the cost function is -1. In each table row, we perform an iteration step and check that the cost function is actually decreasing after the iteration step.


 * ZIP file from GitHub
 * GradientDescent repository on GitHub

Regression to a one-dimensional optimization problem
One method is to determine $ \alpha^{(j)} $ by minimizing the function on the (one-dimensional) "ray" $ x^{(j)}(\alpha) $ that points in the direction of the negative gradient starting from $ x^{(j)} $ . The one-dimensional function to be minimized $ M $ is defined as follows.



M: \mathbb{R}^{+} \rightarrow \mathbb{R}, \qquad \alpha \mapsto M(\alpha):= f\left( x^{(j)}(\alpha) \right) f\left( x^{(j)} + \alpha d^{(j)} \right) $$


 * mit $

x^{(j)}(\alpha) = x^{(j)} + \alpha d^{(j)}. $

One computes in this case the $ x^{(j+1)}:=x^{(j)}(\alpha_o) $ with $ \alpha_o > 0 $ such that $ M(\alpha_o) $ becomes minimal. i.e.:



f\left(x^{(j+1)}\right)=\underset{\alpha >0}{\min}\ {f\left( x^{(j)}(\alpha) \right)}. $$

This is a simple, one-dimensional optimization problem for which there are special methods of Step size determination.

Step sizes and iterated step size reduction
Another method is to make $ \alpha^{(j)} $ dependent on the minimization of the function $ f $, i.e., on the condition $ f(x^{(j+1)}) 0 $ is not decreased, one decreases the step size e. B. with $ \alpha_{k+1} :=\alpha_k \cdot s $ with $$0< s < 1$$ further until the step size starting from $ x^{(j)} $ in the direction of the negative gradient actually yields a function value $ f(x^{(j)} + \alpha_k d^{(j)}) < f(x^{(j)}) $ and sets $ x^{(j+1)} = x^{(j)} + \alpha_k d^{(j)} $.

If one has reached a place $ x^{(j+1)} \in \mathbb{R}^n $ with $ \nabla\left( f(x^{(j+1)})\right) = \mathbf{0} \in \mathbb{R}^n $ in the iteration procedure, possibly a local extreme place of $ f $ is present (check or abort of the iteration procedure).

Termination condition
A key criterion for a termination condition is that $ d^{(j)} = -\nabla\left( f(x^{(j)})\right) = \mathbf{0} \in \mathbb{R}^n $ . As in real one-dimensional analysis, there must be no local minimum at this point $ x^{(j)} \in \mathbb{R}^n $ (one-dimensional saddle point, multidimensional, e.g., saddle surface). If for $ f:U\rightarrow \mathbb{R} $ is twice continuously differentiable and the Hessian matrix is positive definite at this point, there is sufficient criterion for a local minimum in $ x^{(j)} \in \mathbb{R}^n $ . This is output and the iteration is terminated.

If the algorithm is executed on a computer, a possible further termination criterion is the step length, if this becomes smaller than a lower limit $ \varepsilon > 0 $ with the condition $ \alpha^{(j)}<\varepsilon $.

Furthermore, the gradient descent procedure can be terminated if the improvement of the optimization of $ f $ in the interation steps becomes smaller than a lower bound.

Such termination criteria algorithmically ensure that the gradient descent procedure does not end up in an infinite loop of iterations.


 * CAS4Wiki Start-Link - Derivatives and Gradient: [https://niebert.github.io/WikiversityDoc/cas4wiki.html?filename=commands4cas.json&casdata=%7B%22castype%22%3A%22maxima%22%2C%22commands%22%3A%5B%7B%22cmdtitle%22%3A%22Derivative%20x%5E2%2B3x%22%2C%22cmd%22%3A%22d(x%5E2%2B3x)%22%2C%22result4cmd%22%3A%22%22%7D%2C%7B%22cmdtitle%22%3A%22Partial%20Derivative%20g(x%2Cy)%20%22%2C%22cmd%22%3A%22g(x%2Cy)%5Cnd(g(x%2Cy)%2Cy)%22%2C%22result4cmd%22%3A%22%22%7D%2C%7B%22cmdtitle%22%3A%22Gradient%20r(x%2Cy)%22%2C%22cmd%22%3A%22r(x%2Cy)%3A%3Dsqrt(x%5E2%2By%5E2)%5Cnd(r(x%2Cy)%2C%5Bx%2Cy%5D)%22%2C%22result4cmd%22%3A%22%22%7D%5D%2C%22casfunctions%22%3A%5B%7B%22name%22%3A%22g%22%2C%22args%22%3A%22x%2Cy%22%2C%22def%22%3A%22x%5E3%2By%5E2%22%7D%2C%7B%22name%22%3A%22f%22%2C%22args%22%3A%22x%22%2C%22def%22%3A%22x%5E5%22%7D%2C%7B%22name%22%3A%22h%22%2C%22args%22%3A%22x%22%2C%22def%22%3A%2210*sin(x)%22%7D%2C%7B%22name%22%3A%22cur%22%2C%22args%22%3A%22t%22%2C%22def%22%3A%22%5Bcos(t)%2Csin(t)%2Ct%5D%22%7D%2C%7B%22name%22%3A%22K%22%2C%22args%22%3A%22t%22%2C%22def%22%3A%22v1*%20(1-t)%5E3%20%2Bv2*3*(1-t)%5E2*t%2B%20v3*3*(1-t)*t%5E2%20%2Bv4*t%5E3%22%7D%2C%7B%22name%22%3A%22r%22%2C%22args%22%3A%22x%2Cy%22%2C%22def%22%3A%22sqrt(x%5E2%2By%5E2)%22%7D%5D%2C%22casvariables%22%3A%5B%7B%22name%22%3A%22c1%22%2C%22def%22%3A%2212!%22%7D%2C%7B%22name%22%3A%22c2%22%2C%22def%22%3A%2223%5E5-4%2Bsin(13)%22%7D%2C%7B%22name%22%3A%22c3%22%2C%22def%22%3A%22f1(x)%22%7D%2C%7B%22name%22%3A%22v1%22%2C%22def%22%3A%22%5B3%2C4%2C5%5D%22%7D%2C%7B%22name%22%3A%22v2%22%2C%22def%22%3A%22%5B5%2C4%2C-3%5D%22%7D%2C%7B%22name%22%3A%22v3%22%2C%22def%22%3A%22%5B-6%2C-6%2C6%5D%22%7D%2C%7B%22name%22%3A%22v4%22%2C%22def%22%3A%22%5B-3%2C-7%2C0%5D%22%7D%5D%7D&title=Derivatives%20Gradient Derivatives and Gradient]

Videos

 * contour lines, gradient, vector field, vector analysis, ... Youtube video (07/21/2015) by Daniel Jung.
 * gradient and total differential Youtube video (21.07.2015) by Daniel Jung

Page Information
You can display this page as Wiki2Reveal slides

Wiki2Reveal
The Wiki2Reveal slides were created for the Numerical Analysis' and the Link for the Wiki2Reveal Slides was created with the link generator.


 * This page is designed as a PanDocElectron-SLIDE document type.
 * Source: Wikiversity https://en.wikiversity.org/wiki/Gradient%20descent
 * see Wiki2Reveal for the functionality of Wiki2Reveal.

Gradientenabstiegsverfahren