Description
Problem 1 (A OneDimensional Problem): 
(approx. 20 pts) 

We consider the optimization problem 

min f 
x 
1 
x 
2(x − 1) 
x 
. 
, 
. . 

( 
) := −_{(}_{x} _{−} _{1)}_{2} log( 
x + 1 
s.t. 
∈ [1 
5 

x 
) − 
4 5] 
Implement the bisection or the golden section method to solve this problem and output a solution with accuracy at least 10^{−5}.
Problem 2 (The Gradient Method): 
(approx. 30 pts) 

In this exercise, we want to solve the optimization problem 

min 
f 
x 
_{x}4 
+ 
2 
_{x}3 
+ 
1 
_{x}2 
− 2 
x^{2}x 
2 
+ 
4 
x^{2}. 
(1) 

x∈R^{2} 
( ):= 
1 
3 ^{1} 
2 ^{1} 
1 
3 ^{2} 
via the gradient descent method. (This is problem 1 discussed in sheet 6).
Implement the gradient method that was presented in the lecture as a function gradient_method in MATLAB or Python.
The following input functions and parameters should be considered:

obj, grad – function handles that calculate and return the objective function f(x) and the gradient rf(x) at an input vector x ∈ R^{n}. You can treat these handles as functions or fields of a class or structure f or you can use f and rf directly in your code. (For example, your function can have the form gradient_method(obj,grad,…)).

x^{0} – the initial point.

tol – a tolerance parameter. The method should stop whenever the current iterate x^{k} satisfies the criterion krf(x^{k})k ≤ tol.
We want to investigate the performance of the gradient method for diﬀerent step size strategies. In particular, we want to test and compare backtracking and exact line search. The following parameters will be relevant for these strategies:

σ, γ ∈ (0, 1) – parameters for backtracking and the Armijo condition. (At iteration k, we choose α_{k} as the largest element in {1, σ, σ^{2}, . . . } satisfying the condition f(x^{k} −α_{k}rf(x^{k}))− f(x^{k}) ≤ −γα_{k} · krf(x^{k})k^{2}).

You can use the golden section method to determine the exact step size α_{k}. The parameters for the golden section method are: maxit (maximum number of iterations), tol (stopping tolerance), [0, a] (the interval of the step size).
You can organize the latter parameters in an appropriate options class or structure. It is also possible to implement separate algorithms for backtracking and exact line search. The method(s) should return the final iterate x^{k} that satisfies the stopping criterion.

Apply the gradient method with backtracking and parameters (σ, γ) = (0.5, 0.1) and exact line search (maxit = 100, tol = 10^{−6}, a = 2) to solve the problem min_{x} f(x).
The algorithms should use the stopping tolerance tol = 10^{−5}. Test the methods using the initial point x^{0} = (3, 3)^{>} and report the behavior and performance of the methods. In particular, compare the number of iterations and the point to which the diﬀerent gradient descent methods converged.

Let us define the set of initial points
( ! ! ! !)





X
^{0} :=
−3
,
3
,^{−3},
3
−
3
−
3
3
3




Run the methods:
– Gradient descent method with backtracking and (σ, γ) = (0.5, 0.1),
– Gradient method with exact line search and maxit = 100, tol = 10^{−6}, a = 2,
again for every initial point in the set X ^{0} using the tolerance tol = 10^{−5}. For each algorithm/step size strategy create a single figure that contains all of the solution paths generated for the diﬀerent initial points. The initial points and limit points should be clearly visible. Add a contour plot of the function f in the background of each figure.
Problem 3 (A LimitedMemory Version of Adagrad): (approx. 25 pts)
In this exercise, we investigate a limitedmemory version of the gradient descent method with adaptive diagonal scaling. The update of the method is given by






x^{k}^{+1} = x^{k} − α_{k}D_{k}^{−1}rf(x^{k}),
(2)





where α_{k} ≥ 0 is a suitable step size and D_{k} ∈ R^{n}^{×}^{n} is a diagonal matrix that is chosen as follows:
D = diag(v^{k}, v^{k}, …, v^{k}) and
_{k 1 2 }n


v
v_{i}^{k} =
u_{k}
, t_{m}(k) = max{0, k − m}, ∀ i = 1, …, n.
^{u +} _{j}_{=}_{t}_{m}_{(}_{k}_{)}^{(r}^{f}^{(}^{x}^{j}^{))}i^{2}
X
u
t

Here, > 0, the memory constant m ∈ N, and the initial point x^{0} ∈ R^{n} are given parameters.

Show that the direction d^{k} = −D_{k}^{−1}rf(x^{k}) is a descent direction for all k ∈ N (assuming rf(x^{k}) 6= 0).

Write a MATLAB or Python code and implement the gradient method with adaptive diagonal scaling (Adagrad) and backtracking (i.e., implement the abstract descent method with d^{k} as descent direction). Run Adagrad on problem (1) introduced in the last exercise and compare its performance with the tested approaches in Problem 2 using the same initial points and stopping tolerance as in Problem 2 b). You can use the following parameters:
– (σ, γ) = (0.5, 0.1) – parameter for backtracking.
– = 10^{−6} – scaling parameter in the update (2).
– m = 25 – memory parameter.
Generate a plot of the diﬀerent solution paths of Adagrad using the diﬀerent initial points from X ^{0} and add a contour plot of f in the background.

How does Adagrad behave when diﬀerent memories m are chosen? (Suitable other choices might include m ∈ {5, 10, 15, 25, …, maxit})? Does the performance or convergence behavior change?
Problem 4 (Globalized Newton’s Method): (approx. 25 pts)
Implement the globalized Newton method with backtracking that was presented in the lecture as a function newton_glob in MATLAB or Python.
The pseudocode for the full Newton method is given as follows:
Algorithm 1: The Globalized Newton Method

Initialization: Select an initial point x^{0} ∈ R^{n} and parameter γ, γ_{1}, γ_{2}, σ ∈ (0, 1) and tol. for k = 0, 1, … do

If krf(x^{k})k ≤ tol, then STOP and x^{k} is the output.

Compute the Newton direction s^{k} as solution of the linear system of equations:
r^{2}f(x^{k})s^{k} = −rf(x^{k}).

If −rf(x^{k})^{>}s^{k} ≥ γ_{1} min{1, ks^{k}k^{γ}2 }ks^{k}k^{2}, then accept the Newton direction and set d^{k} = s^{k}.
Otherwise set d^{k} = −rf(x^{k}).

Choose a step size α_{k} by backtracking and calculate x^{k}^{+1} = x^{k} + α_{k}d^{k}.
The following input functions and parameters should be considered:

obj, grad, hess – function handles that calculate and return the objective function f(x), the gradient rf(x), and the Hessian r^{2}f(x) at an input vector x ∈ R^{n}. You can treat these handles as functions or fields of a class or structure f or you can use f, rf, and r^{2}f from part a) and b) directly in the algorithm. (For example, your function can have the form newton_glob(obj,grad,hess,…)).

x^{0} – the initial point.

tol – a tolerance parameter. The method should stop whenever the current iterate x^{k} satisfies the criterion krf(x^{k})k ≤ tol.

γ_{1}, γ_{2} > 0 – parameters for the Newton condition.

σ, γ ∈ (0, 1) – parameters for backtracking and the Armijo condition.
You can again organize the latter parameters in an appropriate options class or structure. You can use the backslash operator A\b in MATLAB or numpy.linalg.solve(A,b) to solve the linear system of equations Ax = b. If the computed Newton step s^{k} = −r^{2}f(x^{k})^{−1}rf(x^{k}) is a descent direction and satisfies
−rf(x^{k})^{>}s^{k} ≥ γ_{1} min{1, ks^{k}k^{γ}2 }ks^{k}k^{2},
we accept it as next direction d^{k}. Otherwise, the gradient direction d^{k} = −rf(x^{k}) is chosen. The method should return the final iterate x^{k} that satisfies the stopping criterion.
a) Test your approach on the Rosenbrock function f : R^{2} → R
f(x) = 100(x_{2} − x^{2}_{1})^{2} + (1 − x_{1})^{2}
with initial point x^{0} = (−1, −0.5)^{>} and parameter (σ, γ) = (0.5, 10^{−4}) and (γ_{1}, γ_{2}) = (10^{−6}, 0.1). (Notice that γ is smaller here). Besides the globalized Newton method also run
the gradient method with backtracking ((σ, γ) = (0.5, 10^{−4})) on this problem and compare the performance of the two approaches using the tolerance tol = 10^{−7}.
Does the Newton method always utilize the Newton direction? Does the method always use full step sizes α_{k} = 1?

Repeat the performance test described in Problem 2 b) and Problem 3 b) for problem


using the globalized Newton method. You can use the parameter (σ, γ) = (0.5, 0.1), (γ_{1}, γ_{2}) = (10^{−6}, 0.1), and tol = 10^{−5}.

Plot all of the solution paths obtained by Newton’s method for the diﬀerent initial points in X ^{0} in one figure (with a contour plot of f in the background).
Information: MATLAB and Python Code

For those questions that ask you to write code to solve the problem, please attach the code to the homework. Please also state the optimal solution and the optimal value or show the obtained plot / figure in your solution sheet (depending on the specific question).
Sheet 8 is due on May, 6th. Submit your solutions before May, 6th, 12:00 pm (noon).
Page 4 of 4