Optimization

This notebook has been adapted from one of the tutorials presented during a workshop at the Applied Machine Learning Days 2020.

Here, we will see how to use the gradient obtained with Autograd to perform optimization of an objective function. We will see how we can optimize our function 1. "by hand" and 2. by using various optimizers included in the optimizer classes in PyTorch.

Table of Contents

1. Optimizing by hand

2. Optimization using optimizers

import sys
import torch

#We will use live plot script to see what happens when we optimize a function. 

if 'google.colab' in sys.modules: # Execute if you're using Google Colab
    !wget -q https://raw.githubusercontent.com/theevann/amld-pytorch-workshop/master/live_plot.py -O live_plot.py
    !pip install -q ipympl

%matplotlib ipympl
torch.set_printoptions(precision=3)
from live_plot import anim_2d

Optimizing by hand

We will start by minimizing a square function. Let's define a function below

def f(x):
    return x ** 2

Let's plot it

import matplotlib.pyplot as plt
import numpy as np

x = np.arange(-2,2,0.1)
y = f(x)

plt.plot(x, y, '-k')
plt.show()

We can see that our square function attain its minima at x = 0. Let's now minimize our square function using the gradient descent algorithm.

The update step of the algorithm is given as:

$$x_{t+1} = x_{t} - \lambda \nabla_x f (x_t)$$

When implementing above using PyTorch's autograd, please keep in mind following points:

  • The gradient information $\nabla_x f (x)$ will be stored in x.grad once we run the backward function.
  • The gradient is accumulated by default, so we need to clear x.grad after each iteration.
  • We need to use with torch.no_grad(): since we want to change x in place but don't want autograd to track this change.
x0 = 2.5 # intital value 
lr = 0.1  # learning rate
iterations = 40  
points= []

x_range = torch.arange(-3, 3, 0.1)
x = torch.Tensor([x0]).requires_grad_()

for i in range(iterations):

  y = f(x)
  y.backward()
  points += [(x.item(), y.item())]
  with torch.no_grad():
    x -= lr * x.grad
  x.grad = None   

  #print(y.data) # Uncomment this to see the output of the function i.e. y

anim_2d(x_range, f, points, 400)

What a visual treat to the eye! One can see here what happens when we optimize our square function. Our algorithm starts from the initial value of x given by the user and then for each iteration, performs the gradient descent step. At each interation, it updates the value of x and this process happens untill all the iterations are complete.


Optimizing using an optimizer

An optimizer is an object that automatically loops through all the numerous parameters of your model and performs the (potentially complex) update step for you. Let us now use an optimizer to perform the optimization of the function.

We first need to import torch.optim.

import torch.optim as optim

Below are the most commonly used optimizers. Each of them has its specific parameters that you can check on the Pytorch Doc.

parameters = [x]  # This should be the list of model parameters

optimizer = optim.SGD(parameters, lr=0.01, momentum=0.9)
optimizer = optim.Adam(parameters, lr=0.01)
optimizer = optim.Adadelta(parameters, lr=0.01)
optimizer = optim.Adagrad(parameters, lr=0.01)
optimizer = optim.RMSprop(parameters, lr=0.01)
optimizer = optim.LBFGS(parameters, lr=0.01)

Now, let's use an optimizer to do the optimization !

We will need 2 new functions:

  • optimizer.zero_grad() : This function sets the gradient of the parameters (x here) to 0 (otherwise it will get accumulated)
  • optimizer.step() : This function applies an update step

Let's define a new function

def func_2d(x):
    return x ** 2 / 20 + x.sin().tanh()
x0 = 8
lr = 2
iterations = 15
points= []

x_range = torch.arange(-10, 10, 0.1)
x = torch.Tensor([x0]).requires_grad_()

# Let's use Adam optimizer
optimizer = torch.optim.Adam([x], lr=lr)

for i in range(iterations):
    optimizer.zero_grad()
    f = func_2d(x)
    f.backward()
    points += [(x.item(), f.item())]
    optimizer.step()
    
anim_2d(x_range, func_2d, points, 400)

One can again see here what happens when we optimize our function. Note that func_2d is more complex than our previous square function - presence of several local minima.

We can also use learning rate schedular in combination with an optimizer which can be used to adjust the learning rate during the training. Please refer to the original notebook to learn more about them.

More good learning sources are given below:

  1. https://cs231n.github.io/optimization-1/
  2. https://ruder.io/optimizing-gradient-descent/

Let me know if you have any feedback or comments !