Search⌘ K
AI Features

Reviewing the Steps of Gradient Descent

Explore the fundamental steps of gradient descent applied to a simple linear regression problem using PyTorch. Understand how to generate synthetic data, compute predictions, calculate mean squared error loss, find gradients with respect to parameters, and update those parameters iteratively. This lesson outlines batch, stochastic, and mini-batch gradient descent types and shows how repeated epochs train the model.

Simple linear regression

Most tutorials start with some nice and pretty image classification problems to illustrate how to use PyTorch. It may seem cool, but I believe it distracts you from learning how PyTorch works.

For this reason, in this first example, we will stick with a simple and familiar problem: a linear regression with a single feature x! It does not get much simpler than that. It has the following equation:

y=b+wx+ϵy = b + w x + \epsilon

It is also possible to think of it as the simplest neural network possible: one input, one output, and no activation function (that is, linear).

Note: This lesson serves as a review for the previous chapter.

Data generation

Let us start generating some synthetic data. We start with a vector of 100 (N) points for our feature (x) and create our labels (y) using b = 1, w = 2, and some Gaussian noise (epsilon).

Synthetic data generation

The following code generates our synthetic data:

Python 3.5
# Variables initialized
true_b = 1
true_w = 2
N = 100
# Data generation process
np.random.seed(42)
x = np.random.rand(N, 1)
epsilon = (.1 * np.random.randn(N, 1))
y = true_b + true_w * x + epsilon
# Displaying the data that we generated (first 5 values)
print("X:", x[:5], "\n\nY:", y[:5])

Splitting data

Next, let us split our synthetic data into train and validation sets, shuffling the array of indexes and using the first 80 shuffled points for training.

Python 3.5
# Shuffles the indices
idx = np.arange(N)
np.random.shuffle(idx)
# Uses first 80 random indices for train
train_idx = idx[:int(N*.8)]
# Uses the remaining indices for validation
val_idx = idx[int(N*.8):]
# Generates train and validation sets
x_train, y_train = x[train_idx], y[train_idx]
x_val, y_val = x[val_idx], y[val_idx]
# Displaying the training set which we will be using in this chapter (first 5 values)
print('x_train: {}'.format(x_train[:5]))

The following figure shows the subplots of both the training (x_train, y_train) and validation sets (x_val, y_val) of generated data:

We know that b = 1 and w = 2, but now let us see how close we can get to the true values by using gradient descent and the 80 points in the training set (for training: N = 80).

Gradient descent

We will be covering the five basic steps you would need to go through to use gradient descent and the corresponding Numpy code.

Step 0 - Random initialization

For training a model, you need to randomly initialize the parameters/weights. In our case, we have only two: b, and w.

Python 3.5
import numpy as np
# Step 0 - initializes parameters "b" and "w" randomly
np.random.seed(42)
b = np.random.randn(1)
w = np.random.randn(1)
print(b, w)

Step 1 - Compute model’s predictions

This is the forward pass. It simply computes the model’s predictions using the current values of the parameters/weights.

At the very beginning, we will be producing really bad predictions, as we started with random values from Step 0.

Python 3.5
# Step 1 - computes our model's predicted output - forward pass
yhat = b + w * x_train

You can see the values of these predictions by running the following code:

Python 3.5
# Step 0 - initializes parameters "b" and "w" randomly
np.random.seed(42)
b = np.random.randn(1)
w = np.random.randn(1)
# Step 1 - computes our model's predicted output - forward pass
yhat = b + w * x_train
# Displaying the model's predicted output using current parameter values (first five values)
print(yhat[:5])

Step 2 - Compute the loss

For a regression problem, the loss is given by the Mean Squared Error (MSE). As a reminder, MSE is the average of all squared errors, meaning the average of all squared differences between labels (y) and predictions (b + wx).

Let us now compute the loss using Python.

In the code below, we are using all data points of the training set to compute the loss, so n = N = 80. Meaning, we are performing batch gradient descent.

Python 3.5
# Step 2 - computing the loss
# We are using ALL data points, so this is BATCH gradient
# descent. How wrong is our model? That's the error!
error = (yhat - y_train)
# It is a regression, so it computes mean squared error (MSE)
loss = (error ** 2).mean()
print(loss)




Gradient descent types:

  • If we use all points in the training set (n = N) to compute the loss, we are performing a batch gradient descent.
  • If we were to use a single point (n = 1) each time, it would be a stochastic gradient descent
  • Anything else (n) in-between 1 and N characterizes a mini-batch gradient descent

Step 3 - Compute the gradients

A gradient is a partial derivative. Why partial? Because one computes it with respect to (w.r.t.) a single parameter. Since we have two parameters of, b, and w, therefore, we must compute two partial derivatives.

A derivative tells you how much a given quantity changes when you slightly vary some other quantity. In our case, how much does our MSE loss change when we vary each one of our two parameters separately?

Gradient = how much the loss changes if one parameter changes a little bit!

Using the equations above, we will now compute the gradients with respect to the b and w coefficients.

Python 3.5
# Step 3 - computes gradients for both "b" and "w" parameters
b_grad = 2 * error.mean()
w_grad = 2 * (x_train * error).mean()
print(b_grad, w_grad)

Step 4 - Update the parameters

In the final step, we use the gradients to update the parameters. Since we are trying to minimize our losses, we reverse the sign of the gradient for the update.

There is still another hyperparameter to consider; the learning rate, denoted by the Greek letter eta (that looks like the letter n), is the multiplicative factor that we need to apply to the gradient for the parameter update.

In our example, let us start with a value of 0.1 for the learning rate (which is a relatively big value as far as learning rates are concerned!).

Python 3.5
# Sets learning rate - this is "eta" ~ the "n"-like Greek letter
lr = 0.1
print(b, w)
# Step 4 - updates parameters using gradients and the learning rate
b = b - lr * b_grad
w = w - lr * w_grad
print(b, w)

Step 5 - Rinse and repeat!

Now we use the updated parameters to go back to Step 1 and restart the process.








Definition of epoch:

An epoch is complete whenever every point in the training set (N) has already been used in all steps: forward pass, computing loss, computing gradients, and updating parameters.

During one epoch, we perform at least one update, but no more than N updates.

The number of updates (N/n) will depend on the type of gradient descent being used:

  • For batch (n = N) gradient descent, this is trivial, as it uses all points for computing the loss. One epoch is the same as one update.
  • For stochastic (n = 1) gradient descent, one epoch means N updates since every individual data point is used to perform an update.
  • For mini-batch (of size n), one epoch has N/n updates since a mini-batch of n data points is used to perform an update.

Repeating this process over and over for many epochs is training a model in a nutshell.

Practice

Try to solve this short quiz to test your understanding of the concepts explained in this lesson:

Technical Quiz
1.

Check all that apply. Which steps are executed inside the gradient descent training loop? Multi-select

A.

Computing loss

B.

Making predictions (forward pass)

C.

Randomly initializing parameters

D.

Computing gradients

E.

Updating parameters


1 / 2