Optimizing BCE Loss

Learn how to minimize BCE loss using gradient descent.

In the previous lesson, we saw how logistic regression uses the sigmoid function to output a probability y^i=p(yi=1xi)\hat{y}_i = p(y_i=1|\mathbf{x}_i). Now, the critical question is: How do we find the optimal weight vector (w\mathbf{w}) that makes these predicted probabilities as accurate as possible?

This process is called optimization. We must choose a proper loss function that measures the error between our prediction (y^i\hat{y}_i) and the true label (yiy_i), and then use an algorithm to minimize that error.

This lesson introduces the binary cross-entropy (BCE) loss as the preferred measure of error for probabilistic classifiers, highlighting its essential property of convexityConvexity means the loss surface has a single global minimum with no deceptive local minima, making optimization smooth and reliable. Because BCE is convex for linear models, it allows the training process to converge efficiently to the best solution., which ensures a smooth and successful training process.

Optimization

Logistic regression aims to learn a parameter vector w\bold{w} by minimizing a chosen loss function. While the squared loss Ls(w)=i=1n(yi11+ewTϕ(xi))2L_s(\bold{w})=\sum_{i=1}^n\bigg(y_i-\frac{1}{1+e^{-\bold{w}^T\phi(\bold{x}_i)}}\bigg)^2 might appear as a natural choice, it’s not convex. Fortunately, we have the flexibility to consider alternative loss functions that are convex. One such loss function is the binary cross-entropy (BCE) loss, denoted as LBCEL_{BCE}, which possesses convexity properties. The BCE loss can be defined as:

LBCE(w)=i=1n(yilog(y^i)+(1yi)log(1y^i))L_{BCE}(\bold{w})=-\sum_{i=1}^n(y_ilog(\hat y_i)+(1-y_i)log(1-\hat y_i))

Explanation of BCE loss

Let’s delve into the explanation of the BCE loss. The function has two parts, one active when the true label yi=1y_i=1 and the other when yi=0y_i=0:

  • Case 1: True label is yi=1y_i=1. The loss simplifies to log(y^i)-\log(\hat{y}_i).

    • If the prediction y^i1\hat{y}_i \approx 1 (Correct!), the loss log(y^i)0-\log(\hat{y}_i) \approx 0.
    • Conversely, if y^i0\hat{y}_i \approx 0 (Wrong!), the loss becomes significantly large (approaching \infty).
  • Case 2: True label is yi=0y_i=0. The loss simplifies to log(1y^i)-\log(1-\hat{y}_i).

    • If y^i0\hat{y}_i \approx 0 (Correct!), the loss log(1y^i)0-\log(1-\hat{y}_i) \approx 0.
    • Conversely, if y^i1\hat{y}_i \approx 1 (Wrong!), the loss becomes significantly large.

This structure ensures that the loss strongly penalizes confident, incorrect predictions, which is ideal for a probabilistic model. The code snippet provided below illustrates the computation of the BCE loss for a single example:

Python 3.10.4
import numpy as np
def BCE_loss(y, y_hat):
"""
Compute the binary cross-entropy (BCE) loss for a given target label and predicted probability.
Args:
y: Target label (0 or 1)
y_hat: Predicted probability
Returns:
BCE loss value
"""
if y == 1:
return -np.log(y_hat)
else:
return -np.log(1 - y_hat)
# Iterate over different combinations of y and z
for y in [0, 1]:
for y_hat in [0.0001, 0.99]:
# Compute and print the BCE loss for each combination
print(f"y = {y}, y_hat = {y_hat}, BCE_loss = {BCE_loss(y, y_hat)}")

By utilizing the BCE loss, we can effectively capture the dissimilarity between the target labels and predicted probabilities, enabling convex optimization during the parameter estimation process of logistic regression.

Minimzing BCE loss

We need to find the model parameters (that is, w\bold w) that result in the smallest BCE loss function value to minimize the BCE loss. The BCE loss is defined as:

LBCE(w)=i=1nLiLi=(yilog(y^i)+(1yi)log(1y^i))y^i=σ(zi)=11+ezizi=wTϕ(xi)\begin{align*} L_{BCE}(\bold{w})&=\sum_{i=1}^n L_i\\ L_i &= -(y_ilog(\hat y_i)+(1-y_i)log(1-\hat y_i)) \\ \hat{y}_i&=\sigma(z_i)=\frac{1}{1+e^{-z_i}} \\ z_i&=\bold w^T \phi(\bold{x}_i) \end{align*} ...

Ask