Fundamentals of Machine Learning: A Pythonic Introduction/

...

Ridge and Lasso Regression

Learn about ridge and lasso regression, their comparison, and the importance of their contours’ intersection with MSE.

We'll cover the following...

Ridge and Lasso objectives
Ridge vs. Lasso
- Example Ridge vs. Lasso
Conclusion

In the previous lesson, we saw how regularization helps control overfitting by penalizing large weights and balancing the bias variance trade-off. We also introduced L1 and L2 penalties and discussed their general effects on model behavior. Now, we focus specifically on ridge and lasso regression and examine how these penalties change the solution. Using the same linear model and squared loss, we compare ridge and lasso through their objective functions and visualize their behavior using MSE contours. This geometric perspective helps explain why ridge shrinks all coefficients, while lasso can drive some coefficients exactly to zero.

Ridge and Lasso objectives

Both Ridge and Lasso regression are special forms of regularized linear regression. They use the simplest model type (linear model) and the standard way to measure error (squared loss), differing only in their regularization penalty.

The core model and loss function

Before introducing the penalty, we must define the model that makes a prediction and the loss function that measures the error.

Linear model ( $f_{\mathbf{w}}$ )

A linear model assumes the output ( $\hat{y}_i$ , the prediction) is a simple, weighted sum of the inputs ( $x_i$ ). The goal is to find the best set of weights ( $\mathbf{w}$ ) that connect the inputs to the output.

We have $n$ training examples, $D = \{(\mathbf{x}_i, y_i) \mid 1 \le i \le n\}$ . Each input $\mathbf{x}_i$ has $d$ features.
The model expression:

f_{\mathbf{w}}(\mathbf{x}_i) = w_0 + w_1 x_{i1} + w_2 x_{i2} + \cdots + w_d x_{id}

$w_0$ is the intercept (or bias).
$w_1$ to $w_d$ are the slopes or feature weights.

To simplify the math, we often combine $w_0$ with the other weights by adding a constant $1$ to the start of the feature vector: $\hat{\mathbf{x}}_i = (1, x_{i1}, \dots, x_{id})$ ...

Ask

Course Overview

Supervised Learning

Detect Cyber Intrusion Using Machine Learning

Clustering

Project: Bag of Visual Words

Generalized Linear Regression

Face Recognition Using Kernel Linear Discriminant

Support Vector Machine

Logistic Regression

Ensemble Learning

Early Stage Diabetes Prediction Using Ensemble Learning

Decoding Dimensions: PCA and Autoencoders

Image Reconstruction Using PCA

Image Colorization using Autoencoders

Colorful Face Generation with VAEs

Appendix

Wrapping Up

How to Predict the Traffic Volume Using Machine Learning

Ridge and Lasso Regression

Ridge and Lasso objectives

The core model and loss function

Linear model ( $f_{\mathbf{w}}$ )

Detect Cyber Intrusion Using Machine Learning

Project: Bag of Visual Words

Face Recognition Using Kernel Linear Discriminant

Early Stage Diabetes Prediction Using Ensemble Learning

Image Reconstruction Using PCA

Image Colorization using Autoencoders

Colorful Face Generation with VAEs

How to Predict the Traffic Volume Using Machine Learning

Ridge and Lasso Regression

Ridge and Lasso objectives

The core model and loss function

Linear model (fwf_{\mathbf{w}}fw​)

Linear model ( $f_{\mathbf{w}}$ )