Cosine Similarity

Explore how to measure similarity between data observations using cosine similarity. Learn to calculate it manually and apply scikit-learn's cosine_similarity function to identify similarities within and between datasets to support clustering tasks.

We'll cover the following...

Chapter Goals:
Understanding cosine similarity

Understanding cosine similarity

To find similarities between data observations, we first need to understand how to actually measure similarity. The most common measurement of similarity is the cosine similarity metric.

A data observation with numeric features is essentially just a vector of real numbers. Cosine similarity is used in mathematics as a similarity metric for real-valued vectors, so it makes sense to use it as a similarity metric for data observations. The cosine similarity for two data observations is a number between -1 and 1. It specifically measures the proportional similarity of the feature values between the two data observations (i.e. the ratio between feature columns).

Cosine similarity values closer to 1 represent greater similarity between the observations, while values closer to -1 represent more divergence. A value of 0 means that the two data observations have no correlation (neither similar nor dissimilar).

1.What you'll learn from this course

2.Data Manipulation with NumPy

3.Data Analysis with pandas

4.Data Preprocessing with scikit-learn

5.Data Modeling with scikit-learn

6.Clustering with scikit-learn

7.Gradient Boosting with XGBoost

8.Deep Learning with TensorFlow

9.Deep Learning with Keras

Cosine Similarity

Chapter Goals:

Understanding cosine similarity

How to calculate cosine similarity