Search⌘ K
AI Features

Cosine Similarity

Explore how to measure similarity between data observations using cosine similarity. Learn to calculate it manually and apply scikit-learn's cosine_similarity function to identify similarities within and between datasets to support clustering tasks.

Chapter Goals:

  • Understand what defines similarity between data observations
  • Learn how to calculate cosine similarity
  • Apply cosine_similarity from scikit-learn with NumPy

Understanding cosine similarity

To find similarities between data observations, we first need to understand how to actually measure similarity. The most common measurement of similarity is the cosine similarity metric.

A data observation with numeric features is essentially just a vector of real numbers. Cosine similarity is used in mathematics as a similarity metric for real-valued vectors, so it makes sense to use it as a similarity metric for data observations. The cosine similarity for two data observations is a number between -1 and 1. It specifically measures the proportional similarity of the feature values between the two data observations (i.e. the ratio between feature columns).

Cosine similarity values closer to 1 represent greater similarity between the observations, while values closer to -1 represent more divergence. A value of 0 means that the two data observations have no correlation (neither similar nor dissimilar).

1.

How is cosine similarity used in machine learning?

Show Answer
Did you find this helpful?

How to calculate cosine similarity

The cosine similarity for two vectors, u and v, is calculated as the dot product between the L2-normalization of the vectors. The exact formula for cosine similarity is:

cossim(u,v)=uu2vv2\text{cossim} (u, v) = \frac{u}{||u||_2} \cdot \frac{v}{||v||_2} ...