Divergence Measures
Discover how to measure differences between probability distributions using divergence measures such as entropy, relative entropy, Kullback-Leibler divergence, Jensen-Shannon divergence, total variation distance, and Wasserstein distance. Learn their definitions, applications, and significance in machine learning contexts, including GANs and NLP.
Here we’ll focus on a quite important aspect of statistical learning. This lesson is advanced and can be reasonably skipped if needed.
Introduction
We can easily compare a couple of scalar values by their difference or a ratio. Similarly, we can compare the two vectors by taking the L1 or L2 norm.
To extend this notion of divergence between a couple of distributions requires some better measures, though. There are several real-world applications where we need to find the similarity (or difference) between two distributions. For example, text comparison between two sequences in bioinformatics, text comparison in Natural Language Processing (NLP), comparison of generated images by Generative Adversarial Networks (GANs), and so on.
Entropy
Let’s begin with the fundamental measure. The entropy of an independent vector is defined as:
Usually, the base of the log is taken as either or .
Relative entropy
The relative entropy between two vectors and is defined as:
Since the equation involves (element-wise) ratio as well as logarithm, we must make sure to check for the zeros.
It’s a good use of splitting in the above example to sample diverse vectors from the distributions.
Kullback-Leibler (KL) divergence
Relative entropy is generalized as Kullback-Leibler divergence, denoted as or simply , defined as:
Kullback-Leibler divergence finds many applications and is a frequently used term in machine learning publications.
Note: Some authors refer to relative entropy itself as KL divergence, but we’ll follow a more precise definition from Convex Optimization, by Stephen Boyd and Lieven Vandenberghe, Cambridge University Press, 2004.
Jensen-Shannon divergence
As we saw, KL divergence is asymmetric:
Thus it cannot be used as a distance measure. Jensen-Shannon divergence neatly overcomes this issue:
JS divergence is thus much more applicable due to its additional mathematical properties.
Total variation distance
Total variation (or TV) distance is another common measure. It is defined by the maximum possible difference between the probability distributions for a given event, . This is formally defined as:
Note: A supremum operator or sup is nothing but a generalization of the maximum operator to infinite sets. We use it here for the sake of completeness.
Total variation norm is often used for regularization in computer vision applications.
Wasserstein distance
In 2017, Martin Arjovsky, et al. set the machine learning community ablaze by introducing yet another type of GAN: Wasserstein GANs.
Wasserstein GAN has since then become a state-of-the-art model. At the heart of its architecture lies a new divergence measure called Wasserstein distance.
Understanding Wasserstein distance and its implementation require some background, which is well complemented by a dedicated library in JAX. We will cover it in the following lesson.
Remember: It’s quite okay to use the WGANs in applied areas like computer vision without much background, so you can skip much of the theory in the next lesson and focus purely on the applied part if needed.