Wait, What is KL Divergence?

made by https://cneuralnets.netlify.app/

Story Time

Dr. Maya, a data scientist at CityCare Hospital, noticed long patient wait times in the emergency room. To improve efficiency, she needed to model the true distribution of wait times and compare it against proposed scheduling models.

So she started gathering data about the wait times, and found out the values follow a right-skewed curve.

This is the actual curve of the data. But now she needs to find the actual model that encompasses this distribution. So she proposes two models —

Normal Distribution

$$ Q_1(x)=\frac{1}{\sqrt{2\pi \delta^2}}exp(\frac{(x-\mu)^2}{2\delta^2}) $$
Exponential Distribution

$$ Q_2(x)=\lambda e^{-\lambda x} $$

Entropy?

Imagine you run a shipping company. Each day, you receive packages labeled with different codes (say, A, B, C...), and you know the true frequency (probability) of each code from your records. Entropy tells you the minimum average number of bits you need to label these packages if you use the most efficient code possible-one that matches the true frequencies exactly.

$$ H(p)=-\sum p(x)logp(x) $$

Why need another distribution?

Now, suppose you lose your records and have to guess the frequencies. You use a model distribution Q, which may not match reality. You design your codes based on Q, but packages still arrive according to P.

$$ H(p,q)=-\sum p(x)logq(x) $$

You’re still shipping the same packages, but your codes are now less efficient because your guesses don’t match reality. On average, you’ll need more bits per package.

KL Divergence

KL divergence is the extra shipping cost you pay because you used the wrong map (Q) instead of the true one (P). It’s the difference between the code length you actually use (cross-entropy) and the best possible code length (entropy)